Size: 2835
Comment:
|
Size: 5087
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 3: | Line 3: |
== Świgra parser == | = The Świgra Parser = |
Line 5: | Line 5: |
The following page presents a new version of the syntactic parser ''Świgra'', operating with a new formal grammar. The parser's grammar derives from the Marek Świdziński's Formal Grammar of Polish (Świdziński 1922, ''Gramatyka formalna języka polskiego''), albeit with substantial modifications. The structure of generated parse trees is much simpler, more readable, and more intuitive. It also describes a number of phenomena not described the earlier grammar: coordinate nominal and adjectival phrases, constructions involving numeral forms, argument structures of nominal and adjectival forms, constructions involving particles. A number of peculiarities present in Polish texts have been accounted for. | Author: Marcin Woliński |
Line 7: | Line 7: |
A very important component responsible for the quality of trees generated by the automatic parser is its valence dictionary. Świgra uses Marek Świdziński's valence dictionary (1998). The dictionary has been expanded with the most frequent verbs appearing in the constructed treebank. Due to this, 3 in 4 verbs in the analysed corpus have been provided with valence frames (in the analysis of the remainder, a default, permissive frame has been used). | Web-based demo: http://swigra.nlp.ipipan.waw.pl/ |
Line 9: | Line 9: |
* ''Świgra'', version 1.5: [[attachment:Świgra_1.5.zip]] * Parser's valence dictionary: [[attachment:Słownik-walencyjny.txt.gz]] |
Świgra is a constituency parser of Polish using an extended version of the Definite Clause Grammar formalism. The parser generates constituency forests, which can be disambiguated by a statistical component. The program exists in two versions: |
Line 12: | Line 11: |
---- | * Świgra 1 is a faithful implementation of Marek Świdziński's ''Formal Grammar of Polish'' (Świdziński 1992, ''Gramatyka formalna języka polskiego'', GFJP) * Świgra 2 operates with a new grammar stemming from GFJP but developed by Marcin Woliński. Compared to GFJP, Świgra 2 generates much simpler, more readable, and more intuitive parse trees. The grammar describes a number of phenomena not covered by GFJP, e.g., coordinated phrases of various types, constructions involving numerals, argument structures of non-verbal forms, constructions involving particles, common discontinuous structures, common sentence-like constructions without a verb. |
Line 14: | Line 14: |
== Parser Świgra == | Both versions use almost the same runtime (each version adds somewhat different new features to DCG, but the general processing scheme is the same). The runtime is called Birnam (its slogan is ''We bring forests to your door''). |
Line 16: | Line 16: |
Niniejsza strona prezentuje nową wersję analizatora składniowego ''Świgra'' pracującą na nowej gramatyce formalnej. Gramatyka programu wywodzi się z ''Gramatyki formalnej języka polskiego'' Marka Świdzińskiego (1992), jednak na potrzeby projektu została ona w istotnym stopniu przebudowana. Struktura generowanych drzew składniowych jest znacząco prostsza, bardziej czytelna i bardziej intuicyjna. Zostało także opisanych wiele zjawisk nieuwzględnionych we wcześniejszej gramatyce: współrzędnie złożone frazy nominalne i przymiotnikowe, konstrukcje zawierające formy liczebnikowe, wymagania składniowe form rzeczownikowych i przymiotnikowych, konstrukcje z partykułami. Uwzględnione zostało wiele nietypowości pojawiających się w tekstach polskich. |
Świgra can use morphological analyser [[http://morfeusz.sgjp.pl/|Morfeusz]] as its first step of processing or it can be fed with preprocessed data in the format of the NKJP corpus. |
Line 25: | Line 18: |
Elementem bardzo istotnym dla jakości drzew składniowych wygenerowanych przez analizator automatyczny jest słownik walencyjny. W programie Świgra stosowany jest słownik opracowany w roku 1998 przez Marka Świdzińskiego. Słownik ten został uzupełniony o najczęstsze czasowniki występujące w konstruowanym banku drzew. W wyniku tego ¾ wystąpień czasowników w badanym korpusie udało się przypisać ramę walencyjną (dla pozostałych stosowana jest przy analizie permisywna ramka domyślna). |
A very important component influencing the quality of trees generated by a parser is its valence dictionary. Świgra 2 uses [[http://walenty.ipipan.waw.pl/|Walenty]] as the source of valency data. Previous versions of the parser depended on Marek Świdziński's valence dictionary SDPV (1998) extended with the most frequent verbs appearing in NKJP1M. This dictionary is also available in the distribution package below. |
Line 29: | Line 20: |
== Publications == | |
Line 30: | Line 22: |
* ''Świgra'' wersja 1.5: [[attachment:Świgra_1.5.zip]] * Słownik walencyjny analizatora: [[attachment:Słownik-walencyjny.txt.gz]] |
Świgra 1 and the Birnam runtime are documented in the work: * Marcin Woliński. ''[[http://nlp.ipipan.waw.pl/Bib/woli:04.pdf|Komputerowa weryfikacja gramatyki Świdzińskiego]]''. Ph.D. dissertation, Institute of Computer Science, Polish Academy of Sciences, Warsaw, 2004. The grammar and implementation of Świgra 2 is presented in the book: * Marcin Woliński. ''[[https://www.wuw.pl/data/include/cms/Automatyczna_analiza_skladnikowa_Wolinski_Marcin_2019.pdf|Automatyczna analiza składnikowa języka polskiego]]''. Wydawnictwa Uniwersytetu Warszawskiego, Warsaw, 2019. == Availability and license == * Świgra 2 [[http://zil.ipipan.waw.pl/%C5%9Awigra?action=AttachFile&do=get&target=swigra_current.zip|swigra_current.zip]] Most of the files in the package are made available under the GNU General Public License v3. Some auxiliary files are put in the public domain for the sake of simplicity — the author disclaims copyright to these files. This applies to the sample grammar for Birnam and the C code used to interface Morfeusz with SWI Prolog. The release of Świgra 1 and 2 under GPL is possible thanks to the kind permission of the following persons and institutions: 1. Prof. Marek Świdziński, the author of the grammar used by Świgra 1, 1. Prof. Janusz S. Bień, the leader of a project 8T11C 00213 “Zestaw testów do weryfikacji i oceny analizatorów języka polskiego” (1997–1999) within which a prototype for Świgra 1 was developed (under the name AS), 1. Institute of Computer Science Polish Academy of Sciences, my current employer. The development of Świgra 2 was co-financed by the following projects: * MNiSW N N104 224735 (2008–2011), * [[http://www.ipipan.waw.pl/nekst/|Nekst]] (2012–2013), * [[http://clarin-pl.eu/|CLARIN-PL]] (2014–2018). The copyright holder of Świgra is the Institute of Computer Science, Polish Academy of Sciences. == Installation == 1. Świgra is implemented in SWI Prolog, which can be downloaded from its [[http://www.swi-prolog.org/download/stable|site]]. On Linux a prepackaged version available in system repositories is preferable. 1. The files from the Świgra distribution should be extracted in an empty folder retaining the structure of sub-folders (including the {{{parser/}}} folder). 1. For interactive use Morfeusz is needed. On Linux — please install system-wide as described on Morfeusz download page. On Windows — download the [[http://sgjp.pl/morfeusz/download/20190415/morfeusz2-1.9.10.sgjp.20190415-Windows-amd64.tar.gz|command-line version]], extract the file {{{morfeusz2.dll}}} and put it in the Świgra’s {{{parser/}}} folder. === Interactive use === For interactive use we suggest the web based interface, which can be accessed in a web browser at the addres http://localhost:3333/ after a Świgra server is activated in the following way: ==== On Linux ==== In the parser folder execute the command: {{{ ./swigra -w }}} Świgra 2 is used by default. For Świgra 1 add {{{-1}}} to the command line. ==== On Windows ==== Execute the command {{{swigra2_web.cmd}}} in the parser folder. |
The Świgra Parser
Author: Marcin Woliński
Web-based demo: http://swigra.nlp.ipipan.waw.pl/
Świgra is a constituency parser of Polish using an extended version of the Definite Clause Grammar formalism. The parser generates constituency forests, which can be disambiguated by a statistical component. The program exists in two versions:
Świgra 1 is a faithful implementation of Marek Świdziński's Formal Grammar of Polish (Świdziński 1992, Gramatyka formalna języka polskiego, GFJP)
- Świgra 2 operates with a new grammar stemming from GFJP but developed by Marcin Woliński. Compared to GFJP, Świgra 2 generates much simpler, more readable, and more intuitive parse trees. The grammar describes a number of phenomena not covered by GFJP, e.g., coordinated phrases of various types, constructions involving numerals, argument structures of non-verbal forms, constructions involving particles, common discontinuous structures, common sentence-like constructions without a verb.
Both versions use almost the same runtime (each version adds somewhat different new features to DCG, but the general processing scheme is the same). The runtime is called Birnam (its slogan is We bring forests to your door).
Świgra can use morphological analyser Morfeusz as its first step of processing or it can be fed with preprocessed data in the format of the NKJP corpus.
A very important component influencing the quality of trees generated by a parser is its valence dictionary. Świgra 2 uses Walenty as the source of valency data. Previous versions of the parser depended on Marek Świdziński's valence dictionary SDPV (1998) extended with the most frequent verbs appearing in NKJP1M. This dictionary is also available in the distribution package below.
Publications
Świgra 1 and the Birnam runtime are documented in the work:
Marcin Woliński. Komputerowa weryfikacja gramatyki Świdzińskiego. Ph.D. dissertation, Institute of Computer Science, Polish Academy of Sciences, Warsaw, 2004.
The grammar and implementation of Świgra 2 is presented in the book:
Marcin Woliński. Automatyczna analiza składnikowa języka polskiego. Wydawnictwa Uniwersytetu Warszawskiego, Warsaw, 2019.
Availability and license
Świgra 2 swigra_current.zip
Most of the files in the package are made available under the GNU General Public License v3. Some auxiliary files are put in the public domain for the sake of simplicity — the author disclaims copyright to these files. This applies to the sample grammar for Birnam and the C code used to interface Morfeusz with SWI Prolog.
The release of Świgra 1 and 2 under GPL is possible thanks to the kind permission of the following persons and institutions:
- Prof. Marek Świdziński, the author of the grammar used by Świgra 1,
- Prof. Janusz S. Bień, the leader of a project 8T11C 00213 “Zestaw testów do weryfikacji i oceny analizatorów języka polskiego” (1997–1999) within which a prototype for Świgra 1 was developed (under the name AS),
- Institute of Computer Science Polish Academy of Sciences, my current employer.
The development of Świgra 2 was co-financed by the following projects:
The copyright holder of Świgra is the Institute of Computer Science, Polish Academy of Sciences.
Installation
Świgra is implemented in SWI Prolog, which can be downloaded from its site. On Linux a prepackaged version available in system repositories is preferable.
The files from the Świgra distribution should be extracted in an empty folder retaining the structure of sub-folders (including the parser/ folder).
For interactive use Morfeusz is needed. On Linux — please install system-wide as described on Morfeusz download page. On Windows — download the command-line version, extract the file morfeusz2.dll and put it in the Świgra’s parser/ folder.
Interactive use
For interactive use we suggest the web based interface, which can be accessed in a web browser at the addres http://localhost:3333/ after a Świgra server is activated in the following way:
On Linux
In the parser folder execute the command:
./swigra -w
Świgra 2 is used by default. For Świgra 1 add -1 to the command line.
On Windows
Execute the command swigra2_web.cmd in the parser folder.