Locked History Actions

Diff for "Świgra"

Differences between revisions 2 and 8 (spanning 6 versions)
Revision 2 as of 2013-01-29 16:39:04
Size: 1489
Editor: MichalLenart
Comment:
Revision 8 as of 2019-11-15 17:47:35
Size: 5342
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
== Parser Świgra ==
Line 4: Line 3:
Niniejsza strona prezentuje nową wersję analizatora składniowego ''Świgra'' pracującą na
nowej gramatyce formalnej. Gramatyka programu wywodzi się z ''Gramatyki formalnej języka
polskiego'' Marka Świdzińskiego (1992), jednak na potrzeby projektu została ona w istotnym
stopniu przebudowana. Struktura generowanych drzew składniowych jest znacząco prostsza,
bardziej czytelna i bardziej intuicyjna. Zostało także opisanych wiele zjawisk nieuwzględnionych we wcześniejszej gramatyce: współrzędnie złożone frazy nominalne i przymiotnikowe,
konstrukcje zawierające formy liczebnikowe, wymagania składniowe form rzeczownikowych
i przymiotnikowych, konstrukcje z partykułami. Uwzględnione zostało wiele nietypowości
pojawiających się w tekstach polskich.
= The Świgra Parser =
Line 13: Line 5:
Elementem bardzo istotnym dla jakości drzew składniowych wygenerowanych przez analizator automatyczny jest słownik walencyjny. W programie Świgra stosowany jest słownik opracowany w roku 1998 przez Marka Świdzińskiego. Słownik ten został uzupełniony
o najczęstsze czasowniki występujące w konstruowanym banku drzew. W wyniku tego ¾
wystąpień czasowników w badanym korpusie udało się przypisać ramę walencyjną (dla pozostałych stosowana jest przy analizie permisywna ramka domyślna).
Author: Marcin Woliński
Line 17: Line 7:
Web-based demo: http://swigra.nlp.ipipan.waw.pl/
Line 18: Line 9:
 * ''Świgra'' wersja 1.5: [[attachment:Świgra_1.5.zip]]
 * Słownik walencyjny analizatora: [[attachment:Słownik-walencyjny.txt.gz]]
Świgra is a constituency parser of Polish using an extended version of the Definite Clause Grammar formalism. The parser generates constituency forests, which can be disambiguated by a statistical component. The program exists in two versions:

 * Świgra 1 is a faithful implementation of Marek Świdziński's ''Formal Grammar of Polish'' (Świdziński 1992, ''Gramatyka formalna języka polskiego'', GFJP)
 * Świgra 2 operates with a new grammar stemming from GFJP but developed by Marcin Woliński. Compared to GFJP, Świgra 2 generates much simpler, more readable, and more intuitive parse trees. The grammar describes a number of phenomena not covered by GFJP, e.g., coordinated phrases of various types, constructions involving numerals, argument structures of non-verbal forms, constructions involving particles, common discontinuous structures, common sentence-like constructions without a verb.

Both versions use almost the same runtime (each version adds somewhat different new features to DCG, but the general processing scheme is the same). The runtime is called Birnam (its slogan is ''We bring forests to your door'').

Świgra can use morphological analyser [[http://morfeusz.sgjp.pl/|Morfeusz]] as its first step of processing or it can be fed with preprocessed data in the format of the NKJP corpus.

A very important component influencing the quality of trees generated by a parser is its valence dictionary. Świgra 2 uses [[http://walenty.ipipan.waw.pl/|Walenty]] as the source of valency data. Previous versions of the parser depended on Marek Świdziński's valence dictionary SDPV (1998) extended with the most frequent verbs appearing in NKJP1M. This dictionary is also available in the distribution package below.

== Publications ==

Świgra 1 and the Birnam runtime are documented in the work:

 * Marcin Woliński. ''[[http://nlp.ipipan.waw.pl/Bib/woli:04.pdf|Komputerowa weryfikacja gramatyki Świdzińskiego]]''. Ph.D. dissertation, Institute of Computer Science, Polish Academy of Sciences, Warsaw, 2004.

The grammar and implementation of Świgra 2 is presented in the book:

 * Marcin Woliński. ''[[https://www.wuw.pl/data/include/cms/Automatyczna_analiza_skladnikowa_Wolinski_Marcin_2019.pdf|Automatyczna analiza składnikowa języka polskiego]]''. Wydawnictwa Uniwersytetu Warszawskiego, Warsaw, 2019.

== Availability and license ==

 * Świgra 2 [[attachment:swigra_current.zip]]
 * Maximum entropy disambiguator for Świgra trees [[attachment:disambiguator-maxent-current.zip]]

Most of the files in the package are made available under the GNU General Public License v3. Some auxiliary files are put in the public domain for the sake of simplicity — the author disclaims copyright to these files. This applies to the sample grammar for Birnam and the C code used to interface Morfeusz with SWI Prolog.

The release of Świgra 1 and 2 under GPL is possible thanks to the kind permission of the following persons and institutions:

 1. Prof. Marek Świdziński, the author of the grammar used by Świgra 1,
 1. Prof. Janusz S. Bień, the leader of a project 8T11C 00213 “Zestaw testów do weryfikacji i oceny analizatorów języka polskiego” (1997–1999) within which a prototype for Świgra 1 was developed (under the name AS),
 1. Institute of Computer Science Polish Academy of Sciences, my current employer.

The development of Świgra 2 was co-financed by the following projects:

 * MNiSW N N104 224735 (2008–2011),
 * [[http://www.ipipan.waw.pl/nekst/|Nekst]] (2012–2013),
 * [[http://clarin-pl.eu/|CLARIN-PL]] (2014–2018).

The copyright holder of Świgra is the Institute of Computer Science, Polish Academy of Sciences.

== Installation ==

 1. Świgra is implemented in SWI Prolog, which can be downloaded from its [[http://www.swi-prolog.org/download/stable|site]]. On Linux a prepackaged version available in system repositories is preferable.
 1. The files from the Świgra distribution should be extracted in an empty folder retaining the structure of sub-folders (including the {{{parser/}}} folder).
 1. For interactive use Morfeusz is needed. On Linux — please install system-wide as described on Morfeusz download page. On Windows — download the [[http://sgjp.pl/morfeusz/download/20190415/morfeusz2-1.9.10.sgjp.20190415-Windows-amd64.tar.gz|command-line version]], extract the file {{{morfeusz2.dll}}} and put it in the Świgra’s {{{parser/}}} folder.
 1. For web based interface the disambiguator module is also needed. The archive should be extracted in the same directory as Świgra package, so that the directories {{{parser/}}} and {{{disambiguator-maxent}}} become siblings.

=== Interactive use ===

For interactive use we suggest the web based interface, which can be accessed in a web browser at the addres http://localhost:3333/ after a Świgra server is activated in the following way:

==== On Linux ====

In the parser folder execute the command:

{{{
./swigra -w
}}}

Świgra 2 is used by default. For Świgra 1 add {{{-1}}} to the command line.

==== On Windows ====

Execute the command {{{swigra2_web.cmd}}} in the parser folder.

The Świgra Parser

Author: Marcin Woliński

Web-based demo: http://swigra.nlp.ipipan.waw.pl/

Świgra is a constituency parser of Polish using an extended version of the Definite Clause Grammar formalism. The parser generates constituency forests, which can be disambiguated by a statistical component. The program exists in two versions:

  • Świgra 1 is a faithful implementation of Marek Świdziński's Formal Grammar of Polish (Świdziński 1992, Gramatyka formalna języka polskiego, GFJP)

  • Świgra 2 operates with a new grammar stemming from GFJP but developed by Marcin Woliński. Compared to GFJP, Świgra 2 generates much simpler, more readable, and more intuitive parse trees. The grammar describes a number of phenomena not covered by GFJP, e.g., coordinated phrases of various types, constructions involving numerals, argument structures of non-verbal forms, constructions involving particles, common discontinuous structures, common sentence-like constructions without a verb.

Both versions use almost the same runtime (each version adds somewhat different new features to DCG, but the general processing scheme is the same). The runtime is called Birnam (its slogan is We bring forests to your door).

Świgra can use morphological analyser Morfeusz as its first step of processing or it can be fed with preprocessed data in the format of the NKJP corpus.

A very important component influencing the quality of trees generated by a parser is its valence dictionary. Świgra 2 uses Walenty as the source of valency data. Previous versions of the parser depended on Marek Świdziński's valence dictionary SDPV (1998) extended with the most frequent verbs appearing in NKJP1M. This dictionary is also available in the distribution package below.

Publications

Świgra 1 and the Birnam runtime are documented in the work:

The grammar and implementation of Świgra 2 is presented in the book:

Availability and license

Most of the files in the package are made available under the GNU General Public License v3. Some auxiliary files are put in the public domain for the sake of simplicity — the author disclaims copyright to these files. This applies to the sample grammar for Birnam and the C code used to interface Morfeusz with SWI Prolog.

The release of Świgra 1 and 2 under GPL is possible thanks to the kind permission of the following persons and institutions:

  1. Prof. Marek Świdziński, the author of the grammar used by Świgra 1,
  2. Prof. Janusz S. Bień, the leader of a project 8T11C 00213 “Zestaw testów do weryfikacji i oceny analizatorów języka polskiego” (1997–1999) within which a prototype for Świgra 1 was developed (under the name AS),
  3. Institute of Computer Science Polish Academy of Sciences, my current employer.

The development of Świgra 2 was co-financed by the following projects:

  • MNiSW N N104 224735 (2008–2011),
  • Nekst (2012–2013),

  • CLARIN-PL (2014–2018).

The copyright holder of Świgra is the Institute of Computer Science, Polish Academy of Sciences.

Installation

  1. Świgra is implemented in SWI Prolog, which can be downloaded from its site. On Linux a prepackaged version available in system repositories is preferable.

  2. The files from the Świgra distribution should be extracted in an empty folder retaining the structure of sub-folders (including the parser/ folder).

  3. For interactive use Morfeusz is needed. On Linux — please install system-wide as described on Morfeusz download page. On Windows — download the command-line version, extract the file morfeusz2.dll and put it in the Świgra’s parser/ folder.

  4. For web based interface the disambiguator module is also needed. The archive should be extracted in the same directory as Świgra package, so that the directories parser/ and disambiguator-maxent become siblings.

Interactive use

For interactive use we suggest the web based interface, which can be accessed in a web browser at the addres http://localhost:3333/ after a Świgra server is activated in the following way:

On Linux

In the parser folder execute the command:

./swigra -w

Świgra 2 is used by default. For Świgra 1 add -1 to the command line.

On Windows

Execute the command swigra2_web.cmd in the parser folder.