Locked History Actions

Diff for "SkładnicaMWE"

Differences between revisions 2 and 10 (spanning 8 versions)
Revision 2 as of 2017-02-20 23:09:05
Size: 3576
Editor: AgataSavary
Comment:
Revision 10 as of 2017-05-17 08:27:54
Size: 2742
Editor: AgataSavary
Comment:
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
SkładnicaMWE is a constituency version of the [[http://zil.ipipan.waw.pl/Sk%C5%82adnica|Składnica]] treebank annotated with various types of multiword expressions. It was created within the PhD thesis work by Jakub Waszczuk, and partly funded by the IC1207 COST action [[http://www.parseme.eu|PARSEME]]. SkładnicaMWE is a constituency version of the [[http://zil.ipipan.waw.pl/Składnica|Składnica]] treebank annotated with various types of multiword expressions. It was created within the PhD thesis work by Jakub Waszczuk, and partly funded by the IC1207 COST action [[http://www.parseme.eu|PARSEME]].
Line 8: Line 8:
 * SAVARY, A., WASZCZUK, J., (2017): "Projecting multiword expression resources on a Polish treebank", in the Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing (BSNLP 2017), 4 April 2017, Valencia, Spain.  * SAVARY, A., WASZCZUK, J., (2017): "[[http://aclweb.org/anthology/W/W17/W17-1404.pdf|Projecting multiword expression resources on a Polish treebank]]", in the Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing ([[http://aclweb.org/anthology/W/W17/W17-1404.pdf|BSNLP 2017]]), 4 April 2017, Valencia, Spain.
Line 10: Line 10:
The annotation was performed by automatically projecting 3 Polish MWE resources:
 * named entity layer of the [http://clip.ipipan.waw.pl/NationalCorpusOfPolish|National Corpus of Polish]],
 *  
The pre-annotation was performed by automatically projecting 3 Polish MWE resources:
 * named entity layer of the [[http://clip.ipipan.waw.pl/NationalCorpusOfPolish|National Corpus of Polish]],
 * [[http://zil.ipipan.waw.pl/SEJF|SEJF]], the grammatical lexicon of Polish nominal, adjectival and adverbial MWEs,
 * [[http://zil.ipipan.waw.pl/Walenty|Walenty]], a Polish valence dictionary with phraseological component (2015 version).
All automatic pre-annotation results were manually validated.
Line 15: Line 17:
 * 1,304 multiword named entities (e.g. ''Rozgłośnia Polska Radia Wolna Europa''),
 * 368 nominal, adjectival and adverbial compounds (e.g. ''prosty jak strzała'', ''wprost proporcjonalny''),
 * 365 verbal MWEs (e.g. ''chcąc nie chcąc'').
 
 * 1,304 multiword named entities (e.g. ''Buenos Aires'', ''Ministerstwo Pracy i Polityki Socjalnej''),
 * 368 nominal, adjectival and adverbial compounds (e.g. ''związki zawodowe'', ''jedyny w swoim rodzaju'', ''przede wszystkim''),
 * 365 verbal MWEs (e.g. ''wejść w życie'', ''pominąć milczeiem'', ''zależć za skórę'', ''udzielić rady'').
Line 21: Line 22:
 * [[http://www.uwm.edu.pl/polonistyka/index.php?option=com_content&view=article&id=95&catid=50&Itemid=9|Monika Czerepowicka]] - lexicography
 * [[http://www.info.univ-tours.fr/~savary/English/indexgb.html|Agata Savary]] - automatic inflection and validation
 * [[http://zil.ipipan.waw.pl/JakubWaszczuk|Jakub Waszczuk]]
 * [[http://www.info.univ-tours.fr/~savary/English/indexgb.html|Agata Savary]]
Line 24: Line 25:
== Tools ==
The lexicon has been created within [[http://zil.ipipan.waw.pl/Toposlaw|Toposław]], tool for developping and managing inflectional dictionaries of multi-word units. Toposław integrates:
 * [[http://sgjp.pl/morfeusz/|Morfeusz SGJP]] -- a morphological analyser and generator of Polish,
 * [[http://www.springerlink.com/content/n265j22n73084433/|Multiflex]] -- a morpho-syntactic generator of multi-word units,
 * graph editor stemming from [[http://igm.univ-mlv.fr/~unitex/|Unitex]].
Line 32: Line 27:
The data are available under the [[http://creativecommons.org/licenses/by-sa/3.0/|CC BY-SA license]]. The data are available under the [[https://www.gnu.org/licenses/gpl-3.0.en.html|GPLv3 license]].
Line 36: Line 31:
 * SEJF version 1.1
   * [[attachment:SEJF-1.1-Slownik.tar.gz|Slownik]] -- the binary source file in [[http://zil.ipipan.waw.pl/Toposlaw|Toposław]] format
   * [[http://www.springerlink.com/content/n265j22n73084433/|Multiflex]]-compatible [[attachment:SEJF-1.1.tar.gz|archive]] including:
     * the list of morphologically annotated lexemes,
     * the list of corresponding inflected forms and variants,
     * inflection graphs compatible with [[http://igm.univ-mlv.fr/~unitex/|Unitex]] graph editor,
     * list of known problems,
     * a README.txt file.

 * SEJF version 1.0 - 3200 multi-word lexemes (2121 nominal, 446 adjectival, 604 adverbial, 43 others), 68,000 corresponding inflected forms, and 160 graph-based inflection paradigms
   * [[attachment:Slownik.tar.gz|Slownik]] -- the binary source file in [[http://zil.ipipan.waw.pl/Toposlaw|Toposław]] format
   * [[http://www.springerlink.com/content/n265j22n73084433/|Multiflex]]-compatible [[attachment:SEJF.tar.gz|archive]] including:
     * the list of morphologically annotated lexemes,
     * the list of corresponding inflected forms and variants,
     * inflection graphs compatible with [[http://igm.univ-mlv.fr/~unitex/|Unitex]] graph editor,
     * list of known problems,
     * a README.txt file.
 * [[attachment:SkladnicaMWE-1.0.zip|SkładnicaMWE v 1.0]] -- a version of Składnica (containing only the correct parses) with MWE annotations, in a custom XML format. Token identifiers are compatible with the original Składnica corpus.
Line 56: Line 35:
Defining an [[http://www.lexicalmarkupframework.org/|LMF]] format for the lexicon.  * Repeating the automatic mapping and manual validation with more recent versions of [[http://zil.ipipan.waw.pl/Składnica|Składnica]] and of [[http://zil.ipipan.waw.pl/Walenty|Walenty]].
 * Enhancing the lexicon projection to include more fine-grained syntactic constraints.
 * Enhancing the annotation schema towards a standard format.
 * Linking the MWE occurrences in the treebank with their entries in lexicons.

SkładnicaMWE

SkładnicaMWE is a constituency version of the Składnica treebank annotated with various types of multiword expressions. It was created within the PhD thesis work by Jakub Waszczuk, and partly funded by the IC1207 COST action PARSEME.

Some aspects of its construction, contents and use have been described in:

The pre-annotation was performed by automatically projecting 3 Polish MWE resources:

  • named entity layer of the National Corpus of Polish,

  • SEJF, the grammatical lexicon of Polish nominal, adjectival and adverbial MWEs,

  • Walenty, a Polish valence dictionary with phraseological component (2015 version).

All automatic pre-annotation results were manually validated.

The treebank contains about 2,000 MWE annotations in about 9,000 constituency trees, with the following distribution:

  • 1,304 multiword named entities (e.g. Buenos Aires, Ministerstwo Pracy i Polityki Socjalnej),

  • 368 nominal, adjectival and adverbial compounds (e.g. związki zawodowe, jedyny w swoim rodzaju, przede wszystkim),

  • 365 verbal MWEs (e.g. wejść w życie, pominąć milczeiem, zależć za skórę, udzielić rady).

Authors

License

The data are available under the GPLv3 license.

Available resources

  • SkładnicaMWE v 1.0 -- a version of Składnica (containing only the correct parses) with MWE annotations, in a custom XML format. Token identifiers are compatible with the original Składnica corpus.

Future work

  • Repeating the automatic mapping and manual validation with more recent versions of Składnica and of Walenty.

  • Enhancing the lexicon projection to include more fine-grained syntactic constraints.
  • Enhancing the annotation schema towards a standard format.
  • Linking the MWE occurrences in the treebank with their entries in lexicons.