Size: 3576
Comment:
|
Size: 3572
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 4: | Line 4: |
SkładnicaMWE is a constituency version of the [[http://zil.ipipan.waw.pl/Sk%C5%82adnica|Składnica]] treebank annotated with various types of multiword expressions. It was created within the PhD thesis work by Jakub Waszczuk, and partly funded by the IC1207 COST action [[http://www.parseme.eu|PARSEME]]. | SkładnicaMWE is a constituency version of the [[http://zil.ipipan.waw.pl/Składnica|Składnica]] treebank annotated with various types of multiword expressions. It was created within the PhD thesis work by Jakub Waszczuk, and partly funded by the IC1207 COST action [[http://www.parseme.eu|PARSEME]]. |
SkładnicaMWE
SkładnicaMWE is a constituency version of the Składnica treebank annotated with various types of multiword expressions. It was created within the PhD thesis work by Jakub Waszczuk, and partly funded by the IC1207 COST action PARSEME.
Some aspects of its construction, contents and use have been described in:
- SAVARY, A., WASZCZUK, J., (2017): "Projecting multiword expression resources on a Polish treebank", in the Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing (BSNLP 2017), 4 April 2017, Valencia, Spain.
The annotation was performed by automatically projecting 3 Polish MWE resources:
named entity layer of the [http://clip.ipipan.waw.pl/NationalCorpusOfPolish|National Corpus of Polish]],
The treebank contains about 2,000 MWE annotations in about 9,000 constituency trees, with the following distribution:
1,304 multiword named entities (e.g. Rozgłośnia Polska Radia Wolna Europa),
368 nominal, adjectival and adverbial compounds (e.g. prosty jak strzała, wprost proporcjonalny),
365 verbal MWEs (e.g. chcąc nie chcąc).
Authors
Monika Czerepowicka - lexicography
Agata Savary - automatic inflection and validation
Tools
The lexicon has been created within Toposław, tool for developping and managing inflectional dictionaries of multi-word units. Toposław integrates:
Morfeusz SGJP -- a morphological analyser and generator of Polish,
Multiflex -- a morpho-syntactic generator of multi-word units,
graph editor stemming from Unitex.
License
The data are available under the CC BY-SA license.
Available resources
- SEJF version 1.1
- SEJF version 1.0 - 3200 multi-word lexemes (2121 nominal, 446 adjectival, 604 adverbial, 43 others), 68,000 corresponding inflected forms, and 160 graph-based inflection paradigms
Future work
Defining an LMF format for the lexicon.