Size: 2882
Comment:
|
Size: 3892
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 4: | Line 4: |
= Słownik elektroniczny języka polskiego dla wyrażeń frazeologicznych = | The Grammatical Lexicon of Polish Phraseology (SEJF = Słownik Elektroniczny Jednostek Frazeologicznych) is an electronic lexicon containing multi-word units (mainly nominal, adjectival and adverbial compounds) of the general (non terminological) Polish language. It has been created within the ERDF [[http://zil.ipipan.waw.pl/NEKST|Nekst]] project. Some aspects of its construction, contents and use have been described in: |
Line 6: | Line 8: |
The Gazetteer for Polish Named Entities was used within the ''[[http://sprout.dfki.de/|SProUT]]'' platform, initially for information extraction from Polish texts, and then for the automatic pre-annotation of the ''[[http://nkjp.pl/index.php?page=0&lang=1 |National Corpus of Polish]]'' (NKJP) on the level of named entities. Its construction, contents and use have been described in: | * GRALIŃSKI, F., SAVARY, A., CZEREPOWICKA, M., MAKOWIECKI, F. (2010): ''[[http://multiword.sourceforge.net/CONF_30_MWE_2010___lb__COLING__rb__/CONF_50_Online_Proceedings/pdf/MWE01.pdf|Computational Lexicography of Multi-Word Units: How Efficient Can It Be?]]'', in Proceedings of Multiword Expressions: from Theory to Applications (MWE 2010), Workshop at COLING 2010, Beijing, China, August 28. * CZEREPOWICKA, M., KOSEK, I. (2011): ''Problemy opisu związków frazeologicznych w formalizmie „Multifleks” (na przykładzie rodzaju wyrażeń frazeologicznych)'', in Kopcińska, D., Bańko, M. (eds.) "Różne formy, różne treści", pp. 117–126, Warszawa 2011. * CZEREPOWICKA, M. (2011): ''„Toposław” jako narzędzie znakowania jednostek wieloczłonowych'', in Matusiak-Kempa, I., Przybyszewski, S. (eds.) Nowe zjawiska w języku, tekście, komunikacji. Kontekst a komunikacja, Olsztyn, pp. 28–35. * CZEREPOWICKA, M. (2014): ''Jednostki obce w słowniku języka polskiego na przykładzie „Słownika elektronicznego jednostek frazeologicznych” (SEJF)'', in [[http://www.akademicka.pl/index.php?detale=1&e=1&a=1&id=100000087|LingVaria (IX), vol. 1 (17)]], pp. 59-68 [doi: 10.12797/LV.09.2014.17.04]. * CZEREPOWICKA, M. (2014): ''SEJF - Słownik elektroniczny jednostek frazeologicznych'', in [[http://www.jezyk-polski.pl|Język Polski]] (XCIV), v. 2, pp. 116-129. |
Line 8: | Line 14: |
* SAVARY, A., PISKORSKI, J. (2011). ''Language Resources for Named Entity Annotation in the National Corpus of Polish'', to appear in Control and Cybernetics. * SAVARY, A., PISKORSKI, J. (2010). ''[[http://iis.ipipan.waw.pl/2010/proceedings/iis10-15.pdf|Lexicons and Grammars for Named Entity Annotation in the National Corpus of Polish]]'', in Proceedings of the 18th International Conference Intelligent Information Systems (IIS'10), Siedlce, Poland. * PISKORSKI, J. (2005). ''Named-Entity Recognition for Polish with SProUT'', in LNCS Vol 3490: Proceedings of IMTCI 2004, Warsaw, Poland. |
The lexicon contains about 3200 multi-word lexemes, 68,000 corresponding inflected forms, and 160 graph-based inflection paradigms, with the following distribution: * 2121 nominal compounds (e.g. ''bajońskie sumy''), * 446 adjectival compounds (e.g. ''prosty jak strzała'', ''wprost proporcjonalny''), * 604 adverbial compounds (e.g. ''chcąc nie chcąc''), * 43 others (e.g. ''ni z gruszki, ni z pietruszki''). |
Line 12: | Line 20: |
The file contains 153,477 inflected entries of Polish (and some foreign) proper names and named entity components: * forenames and surnames, * city, country, mountain, region and river names, * institution names, * relational adjectives and inhabitant names stemming from country names, * named entity triggers (months, days, positions, etc.). The file DOES NOT contain inhabitant names and relational adjectives stemming from Polish settlements. These data, owned by the PWN publisher, were used within the NKJP project under a particular licence and are concerned by the copyright. |
== Authors == * [[http://www.uwm.edu.pl/polonistyka/index.php?option=com_content&view=article&id=95&catid=50&Itemid=9|Monika Czerepowicka]] - lexicography * [[http://www.info.univ-tours.fr/~savary/English/indexgb.html|Agata Savary]] - automatic inflection and validation |
Line 21: | Line 24: |
== Authors == * Agata Savary <<MailTo(agata DOT savary AT SPAMFREE univ-tours DOT fr)>> - NKJP version version of the gazetteer; LMF format definition * Michał Lenart <<MailTo(michal DOT lenart AT SPAMFREE gmail DOT com)>> - LMF conversion and validation * Jakub Piskorski <<MailTo(jakub DOT piskorski AT SPAMFREE ipipan DOT waw DOT pl)>> - earlier version of the gazetteer used for information extraction from Polish texts |
== Tools == The lexicon has been created within [[http://zil.ipipan.waw.pl/Toposlaw|Toposław]], tool for developping and managing inflectional dictionaries of multi-word units. Toposław integrates: * [[http://sgjp.pl/morfeusz/|Morfeusz SGJP]] -- a morphological analyser and generator of Polish, * [[http://www.springerlink.com/content/n265j22n73084433/|Multiflex]] -- a morpho-syntactic generator of multi-word units, * graph editor stemming from [[http://igm.univ-mlv.fr/~unitex/|Unitex]]. |
Line 28: | Line 32: |
The data are available under the [[http://en.wikipedia.org/wiki/BSD_licenses#2-clause_license_.28.22Simplified_BSD_License.22_or_.22FreeBSD_License.22.29|2-clause BSD licence]]. | The data are available under the [[http://creativecommons.org/licenses/by-sa/3.0/|CC BY-SA license]]. |
Line 32: | Line 36: |
* [[attachment:gazetteer-nkjp-no-pwn.zip|Text version]] as used with Sprout for NKJP pre-annotation | * SEJF v 1.0 * [[attachment:Slownik.tar.gz|Slownik]] -- the binary source file in [[http://zil.ipipan.waw.pl/Toposlaw|Toposław]] format * [[http://www.springerlink.com/content/n265j22n73084433/|Multiflex]]-compatible [[attachment:SEJF.tar.gz|archive]] containing: * the list of morphologically annotated lexemes, * the list of corresponding inflected forms and variants, * inflection graphs compatible with [[http://igm.univ-mlv.fr/~unitex/|Unitex]] graph editor, * list of known problems. |
Line 34: | Line 44: |
* [[attachment:PNEG-LMF-v1.tar.gz|LMF-compliant version]] containing: * LMF format definition and conversion guidelines, * Relax NG schema, morphosyntax configuration file and validation scrypts, * gramatically complete gazetteer entries (9,060 lemmas and 95,359 word forms), * gramatically incomplete gazetteer entries (35,884 lemmas and 40,612 word forms). |
== Future work == Defining an [[http://www.lexicalmarkupframework.org/|LMF]] format for the lexicon. |
Grammatical Lexicon of Polish Phraseology
The Grammatical Lexicon of Polish Phraseology (SEJF = Słownik Elektroniczny Jednostek Frazeologicznych) is an electronic lexicon containing multi-word units (mainly nominal, adjectival and adverbial compounds) of the general (non terminological) Polish language. It has been created within the ERDF Nekst project.
Some aspects of its construction, contents and use have been described in:
GRALIŃSKI, F., SAVARY, A., CZEREPOWICKA, M., MAKOWIECKI, F. (2010): Computational Lexicography of Multi-Word Units: How Efficient Can It Be?, in Proceedings of Multiword Expressions: from Theory to Applications (MWE 2010), Workshop at COLING 2010, Beijing, China, August 28.
CZEREPOWICKA, M., KOSEK, I. (2011): Problemy opisu związków frazeologicznych w formalizmie „Multifleks” (na przykładzie rodzaju wyrażeń frazeologicznych), in Kopcińska, D., Bańko, M. (eds.) "Różne formy, różne treści", pp. 117–126, Warszawa 2011.
CZEREPOWICKA, M. (2011): „Toposław” jako narzędzie znakowania jednostek wieloczłonowych, in Matusiak-Kempa, I., Przybyszewski, S. (eds.) Nowe zjawiska w języku, tekście, komunikacji. Kontekst a komunikacja, Olsztyn, pp. 28–35.
CZEREPOWICKA, M. (2014): Jednostki obce w słowniku języka polskiego na przykładzie „Słownika elektronicznego jednostek frazeologicznych” (SEJF), in LingVaria (IX), vol. 1 (17), pp. 59-68 [doi: 10.12797/LV.09.2014.17.04].
CZEREPOWICKA, M. (2014): SEJF - Słownik elektroniczny jednostek frazeologicznych, in Język Polski (XCIV), v. 2, pp. 116-129.
The lexicon contains about 3200 multi-word lexemes, 68,000 corresponding inflected forms, and 160 graph-based inflection paradigms, with the following distribution:
2121 nominal compounds (e.g. bajońskie sumy),
446 adjectival compounds (e.g. prosty jak strzała, wprost proporcjonalny),
604 adverbial compounds (e.g. chcąc nie chcąc),
43 others (e.g. ni z gruszki, ni z pietruszki).
Authors
Monika Czerepowicka - lexicography
Agata Savary - automatic inflection and validation
Tools
The lexicon has been created within Toposław, tool for developping and managing inflectional dictionaries of multi-word units. Toposław integrates:
Morfeusz SGJP -- a morphological analyser and generator of Polish,
Multiflex -- a morpho-syntactic generator of multi-word units,
graph editor stemming from Unitex.
License
The data are available under the CC BY-SA license.
Available resources
- SEJF v 1.0
Future work
Defining an LMF format for the lexicon.