Size: 4516
Comment:
|
Size: 4330
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 4: | Line 4: |
Prolexbase is a multilingual relational dictionary of proper names, conceived at the University of Tours, France and further developed at the University of Belgrade, Serbia, and at the Polish Academy of Sciences (IPIPAN). It contains a language-independent typology of proper names with 4 supertypes and 34 types, as well as various language-independent or language-specific relations (synonymy, meronymy accessibility, variation etc.). A pivot-oriented design of concepts yields alignment of proper names in a language with their counterparts if other languages. Currently, the resources counts about 40,000 Polish, 33,000 English and 100,000 French proper names together with their corresponding 165,000 Polish, 18,000 English and 142,393 English inflected forms. A large majority of the data have been extracted from Wikipedia. All data have been manually validated. | Prolexbase is a multilingual relational dictionary of proper names, conceived at the University of Tours, France and further developed at the University of Belgrade, Serbia, and at the Polish Academy of Sciences (IPIPAN). It contains a language-independent typology of proper names with 4 supertypes and 34 types, as well as various language-independent or language-specific relations (synonymy, meronymy accessibility, variation etc.). A pivot-oriented design of concepts yields alignment of proper names in a language with their counterparts if other languages. A large majority of the data have been extracted from Wikipedia. All data have been manually validated. |
Line 15: | Line 15: |
* Bouchou, B., Maurel, D. (2008): [[http://www.atala.org/IMG/pdf/TAL-2008-49-1-04-Bouchou.pdf|Prolexbase et LMF : vers un standard pour les ressources lexicales sur les noms propres]]. In Traitement Automatique des Langues, 49(1). |
* Bouchou, B., Maurel, D. (2008): [[http://www.atala.org/IMG/pdf/TAL-2008-49-1-04-Bouchou.pdf|Prolexbase et LMF : vers un standard pour les ressources lexicales sur les noms propres]]. In Traitement Automatique des Langues, 49(1). |
Line 21: | Line 20: |
The lexicon contains: * 11,212 multi-word nominal lexemes (e.g. ''aktywne ryzyko płynności''), * 146,861 corresponding inflected forms (e.g. ''aktywnego ryzyka płynności''), * 305 graph-based inflection paradigms. |
Currently, the resource counts the following interlinked data: * 40,000 Polish proper names and their corresponding 165,000 inflected forms, * 33,000 English proper names and their corresponding 18,000 inflected forms, * 100,000 French proper names and their corresponding 142,393 inflected forms, |
Line 26: | Line 25: |
See also [[http://zil.ipipan.waw.pl/SEJFEK4Spejd|SEJFEK4Spejd]] – a shallow grammar for [[http://zil.ipipan.waw.pl/Spejd|Spejd]] with fully lexicalized rules automatically generated from SEJFEK lexicon entries. | See also [[http://www.cnrtl.fr/lexiques/prolex/|Prolexbase on CNRTL]] for a previous version of the French data, serialized in an LMF standard format. |
Prolexbase
Prolexbase is a multilingual relational dictionary of proper names, conceived at the University of Tours, France and further developed at the University of Belgrade, Serbia, and at the Polish Academy of Sciences (IPIPAN). It contains a language-independent typology of proper names with 4 supertypes and 34 types, as well as various language-independent or language-specific relations (synonymy, meronymy accessibility, variation etc.). A pivot-oriented design of concepts yields alignment of proper names in a language with their counterparts if other languages. A large majority of the data have been extracted from Wikipedia. All data have been manually validated.
Prolexbase creation has been supported by the following projects:
Technolangue programme from the French Ministry of Industry (2003-2005),
Egide Pavle-Savic programme from the Serbian Ministry of Science, the French Ministry of Foreign Affairs and the French Ministry of Research,
ERDF Nekst project,
Some aspects of its construction, contents and use have been described in:
Savary, A., Manicki, L., Baron, M.: ProlexFeeder— Populating a Multilingual Ontology of Proper Names from Open Sources. Submitted to Journal of Language Modelling.
Bouchou, B., Maurel, D. (2008): Prolexbase et LMF : vers un standard pour les ressources lexicales sur les noms propres. In Traitement Automatique des Langues, 49(1).
Maurel, D. (2008): Prolexbase: a Multilingual Relational Lexical Database of Proper Names. In proceedings of LREC 2008, Marrakech, Morocco.
Tran, M., Maurel, D. (2006): Prolexbase. Un dictionnaire relationnel multilingue de noms propres. In Traitement Automatique des Langues, 47(3).
Krstev S., Vitas D., Maurel D., Tran M. (2005), Multilingual Ontology of Proper Names, Second Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznan, Poland.
Currently, the resource counts the following interlinked data:
- 40,000 Polish proper names and their corresponding 165,000 inflected forms,
- 33,000 English proper names and their corresponding 18,000 inflected forms,
- 100,000 French proper names and their corresponding 142,393 inflected forms,
See also Prolexbase on CNRTL for a previous version of the French data, serialized in an LMF standard format.
Authors
- Filip Makowiecki – lexicography
Agata Savary – automatic inflection and validation
Tools
The lexicon has been created within Toposław, tool for developping and managing inflectional dictionaries of multi-word units. Toposław integrates:
Morfeusz SGJP – a morphological analyser and generator of Polish,
Multiflex – a morpho-syntactic generator of multi-word units,
graph editor stemming from Unitex.
License
The data are available under the CC BY-SA license.
Available resources
Multiflex-compatible archive containing:
- the list of morphologically annotated lexemes,
- the list of corresponding inflected forms and variants,
inflection graphs compatible with Unitex graph editor,
- list of known problems.
Future work
Defining an LMF format for the lexicon.