Locked History Actions

Diff for "Prolexbase"

Differences between revisions 1 and 2
Revision 1 as of 2013-01-22 21:51:33
Size: 4210
Editor: AgataSavary
Comment:
Revision 2 as of 2013-01-22 21:54:54
Size: 4318
Editor: AgataSavary
Comment:
Deletions are marked like this. Additions are marked like this.
Line 6: Line 6:
Prolexbase creation has been supported by numerous projects: Prolexbase creation has been supported by national and international projects:
Line 9: Line 9:
 *


. The French data

 the ERDF [[http://zil.ipipan.waw.pl/NEKST|Nekst]] project.
 * ERDF [[http://zil.ipipan.waw.pl/NEKST|Nekst]] project,
 * European (CIP ICT-PSP) [[http://clip.ipipan.waw.pl/CESAR|CESAR]] project, part of [[http://www.meta-net.eu/|META-NET]].

Prolexbase

Prolexbase is a multilingual relational dictionary of proper names, conceived at the University of Tours, France and further developed at the University of Belgrade, Serbia, and at the Polish Academy of Sciences (IPIPAN). It contains a language-independent typology of proper names with 4 supertypes and 34 types, as well as various language-independent or language-specific relations (synonymy, meronymy accessibility, variation etc.). A pivot-oriented design of concepts yields alignment of proper names in a language with their counterparts if other languages. Currently, the resources counts about 40,000 Polish, 33,000 English and 100,000 French proper names together with their corresponding 165,000 Polish, 18,000 English and 142,393 English inflected forms. A large majority of the data have been extracted from Wikipedia. All data have been manually validated.

Prolexbase creation has been supported by national and international projects:

  • Technolangue programme from the French Ministry of Industry (2003-2005),

  • Egide Pavle-Savic programme from the Serbian Ministry of Science, the French Ministry of Foreign Affairs and the French Ministry of Research,

  • ERDF Nekst project,

  • European (CIP ICT-PSP) CESAR project, part of META-NET.

Some aspects of its construction, contents and use have been described in:

  • GRALIŃSKI, F., SAVARY, A., CZEREPOWICKA, M., MAKOWIECKI, F. (2010): Computational Lexicography of Multi-Word Units: How Efficient Can It Be?, in Proceedings of Multiword Expressions: from Theory to Applications (MWE 2010), Workshop at COLING 2010, Beijing, China, August 28.

  • SAVARY, A., ZABOROWSKI, B., KRAWCZYK-WIECZOREK, A., MAKOWIECKI, F. (2012): SEJFEK — a Lexicon and a Shallow Grammar of Polish Economic Multi-Word Units, in Proceedings of Cognitive Aspects of the Lexicon (COGALEX-III), a Workshop at COLING 2012, Mumbai, India.

The lexicon contains:

  • 11,212 multi-word nominal lexemes (e.g. aktywne ryzyko płynności),

  • 146,861 corresponding inflected forms (e.g. aktywnego ryzyka płynności),

  • 305 graph-based inflection paradigms.

See also SEJFEK4Spejd – a shallow grammar for Spejd with fully lexicalized rules automatically generated from SEJFEK lexicon entries.

Authors

  • Filip Makowiecki – lexicography
  • Agata Savary – automatic inflection and validation

Tools

The lexicon has been created within Toposław, tool for developping and managing inflectional dictionaries of multi-word units. Toposław integrates:

  • Morfeusz SGJP – a morphological analyser and generator of Polish,

  • Multiflex – a morpho-syntactic generator of multi-word units,

  • graph editor stemming from Unitex.

License

The data are available under the CC BY-SA license.

Available resources

  • Slownik – the binary source file in Toposław format

  • Multiflex-compatible archive containing:

    • the list of morphologically annotated lexemes,
    • the list of corresponding inflected forms and variants,
    • inflection graphs compatible with Unitex graph editor,

    • list of known problems.

Future work

Defining an LMF format for the lexicon.