Locked History Actions

Diff for "Prolexbase"

Differences between revisions 6 and 7
Revision 6 as of 2013-01-22 22:24:43
Size: 4330
Editor: AgataSavary
Comment:
Revision 7 as of 2013-01-22 22:44:00
Size: 4527
Editor: AgataSavary
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
= Prolexbase = = Prolexbase 2.0 =
Line 4: Line 4:
Prolexbase is a multilingual relational dictionary of proper names, conceived at the University of Tours, France and further developed at the University of Belgrade, Serbia, and at the Polish Academy of Sciences (IPIPAN). It contains a language-independent typology of proper names with 4 supertypes and 34 types, as well as various language-independent or language-specific relations (synonymy, meronymy accessibility, variation etc.). A pivot-oriented design of concepts yields alignment of proper names in a language with their counterparts if other languages. A large majority of the data have been extracted from Wikipedia. All data have been manually validated. Prolexbase 2.0 is a multilingual relational dictionary of proper names, conceived initially at the University of Tours, France and at the University of Belgrade, Serbia, and further developed at the Polish Academy of Sciences (IPIPAN). It contains a language-independent typology of proper names with 4 supertypes and 34 types, as well as various language-independent or language-specific relations (synonymy, meronymy accessibility, variation etc.). A pivot-oriented design of concepts yields alignment of proper names in a language with their counterparts if other languages. A large majority of the data have been extracted from Wikipedia. ''All data have been manually validated''.
Line 13: Line 13:
Line 20: Line 19:
Currently, the resource counts the following interlinked data: Currently, version 2.0 of the resource contains the following interlinked data:
 * 67,000 languge-independent pivots,
Line 24: Line 24:
 * 65,500 relations.
Line 28: Line 29:
 * Filip Makowiecki – lexicography
 * [[http://www.info.univ-tours.fr/~savary/English/indexgb.html|Agata Savary]] – automatic inflection and validation
 
== Tools ==
The lexicon has been created within [[http://zil.ipipan.waw.pl/Toposlaw|Toposław]], tool for developping and managing inflectional dictionaries of multi-word units. Toposław integrates:
 * [[http://sgjp.pl/morfeusz/|Morfeusz SGJP]] – a morphological analyser and generator of Polish,
 * [[http://www.springerlink.com/content/n265j22n73084433/|Multiflex]] – a morpho-syntactic generator of multi-word units,
 * graph editor stemming from [[http://igm.univ-mlv.fr/~unitex/|Unitex]].
 * Małgorzata Baron - lexicography,
 * [[http://www.info.univ-tours.fr/~bouchou/index_a.html|Béatrice Bouchou Markhoff]] - LMF format design,
 * Pierre-François Laurand - server administration,
 * Leszek Manicki - design and implementation of ProlexFeeder (Prolexbase population from Wikipedia),
 * [[http://www.univ-tours.fr/acces-rapide/m-maurel-denis-84407.kjsp|Denis Maurel]] - design and dissemination, project management,
 * [[http://www.info.univ-tours.fr/~savary/English/indexgb.html|Agata Savary]] – project manager for the Polish and English modules,
 * Mickaël Tran - design and implementation.
Line 39: Line 39:
The data are available under the [[http://creativecommons.org/licenses/by-sa/3.0/|CC BY-SA license]]. The data are available under the [[http://creativecommons.org/licenses/by-sa/3.0/|CC BY-SA license]], i.e. the same as for [[http://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License|Wikipedia]].

Prolexbase 2.0

Prolexbase 2.0 is a multilingual relational dictionary of proper names, conceived initially at the University of Tours, France and at the University of Belgrade, Serbia, and further developed at the Polish Academy of Sciences (IPIPAN). It contains a language-independent typology of proper names with 4 supertypes and 34 types, as well as various language-independent or language-specific relations (synonymy, meronymy accessibility, variation etc.). A pivot-oriented design of concepts yields alignment of proper names in a language with their counterparts if other languages. A large majority of the data have been extracted from Wikipedia. All data have been manually validated.

Prolexbase creation has been supported by the following projects:

  • Technolangue programme from the French Ministry of Industry (2003-2005),

  • Egide Pavle-Savic programme from the Serbian Ministry of Science, the French Ministry of Foreign Affairs and the French Ministry of Research,

  • ERDF Nekst project,

  • European (CIP ICT-PSP) CESAR project, part of META-NET.

Some aspects of its construction, contents and use have been described in:

Currently, version 2.0 of the resource contains the following interlinked data:

  • 67,000 languge-independent pivots,
  • 40,000 Polish proper names and their corresponding 165,000 inflected forms,
  • 33,000 English proper names and their corresponding 18,000 inflected forms,
  • 100,000 French proper names and their corresponding 142,393 inflected forms,
  • 65,500 relations.

See also Prolexbase on CNRTL for a previous version of the French data, serialized in an LMF standard format.

Authors

  • Małgorzata Baron - lexicography,
  • Béatrice Bouchou Markhoff - LMF format design,

  • Pierre-François Laurand - server administration,
  • Leszek Manicki - design and implementation of ProlexFeeder (Prolexbase population from Wikipedia),

  • Denis Maurel - design and dissemination, project management,

  • Agata Savary – project manager for the Polish and English modules,

  • Mickaël Tran - design and implementation.

License

The data are available under the CC BY-SA license, i.e. the same as for Wikipedia.

Available resources

  • Slownik – the binary source file in Toposław format

  • Multiflex-compatible archive containing:

    • the list of morphologically annotated lexemes,
    • the list of corresponding inflected forms and variants,
    • inflection graphs compatible with Unitex graph editor,

    • list of known problems.

Future work

Defining an LMF format for the lexicon.