Locked History Actions

Diff for "Prolexbase"

Differences between revisions 5 and 56 (spanning 51 versions)
Revision 5 as of 2013-01-22 22:19:09
Size: 4515
Editor: AgataSavary
Comment:
Revision 56 as of 2015-02-11 19:37:25
Size: 6086
Editor: AgataSavary
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
= Prolexbase = = Prolexbase 2.0 and 2.2 =
Line 4: Line 4:
Prolexbase is a multilingual relational dictionary of proper names, conceived at the University of Tours, France and further developed at the University of Belgrade, Serbia, and at the Polish Academy of Sciences (IPIPAN). It contains a language-independent typology of proper names with 4 supertypes and 34 types, as well as various language-independent or language-specific relations (synonymy, meronymy accessibility, variation etc.). A pivot-oriented design of concepts yields alignment of proper names in a language with their counterparts if other languages. Currently, the resources counts about 40,000 Polish, 33,000 English and 100,000 French proper names together with their corresponding 165,000 Polish, 18,000 English and 142,393 English inflected forms. A large majority of the data have been extracted from Wikipedia. All data have been manually validated. Prolexbase 2.0 and 2.2 is a multilingual relational dictionary of proper names, conceived initially at the University of Tours, France and at the University of Belgrade, Serbia, and further developed at the Polish Academy of Sciences (IPIPAN). It contains a language-independent typology of proper names with 4 supertypes and 34 types, as well as various language-independent or language-specific relations (synonymy, meronymy accessibility, variation etc.). A pivot-oriented design of concepts yields alignment of proper names in a language with their counterparts if other languages. A large majority of the data have been extracted from [[http://pl.wikipedia.org|Wikipedia]] and [[http://www.geonames.org/|GeoNames]]. ''All data have been manually validated''.
Line 8: Line 8:
 * Egide [[http://tln.li.univ-tours.fr/Tln_Pavle_Savic.html|Pavle-Savic]] programme from the Serbian Ministry of Science, the French Ministry of Foreign Affairs and the French Ministry of Research,
 * ERDF [[http://zil.ipipan.waw.pl/NEKST|Nekst]] project,
 * European (CIP ICT-PSP) [[http://clip.ipipan.waw.pl/CESAR|CESAR]] project, part of [[http://www.meta-net.eu/|META-NET]].
 * Egide [[http://tln.li.univ-tours.fr/Tln_Pavle_Savic.html|Pavle-Savic]] programme from the French Ministry of Foreign Affairs, the French Ministry of Research, and the Serbian Ministry of Science (2004-2005),
 * ERDF [[http://zil.ipipan.waw.pl/NEKST|Nekst]] project (2009-2014),
 * European (CIP ICT-PSP) [[http://clip.ipipan.waw.pl/CESAR|CESAR]] project, part of [[http://www.meta-net.eu/|META-NET]] (2011-2013).
Line 12: Line 12:
Some aspects of its construction, contents and use have been described in: The construction and contents of Prolexbase have been described in:
 * Savary, A., Manicki, L., Baron, M. (2013): [[http://jlm.ipipan.waw.pl/index.php/JLM/article/view/63|Populating a Multilingual Ontology of Proper Names from Open Sources]]. In Journal of Language Modelling, Vol 2, No. 2, pp. 189-225.
 * Savary, A., Manicki, L., Baron, M. (2013): [[http://www.info.univ-tours.fr/~savary/Papers/sav-man-bar-rapport-306.pdf|ProlexFeeder - Populating a Multilingual Ontology of Proper Names from Open Sources]]. Technical Report 306, Laboratoire d'Informatique, Université François Rabelais Tours, France.
 * Maurel, D., Bouchou, B. (2013): ''Prolmf, a Multilingual Dictionary of Proper Names and their Relations''. In Gil Francopoulo (ed.), LMF: Lexical Markup Framework, theory and practice, Iste-Wiley, pp. 67-81.
 * Spędzia, M., Maurel, D., Savary, A. (2011): ''[[attachment:prolexbase.documentation.pdf|Multilingual Relational Database of Proper Names: Prolexbase Documentation]]''. Technical report #297, Laboratoire d'informatique, Université François Rabelais Tours.
 * Bouchou, B., Maurel, D. (2008): ''[[http://www.atala.org/IMG/pdf/TAL-2008-49-1-04-Bouchou.pdf|Prolexbase et LMF : vers un standard pour les ressources lexicales sur les noms propres]]''. In Traitement Automatique des Langues, 49(1).
 * Maurel, D. (2008): ''[[http://www.lrec-conf.org/proceedings/lrec2008/summaries/91.html|Prolexbase: a Multilingual Relational Lexical Database of Proper Names]]''. In proceedings of LREC 2008, Marrakech, Morocco.
 * Tran, M., Maurel, D. (2006): ''[[http://www.atala.org/IMG/pdf/TAL-2006-47-3-06-Tran.pdf|Prolexbase. Un dictionnaire relationnel multilingue de noms propres]]''. In Traitement Automatique des Langues, 47(3).
 * Krstev S., Vitas D., Maurel D., Tran M. (2005): ''Multilingual Ontology of Proper Names''. In Second Language & Technology Conference (LTC'05), Poznań, Poland.
Line 14: Line 22:
 * Savary, A., Manicki, L., Baron, M.: [[http://|ProlexFeeder— Populating a Multilingual Ontology of Proper Names from Open Sources]]. Submitted to Journal of Language Modelling.
 * Bouchou, B., Maurel, D. (2008): [[http://www.atala.org/IMG/pdf/TAL-2008-49-1-04-Bouchou.pdf|Prolexbase et LMF : vers un standard pour les ressources lexicales sur les noms propres]]. In Traitement Automatique des Langues, 49(1).
 * Maurel, D. (2008): [[http://www.lrec-conf.org/proceedings/lrec2008/|Prolexbase: a Multilingual Relational Lexical Database of Proper Names]]. In proceedings of LREC 2008, Marrakech, Morocco.
 * Tran, M., Maurel, D. (2006): [[http://www.atala.org/IMG/pdf/TAL-2006-47-3-06-Tran.pdf|Prolexbase. Un dictionnaire relationnel multilingue de noms propres]]. In Traitement Automatique des Langues, 47(3).
 * Krstev S., Vitas D., Maurel D., Tran M. (2005), Multilingual Ontology of Proper Names, Second Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznan, Poland.
Prolexbase 2.0 and 2.2 contains the following interlinked data:
 * 67,000 language-independent pivots,
 * 40,000 Polish proper names and their corresponding 165,000 inflected forms,
 * 33,000 English proper names and their corresponding 18,000 inflected forms,
 * 100,000 French proper names and their corresponding 142,000 inflected forms,
 * 65,500 relations.
Line 20: Line 29:
The lexicon contains:
 * 11,212 multi-word nominal lexemes (e.g. ''aktywne ryzyko płynności''),
 * 146,861 corresponding inflected forms (e.g. ''aktywnego ryzyka płynności''),
 * 305 graph-based inflection paradigms.

See also [[http://zil.ipipan.waw.pl/SEJFEK4Spejd|SEJFEK4Spejd]] – a shallow grammar for [[http://zil.ipipan.waw.pl/Spejd|Spejd]] with fully lexicalized rules automatically generated from SEJFEK lexicon entries.
See also [[http://www.cnrtl.fr/lexiques/prolex/|Prolexbase on CNRTL]] for a proofread version (2.2), serialized in an LMF standard format.
Line 28: Line 32:
 * Filip Makowiecki – lexicography
 * [[http://www.info.univ-tours.fr/~savary/English/indexgb.html|Agata Savary]] – automatic inflection and validation
 
 * Małgorzata Baron - lexicography,
 * [[http://www.info.univ-tours.fr/~bouchou/index_a.html|Béatrice Bouchou Markhoff]] - LMF format design,
 * Leszek Manicki - design and implementation of Prolexbase population from Wikipedia,
 * [[http://www.univ-tours.fr/acces-rapide/m-maurel-denis-84407.kjsp|Denis Maurel]] - design and dissemination, project management,
 * [[http://www.info.univ-tours.fr/~savary/English/indexgb.html|Agata Savary]] – project management,
 * Mickaël Tran - database design and implementation.
 * [[http://poincare.matf.bg.ac.rs/~vitas//index-en.html|Duško Vitas]] - design and management of Serbian data.
Line 32: Line 41:
The lexicon has been created within [[http://zil.ipipan.waw.pl/Toposlaw|Toposław]], tool for developping and managing inflectional dictionaries of multi-word units. Toposław integrates:
 * [[http://sgjp.pl/morfeusz/|Morfeusz SGJP]] – a morphological analyser and generator of Polish,
 * [[http://www.springerlink.com/content/n265j22n73084433/|Multiflex]] – a morpho-syntactic generator of multi-word units,
 * graph editor stemming from [[http://igm.univ-mlv.fr/~unitex/|Unitex]].
 * [[http://jlm.ipipan.waw.pl/index.php/JLM/article/view/63|ProlexFeeder]], a tool for semi-automatic population of Prolexbase from open sources, notably Wikipedia and Geonames,
 * [[http://www.translatica.pl/|Translatica]]'s automatic inflection tool for multi-word units.
Line 39: Line 46:
The data are available under the [[http://creativecommons.org/licenses/by-sa/3.0/|CC BY-SA license]]. All Prolexbase 2.0 data are available under the [[http://creativecommons.org/licenses/by-sa/3.0/|CC BY-SA license]], i.e. the same as for [[http://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License|Wikipedia]] and [[http://www.geonames.org/|GeoNames]].
The Prolexbase 2.2 data are available under the [[http://infolingu.univ-mlv.fr/DonneesLinguistiques/Lexiques-Grammaires/lgpllr.html|LGPL-LR license]].
Line 43: Line 51:
 * [[attachment:Slownik.tar.gz|Slownik]] – the binary source file in [[http://zil.ipipan.waw.pl/Toposlaw|Toposław]] format
 * [[http://www.springerlink.com/content/n265j22n73084433/|Multiflex]]-compatible [[attachment:SEJFEK.tar.gz|archive]] containing:
   * the list of morphologically annotated lexemes,
   * the list of corresponding inflected forms and variants,
   * inflection graphs compatible with [[http://igm.univ-mlv.fr/~unitex/|Unitex]] graph editor,
   * list of known problems.
For version 2.2
 * XML serialisation in an LMF standard format, and its documentation in French -- see [[http://www.cnrtl.fr/lexiques/prolex/|Prolexbase on CNRTL]].

For version 2.0
 * Prolexbase [[attachment:prolexbase.documentation.pdf|documentation]].
 * Prolexbase [[attachment:prolexbase-schema.tar.gz|schema]] description.
 * [[attachment:prolexbase_en_fr_pl_20130204.sql.tar.gz|MySQL dump]] file,
 * List of all inflected forms of [[attachment:prolexbase-polish-inflected.tar.gz|Polish names]] together with their semantic and grammatical tags.
Line 52: Line 62:
Defining an [[http://www.lexicalmarkupframework.org/|LMF]] format for the lexicon.  * Opening a web interface for Prolexbase 2.0/2.2 navigation.
 * Releasing [[http://jlm.ipipan.waw.pl/index.php/JLM/article/view/63|ProlexFeeder]] under an open license.

Prolexbase 2.0 and 2.2

Prolexbase 2.0 and 2.2 is a multilingual relational dictionary of proper names, conceived initially at the University of Tours, France and at the University of Belgrade, Serbia, and further developed at the Polish Academy of Sciences (IPIPAN). It contains a language-independent typology of proper names with 4 supertypes and 34 types, as well as various language-independent or language-specific relations (synonymy, meronymy accessibility, variation etc.). A pivot-oriented design of concepts yields alignment of proper names in a language with their counterparts if other languages. A large majority of the data have been extracted from Wikipedia and GeoNames. All data have been manually validated.

Prolexbase creation has been supported by the following projects:

  • Technolangue programme from the French Ministry of Industry (2003-2005),

  • Egide Pavle-Savic programme from the French Ministry of Foreign Affairs, the French Ministry of Research, and the Serbian Ministry of Science (2004-2005),

  • ERDF Nekst project (2009-2014),

  • European (CIP ICT-PSP) CESAR project, part of META-NET (2011-2013).

The construction and contents of Prolexbase have been described in:

Prolexbase 2.0 and 2.2 contains the following interlinked data:

  • 67,000 language-independent pivots,
  • 40,000 Polish proper names and their corresponding 165,000 inflected forms,
  • 33,000 English proper names and their corresponding 18,000 inflected forms,
  • 100,000 French proper names and their corresponding 142,000 inflected forms,
  • 65,500 relations.

See also Prolexbase on CNRTL for a proofread version (2.2), serialized in an LMF standard format.

Authors

  • Małgorzata Baron - lexicography,
  • Béatrice Bouchou Markhoff - LMF format design,

  • Leszek Manicki - design and implementation of Prolexbase population from Wikipedia,
  • Denis Maurel - design and dissemination, project management,

  • Agata Savary – project management,

  • Mickaël Tran - database design and implementation.
  • Duško Vitas - design and management of Serbian data.

Tools

  • ProlexFeeder, a tool for semi-automatic population of Prolexbase from open sources, notably Wikipedia and Geonames,

  • Translatica's automatic inflection tool for multi-word units.

License

All Prolexbase 2.0 data are available under the CC BY-SA license, i.e. the same as for Wikipedia and GeoNames. The Prolexbase 2.2 data are available under the LGPL-LR license.

Available resources

For version 2.2

  • XML serialisation in an LMF standard format, and its documentation in French -- see Prolexbase on CNRTL.

For version 2.0

Future work

  • Opening a web interface for Prolexbase 2.0/2.2 navigation.
  • Releasing ProlexFeeder under an open license.