Revision 7 as of 2012-07-20 10:25:17

Clear message
Locked History Actions

SAWA

Grammatical Lexicon of Warsaw Urban Proper Names

The Grammatical Lexicon of Warsaw Urban Proper Names (SAWA - Słownik elektroniczny nAzewnictwa WArszawy) an electronic lexicon containing about 9,000 proper names of places related to the Warsaw transportation system, i.e. names of streets, squares, monuments, buildings, bus, tram and subway stops, etc., as well as names of persons to whom some objects (notably streets) are dedicated. Stylistically marked names (e.g. Czterech Śpiących), as well as previous names (notably those used before 1989) are also included. Their morphosyntax is described by over 450 graph-based inflection paradigms, which allow an automatic generation of about 300,000 inflectional and syntactic variants. It has been developed within a French-Polish Polonium project and within nationally funded Polish project.

Some aspects of its construction, contents and use have been described in:

  • SAVARY, A., RABIEGA-WIŚNIEWSKA, J., WOLIŃSKI, M. (2009): "Inflection of Polish Multi-Word Proper Names with Morfeusz and Multiflex", in MARCINIAK, M., MYKOWIECKA, A. (eds.) "Aspects of Natural Language Processing", Lecture Notes in Computer Science 5070, Springer Verlag, pp. 111-141.
  • MARCINIAK, M., RABIEGA-WIŚNIEWSKA, J., SAVARY, A., WOLIŃSKI, M., HELIASZ, C. (2009): "Constructing an Electronic Dictionary of Polish Urban Proper Names", in Recent Advances in Intelligent Information Systems (Proceedings of the Balto-Slavonic Natural Language Processing Workshop, Kraków), Academic Publishing House EXIT, Warsaw, pp. 743-749.

The lexicon contains the following names of objects of the following types:

  • 4837 communication ways: streets (e.g. ulica Generała Kazimierza Pułaskiego), squares (e.g. Plac Komuny Paryskiej), and bridges (e.g. most Śląsko-Dąbrowski),

  • 1933 communication points: bus, tram, subway and city train stops (e.g. przystanek Aleja Zjednoczenia, stacja Warszawa-ZOO), railway stations (e.g. Warszawa Wschodnia), and airports (e.g. Port Lotniczy imienia Fryderyka Chopina w Warszawie),

  • 435 buildings (e.g. Hala Marymoncka, kościół Świętego Jakuba Apostoła, Muzeum Historii Żydów Polskich, teatr „Kwadrat”, Akademia Medyczna)

  • 385 districts and areas (e.g. Sady Żoliborskie, Stegny, park imienia Romualda Traugutta, Cmentarz Ewangelicko-Augsburski),

  • 115 monuments (e.g. Grób Nieznanego Żołnierza),

  • 34 hydronyms (e.g. Kanał Żerański).

  • 1195 person names to whom some urban objects (notably streets) are dedicated (e.g. Kazimierz Pułaski).

Authors

  • Małgorzata Marciniak -- project management,

  • Celina Heliasz -- lexicography,
  • Joanna Rabiega-Wiśniewska -- lexicography,
  • Piotr Sikora -- programming,
  • Marcin Woliński -- morphology of single words,

  • Agata Savary -- automatic inflection and validation.

Tools

The lexicon has been created within Toposław, tool for developping and managing inflectional dictionaries of multi-word units. Topsław integrates:

  • Morfeusz SGJP -- a morphological analyser and generator of Polish,

  • Multiflex -- a morpho-syntactic generator of multi-word units,

  • graph editor stemming from Unitex.

License

The data are available under the CC-BY-SA license.

Available resources

  • Slownik -- the binary source file in Toposław format

  • Multiflex-compatible archive containing:

    • the list of morphologically annotated lexemes,
    • the list of corresponding inflected forms and variants,
    • inflection graphs compatible with Unitex graph editor,

    • list of known problems.

Future work

Defining an LMF format for the lexicon.