PoliMorf is the new morphological dictionary for Polish resulting from the standardization and merger of Morfeusz SGJP and Morfologik financed by CESAR project.
License: Both the source data and resulting resource are available on 2-clause BSD license.
Please cite the LREC paper:
SGJP is the result of several years of work of an informal group lead by Prof. Saloni. The work started in the 1980s by digitising the list of headwords of the 11-volume Doroszewski’s dictionary of Polish (1958–1969). The grammatical description in SGJP is based on new concepts proposed in the 2nd half of the 20th century with many detailed solutions proposed by the members of the team (Tokarski, Gruszczyński, Saloni). PoliMorf will use data from the second edition of SGJP. 244,341 lexemes correspond to 4,223,981 word forms (counting syncretic forms of the same lexeme as one unit).
Inflection in SGJP is represented with inflectional patterns, which describe forms in terms of a stem common to all forms and endings differentiating the forms. The model of inflection is in fact more complicated, but the high level of irregularity in Polish inflection still leads to numerous inflectional patterns — over a thousand.
Morfologik is an open-source morphological dictionary of Polish. It contains 216,992 lexemes and 3,475,809 word forms.
The dictionary was created by enriching the Polish ispell/hunspell dictionary with morphological information, which was possible thanks to the structure of the original dictionary that retained important grammatical distinctions. The process of conversion relied on a series of scripts, and the resulting dictionary was later augmented with manually entered information. Unfortunately, the original source dictionary did not contain sufficient structure to allow reliable detection of some information, such as the exact subgender of the masculine for substantives. This information was added manually and using heuristic methods, however its reliability is low. Considering the fact that the substantives are about one third of the dictionary content (and almost half of them are masculine), this limitation is severe.
The tagset of the dictionary is inspired by the IPI PAN Tagset. However, Morfologik diverges from that tagset and from Morfeusz, as it never splits orthographic (“space-to-space”) words into smaller dictionary words (i.e. so-called agglutination is not considered). Moreover, due to the lack of information in the ispell dictionary, some forms are not completely annotated, and are marked as irregular. There is, however, some additional mark up added to reflexive verbs, which is not present in the original IPI PAN Tagset. This was introduced for the purposes of the grammar checker LanguageTool that used the dictionary extensively.