NKJP1M re-annotated using the Morfeusz SGJP tagset

NKJP1M is a 1 million word manually annotated sub-corpus of the National Corpus of Polish (NKJP). It is the main resource used for training taggers of Polish. Unfortunately, NKJP was annotated according to a tagset, which is somewhat different than the tagset of morphological analyser Morfeusz SGJP.

Here, we present NKJP1M-SGJP — a version of NKJP1M re-annotated in accordance with the tagset of Morfeusz SGJP. Thus, taggers can be trained compatible with Morfeusz without any tagset conversion. We intend to maintain this version of the corpus both in terms of correcting errors and keeping it compatible with Morfeusz.

NKJP1M-SGJP compatible with current Morfeusz is available at http://download.sgjp.pl/morfeusz/current/. (Older versions are contained in respective subdirectories of http://download.sgjp.pl/morfeusz/, starting July 2020). The corpus is available as a set of NKJP-TEI XML files (file nkjp1m-sgjp-tei-‹release_date›.tgz) as well as a set of files in a simple column based format used by the tagger Concraft-PL (see Concraft’s page for format description), which we find easier to use (file nkjp1m-sgjp-dag-‹data›.tgz).

The DAG version was prepared with training taggers in mind. For each text in the corpus there are two files named ann_morphosyntax_disamb.dag and ann_morphosyntax_ambig.dag. The disamb files contain complete information and can be used for training. Correct interpretations are marked with disamb in column 12 and non-zero probability in column 8. Interpretations unknown to Morfeusz (added by annotators) are marked manual in column 9 (“interpretation related metadata”). The nps (no preceding space) marker of NKJP is present in column 11 (“segment related metadata”). The ambig files can be used for fair testing: they do not contain disamb marks. Moreover, and all manual interpretations and manual segmentation variants were stripped from these files. Thus, an ideal tagger, when given an “ambig” file, should produce a sequence of interpretations as in the “disamb” file.

NKJP1M-SGJP is available under Creative Commons Attribution (CC-BY), since this is the license of NKJP1M.

Preparation and maintenance of this resource is possible thanks to CLARIN-PL.

Upload page content

NKJP1M-SGJP

Menu

NKJP1M re-annotated using the Morfeusz SGJP tagset