Concraft-pl
Concraft-pl is a morphosyntactic tagger for the Polish language based on conditional random fields. The tool is coupled with Morfeusz 2, a morphosyntactic analyzer for Polish. Both tools represent morphosyntactic and segmentation ambiguities in the form of a directed acyclic graph (DAG) of morphosyntactic interpretations. The current version of Concraft uses a simple column based format for input and output (described below).
Downloads
- A tagging model trained on NKJP1M-SGJP:
For Morfeusz 1.99.5 (2022/02/20) and newer:
For Morfeusz 1.99.4 and older:
- Compiled version of Concraft
for Linux (compiled on Ubuntu 18.04): Concraft-Linux.zip
for Windows: Concraft-Windows.zip
for Mac OS/X: Concraft-MacOSX.zip
Source code of Concraft: https://github.com/kawu/concraft-pl
Examples of Concraft input and output format: format-examples.zip
Usage
The provided model is compatible with the current tagset of Morfeusz SGJP and was trained on a version of NKJP1M adapted to that tagset. The preparation of this version of the corpus was financed by CLARIN-PL.
The compiled binaries provided above are standard executables, which depend only on basic system C-language libraries. If none matches your system, try compiling from sources (which requires the Haskell stack).
Concraft can be used from command-line as in the following example. For more details see the documentation at the source page.
./concraft-pl tag concraft-pl-model-SGJP-08022020.gz -i example-input.dag -o example-output.dag
Note that the model has almost 100 MB in size. It may take several seconds for Concraft to load the model into the memory, please do not despair in the meantime.
Data format
The common format for input and output of Concraft is as follows. Each line in the file represents a single morphosyntactic interpretation of a segment. The line comprises the following tab-separated fields:
Columns 1 & 2 contain numerical IDs of the starting and ending node of the current segment in the morphosyntactic graph of Morfeusz 2,
- column 3 — segment (token),
- column 4 — lemma,
- column 5 — morphosyntactic tag,
- column 6 — proper name type as given by Morfeusz (not yet used for tagging),
- column 7 — any labels as given by Morfeusz (not yet used for tagging),
- column 8 — probability of this interpretation as determined by Concraft (use 0.0 for input),
- column 9 — interpretation-related meta information,
- column 10 — end-of-sentence mark assigned by Concraft,
- column 11 — segment-related meta information
- column 12 (only in the output) — the interpretation of the segment chosen by Concraft is marked “disamb” in this column.
The columns for meta information are simply copied from input to the output. They can be used to carry some additional information, for example IDs of XML elements representing the interpretations in some external format (interpretation-related) or NKJP-style no-preceding-space markers (segment-related).
Publications
Jakub Waszczuk. (2012). Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language.
In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), pages 2789–2804, Mumbai, India, 2012.Jakub Waszczuk, Witold Kieraś, and Marcin Woliński. (2018). Morphosyntactic disambiguation and segmentation for historical Polish with graph-based conditional random fields.
In: Petr Sojka, Aleš Horák, Ivan Kopeček, and Karel Pala, editors, Text, Speech, and Dialogue: 21st International Conference, TSD 2018, Brno, Czech Republic, September 11-14, 2018.