Locked History Actions

Concraft

Concraft-pl

Concraft-pl is a morphosyntactic tagger for the Polish language based on conditional random fields. The tool is coupled with Morfeusz 2, a morphosyntactic analyzer for Polish. Both tools represent morphosyntactic and segmentation ambiguities in the form of a directed acyclic graph (DAG) of morphosyntactic interpretations. The current version of Concraft uses a simple column based format for input and output (described below).

Downloads

Usage

The provided model is compatible with the current tagset of Morfeusz SGJP and was trained on a version of NKJP1M adapted to that tagset. The preparation of this version of the corpus was financed by CLARIN-PL.

The compiled binaries provided above are standard executables, which depend only on basic system C-language libraries. If none matches your system, try compiling from sources (which requires the Haskell stack).

Concraft can be used from command-line as in the following example. For more details see the documentation at the source page.

./concraft-pl tag concraft-pl-model-SGJP-08022020.gz -i example-input.dag -o example-output.dag

Note that the model has almost 100 MB in size. It may take several seconds for Concraft to load the model into the memory, please do not despair in the meantime.

Data format

The common format for input and output of Concraft is as follows. Each line in the file represents a single morphosyntactic interpretation of a segment. The line comprises the following tab-separated fields:

  • Columns 1 & 2 contain numerical IDs of the starting and ending node of the current segment in the morphosyntactic graph of Morfeusz 2,

  • column 3 — segment (token),
  • column 4 — lemma,
  • column 5 — morphosyntactic tag,
  • column 6 — proper name type as given by Morfeusz (not yet used for tagging),
  • column 7 — any labels as given by Morfeusz (not yet used for tagging),
  • column 8 — probability of this interpretation as determined by Concraft (use 0.0 for input),
  • column 9 — interpretation-related meta information,
  • column 10 — end-of-sentence mark assigned by Concraft,
  • column 11 — segment-related meta information
  • column 12 (only in the output) — the interpretation of the segment chosen by Concraft is marked “disamb” in this column.

The columns for meta information are simply copied from input to the output. They can be used to carry some additional information, for example IDs of XML elements representing the interpretations in some external format (interpretation-related) or NKJP-style no-preceding-space markers (segment-related).

Publications