ZIL
Concraft

Concraft-pl

Concraft-pl is a morphosyntactic tagger for the Polish language based on conditional random fields. The tool is coupled with Morfeusz 2, a morphosyntactic analyzer for Polish. Both tools represent morphosyntactic and segmentation ambiguities in the form of a directed acyclic graph (DAG) of morphosyntactic interpretations. The current version of Concraft uses a simple column based format for input and output (described below).

Downloads

Usage

The provided model is compatible with the current tagset of Morfeusz SGJP and was trained on a version of NKJP1M adapted to that tagset. The preparation of this version of the corpus was financed by CLARIN-PL.

The compiled binaries provided above are standard executables, which depend only on basic system C-language libraries. If none matches your system, try compiling from sources (which requires the Haskell stack).

Concraft can be used from command-line as in the following example. For more details see the documentation at the source page.

./concraft-pl tag concraft-pl-model-SGJP-08022020.gz -i example-input.dag -o example-output.dag

Note that the model has almost 100 MB in size. It may take several seconds for Concraft to load the model into the memory, please do not despair in the meantime.

Data format

The common format for input and output of Concraft is as follows. Each line in the file represents a single morphosyntactic interpretation of a segment. The line comprises the following tab-separated fields:

The columns for meta information are simply copied from input to the output. They can be used to carry some additional information, for example IDs of XML elements representing the interpretations in some external format (interpretation-related) or NKJP-style no-preceding-space markers (segment-related).

Publications

last edited 2022-02-22 13:45:39 by MarcinWolinski