Locked History Actions

Diff for "Concraft"

Differences between revisions 9 and 41 (spanning 32 versions)
Revision 9 as of 2013-01-09 22:59:23
Size: 1721
Comment:
Revision 41 as of 2022-02-22 13:45:39
Size: 4189
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
= Concraft =
Line 4: Line 3:
This page provides the official release of Concraft, a morphosyntactic disambiguation tool based on constrained conditional random fields. = Concraft-pl =
Line 6: Line 5:
'''Author:'''
[[http://zil.ipipan.waw.pl/JakubWaszczuk|Jakub Waszczuk]] <<BR>>
'''License:''' 2-clause BSD

== Documentation ==

See the [[https://github.com/kawu/concraft/blob/master/README.md#concraft|README]] file from the development repository.
Concraft-pl is a morphosyntactic tagger for the Polish language based on conditional random fields. The tool is coupled with [[http://morfeusz.sgjp.pl|Morfeusz 2]], a morphosyntactic analyzer for Polish. Both tools represent morphosyntactic and segmentation ambiguities in the form of a directed acyclic graph (DAG) of morphosyntactic interpretations. The current version of Concraft uses a simple column based format for input and output (described below).
Line 16: Line 9:
Concraft is available in a form of a software distribution which can be downloaded from [[http://hackage.haskell.org/package/concraft|Hackage]] using the [[http://www.haskell.org/cabal/|Cabal]] tool. To compile Concraft you will also need the [[http://www.haskell.org/ghc/|Glasgow Haskell Compiler]] (GHC). The simplest way to get both Cabal and GHC is to install the [[http://www.haskell.org/platform/|Haskell Platform]]. Please see the documentation for more information about the installation process.  * A tagging model trained on NKJP1M-SGJP:
   || For Morfeusz 1.99.5 (2022/02/20) and newer: || [[attachment:concraft-pl-model-SGJP-20220221.gz]] ||
   || For Morfeusz 1.99.4 and older: || [[attachment:concraft-pl-model-SGJP-20200818.gz]] ||
 * Compiled version of Concraft
   * for Linux (compiled on Ubuntu 18.04): [[attachment:Concraft-Linux.zip]]
   * for Windows: [[attachment:Concraft-Windows.zip]]
   * for Mac OS/X: [[attachment:Concraft-MacOSX.zip]]
 * Source code of Concraft: [[https://github.com/kawu/concraft-pl]]
 * Examples of Concraft input and output format: [[attachment:format-examples.zip]]
Line 18: Line 19:
=== Pre-trained model === == Usage ==
Line 20: Line 21:
You can download a pre-trained Concraft model for the Polish language from here. The training material, manually annotated 1-million word subcorpus of the National Corpus of Polish, has been first re-analysed using the [[http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki|Maca]] tool set up to use the `morfeusz-nkjp-official` configuration. The same preprocessing pipeline should be used to prepare input data for subsequent disambiguation. The provided model is compatible with the current tagset of Morfeusz SGJP and was trained on a version of NKJP1M adapted to that tagset. The preparation of this version of the corpus was financed by [[http://clarin-pl.eu|CLARIN-PL]].

The compiled binaries provided above are standard executables, which depend only on basic system C-language libraries. If none matches your system, try compiling from sources (which requires the Haskell stack).

Concraft can be used from command-line as in the following example. For more details see the documentation at the [[https://github.com/kawu/concraft-pl|source page]].
{{{
./concraft-pl tag concraft-pl-model-SGJP-08022020.gz -i example-input.dag -o example-output.dag
}}}

Note that the model has almost 100 MB in size. It may take several seconds for Concraft to load the model into the memory, please do not despair in the meantime.

== Data format ==

The common format for input and output of Concraft is as follows. Each line in the file represents a single morphosyntactic interpretation of a segment. The line comprises the following tab-separated fields:

 * Columns 1 & 2 contain numerical IDs of the starting and ending node of the current segment in the morphosyntactic graph of Morfeusz 2,
 * column 3 — segment (token),
 * column 4 — lemma,
 * column 5 — morphosyntactic tag,
 * column 6 — proper name type as given by Morfeusz (not yet used for tagging),
 * column 7 — any labels as given by Morfeusz (not yet used for tagging),
 * column 8 — probability of this interpretation as determined by Concraft (use 0.0 for input),
 * column 9 — interpretation-related meta information,
 * column 10 — end-of-sentence mark assigned by Concraft,
 * column 11 — segment-related meta information
 * column 12 (only in the output) — the interpretation of the segment chosen by Concraft is marked “disamb” in this column.

The columns for meta information are simply copied from input to the output. They can be used to carry some additional information, for example IDs of XML elements representing the interpretations in some external format (interpretation-related) or NKJP-style no-preceding-space markers (segment-related).
Line 24: Line 52:
 * Jakub Waszczuk. (2012). [[attachment:coling2012.pdf|Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language]]. <<BR>> In: Proceedings of COLING 2012, Mumbai, India.  * Jakub Waszczuk. (2012). [[attachment:coling2012.pdf|Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language]]. <<BR>> In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), pages 2789–2804, Mumbai, India, 2012.
 * Jakub Waszczuk, Witold Kieraś, and Marcin Woliński. (2018). [[https://hal.archives-ouvertes.fr/hal-01835573/document|Morphosyntactic disambiguation and segmentation for historical Polish with graph-based conditional random fields]]. <<BR>> In: Petr Sojka, Aleš Horák, Ivan Kopeček, and Karel Pala, editors, Text, Speech, and Dialogue: 21st International Conference, TSD 2018, Brno, Czech Republic, September 11-14, 2018.

Concraft-pl

Concraft-pl is a morphosyntactic tagger for the Polish language based on conditional random fields. The tool is coupled with Morfeusz 2, a morphosyntactic analyzer for Polish. Both tools represent morphosyntactic and segmentation ambiguities in the form of a directed acyclic graph (DAG) of morphosyntactic interpretations. The current version of Concraft uses a simple column based format for input and output (described below).

Downloads

Usage

The provided model is compatible with the current tagset of Morfeusz SGJP and was trained on a version of NKJP1M adapted to that tagset. The preparation of this version of the corpus was financed by CLARIN-PL.

The compiled binaries provided above are standard executables, which depend only on basic system C-language libraries. If none matches your system, try compiling from sources (which requires the Haskell stack).

Concraft can be used from command-line as in the following example. For more details see the documentation at the source page.

./concraft-pl tag concraft-pl-model-SGJP-08022020.gz -i example-input.dag -o example-output.dag

Note that the model has almost 100 MB in size. It may take several seconds for Concraft to load the model into the memory, please do not despair in the meantime.

Data format

The common format for input and output of Concraft is as follows. Each line in the file represents a single morphosyntactic interpretation of a segment. The line comprises the following tab-separated fields:

  • Columns 1 & 2 contain numerical IDs of the starting and ending node of the current segment in the morphosyntactic graph of Morfeusz 2,

  • column 3 — segment (token),
  • column 4 — lemma,
  • column 5 — morphosyntactic tag,
  • column 6 — proper name type as given by Morfeusz (not yet used for tagging),
  • column 7 — any labels as given by Morfeusz (not yet used for tagging),
  • column 8 — probability of this interpretation as determined by Concraft (use 0.0 for input),
  • column 9 — interpretation-related meta information,
  • column 10 — end-of-sentence mark assigned by Concraft,
  • column 11 — segment-related meta information
  • column 12 (only in the output) — the interpretation of the segment chosen by Concraft is marked “disamb” in this column.

The columns for meta information are simply copied from input to the output. They can be used to carry some additional information, for example IDs of XML elements representing the interpretations in some external format (interpretation-related) or NKJP-style no-preceding-space markers (segment-related).

Publications