⇤ ← Revision 1 as of 2020-07-09 16:54:14
Size: 1594
Comment:
|
Size: 1595
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 5: | Line 5: |
NKJP1M is a 1 million word manually annotated sub-corpus of the National Corpus of Polish ([[http://clip.ipipan.waw.pl/NationalCorpusOfPolish|NKJP]]). It is the main resource used for training taggers of Polish. Unfortunately, NKJP was annotated according to a tagset which is somewhat different than the tagset of morphological analyser [[http://morfeusz.sgjp.pl/|Morfeusz SGJP]]. | NKJP1M is a 1 million word manually annotated sub-corpus of the National Corpus of Polish ([[http://clip.ipipan.waw.pl/NationalCorpusOfPolish|NKJP]]). It is the main resource used for training taggers of Polish. Unfortunately, NKJP was annotated according to a tagset, which is somewhat different than the tagset of morphological analyser [[http://morfeusz.sgjp.pl/|Morfeusz SGJP]]. |
NKJP1M re-annotated using the Morfeusz SGJP tagset
NKJP1M is a 1 million word manually annotated sub-corpus of the National Corpus of Polish (NKJP). It is the main resource used for training taggers of Polish. Unfortunately, NKJP was annotated according to a tagset, which is somewhat different than the tagset of morphological analyser Morfeusz SGJP.
Here, we present NKJP1M-SGJP — a version of NKJP1M re-annotated in accordance with the tagset of Morfeusz SGJP. Thus, taggers can be trained compatible with Morfeusz without any tagset conversion. We intend to maintain this version of the corpus both in terms of correcting errors and keeping it compatible with Morfeusz.
NKJP1M-SGJP compatible with current Morfeusz is available at http://download.sgjp.pl/morfeusz/current/. (Older versions are contained in respective subdirectories of http://download.sgjp.pl/morfeusz/, starting July 2020). The corpus is available as a set of NKJP-TEI XML files (file nkjp1m-sgjp-tei-‹release_date›.tgz) as well as a set of files in a simple column based format used by the tagger Concraft-PL (see Concraft’s page for format description), which we find easier to use (file nkjp1m-sgjp-dag-‹data›.tgz).
NKJP1M-SGJP is available under GNU General Public License v. 3 (GPL-3), since this is the license of NKJP1M.
Preparation and maintenance of this resource is possible thanks to CLARIN-PL.