N-grams from the balanced subcorpus of the National Corpus of Polish
The resource is a set of N-grams extracted from the balanced subcorpus of National Corpus of Polish (300M tokens) for N from 1 to 5. Each unigram is maximum continuous chunk of non-whitespace lower-case characters. The resource contains all unique N-grams followed by number of occurrencies.
Downloads
Licence
NKJP ngrams are made available on CC-BY licence.