All listed resources have been made available under the GPLv3 license.
Development version of Składnica
This version is the result of development in the project NEKST and in two stages of CLARIN-PL (CLARIN-PL, CLARIN-PL-2).
Constituency forests as XML files, version 2018.07.23 (final result of CLARIN-PL-2, using Walenty notation for phrase types; 13035 full trees) Składnica-frazowa-180723.tar.gz
Składnica v.½ (2011)
The following page presents the results of the research project N N104 224735 Construction of a treebank for Polish using machine parsing, financed by the Ministry of Science and Higher Education in 2008-2011.
Składnica frazowa — constituency treebank
The primary resource presented is the constituency treebank (Składnica frazowa), version 0.5. The treebank is a result of parsing 20,000 Polish sentences with the syntactic parser Świgra. For every sentence, the parser generates all possible syntactic parse trees predicted by the rules of its grammar. Within the Dendrarium system, a single correct parse tree has been selected for each sentence by linguists (termed "dendrologists"). Dendrologists have established parse trees for 8,227 sentences to be correct. Other sentences under consideration have undergone classification on the basis of their (un)grammaticality and reasons for their rejection by the parser. The largest class among the rejected sentences consists of utterances with no finite verb. Their analysis of which will be a subject of separate research.
Constituency forests as XML files: Składnica-frazowa-0.5.tar.bz2
- The files contain all trees generated by the parser, the interpretation selected by dendrologists is marked through attributes.
XML schema for the constituency treebank files: Składnica-frazowa.xsd
Trees in the Tiger XML format: Składnica-frazowa-0.5-TigerXML.xml.gz
- The format represents parse trees selected by dendrologists only (one interpretation per sentence).
Składnica zależnościowa — dependency treebank
The dependency treebank (Składnica zależnościowa), version 0.5, is a result of an automatic conversion of manually disambiguated constituency trees into dependency structures.
Dependency structures take shape of directed graphs with nodes representing tokens in the sentence (plus an artificial root node), edges representing binary dependency relations between tokens (head - dependent) and edge labels marking the type of dependency relation involved (Polish Dependency Relation Types). Nodes contain indices corresponding to the position of the token within the sentence, with the root always indexed as 0.
The conversion is an entirely automatic and unambiguous process. Conversion rules have been based on morphosyntactic information, syntactic categories of phrases, and parsing rule references encoded within constituency trees. The majority of constituency trees contain specified syntactic centres, which made conversion easier. For other cases, heuristics were constructed in order to select the head.
Dependency trees are encoded in the CoNLL format (Buchholz i Marsi, 2006). The choice of the format was guided by existing available parsing systems and the formats they accept. In the CoNLL format, each token encodes the following information: index (ID), orthographic form/punctuation mark (FORM), base form (LEMMA), coarse-grained part of speech (CPOSTAG), fine-grained part of speech (POSTAG), morphosyntactic features (FEATS), head index (HEAD) and type of dependency relation (DEPREL).
Dependency trees in the CoNLL format: Składnica-zależnościowa-0.5.conll.gz
A MaltParser model trained on Składnica zależnościowa is available on http://zil.ipipan.waw.pl/PolishDependencyParser.
Search engines
Online search engine
The project also involved the creation of a treebank search engine hosted on a web server and allowing for online queries. This makes searching the treebank possible without installing any additional software (the only requirement is the Firefox browser). A unique feature of the engine is the ability to search not only for disambiguated nodes selected by linguists, but for all tree nodes created by the parser. This makes it useful for evaluating dendrologists' decisions.
Source code: http://github.com/balrog-kun/forestsearch
Local mirror: Wyszukiwarka-drzew-sieciowa.tar.gz
Tiger Search
The conversion of the constituency treebank to the TigerXML format has made this option available as well. We suggest Windows users to obtain a CD containing the program and the constituency treebank data. Tiger Search will boot up automatically after inserting the CD in the drive. Users running other operating systems should install the Tiger Search program manually and load the treebank in TIGERRegistry. After these steps, the treebank (Składnica) will become visible in TIGERSearch.
CD image for Windows along with Składnica and Tiger Search, ready for launch: Składnica-frazowa-0.5+TigerSearch.iso.bz2
Installer version Tiger Search
Składnica frazowa (constituency treebank) in the Tiger XML format (as above): Składnica-frazowa-0.5-TigerXML.xml.gz
Script for converting constituency treebank files into the Tiger format: forest2tiger.py
MaltEval
Dependency trees can be viewed with the publicly available tool for evaluating dependency parsers MaltEval, containing a built-in module for visualisation of dependency structures. Call: java -jar …/MaltEval.jar -v 1 -g …/Składnica-zależnościowa-0.5.conll
Installer version: MaltEval