Locked History Actions

Diff for "PDB"

Differences between revisions 12 and 16 (spanning 4 versions)
Revision 12 as of 2018-07-18 18:53:57
Size: 2850
Comment:
Revision 16 as of 2018-07-18 19:02:44
Size: 3215
Comment:
Deletions are marked like this. Additions are marked like this.
Line 14: Line 14:
PDB consists of three parts:
1. NKJP1M part (14K trees)
2. PDB_projected part (4K trees)
3. CDScorpus (2K trees)
4. OTHER (2K trees)
PDB consists of four parts:
 1. NKJP1M trees (14K)
 2. PDB_projected trees (4K)
 3. CDScorpus trees (2K)
 4. OTHER trees (2K)
Line 20: Line 20:
== PDB data == === PDB data ===
Line 24: Line 24:
== PDB relation types == === PDB relation types ===
Line 48: Line 48:
== Acknowledgements ==
The creation of PDB was supported by grant no POIG.01.01.02-14-013/09 from Innovative Economy Operational Programme co-financed by the European Union (European Regional Development Fund) and by the grant from the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure (2016-2018).

Polish Dependency Bank (PDB)

The current version of PDB consists of more than 22K trees with 15.8 tokens per sentence on the average.

# sentences

22,208

# tokens

351,715

# tokens per sentence

15.84

% non-projective trees

8.61

PDB consists of four parts:

  1. NKJP1M trees (14K)
  2. PDB_projected trees (4K)
  3. CDScorpus trees (2K)
  4. OTHER trees (2K)

PDB data

If you wish to get the entire PDB corpus (22K sentences annotated with dependency trees) please contact alina <at> ipipan.waw.pl (replace <at> with @).

PDB relation types

Descritptions of Polish dependency relation types are at http://zil.ipipan.waw.pl/PDB/DepRelTypes (outdated).

PDB parser

Some dependency parsing models estimated on PDB are available at http://zil.ipipan.waw.pl/PDB/PDBparser

Polish Dependency Bank in Universal Dependencies format (PDBUD)

PDB is an extended version of Składnica treebank. Since the UD conversion of Składnica trees constitutes the Polish treebank in Universal Dependencies collection (the release 2.1), PDBUD is thus an extended and corrected version of this treebank.

The converted PDBUD trees are largely consistent with Polish UD trees. Składnica trees are rather simple and the sentences underlying this data set do not contain some linguistic phenomena, e.g. ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction JAKO, directed speech, interpolations and comments, nominative noun phrases used in the vocative function and many others. Therefore, the repertoire of UD relation subtypes and language-specific features is slightly extended to cover these phenomena. Furthermore, PDBUD trees contain enhanced edges encoding shared dependents of coordinated elements, e.g. Dziewczynka śpiewa i tańczy. (The girl sings and dances.), and shared governors of coordinated elements, e.g. Dziewczynka i chłopiec śpiewają. (A girl and a boy sing.)

PDBUD trees are currently used in:

  • Shared task on automatic identification of verbal multiword expressions (LAW-MWE-CxG-2018)
  • Shared task on dependency parsing of Polish (PolEval 2018, http://poleval.pl)

PDBUD data

Note! As PDBUD data are used in PolEval 2018, test data are currently not publicly available.

Publications

Acknowledgements

The creation of PDB was supported by grant no POIG.01.01.02-14-013/09 from Innovative Economy Operational Programme co-financed by the European Union (European Regional Development Fund) and by the grant from the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure (2016-2018).

Licence

The resources is distributed under the CC BY-SA-NC 4.0 licence.

Contact

Any questions, comments? Please send them to Alina Wróblewska <alina AT SPAMFREE ipipan DOT waw DOT pl>.