Size: 1996
Comment:
|
Size: 2850
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 5: | Line 5: |
= Polish Dependency Bank (PDB)= | = Polish Dependency Bank (PDB) = |
Line 7: | Line 7: |
under development ... | The current version of PDB consists of more than 22K trees with 15.8 tokens per sentence on the average. |
Line 12: | Line 12: |
|| % non-projective trees || 8.61 || PDB consists of three parts: 1. NKJP1M part (14K trees) 2. PDB_projected part (4K trees) 3. CDScorpus (2K trees) 4. OTHER (2K trees) == PDB data == If you wish to get the entire PDB corpus (22K sentences annotated with dependency trees) please contact ''alina'' <at> ''ipipan.waw.pl'' (replace <at> with @). |
|
Line 19: | Line 30: |
= Polish Dependency Bank in Universal Dependencies format (PDBUD) = | |
Line 21: | Line 31: |
PDB is an extended version of ''Składnica" treebank. Since the UD conversion of "Składnica" trees constitutes the Polish treebank in Universal Dependencies collection (the release 2.1), PDBUD is thus an extended and corrected version of this treebank. | == Polish Dependency Bank in Universal Dependencies format (PDBUD) == |
Line 23: | Line 33: |
The converted PDBUD trees are largely consistent with Polish UD trees. Składnica trees are rather simple and the sentences underlying this data set do not contain some linguistic phenomena, e.g. ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction "jako", directed speech, interpolations and comments, nominative noun phrases used in the vocative function and many others. Therefore, the repertoire of UD relation subtypes and language-specific features is slightly extended to cover these phenomena. Furthermore, PDBUD trees contain enhanced edges encoding shared dependents of coordinated elements, e.g. "Dziewczynka śpiewa i tańczy. (The girl sings and dances.), and shared governors of coordinated elements, e.g. "Dziewczynka i chłopiec śpiewają. (A girl and a boy sing.) | PDB is an extended version of Składnica treebank. Since the UD conversion of Składnica trees constitutes the Polish treebank in Universal Dependencies collection (the release 2.1), PDBUD is thus an extended and corrected version of this treebank. |
Line 25: | Line 35: |
PDBUD trees are used in: === Shared task on automatic identification of verbal multiword expressions (LAW-MWE-CxG-2018) === === Shared task on dependency parsing of Polish (PolEval 2018, [[http://poleval.pl]]) === |
The converted PDBUD trees are largely consistent with Polish UD trees. Składnica trees are rather simple and the sentences underlying this data set do not contain some linguistic phenomena, e.g. ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction JAKO, directed speech, interpolations and comments, nominative noun phrases used in the vocative function and many others. Therefore, the repertoire of UD relation subtypes and language-specific features is slightly extended to cover these phenomena. Furthermore, PDBUD trees contain enhanced edges encoding shared dependents of coordinated elements, e.g. Dziewczynka śpiewa i tańczy. (The girl sings and dances.), and shared governors of coordinated elements, e.g. Dziewczynka i chłopiec śpiewają. (A girl and a boy sing.) |
Line 29: | Line 37: |
== Data == | PDBUD trees are currently used in: * Shared task on automatic identification of verbal multiword expressions (LAW-MWE-CxG-2018) * Shared task on dependency parsing of Polish (PolEval 2018, [[http://poleval.pl]]) === PDBUD data === |
Line 31: | Line 43: |
=== Basic PDBUD trees === | * Basic PDBUD trees [[attachment:PDBUD_nosem.zip]] * PDBUD trees with enhanced edges and semantic labels [[attachment:PDBUD_sem.zip]] == Publications == == Licence == The resources is distributed under the [[https://creativecommons.org/licenses/by-nc-sa/4.0/|CC BY-SA-NC 4.0]] licence. == Contact == Any questions, comments? Please send them to Alina Wróblewska <<MailTo(alina AT SPAMFREE ipipan DOT waw DOT pl)>>. |
Polish Dependency Bank (PDB)
The current version of PDB consists of more than 22K trees with 15.8 tokens per sentence on the average.
# sentences |
22,208 |
# tokens |
351,715 |
# tokens per sentence |
15.84 |
% non-projective trees |
8.61 |
PDB consists of three parts: 1. NKJP1M part (14K trees) 2. PDB_projected part (4K trees) 3. CDScorpus (2K trees) 4. OTHER (2K trees)
PDB data
If you wish to get the entire PDB corpus (22K sentences annotated with dependency trees) please contact alina <at> ipipan.waw.pl (replace <at> with @).
PDB relation types
Descritptions of Polish dependency relation types are at http://zil.ipipan.waw.pl/PDB/DepRelTypes (outdated).
PDB parser
Some dependency parsing models estimated on PDB are available at http://zil.ipipan.waw.pl/PDB/PDBparser
Polish Dependency Bank in Universal Dependencies format (PDBUD)
PDB is an extended version of Składnica treebank. Since the UD conversion of Składnica trees constitutes the Polish treebank in Universal Dependencies collection (the release 2.1), PDBUD is thus an extended and corrected version of this treebank.
The converted PDBUD trees are largely consistent with Polish UD trees. Składnica trees are rather simple and the sentences underlying this data set do not contain some linguistic phenomena, e.g. ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction JAKO, directed speech, interpolations and comments, nominative noun phrases used in the vocative function and many others. Therefore, the repertoire of UD relation subtypes and language-specific features is slightly extended to cover these phenomena. Furthermore, PDBUD trees contain enhanced edges encoding shared dependents of coordinated elements, e.g. Dziewczynka śpiewa i tańczy. (The girl sings and dances.), and shared governors of coordinated elements, e.g. Dziewczynka i chłopiec śpiewają. (A girl and a boy sing.)
PDBUD trees are currently used in:
- Shared task on automatic identification of verbal multiword expressions (LAW-MWE-CxG-2018)
Shared task on dependency parsing of Polish (PolEval 2018, http://poleval.pl)
PDBUD data
Note! As PDBUD data are used in PolEval 2018, test data are currently not publicly available.
Basic PDBUD trees PDBUD_nosem.zip
PDBUD trees with enhanced edges and semantic labels PDBUD_sem.zip
Publications
Licence
The resources is distributed under the CC BY-SA-NC 4.0 licence.
Contact
Any questions, comments? Please send them to Alina Wróblewska <alina AT SPAMFREE ipipan DOT waw DOT pl>.