Size: 2850
Comment:
|
Size: 2826
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 5: | Line 5: |
= Polish Dependency Bank (PDB) = | = Polish Dependency Bank 2.0 (PDB 2.0) = |
Line 7: | Line 7: |
The current version of PDB consists of more than 22K trees with 15.8 tokens per sentence on the average. | PDB 2.0 is an extended version of ''Składnica zależnościowa'' treebank. It consists of 22,208 trees and 351,175 tokens (i.e. 15.8 tokens per sentence on average). There are four parts of PDB 2.0: 1. NKJP1M trees (14K) 2. PDB_projected trees (4K) 3. CDScorpus-based trees (2K) 4. OTHER trees (2K) |
Line 9: | Line 13: |
|| # sentences || 22,208 || || # tokens || 351,715 || || # tokens per sentence || 15.84 || || % non-projective trees || 8.61 || |
The PDB sentences contain some problematic linguistic phenomena, e.g. ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction JAKO, directed speech, interpolations and comments, nominative noun phrases used in the vocative function and many others. Descritptions of Polish dependency relation types are at [[http://zil.ipipan.waw.pl/PDB/DepRelTypes]] (outdated). Some dependents are annotated with semantic roles, e.g. Beneficiary/Recipient. |
Line 14: | Line 15: |
PDB consists of three parts: 1. NKJP1M part (14K trees) 2. PDB_projected part (4K trees) 3. CDScorpus (2K trees) 4. OTHER (2K trees) == PDB data == If you wish to get the entire PDB corpus (22K sentences annotated with dependency trees) please contact ''alina'' <at> ''ipipan.waw.pl'' (replace <at> with @). == PDB relation types == Descritptions of Polish dependency relation types are at [[http://zil.ipipan.waw.pl/PDB/DepRelTypes]] (outdated). == PDB parser == Some dependency parsing models estimated on PDB are available at [[http://zil.ipipan.waw.pl/PDB/PDBparser]] |
The updated version of [[attachment:NKJP1M_Skladnica_sem.conll|Składnica zależnościowa]] (the first version of PDB). If you wish to get the entire PDB corpus (22K sentences annotated with the dependency trees) please contact ''alina'' <at> ''ipipan.waw.pl'' (replace <at> with @). |
Line 31: | Line 18: |
== Polish Dependency Bank in Universal Dependencies format (PDBUD) == | |
Line 33: | Line 19: |
PDB is an extended version of Składnica treebank. Since the UD conversion of Składnica trees constitutes the Polish treebank in Universal Dependencies collection (the release 2.1), PDBUD is thus an extended and corrected version of this treebank. | |
Line 35: | Line 20: |
The converted PDBUD trees are largely consistent with Polish UD trees. Składnica trees are rather simple and the sentences underlying this data set do not contain some linguistic phenomena, e.g. ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction JAKO, directed speech, interpolations and comments, nominative noun phrases used in the vocative function and many others. Therefore, the repertoire of UD relation subtypes and language-specific features is slightly extended to cover these phenomena. Furthermore, PDBUD trees contain enhanced edges encoding shared dependents of coordinated elements, e.g. Dziewczynka śpiewa i tańczy. (The girl sings and dances.), and shared governors of coordinated elements, e.g. Dziewczynka i chłopiec śpiewają. (A girl and a boy sing.) | == PDB in Universal Dependencies format (PDBUD) == |
Line 37: | Line 22: |
PDBUD trees are currently used in: | PDBUD is a conversion of PDB into the UD-like format. It is an extended and corrected version of the Polish UD treebank (the release 2.1). PDBUD contains enhanced graphs, i.e. trees with enhanced edges encoding shared dependents of coordinated elements, e.g. ''Dziewczynka śpiewa i tańczy'' (The girl sings and dances), and shared governors of coordinated elements, e.g. ''Dziewczynka i chłopiec śpiewają'' (A girl and a boy sing). PDBUD trees were used in: |
Line 41: | Line 27: |
=== PDBUD data === Note! As PDBUD data are used in PolEval 2018, test data are currently not publicly available. * Basic PDBUD trees [[attachment:PDBUD_nosem.zip]] * PDBUD trees with enhanced edges and semantic labels [[attachment:PDBUD_sem.zip]] |
PDBUD is publicly available on [[http://git.nlp.ipipan.waw.pl/alina/PDBUD]] |
Line 46: | Line 29: |
== Publications == | == PDB-based parsers == Some dependency parsing models estimated on PDB are available at [[http://zil.ipipan.waw.pl/PDB/PDBparser]] === Publications === <<BibMate(key, "wro:14", omitYears=true)>> === Acknowledgements === The creation of PDB was supported by grant no POIG.01.01.02-14-013/09 from Innovative Economy Operational Programme co-financed by the European Union (European Regional Development Fund) and by the grant from the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure (2016-2018). |
Line 49: | Line 40: |
== Licence == | === Licence === |
Line 51: | Line 42: |
The resources is distributed under the [[https://creativecommons.org/licenses/by-nc-sa/4.0/|CC BY-SA-NC 4.0]] licence. | The resources are distributed under the [[https://creativecommons.org/licenses/by-nc-sa/4.0/|CC BY-NC-SA 4.0]] licence. |
Line 53: | Line 44: |
== Contact == | === Contact === |
Polish Dependency Bank 2.0 (PDB 2.0)
PDB 2.0 is an extended version of Składnica zależnościowa treebank. It consists of 22,208 trees and 351,175 tokens (i.e. 15.8 tokens per sentence on average). There are four parts of PDB 2.0:
- NKJP1M trees (14K)
- PDB_projected trees (4K)
- CDScorpus-based trees (2K)
- OTHER trees (2K)
The PDB sentences contain some problematic linguistic phenomena, e.g. ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction JAKO, directed speech, interpolations and comments, nominative noun phrases used in the vocative function and many others. Descritptions of Polish dependency relation types are at http://zil.ipipan.waw.pl/PDB/DepRelTypes (outdated). Some dependents are annotated with semantic roles, e.g. Beneficiary/Recipient.
The updated version of Składnica zależnościowa (the first version of PDB). If you wish to get the entire PDB corpus (22K sentences annotated with the dependency trees) please contact alina <at> ipipan.waw.pl (replace <at> with @).
PDB in Universal Dependencies format (PDBUD)
PDBUD is a conversion of PDB into the UD-like format. It is an extended and corrected version of the Polish UD treebank (the release 2.1). PDBUD contains enhanced graphs, i.e. trees with enhanced edges encoding shared dependents of coordinated elements, e.g. Dziewczynka śpiewa i tańczy (The girl sings and dances), and shared governors of coordinated elements, e.g. Dziewczynka i chłopiec śpiewają (A girl and a boy sing). PDBUD trees were used in:
- Shared task on automatic identification of verbal multiword expressions (LAW-MWE-CxG-2018)
Shared task on dependency parsing of Polish (PolEval 2018, http://poleval.pl)
PDBUD is publicly available on http://git.nlp.ipipan.waw.pl/alina/PDBUD
PDB-based parsers
Some dependency parsing models estimated on PDB are available at http://zil.ipipan.waw.pl/PDB/PDBparser
Publications
Acknowledgements
The creation of PDB was supported by grant no POIG.01.01.02-14-013/09 from Innovative Economy Operational Programme co-financed by the European Union (European Regional Development Fund) and by the grant from the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure (2016-2018).
Licence
The resources are distributed under the CC BY-NC-SA 4.0 licence.
Contact
Any questions, comments? Please send them to Alina Wróblewska <alina AT SPAMFREE ipipan DOT waw DOT pl>.