Locked History Actions

Diff for "PDB"

Differences between revisions 24 and 59 (spanning 35 versions)
Revision 24 as of 2018-09-10 13:33:22
Size: 3061
Comment:
Revision 59 as of 2022-09-09 07:04:06
Size: 3539
Comment:
Deletions are marked like this. Additions are marked like this.
Line 7: Line 7:
The current version of PDB consists of 22,208 trees and 351,175 tokens (i.e. 15.8 tokens per sentence on average). There are four parts of PDB:
 1. NKJP1M trees (14K)
 2. PDB_projected trees (4K)
 3. CDScorpus trees (2K)
PDB 2.0 is an extended version of ''Składnica zależnościowa'' (the first Polish dependency treebank). It consists of 22,152 trees and 350,001 tokens (i.e. 15.8 tokens per sentence on average). There are four parts of PDB 2.0:
 1. NKJP1M-based trees (14K)
 2. Projection-based trees (4K)
 3. CDScorpus-based trees (2K)
Line 13: Line 13:
=== PDB data === The PDB sentences contain some problematic linguistic phenomena, e.g. ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction JAKO, directed speech, interpolations and comments, nominative noun phrases used in the vocative function and many others. The Polish dependency relation types are as follows: abbrev_punct, adjunct, adjunct_compar, adjunct_qt, adjunct_rc, aglt, app, aux, comp, comp_ag, comp_fin, comp_inf, cond, conjunct, imp, item, mwe, ne, neg, obj, obj_th, pd, pre_coord, punct, refl, root, subj, vocative. Descritptions of Polish dependency relation types are at [[http://zil.ipipan.waw.pl/PDB/DepRelTypes]]. Some dependents are annotated with semantic roles, e.g. Beneficiary/Recipient.
Line 15: Line 15:
 * Updated [[attachment:NKJP1M_Skladnica_sem.conll|Składnica zależnościowa]] (It is the previous version of PDB with 8K trees)

If you wish to get the entire PDB corpus (22K sentences annotated with the dependency trees) please contact ''alina'' <at> ''ipipan.waw.pl'' (replace <at> with @).

=== PDB relation types ===
Descritptions of Polish dependency relation types are at [[http://zil.ipipan.waw.pl/PDB/DepRelTypes]] (outdated).
'''Download''': The updated version of [[attachment:NKJP1M_Skladnica_sem.conll|Składnica zależnościowa]] (the first version of PDB). If you wish to get the entire PDB corpus (22K sentences annotated with the dependency trees) please contact ''alina'' <at> ''ipipan.waw.pl'' (replace <at> with @).
Line 23: Line 18:
== Polish Dependency Bank in Universal Dependencies format (PDBUD) ==

PDB is an extended version of Składnica treebank. Since the UD conversion of Składnica trees constitutes the Polish treebank in Universal Dependencies collection (the release 2.1), PDBUD is thus an extended and corrected version of this treebank.

The converted PDBUD trees are largely consistent with Polish UD trees. Składnica trees are rather simple and the sentences underlying this data set do not contain some linguistic phenomena, e.g. ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction JAKO, directed speech, interpolations and comments, nominative noun phrases used in the vocative function and many others. Therefore, the repertoire of UD relation subtypes and language-specific features is slightly extended to cover these phenomena. Furthermore, PDBUD trees contain enhanced edges encoding shared dependents of coordinated elements, e.g. Dziewczynka śpiewa i tańczy. (The girl sings and dances.), and shared governors of coordinated elements, e.g. Dziewczynka i chłopiec śpiewają. (A girl and a boy sing.)

PDBUD trees are currently used in:
 * Shared task on automatic identification of verbal multiword expressions (LAW-MWE-CxG-2018)
 * Shared task on dependency parsing of Polish (PolEval 2018, [[http://poleval.pl]])

=== PDBUD data ===
Link to [[http://git.nlp.ipipan.waw.pl/alina/PDBUD|PDBUD]]

== PDB parser ==
Some dependency parsing models estimated on PDB are available at [[http://zil.ipipan.waw.pl/PDB/PDBparser]]

== Publications ==

== Acknowledgements ==
The creation of PDB was supported by grant no POIG.01.01.02-14-013/09 from Innovative Economy Operational Programme co-financed by the European Union (European Regional Development Fund) and by the grant from the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure (2016-2018).
Line 45: Line 20:
== Licence == == PDB in Universal Dependencies format (PDB-UD) ==
Line 47: Line 22:
The resources is distributed under the [[https://creativecommons.org/licenses/by-nc-sa/4.0/|CC BY-NC-SA 4.0]] licence. PDB-UD is a conversion of PDB in the UD-like format. It is an extended and corrected version of the Polish UD treebank (the release 2.1). PDB-UD contains enhanced graphs, i.e. trees with enhanced edges encoding shared dependents of coordinated elements, e.g. ''Dziewczynka śpiewa i tańczy'' (The girl sings and dances), and shared governors of coordinated elements, e.g. ''Dziewczynka i chłopiec śpiewają'' (A girl and a boy sing). The Polish dependency types are listed [[https://universaldependencies.org/pl/dep|here]].
Line 49: Line 24:
== Contact == PDB-UD trees were used in two shared tasks: [[http://multiword.sourceforge.net/PHITE.php?sitesig=CONF&page=CONF_04_LAW-MWE-CxG_2018___lb__COLING__rb__&subpage=CONF_40_Shared_Task|LAW-MWE-CxG-2018]] and [[http://poleval.pl|PolEval 2018]].

'''Download:''' PDB-UD is publicly available on [[http://git.nlp.ipipan.waw.pl/alina/PDBUD]]

'''Download''': Alternatively, you can download PDB-UD trees from [[https://github.com/UniversalDependencies/UD_Polish-PDB|UD repository]].

== PDB-trained COMBO's models ==
Natural language preprocessing models estimated on PDB and PDB-UD are available at [[http://zil.ipipan.waw.pl/PDB/COMBO]].

=== Publications ===

<<BibMate(key, "wro:14", omitYears=true)>>
<<BibMate(key, "wrob:18", omitYears=true)>>
<<BibMate(key, "wro:2020", omitYears=true)>>

=== Acknowledgements ===
The creation of PDB was supported by grant no POIG.01.01.02-14-013/09 from Innovative Economy Operational Programme co-financed by the European Union (European Regional Development Fund) and by the grant from the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure (2016-2020).


=== Licence ===
The resources are distributed under the [[https://creativecommons.org/licenses/by-nc-sa/4.0/|CC BY-NC-SA 4.0]] licence.

=== Contact ===

Polish Dependency Bank 2.0 (PDB 2.0)

PDB 2.0 is an extended version of Składnica zależnościowa (the first Polish dependency treebank). It consists of 22,152 trees and 350,001 tokens (i.e. 15.8 tokens per sentence on average). There are four parts of PDB 2.0:

  1. NKJP1M-based trees (14K)
  2. Projection-based trees (4K)
  3. CDScorpus-based trees (2K)
  4. OTHER trees (2K)

The PDB sentences contain some problematic linguistic phenomena, e.g. ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction JAKO, directed speech, interpolations and comments, nominative noun phrases used in the vocative function and many others. The Polish dependency relation types are as follows: abbrev_punct, adjunct, adjunct_compar, adjunct_qt, adjunct_rc, aglt, app, aux, comp, comp_ag, comp_fin, comp_inf, cond, conjunct, imp, item, mwe, ne, neg, obj, obj_th, pd, pre_coord, punct, refl, root, subj, vocative. Descritptions of Polish dependency relation types are at http://zil.ipipan.waw.pl/PDB/DepRelTypes. Some dependents are annotated with semantic roles, e.g. Beneficiary/Recipient.

Download: The updated version of Składnica zależnościowa (the first version of PDB). If you wish to get the entire PDB corpus (22K sentences annotated with the dependency trees) please contact alina <at> ipipan.waw.pl (replace <at> with @).

PDB in Universal Dependencies format (PDB-UD)

PDB-UD is a conversion of PDB in the UD-like format. It is an extended and corrected version of the Polish UD treebank (the release 2.1). PDB-UD contains enhanced graphs, i.e. trees with enhanced edges encoding shared dependents of coordinated elements, e.g. Dziewczynka śpiewa i tańczy (The girl sings and dances), and shared governors of coordinated elements, e.g. Dziewczynka i chłopiec śpiewają (A girl and a boy sing). The Polish dependency types are listed here.

PDB-UD trees were used in two shared tasks: LAW-MWE-CxG-2018 and PolEval 2018.

Download: PDB-UD is publicly available on http://git.nlp.ipipan.waw.pl/alina/PDBUD

Download: Alternatively, you can download PDB-UD trees from UD repository.

PDB-trained COMBO's models

Natural language preprocessing models estimated on PDB and PDB-UD are available at http://zil.ipipan.waw.pl/PDB/COMBO.

Publications

List of publications

Alina Wróblewska. Polish Dependency Parser Trained on an Automatically Induced Dependency Bank. Ph.D. dissertation, Institute of Computer Science, Polish Academy of Sciences, Warsaw, 2014.

List of publications

Alina Wróblewska. Extended and enhanced Polish dependency bank in Universal Dependencies format. In Marie-Catherine de Marneffe, Teresa Lynn, and Sebastian Schuster, editors, Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 173–182. Association for Computational Linguistics, 2018.

List of publications

Alina Wróblewska. Towards the Conversion of National Corpus of Polish to Universal Dependencies. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 5308–5315, Marseille, France, 2020. European Language Resources Association (ELRA).

Acknowledgements

The creation of PDB was supported by grant no POIG.01.01.02-14-013/09 from Innovative Economy Operational Programme co-financed by the European Union (European Regional Development Fund) and by the grant from the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure (2016-2020).

Licence

The resources are distributed under the CC BY-NC-SA 4.0 licence.

Contact

Any questions, comments? Please send them to Alina Wróblewska <alina AT SPAMFREE ipipan DOT waw DOT pl>.