Locked History Actions

Diff for "PolishDiscourseCorpus"

Differences between revisions 1 and 30 (spanning 29 versions)
Revision 1 as of 2020-12-18 16:27:31
Size: 1705
Comment:
Revision 30 as of 2026-04-29 14:29:34
Size: 3803
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
## page was renamed from PolishDiscourseCorpus
Line 4: Line 5:
This page offers the official [[http://creativecommons.org/licenses/by/3.0/deed.en_US|Creative Commons Attribution 3.0 Unported License]] release of the corpus of discourse relations created as a part of the [[http://clip.ipipan.waw.pl/CLARIN-PL-2|CLARIN-PL]] project. By downloading the corpus data you accept the conditions of that licence. The corpus of discourse relations is based on the [[PCC|Polish Coreference Corpus]].
== Version 0.1 ==
Line 6: Line 8:
'''Contact person:'''
[[MaciejOgrodniczuk|Maciej Ogrodniczuk]]<<BR>>
'''License:''' CC BY v.3
=== Documentation ===
Line 10: Line 10:
{{http://i.creativecommons.org/l/by/3.0/88x31.png}} The [[attachment:instrukcja-anotacji-metatekstu.pdf|annotation instructions]] (in Polish) were created by Celina Heliasz.
Line 12: Line 12:
== Documentation == === Download ===
Line 14: Line 14:
 * [[attachment:PCC_README_EN.pdf|Description of the corpus, in English]]
 * [[attachment:PCC_README_PL.pdf|Description of the corpus, in Polish]]
The corpus is available for download in the form of a [[attachment:corpus.tar.gz|zip file]] in the format of [[Discann|Discann annotation tool]] containing:
 * 1773 source XML TEI files of the Polish Coreference Corpus
 * metatext.xml file containing descriptions of all relations
Line 17: Line 18:
== Downloads == === Funding ===
Line 19: Line 20:
The corpus is available for download in 3 formats:
 * [[attachment:PCC-1.5-MMAX.zip|full corpus in MMAX format]] ([[attachment:example_text_mmax.zip|example text in MMAX format]])
 * [[attachment:PCC-1.5-TEI.zip|full corpus in TEI format]] ([[attachment:example_text_tei.zip|example text in TEI format]])
 * [[attachment:PCC-1.5-BRAT.zip|full corpus in BRAT format]] ([[attachment:example_text_brat.zip|example text in BRAT format]])
Version 0.1 of the corpus was financed by the Polish Ministry of Education and Science under the agreement DIR/WK/2016/02.
Line 24: Line 22:
== Online version == == Version 1.0 ==
Line 26: Line 24:
The corpus is available:
 * [[http://cothec.nlp.ipipan.waw.pl/|for browsing]]
 * [[http://pcc.nlp.ipipan.waw.pl/|for search]]
=== Documentation ===
Line 30: Line 26:
You may also want to see [[PolishCoreferenceTools|Polish Coreference Tools site]]. The [[attachment:anotacja-pdc.pdf|annotation guidelines]] (in Polish) were created by Maciej Ogrodniczuk.
Line 32: Line 28:
== Citing ==
When using Polish Coreference Corpus, please cite our book on coreference:
<<BibMate(key, "ogr:etal:15:gruyter", omitYears=true)>>
=== Download ===
Line 36: Line 30:
but you can also check [[http://core.ipipan.waw.pl/|the project page]] for earlier publications. The corpus is available for download in the form of a [[attachment:pdc.zip|zip file]] in the [[https://clarin.biz/tools/inforex|Inforex]] format.

=== Funding ===

Version 1.0 of the corpus was financed by the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme, CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00–00C002/19, the Polish Ministry of Education and Science grant 2022/WK/09, continued as part of the investment: CLARIN ERIC – European Research Infrastructure Consortium: Common Language Resources and Technology Infrastructure (period: 2024-2026) funded by the Polish Ministry of Science and Higher Education (Programme: ”Support for the participation of Polish scientific teams in international research infrastructure projects”), agreement number 2024/WK/01 and by CLARIN-PL, the European Regional Development Fund, FENG programme, agreement number FENG.02.04-IP.040004/24.

== DISRPT 2025 Version ==

The PDC dataset was also converted to the format of the [[https://sites.google.com/view/disrpt2025/|DISRPT2025 Shared Task on Discourse Unit Segmentation, Connective Detection, and Discourse Relation Classification]].

This version of the dataset contains annotation for discontinuous discourse units and connectives. Explicit relation connectives are not included in the argument spans for the .rels data.

POS tags, morphology, and syntactic parses were added using Stanza's `default_accurate` model for Polish (`pl`) while preserving tokenization and sentence splits from the Polish Coreference Corpus.

{{{#!highlight python
  nlp = stanza.Pipeline(
        'pl',
        pretokenized=True,
        tokenize_pretokenized=True,
        package='default_accurate',
    )
}}}

=== Download ===

The corpus is available for download in the form of CoNLL-u annotations from the [[https://github.com/disrpt/sharedtask2025/tree/master/data/pol.iso.pdc|DISRPT 2025 GitHub]].

=== Funding ===

This research was funded in whole by the National Science Centre, Poland, grant 2023/50/A/HS2/00559 (''Universal Discourse: a multilingual model of discourse relations'').

== Licence ==

[[https://creativecommons.org/licenses/by-nc/4.0/|Creative Commons Attribution-NonCommercial 4.0 International License]]

{{http://i.creativecommons.org/l/by-nc/4.0/88x31.png}}

== Universal Discourse Version ==

Reannotation of the PDC data according to [[http://udisc.org|Universal Discourse]] annotation guidelines is ongoing. Please take a look at (already outdated) [[attachment:anotacja-iso.pdf|annotation guidelines]].


== Please cite ==
<<BibMate(key, "ogr:etal:24", "tom:etal:24:iso", "zur:etal:23:ldk", "hel:ogr:19:lc", omitYears=true)>>

Polish Discourse Corpus / Polski Korpus Metatekstowy

The corpus of discourse relations is based on the Polish Coreference Corpus.

Version 0.1

Documentation

The annotation instructions (in Polish) were created by Celina Heliasz.

Download

The corpus is available for download in the form of a zip file in the format of Discann annotation tool containing:

  • 1773 source XML TEI files of the Polish Coreference Corpus
  • metatext.xml file containing descriptions of all relations

Funding

Version 0.1 of the corpus was financed by the Polish Ministry of Education and Science under the agreement DIR/WK/2016/02.

Version 1.0

Documentation

The annotation guidelines (in Polish) were created by Maciej Ogrodniczuk.

Download

The corpus is available for download in the form of a zip file in the Inforex format.

Funding

Version 1.0 of the corpus was financed by the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme, CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00–00C002/19, the Polish Ministry of Education and Science grant 2022/WK/09, continued as part of the investment: CLARIN ERIC – European Research Infrastructure Consortium: Common Language Resources and Technology Infrastructure (period: 2024-2026) funded by the Polish Ministry of Science and Higher Education (Programme: ”Support for the participation of Polish scientific teams in international research infrastructure projects”), agreement number 2024/WK/01 and by CLARIN-PL, the European Regional Development Fund, FENG programme, agreement number FENG.02.04-IP.040004/24.

DISRPT 2025 Version

The PDC dataset was also converted to the format of the DISRPT2025 Shared Task on Discourse Unit Segmentation, Connective Detection, and Discourse Relation Classification.

This version of the dataset contains annotation for discontinuous discourse units and connectives. Explicit relation connectives are not included in the argument spans for the .rels data.

POS tags, morphology, and syntactic parses were added using Stanza's default_accurate model for Polish (pl) while preserving tokenization and sentence splits from the Polish Coreference Corpus.

   1   nlp = stanza.Pipeline(
   2         'pl',
   3         pretokenized=True,
   4         tokenize_pretokenized=True,
   5         package='default_accurate',
   6     )

Download

The corpus is available for download in the form of CoNLL-u annotations from the DISRPT 2025 GitHub.

Funding

This research was funded in whole by the National Science Centre, Poland, grant 2023/50/A/HS2/00559 (Universal Discourse: a multilingual model of discourse relations).

Licence

Creative Commons Attribution-NonCommercial 4.0 International License

http://i.creativecommons.org/l/by-nc/4.0/88x31.png

Universal Discourse Version

Reannotation of the PDC data according to Universal Discourse annotation guidelines is ongoing. Please take a look at (already outdated) annotation guidelines.

Please cite

List of publications

Maciej Ogrodniczuk, Aleksandra Tomaszewska, Daniel Ziembicki, Sebastian Żurowski, Ryszard Tuora, and Aleksandra Zwierzchowska. Polish Discourse Corpus (PDC): Corpus design, ISO-compliant annotation, data highlights, and parser development. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12829–12835, Torino, Italy, 2024. ELRA and ICCL.

Aleksandra Tomaszewska, Purificação Silvano, António Leal, and Evelin Amorim. ISO 24617-8 applied: Insights from multilingual discourse relations annotation in English, Polish, and Portuguese. In Harry Bunt, Nancy Ide, Kiyong Lee, Volha Petukhova, James Pustejovsky, and Laurent Romary, editors, Proceedings of the 20th Joint ACL – ISO Workshop on Interoperable Semantic Annotation @ LREC-COLING 2024, pages 99–110, Torino, Italy, 2024. ELRA and ICCL.

Sebastian Żurowski, Daniel Ziembicki, Aleksandra Tomaszewska, Maciej Ogrodniczuk, and Agata Drozd. Adopting ISO 24617-8 for discourse relations annotation in Polish: Challenges and future directions. In Sara Carvalho, Anas Fahad Khan, Ana Ostroski Anić, Blerina Spahiu, Jorge Gracia, John P. McCrae, Dagmar Gromann, Barbara Heinisch, and Ana Castro Salgado, editors, Proceedings of the 4th Conference on Language, Data and Knowledge, pages 482–492, Vienna, Austria, 2023. NOVA CLUNL, Portugal.

Celina Heliasz and Maciej Ogrodniczuk. Eksplicytność a implicytność w świetle analizy korpusowej (meta)tekstu. Linguistica Copernicana, 16:75–100, 2019.