|
Size: 900
Comment:
|
← Revision 27 as of 2026-04-22 15:30:31 ⇥
Size: 3587
Comment:
|
| Deletions are marked like this. | Additions are marked like this. |
| Line 1: | Line 1: |
| ## page was renamed from PolishDiscourseCorpus | |
| Line 4: | Line 5: |
| The following corpus of discourse relations is based on the [[PCC|Polish Coreference Corpus]] as part of the [[http://clip.ipipan.waw.pl/CLARIN-PL-2|CLARIN-PL]] project. | The corpus of discourse relations is based on the [[PCC|Polish Coreference Corpus]]. The annotation of the corpus was completed using [[Discann|Discann annotation tool]]. |
| Line 6: | Line 7: |
| == Documentation == | == Version 0.1 == |
| Line 8: | Line 9: |
| Please see the [[attachment:instrukcja-anotacji-metatekstu.pdf|annotation instructions]], in Polish. | === Documentation === |
| Line 10: | Line 11: |
| == Licence == | The [[attachment:instrukcja-anotacji-metatekstu.pdf|annotation instructions]] (in Polish) were created by Celina Heliasz. |
| Line 12: | Line 13: |
| [[http://creativecommons.org/licenses/by/3.0/deed.en_US|Creative Commons Attribution 3.0 Unported License]] {{http://i.creativecommons.org/l/by/3.0/88x31.png}} == Downloads == |
=== Download === |
| Line 22: | Line 19: |
| == Citing == Please cite: <<BibMate(key, "hel:ogr:19:lc", omitYears=true)>> |
=== Funding === Version 1.0 of the corpus was financed by the Polish Ministry of Education and Science under the agreement DIR/WK/2016/02. == Version 1.0 == === Documentation === The [[attachment:anotacja-pdc.pdf|annotation instructions]] (in Polish) were created by Maciej Ogrodniczuk. === Download === The corpus is available for download in the form of a [[attachment:pdc.zip|zip file]] in the [[https://clarin.biz/tools/inforex|Inforex]] format. === Funding === Version 1.0 of the corpus was financed by the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme, CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00–00C002/19, the Polish Ministry of Education and Science grant 2022/WK/09, continued as part of the investment: CLARIN ERIC – European Research Infrastructure Consortium: Common Language Resources and Technology Infrastructure (period: 2024-2026) funded by the Polish Ministry of Science and Higher Education (Programme: ”Support for the participation of Polish scientific teams in international research infrastructure projects”), agreement number 2024/WK/01 and by CLARIN-PL, the European Regional Development Fund, FENG programme, agreement number FENG.02.04-IP.040004/24. == DISRPT 2025 Version == The PDC dataset was also converted to the format of the [[https://sites.google.com/view/disrpt2025/|DISRPT2025 Shared Task on Discourse Unit Segmentation, Connective Detection, and Discourse Relation Classification]]. This version of the dataset contains annotation for discontinuous discourse units and connectives. Explicit relation connectives are not included in the argument spans for the .rels data. POS tags, morphology, and syntactic parses were added using Stanza's `default_accurate` model for Polish (`pl`) while preserving tokenization and sentence splits from the Polish Coreference Corpus. {{{#!highlight python nlp = stanza.Pipeline( 'pl', pretokenized=True, tokenize_pretokenized=True, package='default_accurate', ) }}} === Download === The corpus is available for download in the form of CoNLL-u annotations from the [[https://github.com/disrpt/sharedtask2025/tree/master/data/pol.iso.pdc|DISRPT 2025 GitHub]]. === Funding === This research was funded in whole by the National Science Centre, Poland, grant 2023/50/A/HS2/00559 (''Universal Discourse: a multilingual model of discourse relations''). == Licence == [[https://creativecommons.org/licenses/by-nc/4.0/|Creative Commons Attribution-NonCommercial 4.0 International License]] {{http://i.creativecommons.org/l/by-nc/4.0/88x31.png}} == Please cite == <<BibMate(key, "ogr:etal:24", "tom:etal:24:iso", "zur:etal:23:ldk", "hel:ogr:19:lc", omitYears=true)>> |
Polish Discourse Corpus / Polski Korpus Metatekstowy
The corpus of discourse relations is based on the Polish Coreference Corpus. The annotation of the corpus was completed using Discann annotation tool.
Version 0.1
Documentation
The annotation instructions (in Polish) were created by Celina Heliasz.
Download
The corpus is available for download in the form of a zip file containing:
- 1773 source XML TEI files of the Polish Coreference Corpus
- metatext.xml file containing descriptions of all relations
Funding
Version 1.0 of the corpus was financed by the Polish Ministry of Education and Science under the agreement DIR/WK/2016/02.
Version 1.0
Documentation
The annotation instructions (in Polish) were created by Maciej Ogrodniczuk.
Download
The corpus is available for download in the form of a zip file in the Inforex format.
Funding
Version 1.0 of the corpus was financed by the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme, CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00–00C002/19, the Polish Ministry of Education and Science grant 2022/WK/09, continued as part of the investment: CLARIN ERIC – European Research Infrastructure Consortium: Common Language Resources and Technology Infrastructure (period: 2024-2026) funded by the Polish Ministry of Science and Higher Education (Programme: ”Support for the participation of Polish scientific teams in international research infrastructure projects”), agreement number 2024/WK/01 and by CLARIN-PL, the European Regional Development Fund, FENG programme, agreement number FENG.02.04-IP.040004/24.
DISRPT 2025 Version
The PDC dataset was also converted to the format of the DISRPT2025 Shared Task on Discourse Unit Segmentation, Connective Detection, and Discourse Relation Classification.
This version of the dataset contains annotation for discontinuous discourse units and connectives. Explicit relation connectives are not included in the argument spans for the .rels data.
POS tags, morphology, and syntactic parses were added using Stanza's default_accurate model for Polish (pl) while preserving tokenization and sentence splits from the Polish Coreference Corpus.
Download
The corpus is available for download in the form of CoNLL-u annotations from the DISRPT 2025 GitHub.
Funding
This research was funded in whole by the National Science Centre, Poland, grant 2023/50/A/HS2/00559 (Universal Discourse: a multilingual model of discourse relations).
Licence
Creative Commons Attribution-NonCommercial 4.0 International License
Please cite


