<?xml version="1.0" encoding="utf-8"?><!DOCTYPE article  PUBLIC '-//OASIS//DTD DocBook XML V4.4//EN'  'http://www.docbook.org/xml/4.4/docbookx.dtd'><article><articleinfo><title>PolishDiscourseCorpus</title><revhistory><revision><revnumber>30</revnumber><date>2026-04-29 14:29:34</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>29</revnumber><date>2026-04-29 14:12:26</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>28</revnumber><date>2026-04-29 14:12:08</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>27</revnumber><date>2026-04-22 15:30:31</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>26</revnumber><date>2026-04-22 15:30:07</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>25</revnumber><date>2026-04-22 15:28:54</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>24</revnumber><date>2026-04-22 15:28:03</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>23</revnumber><date>2026-04-22 15:27:01</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>22</revnumber><date>2026-02-28 01:31:30</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>21</revnumber><date>2026-02-11 20:57:43</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>20</revnumber><date>2026-02-11 20:55:43</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>19</revnumber><date>2026-02-11 20:51:29</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>18</revnumber><date>2024-10-15 10:23:13</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>17</revnumber><date>2024-06-10 13:38:51</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>16</revnumber><date>2024-06-10 13:38:40</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>15</revnumber><date>2024-03-20 14:02:59</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>14</revnumber><date>2024-03-20 14:01:03</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>13</revnumber><date>2024-03-20 14:00:44</date><authorinitials>MaciejOgrodniczuk</authorinitials><revremark>Renamed from 'PolskiKorpusMetatekstowy'.</revremark></revision><revision><revnumber>12</revnumber><date>2023-10-18 23:32:40</date><authorinitials>MaciejOgrodniczuk</authorinitials><revremark>Renamed from 'PolishDiscourseCorpus'.</revremark></revision><revision><revnumber>11</revnumber><date>2023-10-18 23:31:28</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>10</revnumber><date>2023-06-16 05:51:32</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>9</revnumber><date>2023-06-16 05:51:14</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>8</revnumber><date>2022-02-01 10:47:55</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>7</revnumber><date>2022-02-01 10:47:42</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>6</revnumber><date>2020-12-30 15:22:51</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>5</revnumber><date>2020-12-30 15:21:36</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>4</revnumber><date>2020-12-30 15:19:35</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>3</revnumber><date>2020-12-18 16:34:55</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>2</revnumber><date>2020-12-18 16:34:33</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>1</revnumber><date>2020-12-18 16:27:31</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision></revhistory></articleinfo><section><title>Polish Discourse Corpus / Polski Korpus Metatekstowy</title><para>The corpus of discourse relations is based on the <ulink url="http://zil.ipipan.waw.pl/PolishDiscourseCorpus/PCC#">Polish Coreference Corpus</ulink>. </para><section><title>Version 0.1</title><section><title>Documentation</title><para>The <ulink url="http://zil.ipipan.waw.pl/PolishDiscourseCorpus/PolishDiscourseCorpus?action=AttachFile&amp;do=get&amp;target=instrukcja-anotacji-metatekstu.pdf">annotation instructions</ulink> (in Polish) were created by Celina Heliasz. </para></section><section><title>Download</title><para>The corpus is available for download in the form of a <ulink url="http://zil.ipipan.waw.pl/PolishDiscourseCorpus/PolishDiscourseCorpus?action=AttachFile&amp;do=get&amp;target=corpus.tar.gz">zip file</ulink> in the format of <ulink url="http://zil.ipipan.waw.pl/PolishDiscourseCorpus/Discann#">Discann annotation tool</ulink> containing: </para><itemizedlist><listitem><para>1773 source XML TEI files of the Polish Coreference Corpus </para></listitem><listitem><para>metatext.xml file containing descriptions of all relations </para></listitem></itemizedlist></section><section><title>Funding</title><para>Version 0.1 of the corpus was financed by the Polish Ministry of Education and Science under the agreement DIR/WK/2016/02. </para></section></section><section><title>Version 1.0</title><section><title>Documentation</title><para>The <ulink url="http://zil.ipipan.waw.pl/PolishDiscourseCorpus/PolishDiscourseCorpus?action=AttachFile&amp;do=get&amp;target=anotacja-pdc.pdf">annotation guidelines</ulink> (in Polish) were created by Maciej Ogrodniczuk. </para></section><section><title>Download</title><para>The corpus is available for download in the form of a <ulink url="http://zil.ipipan.waw.pl/PolishDiscourseCorpus/PolishDiscourseCorpus?action=AttachFile&amp;do=get&amp;target=pdc.zip">zip file</ulink> in the <ulink url="https://clarin.biz/tools/inforex">Inforex</ulink> format. </para></section><section><title>Funding</title><para>Version 1.0 of the corpus was financed by the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme, CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00–00C002/19, the Polish Ministry of Education and Science grant 2022/WK/09, continued as part of the investment: CLARIN ERIC – European Research Infrastructure Consortium: Common Language Resources and Technology Infrastructure (period: 2024-2026) funded by the Polish Ministry of Science and Higher Education (Programme: ”Support for the participation of Polish scientific teams in international research infrastructure projects”), agreement number 2024/WK/01 and by CLARIN-PL, the European Regional Development Fund, FENG programme, agreement number FENG.02.04-IP.040004/24. </para></section></section><section><title>DISRPT 2025 Version</title><para>The PDC dataset was also converted to the format of the <ulink url="https://sites.google.com/view/disrpt2025/">DISRPT2025 Shared Task on Discourse Unit Segmentation, Connective Detection, and Discourse Relation Classification</ulink>. </para><para>This version of the dataset contains annotation for discontinuous discourse units and connectives. Explicit relation connectives are not included in the argument spans for the .rels data. </para><para>POS tags, morphology, and syntactic parses were added using Stanza's <code>default_accurate</code> model for Polish (<code>pl</code>) while preserving tokenization and sentence splits from the Polish Coreference Corpus. </para><programlisting format="linespecific" language="highlight" linenumbering="numbered" startinglinenumber="1"><![CDATA[  ]]><methodname><![CDATA[nlp]]></methodname><![CDATA[ = ]]><methodname><![CDATA[stanza]]></methodname><![CDATA[.]]><methodname><![CDATA[Pipeline]]></methodname><![CDATA[(]]>
<![CDATA[        ]]><phrase><![CDATA[']]></phrase><phrase><![CDATA[pl]]></phrase><phrase><![CDATA[']]></phrase><![CDATA[,]]>
<![CDATA[        ]]><methodname><![CDATA[pretokenized]]></methodname><![CDATA[=]]><token><![CDATA[True]]></token><![CDATA[,]]>
<![CDATA[        ]]><methodname><![CDATA[tokenize_pretokenized]]></methodname><![CDATA[=]]><token><![CDATA[True]]></token><![CDATA[,]]>
<![CDATA[        ]]><methodname><![CDATA[package]]></methodname><![CDATA[=]]><phrase><![CDATA[']]></phrase><phrase><![CDATA[default_accurate]]></phrase><phrase><![CDATA[']]></phrase><![CDATA[,]]>
<![CDATA[    )]]>
</programlisting><section><title>Download</title><para>The corpus is available for download in the form of CoNLL-u annotations from the <ulink url="https://github.com/disrpt/sharedtask2025/tree/master/data/pol.iso.pdc">DISRPT 2025 GitHub</ulink>. </para></section><section><title>Funding</title><para>This research was funded in whole by the National Science Centre, Poland, grant 2023/50/A/HS2/00559 (<emphasis>Universal Discourse: a multilingual model of discourse relations</emphasis>).  </para></section></section><section><title>Licence</title><para><ulink url="https://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</ulink>  </para><para><inlinemediaobject><imageobject><imagedata fileref="http://i.creativecommons.org/l/by-nc/4.0/88x31.png"/></imageobject><textobject><phrase>http://i.creativecommons.org/l/by-nc/4.0/88x31.png</phrase></textobject></inlinemediaobject> </para></section><section><title>Universal Discourse Version</title><para>Reannotation of the PDC data according to <ulink url="http://udisc.org">Universal Discourse</ulink> annotation guidelines is ongoing. Please take a look at (already outdated) <ulink url="http://zil.ipipan.waw.pl/PolishDiscourseCorpus/PolishDiscourseCorpus?action=AttachFile&amp;do=get&amp;target=anotacja-iso.pdf">annotation guidelines</ulink>. </para></section><section><title>Please cite</title></section></section></article>