Size: 2380
Comment:
|
Size: 2466
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 25: | Line 25: |
You may also want to see [[PolishCoreferenceTools|Polish Coreference Tools site]]. |
Polish Coreference Corpus
This page describes the corpus of Polish coreference, which was created as a part of the CORE project.
Approximate corpus texts type distribution:
Texts type |
# of texts |
# of segments |
Percent |
Dailies |
459 |
127500 |
25.5% |
Magazines |
406 |
117500 |
23.5% |
Fiction literature (prose, poetry, drama) |
288 |
80000 |
16% |
Non-fiction literature |
96 |
27500 |
5.5% |
Instructive writing and textbooks |
100 |
27500 |
5.5% |
Spoken – conversational |
83 |
25000 |
5% |
Internet – interactive (blogs, forums, usenet) |
63 |
17500 |
3.5% |
Internet – non-interactive (static pages, Wikipedia) |
63 |
17500 |
3.5% |
Miscellaneous written (legal, advertisements, user manuals, letters) |
55 |
15000 |
3% |
Spoken from the media |
44 |
12500 |
2.5% |
Quasi-spoken (parliamentary transcripts) |
43 |
12500 |
2.5% |
Academic writing and textbooks |
35 |
10000 |
2% |
Unclassified written |
19 |
5000 |
1% |
Journalistic books |
19 |
5000 |
1% |
Total |
1773 |
500000 |
100% |
To be updated.
You may also want to see Polish Coreference Tools site.