Locked History Actions

Diff for "PolishCoreferenceCorpus"

Differences between revisions 5 and 6
Revision 5 as of 2013-01-08 09:33:34
Size: 2380
Editor: MateuszKopec
Comment:
Revision 6 as of 2013-01-08 12:06:40
Size: 2466
Editor: MateuszKopec
Comment:
Deletions are marked like this. Additions are marked like this.
Line 25: Line 25:

You may also want to see [[PolishCoreferenceTools|Polish Coreference Tools site]].

Polish Coreference Corpus

This page describes the corpus of Polish coreference, which was created as a part of the CORE project.

Approximate corpus texts type distribution:

Texts type

# of texts

# of segments

Percent

Dailies

459

127500

25.5%

Magazines

406

117500

23.5%

Fiction literature (prose, poetry, drama)

288

80000

16%

Non-fiction literature

96

27500

5.5%

Instructive writing and textbooks

100

27500

5.5%

Spoken – conversational

83

25000

5%

Internet – interactive (blogs, forums, usenet)

63

17500

3.5%

Internet – non-interactive (static pages, Wikipedia)

63

17500

3.5%

Miscellaneous written (legal, advertisements, user manuals, letters)

55

15000

3%

Spoken from the media

44

12500

2.5%

Quasi-spoken (parliamentary transcripts)

43

12500

2.5%

Academic writing and textbooks

35

10000

2%

Unclassified written

19

5000

1%

Journalistic books

19

5000

1%

Total

1773

500000

100%

To be updated.

You may also want to see Polish Coreference Tools site.