Locked History Actions

Diff for "Segment"

Differences between revisions 14 and 15
Revision 14 as of 2012-05-17 20:05:54
Size: 1030
Editor: MichalLenart
Comment:
Revision 15 as of 2012-05-17 20:22:26
Size: 1433
Comment: add _one and _two
Deletions are marked like this. Additions are marked like this.
Line 19: Line 19:
This way the program will read text from stdin and write sentences to stdout - each one in separate line. This way the program will read text from stdin and write sentences to stdout - each one in separate line.

Note: the SRX file can deal with text that contains paragraphs separated with two line breaks, and where a single line break is still inside the paragraph (which is also the default TeX mode). To have such behavior, use:

{{{./bin/segment -l pl_two -s sample.srx}}}

Adding an end-of-sentence marker on a single line break is achievied this way:

{{{./bin/segment -l pl_one -s sample.srx}}}

Segment

Segment program is used to split text into segments (sentences, paragraphs, words). Split rules are read from file in XML based Segmentation Rules Exchange (SRX) standard format. Can be used as a programming library.
Homepage: http://sourceforge.net/projects/segment/

Documentation

For detailed info see the docs directory in the Segment package file.

Using as sentence splitter for Polish

In segment main directory:

./bin/segment -l pl -s sample.srx

This way the program will read text from stdin and write sentences to stdout - each one in separate line.

Note: the SRX file can deal with text that contains paragraphs separated with two line breaks, and where a single line break is still inside the paragraph (which is also the default TeX mode). To have such behavior, use:

./bin/segment -l pl_two -s sample.srx

Adding an end-of-sentence marker on a single line break is achievied this way:

./bin/segment -l pl_one -s sample.srx

Recommended resources: