Locked History Actions

Diff for "Segment"

Differences between revisions 1 and 20 (spanning 19 versions)
Revision 1 as of 2012-05-17 19:51:40
Size: 806
Editor: MichalLenart
Comment:
Revision 20 as of 2012-05-18 17:56:13
Size: 1524
Editor: MichalLenart
Comment:
Deletions are marked like this. Additions are marked like this.
Line 11: Line 11:
See docs directory in the Segment package file. For detailed info see the docs directory in the Segment package file.
Line 13: Line 13:
== Using as sentence splitter == === Using as sentence splitter for Polish ===
Line 15: Line 15:
In segment main directory type: In segment main directory:
Line 17: Line 17:
./bin/segment -l <language_code> -s <SRX_file> {{{./bin/segment -l pl -s segment.srx}}}
Line 19: Line 19:
For example:
{{{./bin/segment -l pl -s sample.srx}}}
This way the program will read text from stdin and write sentences to stdout - each one in separate line.
Line 22: Line 21:
This way the program will read data from stdin and write sentences to stdout - each one in separate line. Note: the SRX file can deal with text that contains paragraphs separated with two line breaks, and where a single line break is still inside the paragraph (which is also the default TeX mode). To have such behavior, use:
Line 24: Line 23:
Recommended SRX rules file for sentence splitting in Polish written by Marcin Miłkowski: [[attachment:sample.srx]]. {{{./bin/segment -l pl_two -s segment.srx}}}

Adding an end-of-sentence marker on a single line break is achieved this way:

{{{./bin/segment -l pl_one -s segment.srx}}}


==== Recommended resources: ====
 * SRX rules file for sentence splitting in Polish, written by Marcin Miłkowski: http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/resource/segment.srx.
 * More information about SRX format: http://morfologik.blogspot.com/2009/11/talking-about-srx-in-lt-during-ltc.html
 * Some notes on sentence splitting performance (in Polish): http://morfologik.blogspot.com/2010/03/testy-segmentacji.html

Segment

Segment program is used to split text into segments (sentences, paragraphs, words). Split rules are read from file in XML based Segmentation Rules Exchange (SRX) standard format. Can be used as a programming library.
Homepage: http://sourceforge.net/projects/segment/

Documentation

For detailed info see the docs directory in the Segment package file.

Using as sentence splitter for Polish

In segment main directory:

./bin/segment -l pl -s segment.srx

This way the program will read text from stdin and write sentences to stdout - each one in separate line.

Note: the SRX file can deal with text that contains paragraphs separated with two line breaks, and where a single line break is still inside the paragraph (which is also the default TeX mode). To have such behavior, use:

./bin/segment -l pl_two -s segment.srx

Adding an end-of-sentence marker on a single line break is achieved this way:

./bin/segment -l pl_one -s segment.srx