Locked History Actions

Diff for "Segment"

Differences between revisions 8 and 14 (spanning 6 versions)
Revision 8 as of 2012-05-17 19:54:58
Size: 977
Editor: MichalLenart
Comment:
Revision 14 as of 2012-05-17 20:05:54
Size: 1030
Editor: MichalLenart
Comment:
Deletions are marked like this. Additions are marked like this.
Line 13: Line 13:
=== Using as sentence splitter === === Using as sentence splitter for Polish ===
Line 15: Line 15:
In segment main directory type:

{{{./bin/segment -l <language_code> -s <SRX_file>}}}

For example:<<BR>>
In segment main directory:
Line 26: Line 22:
 * SRX rules file for sentence splitting in Polish written by Marcin Miłkowski: [[attachment:sample.srx]].  * SRX rules file for sentence splitting in Polish, written by Marcin Miłkowski: [[attachment:sample.srx]].
Line 28: Line 24:
 * Some notes on sentence splitting performance (in Polish): http://morfologik.blogspot.com/2010/03/testy-segmentacji.html

Segment

Segment program is used to split text into segments (sentences, paragraphs, words). Split rules are read from file in XML based Segmentation Rules Exchange (SRX) standard format. Can be used as a programming library.
Homepage: http://sourceforge.net/projects/segment/

Documentation

For detailed info see the docs directory in the Segment package file.

Using as sentence splitter for Polish

In segment main directory:

./bin/segment -l pl -s sample.srx

This way the program will read text from stdin and write sentences to stdout - each one in separate line.

Recommended resources: