Size: 972
Comment:
|
Size: 905
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 13: | Line 13: |
=== Using as sentence splitter === | === Using as sentence splitter for Polish === |
Line 16: | Line 16: |
{{{./bin/segment -l <language_code> -s <SRX_file>}}} For example:<<BR>> |
Segment
Segment program is used to split text into segments (sentences, paragraphs, words). Split rules are read from file in XML based Segmentation Rules Exchange (SRX) standard format. Can be used as a programming library.
Homepage: http://sourceforge.net/projects/segment/
Documentation
For detailed info see the docs directory in the Segment package file.
Using as sentence splitter for Polish
In segment main directory:
./bin/segment -l pl -s sample.srx
This way the program will read text from stdin and write sentences to stdout - each one in separate line.
Recommended resources:
SRX rules file for sentence splitting in Polish written by Marcin Miłkowski: sample.srx.
More information about SRX format: http://morfologik.blogspot.com/2009/11/talking-about-srx-in-lt-during-ltc.html