Differences between revisions 1 and 3 (spanning 2 versions)
Size: 806
Comment:
|
Size: 818
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 17: | Line 17: |
./bin/segment -l <language_code> -s <SRX_file> | {{{./bin/segment -l <language_code> -s <SRX_file>}}} |
Line 19: | Line 19: |
For example: | For example:<<BR>> |
Segment
Segment program is used to split text into segments (sentences, paragraphs, words). Split rules are read from file in XML based Segmentation Rules Exchange (SRX) standard format. Can be used as a programming library.
Homepage: http://sourceforge.net/projects/segment/
Documentation
See docs directory in the Segment package file.
Using as sentence splitter
In segment main directory type:
./bin/segment -l <language_code> -s <SRX_file>
For example:
./bin/segment -l pl -s sample.srx
This way the program will read data from stdin and write sentences to stdout - each one in separate line.
Recommended SRX rules file for sentence splitting in Polish written by Marcin Miłkowski: sample.srx.