#acl +All:read Default = Segment = Segment program is used to split text into segments (sentences, paragraphs, words). Split rules are read from file in XML based Segmentation Rules Exchange (SRX) standard format. Can be used as a programming library. <
> '''Homepage:''' https://github.com/loomchild/segment == Documentation == For detailed info see the docs directory in the Segment package file. === Using as sentence splitter for Polish === In segment main directory: {{{./bin/segment -l pl -s segment.srx}}} This way the program will read text from stdin and write sentences to stdout - each one in separate line. Note: the SRX file can deal with text that contains paragraphs separated with two line breaks, and where a single line break is still inside the paragraph (which is also the default TeX mode). To have such behavior, use: {{{./bin/segment -l pl_two -s segment.srx}}} Adding an end-of-sentence marker on a single line break is achieved this way: {{{./bin/segment -l pl_one -s segment.srx}}} ==== Recommended resources: ==== * SRX rules file for sentence splitting in Polish, written by Marcin MiƂkowski: https://raw.githubusercontent.com/languagetool-org/languagetool/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx * More information about SRX format: http://morfologik.blogspot.com/2009/11/talking-about-srx-in-lt-during-ltc.html * Some notes on sentence splitting performance (in Polish): http://morfologik.blogspot.com/2010/03/testy-segmentacji.html