Spejd 0.8.3

Copyright (C) IPI PAN, 2007-2010. All rights reserved.
Available under the terms of the GNU General Public License;
see the file doc/gpl.txt for details.

ABOUT

Spejd is a shallow parser, which allows for simultaneous syntactic 
parsing and morphological disambiguation, developed at the 
Institute of Computer Science, Polish Academy od Sciences, Warsaw.

Spejd homepage:
http://nlp.ipipan.waw.pl/Spejd/

Last releases:
0.8.4: bugfix release
0.8.3: bugfix release
0.8.2: bugfix release
0.8.1:

Compared to the previous release, major changes in this version include:
- Integrated plain text mode processing module based on morphological 
  analyzer Morfologik (http://morfologik.blogspot.com/). This module requires 
  appropriately encoded input, as defined by inputEncoding config parameter.
  Plain text module is enabled by inputType parameter (auto or txt).
- Parallel processing (benefits are immediate on multicore CPUs). 
  The number of processing threads are defined by maxThreads parameter.
- A simple spelling correction module, addressing lacks of Polish
  diactrics. Possible transformations are listed in ogonkifier.ini.
- Changes listed in doc/changes0_5.txt.

REQUIREMENTS

Sun Java Runtime Environment version 1.5 or higher.

Notice: it may be possible to run the program on alternative Java
implementation, but because of differences in regular expression
implementations, we can not guarantee its behaviour.

INSTALLATION

Unzip the file spade.zip.  Installation finished!

SYNOPSIS

java -jar spejd.jar path [options]

where:

- path - a single file or a folder with XML CES (see doc/xcesIPIAna.dtd) 
    or plain text files (.txt, encoding defined by inputEncoding parameter)
    to parse; the parser looks for files matching a pattern defined in 
    config.ini (inputFiles parameter) and recursively checks subdirectories.

- options - optional list of assignments var=value; var has to be one
    of variables from config.ini; values passed as an invocations
    argument override the default values from the file.

Examples:

java -jar spejd.jar corpus nullAgreement=1
java -jar spejd.jar corpus rules=rules2.sr logDir=log2
java -jar spejd.jar corpus discardDeleted=true outputSuffix=.sh2.xml

RESULTS

In the case of xml input, for each directory, in which filename.xml(.gz)
has been found, a new filenameSh.xml is created.  It is a copy of a
corresponding .xml, but with additional annotation: token
identifiers, disambiguation attributes, syntactic word and groups.  
In the case of plain text input filename.txt, a new xml file 
(file name ends with Sh.xml) is created for each corresponding .txt file.

A few additional files are generated in logs subdirectory of the spade
directory:

rules.compiled - a compiled set of rules

rules.matched.csv - rules statistics: for each rule gives the number
    of completed (evaluated to true) matches, the number of matches,
    matching time, evaluation time, total time

tagdict.ini - tags dictionary, translating the tagset defined in
    configuration file to inner positional tagset

DOCUMENTATION

doc/spade.pdf      - a paper about Spejd
doc/xcesAnaIPI.dtd - DTD of the input format
api/               - technical documentation

EXAMPLE

./sample-morfeusz.cfg      - example Morfeusz tagset file
./sample-morfologik.cfg    - example Morfologik tagset file (for plain text input)
./rules.sr                 - example set of rules
doc/morph.xml              - example XML input to the parser
doc/morphSh.xml            - example output 
doc/display.*              - stylesheets and example output

WHAT'S NEW IN THIS VERSION



FOR DEVELOPERS

Please feel free to play around with the sources, modify them and post
patches on Spejd's bugtracker at sourceforge (linked from the homepage)!
See api/ - for a brief introduction to the code structure.

