CorpCor 1.0
A web-based tool for correcting morphosyntactic annotation in TEI XML encoded corpora (e.g. NCP).

Authors: Łukasz Kobyliński, Nestor Pawłowski.
License: GPL v.3.

ABOUT
CorpCor is a tool integrating Poliqarp (http://poliqarp.sourceforge.net/), a library which allows for querying large corpora, with a web-based interface to correct morphosyntactic annotation in a TEI XML encoded text corpus.

Specifically, it has been used to correct annotation mistakes in the morphosyntatic layer of annotation of National Corpus of Polish.

HOW IT WORKS
CorpCor depends on two sources of information:
 - text corpus in its source version (XML format),
 - text corpus in the binary format, created by Poliqarp.
 
Binary version of the corpus must contain text identifiers used in the source version. Consequently, you need such mappings in your .bp.conf file when generating the binary version:
[meta]
name = idno
path = /tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:bibl/tei:idno[@type='nkjp']

[meta]
name = textId
path = /tei:teiHeader/@xml:id

***

Queries provided by a linguist are first processed by conducting a poliqarp query. To run a query, an instance of poliqarpd is run on the server. A pool of poliqarpd daemons is maintained to answer simultaneous queries in parallel.

The result of a poliqarpd query is analyzed to find text identifiers of returned contexts. The identifiers are used to match XML sources of the corpus with returned contexts and to find paragraph and segment identifiers of segments in the returned contexts.

Each modification to the corpus annotation is saved in the database. Such an entry consits of the identifier of text, in which the change was made, paragraph ID, segment ID, linguist ID, the change in annotation itself, the query used to find the edited context and additional comments.

INSTALLATION AND CONFIGURATION
The application has been tested on Windows and Linux platforms. You may need to provide poliqarpd binaries for your specific Linux installation (Windows binary is provided in the package).

You need a Java web application server to run CorpCor. It has been tested on:
apache-tomcat-6.0.35

A Java DataSource must be configured in order to be able to create user (linguist) accounts and to save corrections. A MySQL datasource is configured in the provided package (with login/password 'corpcor' for a MySQL instance running on localhost and default port), but may be changed to suit your needs in:
WEB-INF/classes/META-INF/persistence.xml

Necessary table are created on the first run of the application. You may then import the provided import.sql file to create 'gwt' and 'admin' users (password 'gwt' and 'admin').

Other configuration files in WEB-INF/classes include:
---
corpus.properties, which contains source corpus location, specified in a config.xml file such as:
---
<apiconf>
        <corpus type="TEI" id="NKJP1M">
                <text relativePath="false" path="/home/lkobylin/corpcor/nkjp1M-1.1-source" />
        </corpus>
        <senseInventory type="TEI" path="/home/lkobylin/corpcor/nkjp1M-1.1-source/NKJP_WSI.xml" />
</apiconf>

---
pqc.properties, which contains the location of the binary version of the corpus:
---
corpus-image=/home/lkobylin/corpcor/nkjp1M-1.1-binary/nkjp1M
# context lengths
wide-context=50
left-context=5
right-context=5

---
pqd.properties, which contains the configuration of poliqarpd daemon pool run on the server:
---
hostname=127.0.0.1
# starting port of poliqarpd daemons
port=45678
logging=on
#log-file=poliqarpd.log
match-buffer-size=10000
max-match-length=1000
max-session-idle=86400
corpus=any
pqd.max-connections=10

---
pqlm.properties, which contains the daemon monitor configuration
---
# log directory
log-root-dir=/tmp/pqdlogs
# where the poliqarpd binary will be copied
deployment-root-dir=/tmp/pqdinstalls
# should the deployed binaries be deleted on application exit
delete-deployment-on-exit=true
# longest time we wait for the daemon to respond (after which it is killed and recreated)
watchdog-delay-milis=30000
# pool size
min-daemons=1
max-daemons=10
# binary platform selection
daemon-platform=linux
daemon-version=1.3.13
# alternative:
# daemon-platform=win32
# daemon-version=1.3.12
#
# session timeout
session-timeout-secs=900

Maven script has been included in the package. To create a deployment .war file, run:
mvn package