CorpCor 1.0 A web-based tool for correcting morphosyntactic annotation in TEI XML encoded corpora (e.g. NCP). Authors: Łukasz Kobyliński, Nestor Pawłowski. License: GPL v.3. ABOUT CorpCor is a tool integrating Poliqarp (http://poliqarp.sourceforge.net/), a library which allows for querying large corpora, with a web-based interface to correct morphosyntactic annotation in a TEI XML encoded text corpus. Specifically, it has been used to correct annotation mistakes in the morphosyntatic layer of annotation of National Corpus of Polish. HOW IT WORKS CorpCor depends on two sources of information: - text corpus in its source version (XML format), - text corpus in the binary format, created by Poliqarp. Binary version of the corpus must contain text identifiers used in the source version. Consequently, you need such mappings in your .bp.conf file when generating the binary version: [meta] name = idno path = /tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:bibl/tei:idno[@type='nkjp'] [meta] name = textId path = /tei:teiHeader/@xml:id *** Queries provided by a linguist are first processed by conducting a poliqarp query. To run a query, an instance of poliqarpd is run on the server. A pool of poliqarpd daemons is maintained to answer simultaneous queries in parallel. The result of a poliqarpd query is analyzed to find text identifiers of returned contexts. The identifiers are used to match XML sources of the corpus with returned contexts and to find paragraph and segment identifiers of segments in the returned contexts. Each modification to the corpus annotation is saved in the database. Such an entry consits of the identifier of text, in which the change was made, paragraph ID, segment ID, linguist ID, the change in annotation itself, the query used to find the edited context and additional comments. INSTALLATION AND CONFIGURATION The application has been tested on Windows and Linux platforms. You may need to provide poliqarpd binaries for your specific Linux installation (Windows binary is provided in the package). You need a Java web application server to run CorpCor. It has been tested on: apache-tomcat-6.0.35 A Java DataSource must be configured in order to be able to create user (linguist) accounts and to save corrections. A MySQL datasource is configured in the provided package (with login/password 'corpcor' for a MySQL instance running on localhost and default port), but may be changed to suit your needs in: WEB-INF/classes/META-INF/persistence.xml Necessary table are created on the first run of the application. You may then import the provided import.sql file to create 'gwt' and 'admin' users (password 'gwt' and 'admin'). Other configuration files in WEB-INF/classes include: --- corpus.properties, which contains source corpus location, specified in a config.xml file such as: --- --- pqc.properties, which contains the location of the binary version of the corpus: --- corpus-image=/home/lkobylin/corpcor/nkjp1M-1.1-binary/nkjp1M # context lengths wide-context=50 left-context=5 right-context=5 --- pqd.properties, which contains the configuration of poliqarpd daemon pool run on the server: --- hostname=127.0.0.1 # starting port of poliqarpd daemons port=45678 logging=on #log-file=poliqarpd.log match-buffer-size=10000 max-match-length=1000 max-session-idle=86400 corpus=any pqd.max-connections=10 --- pqlm.properties, which contains the daemon monitor configuration --- # log directory log-root-dir=/tmp/pqdlogs # where the poliqarpd binary will be copied deployment-root-dir=/tmp/pqdinstalls # should the deployed binaries be deleted on application exit delete-deployment-on-exit=true # longest time we wait for the daemon to respond (after which it is killed and recreated) watchdog-delay-milis=30000 # pool size min-daemons=1 max-daemons=10 # binary platform selection daemon-platform=linux daemon-version=1.3.13 # alternative: # daemon-platform=win32 # daemon-version=1.3.12 # # session timeout session-timeout-secs=900 Maven script has been included in the package. To create a deployment .war file, run: mvn package