Attachment 'README.txt'
Download 1 CorpCor 1.0
2 A web-based tool for correcting morphosyntactic annotation in TEI XML encoded corpora (e.g. NCP).
3
4 Authors: £ukasz Kobyliñski, Nestor Paw³owski.
5 License: GPL v.3.
6
7 ABOUT
8 CorpCor is a tool integrating Poliqarp (http://poliqarp.sourceforge.net/), a library which allows for querying large corpora, with a web-based interface to correct morphosyntactic annotation in a TEI XML encoded text corpus.
9
10 Specifically, it has been used to correct annotation mistakes in the morphosyntatic layer of annotation of National Corpus of Polish.
11
12 HOW IT WORKS
13 CorpCor depends on two sources of information:
14 - text corpus in its source version (XML format),
15 - text corpus in the binary format, created by Poliqarp.
16
17 Binary version of the corpus must contain text identifiers used in the source version. Consequently, you need such mappings in your .bp.conf file when generating the binary version:
18 [meta]
19 name = idno
20 path = /tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:bibl/tei:idno[@type='nkjp']
21
22 [meta]
23 name = textId
24 path = /tei:teiHeader/@xml:id
25
26 ***
27
28 Queries provided by a linguist are first processed by conducting a poliqarp query. To run a query, an instance of poliqarpd is run on the server. A pool of poliqarpd daemons is maintained to answer simultaneous queries in parallel.
29
30 The result of a poliqarpd query is analyzed to find text identifiers of returned contexts. The identifiers are used to match XML sources of the corpus with returned contexts and to find paragraph and segment identifiers of segments in the returned contexts.
31
32 Each modification to the corpus annotation is saved in the database. Such an entry consits of the identifier of text, in which the change was made, paragraph ID, segment ID, linguist ID, the change in annotation itself, the query used to find the edited context and additional comments.
33
34 INSTALLATION AND CONFIGURATION
35 The application has been tested on Windows and Linux platforms. You may need to provide poliqarpd binaries for your specific Linux installation (Windows binary is provided in the package).
36
37 You need a Java web application server to run CorpCor. It has been tested on:
38 apache-tomcat-6.0.35
39
40 A Java DataSource must be configured in order to be able to create user (linguist) accounts and to save corrections. A MySQL datasource is configured in the provided package (with login/password 'corpcor' for a MySQL instance running on localhost and default port), but may be changed to suit your needs in:
41 WEB-INF/classes/META-INF/persistence.xml
42
43 Necessary table are created on the first run of the application. You may then import the provided import.sql file to create 'gwt' and 'admin' users (password 'gwt' and 'admin').
44
45 Other configuration files in WEB-INF/classes include:
46 ---
47 corpus.properties, which contains source corpus location, specified in a config.xml file such as:
48 ---
49 <apiconf>
50 <corpus type="TEI" id="NKJP1M">
51 <text relativePath="false" path="/home/lkobylin/corpcor/nkjp1M-1.1-source" />
52 </corpus>
53 <senseInventory type="TEI" path="/home/lkobylin/corpcor/nkjp1M-1.1-source/NKJP_WSI.xml" />
54 </apiconf>
55
56 ---
57 pqc.properties, which contains the location of the binary version of the corpus:
58 ---
59 corpus-image=/home/lkobylin/corpcor/nkjp1M-1.1-binary/nkjp1M
60 # context lengths
61 wide-context=50
62 left-context=5
63 right-context=5
64
65 ---
66 pqd.properties, which contains the configuration of poliqarpd daemon pool run on the server:
67 ---
68 hostname=127.0.0.1
69 # starting port of poliqarpd daemons
70 port=45678
71 logging=on
72 #log-file=poliqarpd.log
73 match-buffer-size=10000
74 max-match-length=1000
75 max-session-idle=86400
76 corpus=any
77 pqd.max-connections=10
78
79 ---
80 pqlm.properties, which contains the daemon monitor configuration
81 ---
82 # log directory
83 log-root-dir=/tmp/pqdlogs
84 # where the poliqarpd binary will be copied
85 deployment-root-dir=/tmp/pqdinstalls
86 # should the deployed binaries be deleted on application exit
87 delete-deployment-on-exit=true
88 # longest time we wait for the daemon to respond (after which it is killed and recreated)
89 watchdog-delay-milis=30000
90 # pool size
91 min-daemons=1
92 max-daemons=10
93 # binary platform selection
94 daemon-platform=linux
95 daemon-version=1.3.13
96 # alternative:
97 # daemon-platform=win32
98 # daemon-version=1.3.12
99 #
100 # session timeout
101 session-timeout-secs=900
102
103 Maven script has been included in the package. To create a deployment .war file, run:
104 mvn package
Attached Files
To refer to attachments on a page, use attachment:filename, as shown below in the list of files. Do NOT use the URL of the [get] link, since this is subject to change and can break easily.You are not allowed to attach a file to this page.