Locked History Actions

attachment:README.txt of CorpCor

Attachment 'README.txt'

Download

   1 CorpCor 1.0
   2 A web-based tool for correcting morphosyntactic annotation in TEI XML encoded corpora (e.g. NCP).
   3 
   4 Authors: £ukasz Kobyliñski, Nestor Paw³owski.
   5 License: GPL v.3.
   6 
   7 ABOUT
   8 CorpCor is a tool integrating Poliqarp (http://poliqarp.sourceforge.net/), a library which allows for querying large corpora, with a web-based interface to correct morphosyntactic annotation in a TEI XML encoded text corpus.
   9 
  10 Specifically, it has been used to correct annotation mistakes in the morphosyntatic layer of annotation of National Corpus of Polish.
  11 
  12 HOW IT WORKS
  13 CorpCor depends on two sources of information:
  14  - text corpus in its source version (XML format),
  15  - text corpus in the binary format, created by Poliqarp.
  16  
  17 Binary version of the corpus must contain text identifiers used in the source version. Consequently, you need such mappings in your .bp.conf file when generating the binary version:
  18 [meta]
  19 name = idno
  20 path = /tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:bibl/tei:idno[@type='nkjp']
  21 
  22 [meta]
  23 name = textId
  24 path = /tei:teiHeader/@xml:id
  25 
  26 ***
  27 
  28 Queries provided by a linguist are first processed by conducting a poliqarp query. To run a query, an instance of poliqarpd is run on the server. A pool of poliqarpd daemons is maintained to answer simultaneous queries in parallel.
  29 
  30 The result of a poliqarpd query is analyzed to find text identifiers of returned contexts. The identifiers are used to match XML sources of the corpus with returned contexts and to find paragraph and segment identifiers of segments in the returned contexts.
  31 
  32 Each modification to the corpus annotation is saved in the database. Such an entry consits of the identifier of text, in which the change was made, paragraph ID, segment ID, linguist ID, the change in annotation itself, the query used to find the edited context and additional comments.
  33 
  34 INSTALLATION AND CONFIGURATION
  35 The application has been tested on Windows and Linux platforms. You may need to provide poliqarpd binaries for your specific Linux installation (Windows binary is provided in the package).
  36 
  37 You need a Java web application server to run CorpCor. It has been tested on:
  38 apache-tomcat-6.0.35
  39 
  40 A Java DataSource must be configured in order to be able to create user (linguist) accounts and to save corrections. A MySQL datasource is configured in the provided package (with login/password 'corpcor' for a MySQL instance running on localhost and default port), but may be changed to suit your needs in:
  41 WEB-INF/classes/META-INF/persistence.xml
  42 
  43 Necessary table are created on the first run of the application. You may then import the provided import.sql file to create 'gwt' and 'admin' users (password 'gwt' and 'admin').
  44 
  45 Other configuration files in WEB-INF/classes include:
  46 ---
  47 corpus.properties, which contains source corpus location, specified in a config.xml file such as:
  48 ---
  49 <apiconf>
  50         <corpus type="TEI" id="NKJP1M">
  51                 <text relativePath="false" path="/home/lkobylin/corpcor/nkjp1M-1.1-source" />
  52         </corpus>
  53         <senseInventory type="TEI" path="/home/lkobylin/corpcor/nkjp1M-1.1-source/NKJP_WSI.xml" />
  54 </apiconf>
  55 
  56 ---
  57 pqc.properties, which contains the location of the binary version of the corpus:
  58 ---
  59 corpus-image=/home/lkobylin/corpcor/nkjp1M-1.1-binary/nkjp1M
  60 # context lengths
  61 wide-context=50
  62 left-context=5
  63 right-context=5
  64 
  65 ---
  66 pqd.properties, which contains the configuration of poliqarpd daemon pool run on the server:
  67 ---
  68 hostname=127.0.0.1
  69 # starting port of poliqarpd daemons
  70 port=45678
  71 logging=on
  72 #log-file=poliqarpd.log
  73 match-buffer-size=10000
  74 max-match-length=1000
  75 max-session-idle=86400
  76 corpus=any
  77 pqd.max-connections=10
  78 
  79 ---
  80 pqlm.properties, which contains the daemon monitor configuration
  81 ---
  82 # log directory
  83 log-root-dir=/tmp/pqdlogs
  84 # where the poliqarpd binary will be copied
  85 deployment-root-dir=/tmp/pqdinstalls
  86 # should the deployed binaries be deleted on application exit
  87 delete-deployment-on-exit=true
  88 # longest time we wait for the daemon to respond (after which it is killed and recreated)
  89 watchdog-delay-milis=30000
  90 # pool size
  91 min-daemons=1
  92 max-daemons=10
  93 # binary platform selection
  94 daemon-platform=linux
  95 daemon-version=1.3.13
  96 # alternative:
  97 # daemon-platform=win32
  98 # daemon-version=1.3.12
  99 #
 100 # session timeout
 101 session-timeout-secs=900
 102 
 103 Maven script has been included in the package. To create a deployment .war file, run:
 104 mvn package

Attached Files

To refer to attachments on a page, use attachment:filename, as shown below in the list of files. Do NOT use the URL of the [get] link, since this is subject to change and can break easily.
  • [get | view] (2014-12-29 14:19:01, 1779.8 KB) [[attachment:CorpCor-1.1.zip]]
  • [get | view] (2014-12-29 14:19:01, 4.4 KB) [[attachment:README.txt]]
 All files | Selected Files: delete move to page

You are not allowed to attach a file to this page.