Locked History Actions

Diff for "TermoPL"

Differences between revisions 80 and 150 (spanning 70 versions)
Revision 80 as of 2020-05-28 10:19:12
Size: 3114
Editor: PiotrRychlik
Comment:
Revision 150 as of 2025-03-05 16:11:49
Size: 5712
Editor: PiotrRychlik
Comment:
Deletions are marked like this. Additions are marked like this.
Line 8: Line 8:
TermoPL is a tool for automatic extraction of Polish terminology. Its description can be found in the corrected verion of the [[attachment:article-LREC2016.pdf|article]] and the [[attachment:poster-LREC2016.pdf|poster]] presented at LREC 2016. TermoPL is a tool created to extract terminology from domain corpora. It can also be used for other languages as long as you define the appropriate tagset and grammar.
The program extracts phrases, candidates for terms, using Universal Dependency (UD) structures obtained from UD parsers or through a simple grammar that can be customized.
It applies the C-value method to rank term candidates {being either the longest identified acceptable phrases or their nested subphrases}. The method operates on simplified base forms in order to unify morphological variants of terms and to recognise their contexts. For the method using simple grammar templates, the program supports the recognition of nested terms by word connection strength which allows eliminating truncated phrases from the top of the term list. For Polish, the program has an option to convert simplified forms of phrases into correct phrases in the nominative case. TermoPL accepts as input morphologically annotated and disambiguated domain texts and creates a list of terms, the top part of which comprises domain terminology. It can be used to compare two candidate term lists using four different coefficients showing asymmetry of term occurrences in this data. For Polish texts, TermoPL can group semantically related terms using plWordNet.
Line 10: Line 12:

Its description can be found in the corrected verion of the [[attachment:article-LREC2016.pdf|article]] and the [[attachment:poster-LREC2016.pdf|poster]] presented at LREC 2016.

Małgorzata Marciniak, Agnieszka Mykowiecka, and Piotr Rychlik. [[attachment:article-LREC2016.pdf|TermoPL — a flexible tool for terminology extraction]]. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC 2016, pages 2278–2284, Portorož, Slovenia, 2016. European Language Resources Association (ELRA), European Language Resources Association (ELRA). [[attachment:mar_myk_rych_lrec16.bib|bibtex|&do-view]]
Line 13: Line 19:
=== Downloads (current version: 5.1.0) ===

=== Downloads (current version: 8.0.0) ===
Line 17: Line 25:
 * [[attachment:parse.py|Parser|&do-get]]
Line 19: Line 28:
 * [[attachment:jars.zip|Additionlal jars|&do-get]]
 * [[attachment:filters.zip|Filters|&do-get]]
 * [[attachment:grammars.zip|Grammars|&do-get]]
 * [[attachment:jars.zip|Additional jars|&do-get]]
 * [[attachment:languages.zip|Languages|&do-get]]
 * [[attachment:tagset.pdf|Default tagset|&do-get]]
Line 23: Line 32:
=== Instalacja gotowych pakietów === === Requirements ===
Line 25: Line 34:
1. Wymagania podstawowe TermoPL is written in Java programming language and therefore requires Java Runtime Environment (version 8 or later) to be installed on a target machine (Windows, Linux or Mac OS X). It can be downloaded from , which can be downoaded from https://www.java.com/en/download/}. Since TermoPL uses Morfeusz 2.0 to generate base forms of terms and produce simplified forms for the list of common terms, all libraries of Morfeusz 2.0 have to be installed, too. As long as the user does not need to work on base forms, Morfeusz 2.0 libraries are not required. Installation of this software is only necessary for Unix. For other systems, the relevant libraries are supplied with TermoPL. Morpheus can be downloaded from http://sgjp.pl/morfeusz/dopobrania.html by following the instructions there. To enable the tgging/parsing of plain text files one has to install the Python interpreter. A script written in Python that creates CoNLLu files was tested using Python version 3.11.5. The Python interpreter can be downloaded from https://www.python.org/downloads/. After installing Python, install the stanza module by running the following command:
Line 27: Line 36:
Do uczestnictwa w warsztatach potrzebny jest komputer z dostępem do Internetu, systemem operacyjnym Windows, Linux (Ubuntu) lub MacOS X oraz zainstalowaną maszyną vitrualną Java (wesja 8 lub wyższa). W przypadku systemu Unix, należy zainstalować oprogramowanie Morfeusz 2. '''> pip install stanza'''
Line 29: Line 38:
2. Instalacja Javy
Line 31: Line 39:
Javę można pobrać ze strony https://www.java.com/pl/.
Należy zwrócić uwagę na to, czy instalowana wersja jest zgodna z posiadaną architekturą (32-, lub 64-bitową) posiadanego komputera.
The program requires about 1GB of RAM to process corpora of approximate size of 500 000 tokens. For considerably bigger data, one should reserve more memory invoking the program with -Xmx and -Xms Java options, e.g.:
Line 34: Line 41:
3. Instalacja Morfeusza 2 '''> java -Xmx5G -Xms4G -jar TermoPL.jar''',
Line 36: Line 43:
Instalacja tego oprogramowania jest konieczna tylko w przypadku systemu Unix. Dla pozostałych systemów odpowiednie biblioteki są dostarczone wraz z programem TermoPL. Morfeusza można pobrać ze strony http://sgjp.pl/morfeusz/dopobrania.html, postępując zgodnie z zamieszczonymi tam instrukcjami. which reserves minimum 4GB and up to 5GB of memory for the program to run. The Python script that converts the files to CoNLLu format requires an additional approximately 1 GB of RAM.
Line 38: Line 45:
4. Pobieranie i uruchamianie oprogramowania TermoPL === Installing and Running TermoPL ===
Line 40: Line 47:
TermoPL można pobrać ze strony zil.ipipan.waw.pl/TermoPL. Znajdują się na niej przygotowane pakiety dla Mac OS X, Linux (Ubuntu) oraz Windows w wersji 32- i 64-bitowej. Należy wybrać stosowny pakiet klikając w jego nazwę, a następnie kliknąć w napis "Download".
Oprogramowanie w postaci pliku zip zostanie pobrane do katalogu "Downloads" lub "Pobrane". Pobrany plik zip należy "rozpakować". W systemie Mac OS X "rozpakowywanie" uruchomi się automatycznie.
Three TermoPL software packages have been developed for Windows, Mac OS and Linux (Ubuntu):
Line 43: Line 49:
Wskazane by było również pobranie danych do ćwiczeń.

Program uruchamia się przez podwójne kliknięcie w ikonę pliku TermoPL.jar.
TermoPL można również uruchomić poleceniem zawartym w pliku termopl.bat.

W przypadku systemu Unix jest to polecenie

java -Djava.library.path=/usr/lib/jni/ -jar TermoPL.jar

=== Pakiety ===

 * [[attachment:TermoPL_Mac_OS_X.zip|TermoPL Mac OS X|&do-get]]
 * [[attachment:TermoPL_Mac_OS_X.zip|TermoPL Mac OS X (arm_64)|&do-get]]
Line 56: Line 51:
 * [[attachment:TermoPL_Win32.zip|TermoPL Win32|&do-get]]
Line 59: Line 53:
Select the relevant package by clicking on its name and then click on ‘Download’.
Line 60: Line 55:
=== Dane do ćwiczeń === The program is distributed as an executable jar file, so it can be started by double-clicking on its icon. TermoPL can also be run with the command contained in the termopl.bat (Windows) or termopl.sh (Mac OS, Linux) file. On a Unix system, this is the following command:
Line 62: Line 57:
 * [[attachment:warsztaty-dane.zip|dane|&do-get]] '''> java -Djava.library.path=/usr/lib/jni/ -jar TermoPL.jar'''

-----
==== Miscellaneous resources ====
-----
Line 65: Line 64:
== Seminarium nt. terminologii informatycznej 5.07.2017 ==  * [[attachment:TermoUD-results.zip|EACL 2023 (results for ACTER, Genia, RD-TEC and RSDO5)]]
 * [[attachment:Finnish.zip|Results for small Finnish corpus concerning economy]]
 * [[attachment:TermoUD.mp4|TermoUD demo for EACL 2023 (.mp4)]]
Line 67: Line 68:
 * [[attachment:sem-ptj.pdf|Referat|&do-get]] -----

 * [[attachment:warsztaty-dane.zip|Dane do ćwiczeń|&do-get]]
 * [[attachment:sem-ptj.pdf|Seminarium nt. terminologii informatycznej 5.07.2017|&do-get]]
Line 69: Line 73:

-----

TermoPL

TermoPL is a tool created to extract terminology from domain corpora. It can also be used for other languages as long as you define the appropriate tagset and grammar. The program extracts phrases, candidates for terms, using Universal Dependency (UD) structures obtained from UD parsers or through a simple grammar that can be customized. It applies the C-value method to rank term candidates {being either the longest identified acceptable phrases or their nested subphrases}. The method operates on simplified base forms in order to unify morphological variants of terms and to recognise their contexts. For the method using simple grammar templates, the program supports the recognition of nested terms by word connection strength which allows eliminating truncated phrases from the top of the term list. For Polish, the program has an option to convert simplified forms of phrases into correct phrases in the nominative case. TermoPL accepts as input morphologically annotated and disambiguated domain texts and creates a list of terms, the top part of which comprises domain terminology. It can be used to compare two candidate term lists using four different coefficients showing asymmetry of term occurrences in this data. For Polish texts, TermoPL can group semantically related terms using plWordNet.

Its description can be found in the corrected verion of the article and the poster presented at LREC 2016.

Małgorzata Marciniak, Agnieszka Mykowiecka, and Piotr Rychlik. TermoPL — a flexible tool for terminology extraction. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC 2016, pages 2278–2284, Portorož, Slovenia, 2016. European Language Resources Association (ELRA), European Language Resources Association (ELRA). bibtex

LICENSE

Downloads (current version: 8.0.0)

Requirements

TermoPL is written in Java programming language and therefore requires Java Runtime Environment (version 8 or later) to be installed on a target machine (Windows, Linux or Mac OS X). It can be downloaded from , which can be downoaded from https://www.java.com/en/download/}. Since TermoPL uses Morfeusz 2.0 to generate base forms of terms and produce simplified forms for the list of common terms, all libraries of Morfeusz 2.0 have to be installed, too. As long as the user does not need to work on base forms, Morfeusz 2.0 libraries are not required. Installation of this software is only necessary for Unix. For other systems, the relevant libraries are supplied with TermoPL. Morpheus can be downloaded from http://sgjp.pl/morfeusz/dopobrania.html by following the instructions there. To enable the tgging/parsing of plain text files one has to install the Python interpreter. A script written in Python that creates CoNLLu files was tested using Python version 3.11.5. The Python interpreter can be downloaded from https://www.python.org/downloads/. After installing Python, install the stanza module by running the following command:

> pip install stanza

The program requires about 1GB of RAM to process corpora of approximate size of 500 000 tokens. For considerably bigger data, one should reserve more memory invoking the program with -Xmx and -Xms Java options, e.g.:

> java -Xmx5G -Xms4G -jar TermoPL.jar,

which reserves minimum 4GB and up to 5GB of memory for the program to run. The Python script that converts the files to CoNLLu format requires an additional approximately 1 GB of RAM.

Installing and Running TermoPL

Three TermoPL software packages have been developed for Windows, Mac OS and Linux (Ubuntu):

Select the relevant package by clicking on its name and then click on ‘Download’.

The program is distributed as an executable jar file, so it can be started by double-clicking on its icon. TermoPL can also be run with the command contained in the termopl.bat (Windows) or termopl.sh (Mac OS, Linux) file. On a Unix system, this is the following command:

> java -Djava.library.path=/usr/lib/jni/ -jar TermoPL.jar


Miscellaneous resources