Locked History Actions

Diff for "PDB/PDBparser"

Differences between revisions 70 and 92 (spanning 22 versions)
Revision 70 as of 2020-01-20 10:19:33
Size: 10424
Comment:
Revision 92 as of 2022-09-08 16:19:40
Size: 12716
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
== PDB-trained dependency parsing models for Polish == == Polish COMBO models ==
Line 4: Line 4:
The PDB-based models are trained on the current version of [[http://zil.ipipan.waw.pl/PDB|Polish Dependency Bank]] with the publicly available parsing systems – [[https://github.com/360er0/COMBO|COMBO]], [[https://code.google.com/archive/p/mate-tools/|MateParser]] and [[http://maltparser.org|MaltParser]]. /* ''MaltParser'' is a transition-based dependency parser that uses a deterministic parsing algorithm. The deterministic parsing algorithm builds a dependency structure of an input sentence based on transitions (shift-reduce actions) predicted by a classifier. The classifier learns to predict the next transition given training data and the parse history. `MateParser`, in turn, is a graph-based parser that defines a space of well-formed candidate dependency trees for an input sentence, scores them given an induced parsing model, and selects the highest scoring dependency tree as a correct analysis of the input sentence. */ The [[https://gitlab.clarin-pl.eu/syntactic-tools/combo/-/tree/master|COMBO]] models for Polish are trained on the current version of [[http://zil.ipipan.waw.pl/PDB|Polish Dependency Bank]]. The models use the [[https://huggingface.co/allegro/herbert-base-cased|HerBERT]] language model.
Line 6: Line 6:
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO/200118_COMBO_PDB_nosem_parseonly.pkl|COMBO model]] for dependency parsing only
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO/200118_COMBO_PDB_nosem_full.pkl|COMBO model]] for part-of-speech tagging, lemmatisation, and dependency parsing
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO/200118_COMBO_PDB_sem_full.pkl|COMBO model]] for part-of-speech tagging, lemmatisation, dependency parsing, and semantic role labelling
== PDB-trained models ==
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO_pytorch/combo_PDB_parseonly_220906.tar.gz|model]] for dependency parsing only
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO_pytorch/combo_PDB_full_220906.tar.gz|model]] for part-of-speech tagging, morphological analysis, lemmatisation, and dependency parsing (dependency relation types '''without''' semantic extensions, e.g. adjunct instead of adjunct_temp)
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO_pytorch/combo_PDB_full_SEMLAB_220906.tar.gz|model]] for part-of-speech tagging, morphological analysis, lemmatisation, and dependency parsing (dependency relation types '''with''' semantic extensions, e.g. adjunct_temp)

== PDB-UD-trained model ==
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO_pytorch/combo_PDBUD_full_220906.tar.gz|model]] for part-of-speech tagging, morphological analysis, lemmatisation, and dependency parsing
Line 10: Line 15:
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO/191107_COMBO_PDB_semlab_parseonly.pkl|COMBO model]] for (semantic) dependency parsing only}}} [[https://github.com/360er0/COMBO|COMBO]], [[https://code.google.com/archive/p/mate-tools/|MateParser]] and [[http://maltparser.org|MaltParser]]. /* ''MaltParser'' is a transition-based dependency parser that uses a deterministic parsing algorithm. The deterministic parsing algorithm builds a dependency structure of an input sentence based on transitions (shift-reduce actions) predicted by a classifier. The classifier learns to predict the next transition given training data and the parse history. `MateParser`, in turn, is a graph-based parser that defines a space of well-formed candidate dependency trees for an input sentence, scores them given an induced parsing model, and selects the highest scoring dependency tree as a correct analysis of the input sentence. */

 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO_pytorch/combo_PDB_parseonly_220906.tar.gz|COMBO-pytorch model]] for dependency parsing only (with [[https://huggingface.co/allegro/herbert-base-cased|HerBERT-base]] embeddings),
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO/20200930_COMBO_PDB_nosem_parseonly.pkl|COMBO model]] for dependency parsing only
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO/20200930_COMBO_PDB_nosem.pkl|COMBO model]] for part-of-speech tagging, lemmatisation, and dependency parsing
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO/20200930_COMBO_PDB_sem.pkl|COMBO model]] for part-of-speech tagging, lemmatisation, dependency parsing, and semantic role labelling

 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO/191107_COMBO_PDB_semlab_parseonly.pkl|COMBO model]] for (semantic) dependency parsing only
Line 17: Line 29:
The PDB-UD-based models are trained on the current version of [[http://git.nlp.ipipan.waw.pl/alina/PDBUD|Polish Dependency Bank in Universal Dependencies format]] with the publicly available parsing systems – [[http://ufal.mff.cuni.cz/udpipe|UDPipe]] and [[https://github.com/360er0/COMBO|COMBO]]. The PDB-UD-based models are trained on the current version of [[http://git.nlp.ipipan.waw.pl/alina/PDBUD|Polish Dependency Bank in Universal Dependencies format]] with the publicly available parsing systems – [[https://gitlab.clarin-pl.eu/syntactic-tools/combo/-/tree/master|COMBO-pytorch]], [[https://github.com/360er0/COMBO|COMBO]], [[http://ufal.mff.cuni.cz/udpipe|UDPipe]].
Line 19: Line 31:
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO/190423_COMBO_PDBUD_nosem.pkl|COMBO model]] for part-of-speech tagging, lemmatisation, and dependency parsing
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO/190423_COMBO_PDBUD_sem.pkl|COMBO model]] for part-of-speech tagging, lemmatisation, dependency parsing, and semantic role labelling
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/UDPIPE/190423_PDBUD_ttp_embedd.udpipe|UDPipe model]] for tokenisation, part-of-speech tagging, lemmatisation, and dependency parsing
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/UDPIPE/190423_PDBUD_tokeniser.udpipe|UDPipe model]] for tokenisation
 * [[http://mozart.ipipan.waw.pl/~mklimaszewski/models/polish-herbert-base.tar.gz|COMBO-pytorch model]] for for part-of-speech tagging, lemmatisation, and dependency parsing (with [[https://huggingface.co/allegro/herbert-base-cased|HerBERT-base]] embeddings),
 * [[http://mozart.ipipan.waw.pl/~mklimaszewski/models/polish-herbert-large.tar.gz|COMBO-pytorch model]] for for part-of-speech tagging, lemmatisation, and dependency parsing (with [[https://huggingface.co/allegro/herbert-large-cased|HerBERT-large]] embeddings),
 * [[http://mozart.ipipan.waw.pl/~mklimaszewski/models/polish-ud27.tar.gz|COMBO-pytorch model]] for for part-of-speech tagging, lemmatisation, and dependency parsing (with fastText embeddings),
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO/20200
930_COMBO_PDBUD_nosem.pkl|COMBO model]] for part-of-speech tagging, lemmatisation, and dependency parsing
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/COMBO/20200930_COMBO_PDBUD_sem.pkl|COMBO model]] for part-of-speech tagging, lemmatisation, dependency parsing, and semantic role labelling
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/UDPIPE/20200930_PDBUD_ttp_embedd.udpipe|UDPipe model]] for tokenisation, part-of-speech tagging, lemmatisation, and dependency parsing
 * [[http://mozart.ipipan.waw.pl/~alina/Polish_dependency_parsing_models/UDPIPE/20200930_PDBUD_tokeniser.udpipe|UDPipe model]] for tokenisation}}}
Line 28: Line 43:
== Parsing performance == == Parsing performance (outdated) ==
Line 30: Line 45:
See [[http://clip.ipipan.waw.pl/benchmarks|Dependency parsing]] section. See [[http://clip.ipipan.waw.pl/benchmarks#Dependency_parsing|Dependency parsing]] section.
Line 91: Line 106:
== PDB-based MaltParser in Multiservice ==
 * The performance of !MaltParser model for Polish may be tested in Multiservice NLP – [[http://multiservice.nlp.ipipan.waw.pl]].
 * To parse a Polish text in Multiservice "Select predefined chain of actions": 5: Concraft, !DependencyParser, input your text, and press the button "Run".
 * To download the parser's output in CoNLL format, "Select output format:":
== PDB-based dependency parsing demos ==

* [[http://scwad-demo.nlp.ipipan.waw.pl:8000/dependency-parsing|COMBO demo]] (only in Polish)
 * [[http://multiservice.nlp.ipipan.waw.pl|MaltParser demo
in Multiservice NLP]]
  * To parse a Polish text in Multiservice "Select predefined chain of actions": 5: Concraft, !DependencyParser, input your text, and press the button "Run".
  * To download the parser's output in CoNLL format, "Select output format:".
Line 109: Line 126:
== Founding == == Acknowledgment ==

Polish COMBO models

The COMBO models for Polish are trained on the current version of Polish Dependency Bank. The models use the HerBERT language model.

PDB-trained models

  • model for dependency parsing only

  • model for part-of-speech tagging, morphological analysis, lemmatisation, and dependency parsing (dependency relation types without semantic extensions, e.g. adjunct instead of adjunct_temp)

  • model for part-of-speech tagging, morphological analysis, lemmatisation, and dependency parsing (dependency relation types with semantic extensions, e.g. adjunct_temp)

PDB-UD-trained model

  • model for part-of-speech tagging, morphological analysis, lemmatisation, and dependency parsing

Parsing performance (outdated)

See Dependency parsing section.

PDB-based dependency parsing demos

  • COMBO demo (only in Polish)

  • MaltParser demo in Multiservice NLP

    • To parse a Polish text in Multiservice "Select predefined chain of actions": 5: Concraft, DependencyParser, input your text, and press the button "Run".

    • To download the parser's output in CoNLL format, "Select output format:".

Publications

List of publications

Alina Wróblewska and Piotr Rybak. Dependency parsing of Polish. Poznań Studies in Contemporary Linguistics, 55(2):305–337, 2019.

(Note: Please contact the first author to get a copy of this article.) List of publications

Alina Wróblewska. Polish Dependency Parser Trained on an Automatically Induced Dependency Bank. Ph.D. dissertation, Institute of Computer Science, Polish Academy of Sciences, Warsaw, 2014.

List of publications

Alina Wróblewska and Adam Przepiórkowski. Projection-based annotation of a Polish dependency treebank. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, pages 2306–2312, Reykjavík, Iceland, 2014. European Language Resources Association (ELRA).

List of publications

Alina Wróblewska. Polish dependency bank. Linguistic Issues in Language Technology, 7(1), 2012.

List of publications

Alina Wróblewska and Marcin Woliński. Preliminary experiments in Polish dependency parsing. In Pascal Bouvry, Mieczysław A. Kłopotek, Franck Leprevost, Małgorzata Marciniak, Agnieszka Mykowiecka, and Henryk Rybiński, editors, Security and Intelligent Information Systems: International Joint Conference, SIIS 2011, Warsaw, Poland, June 13-14, 2011, Revised Selected Papers, number 7053 in Lecture Notes in Computer Science, pages 279–292. Springer-Verlag, 2012.

Licensing

The dependency parsing models for Polish are released under the CC BY-NC-SA 4.0 licence and by downloading them you accept the conditions of that licence.

Acknowledgment

The research was founded by SONATA 8 grant no 2014/15/D/HS2/03486 from the National Science Centre Poland and by the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure. The computing was performed at Poznań Supercomputing and Networking Center.

Contact

Any questions, comments? Please send them to <alina AT SPAMFREE ipipan DOT waw DOT pl>.