Natural Language Processing Seminar 2018–2019
The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). All recorded talks are available on YouTube. |
1 October 2018 |
Janusz S. Bień (University of Warsaw – prof. emeritus) |
We will focus on the indexes to lexicographical resources available online in DjVu format. Such indexes can be browsed, searched, modified and created with the djview4poliqarp open source program; the origins and the history of the program will be briefly presented. Originally the index support was added to the program to handle the list of entries in the 19th century Linde's dictionary, but can be used conveniently also for other resources, as will be demonstrated on selected examples. In particular some new features, introduced to the program in the last months, will be presented publicly for the first time. |
15 October 2018 |
Wojciech Jaworski, Szymon Rutkowski (University of Warsaw) |
The presentation will be devoted to the multilayer model of Polish inflection. The model has been developed on the basis of Grammatical Dictionary of Polish; it does not use the concept of a inflexion paradigm. The model consists of three layers of hand-made rules: "orthographic-phonetic layer" converting a segment to representation reflecting morphological patterns of the language, "analytic layer" generating lemma and determining affix and "interpretation layer" giving a morphosyntactic interpretation based on detected affixes. The model provides knowledge about the language to a morphological analyzer supplemented with the function of guessing lemmas and morphosyntactic interpretations for non-dictionary forms (guesser). The second use of the model is generation of word forms based on lemma and morphosyntactic interpretation. The presentation will also cover the issue of disambiguation of the results provided by the morphological analyzer. The demo version of the program is available on the Internet. |
29 October 2018 |
Jakub Waszczuk (Heinrich-Heine-Universität Düsseldorf) |
|
The first part of the talk was dedicated to Concraft-pl 2.0, the new version of a morphosyntactic tagger for Polish based on conditional random fields. Concraft-pl 2.0 performs morphosyntactic segmentation as a by-product of disambiguation, which allows to use it directly on the segmentation graphs provided by the analyser Morfeusz. This is in contrast with other existing taggers for Polish, which either neglect the problem of segmentation or rely on heuristics to perform it in a pre-processing stage. During the second part, an approach to identifying verbal multiword expressions (VMWEs) based on dependency parsing results was presented. In this approach, VMWE identification is reduced to the problem of dependency tree labeling, where one of two labels (MWE or not-MWE) must be predicted for each node in the dependency tree. The underlying labeling model can be seen as conditional random fields (as used in Concraft) adapted to tree structures. A system based on this approach ranked 1st in the closed track of the PARSEME shared task 2018. |
19 November 2018 |
Daniel Zeman (Institute of Formal and Applied Linguistics, Charles University in Prague) |
I will present Universal Dependencies, a worldwide community effort aimed at providing multilingual corpora, annotated at the morphological and syntactic levels following unified annotation guidelines. I will discuss the concept of core arguments, one of the cornerstones of the UD framework. In the second part of the talk I will focus on some interesting problems and challenges of applying Universal Dependencies to the Slavic languages. I will discuss examples from 12 Slavic languages that are currently represented in UD and show that cross-linguistic consistency can still be improved. |
3 December 2018 |
Ekaterina Lapshinova-Koltunski (Saarland University) |
|
In this talk, I will report on the ongoing work on coreference analysis in a multilingual context. I will present two approaches in the analysis of coreference and coreference-related phenomena: (1) top-down or theory-driven: here we start from some linguistic knowledge derived from the existing frameworks, define linguistic categories to analyse and create an annotated corpus that can be used either for further linguistic analysis or as training data for NLP applications; (2) bottom-up or data-driven: in this case, we start from a set of features of shallow character that we believe are discourse-related. We extract these structures from a huge amount of data and analyse them from a linguistic point of view trying to describe and explain the observed phenomena from the point of view of existing theories and grammars. |
7 January 2019 |
Adam Przepiórkowski (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw), Agnieszka Patejuk (Institute of Computer Science, Polish Academy of Sciences / University of Oxford) |
The aim of this talk is to present the two threads of our recent work on Universal Dependencies (UD), a standard for syntactically annotated corpora (http://universaldependencies.org/). The first thread is concerned with the developement of a new UD treebank of Polish, one that makes extensive use of the enhanced level of representation made available in the current UD standard. The treebank is the result of conversion from an earlier ‘treebank’ of Polish, one that was annotated with constituency and functional structures as they are understood in Lexical Functional Grammar. We will outline the conversion procedure and present the resulting UD treebank of Polish. The second thread is concerned with various inconsistencies and deficiencies of UD that we identified in the process of developing the UD treebank of Polish. We will concentrate on two particularly problematic areas in UD, namely, on the core/oblique distinction, which aims to – but does not really – replace the infamous argument/adjunct dichotomy, and on coordination, a phenomenon problematic for all dependency approaches. |
14 January 2019 |
Agata Savary (François Rabelais University Tours) |
Literal occurrences of multiword expressions: quantitative and qualitative analyses |
Multiword expressions (MWEs) such as to “pull strings” (to use one's influence), “to take part” or to “do in” (to kill) are word combinations which exhibit lexical, syntactic, and especially semantic idiosyncrasies. They pose special challenges to linguistic modeling and computational linguistics due to their non-compositional semantics, i.e. the fact that their meaning cannot be deduced from the meanings of their components, and from their syntactic structure, in a way deemed regular for the given language. Additionally, MWEs can have both idiomatic and literal occurrences. For instance “pulling strings” can be understood either as making use of one’s influence, or literally. Even if this phenomenon has been largely addressed in psycholinguistics, linguistics and natural language processing, the notion of a literal reading has rarely been formally defined or subject to quantitative analyses. I will propose a syntax-based definition of a literal reading of a MWE. I will also present the results of a quantitative and qualitative analysis of this phenomenon in Polish, as well as in 4 typologically distinct languages: Basque, German, Greek and Portuguese. This study, performed in a multilingual annotated corpus of the PARSEME network, shows that literal readings constitute a rare phenomenon. We also identify some properties that may distinguish them from their idiomatic counterparts. |
21 January 2019 |
Marek Łaziński (University of Warsaw), Michał Woźniak (Jagiellonian University) |
Aspect in dictionaries and corpora. What for and how aspect pairs should be tagged in corpora? |
Corpora are generally tagged for grammatical categories, also for verbal aspect value. They all choose between pf and ipf, some of them add the third value: bi-aspectual (not present in the National Corpus of Polish). However, no Slavic corpus tags the aspect value of a verb form in reference to an aspect partner. If we can mark aspect pairs in dictionaries, it should be also possible in corpora. However under the condition, that we extrapolate aspect features of lexeme to specific verb forms in specific uses. Retaining the existing morphological tagging including aspect value, two more aspect tags have been added: 1) morphological markers of aspect and 2) reference to superlemma. Every verb form in the corpus has thus three parts: 1) The existing grammatcial characteristics (TAKIPI), 2) Repeated or corrected aspect value (including bi-aspecual) and morphological markers, 3) Reference to the aspect pair–superlemma. A corpus tagged for aspect pairs, even with alternative reference for every lexeme, opens new perspectives for research. The possibilities are especially rich in a parallel corpus with one Slavic and one aspectless language, as the Mainz-Warsaw Corpus. In order to check the usefulness of our aspect pair tagging a series of queries will be built which allow to compare grammatical profiles of suffixal and prefixal aspect pf and ipf partners. |
11 February 2019 |
|
Anna Wróblewska (Applica / Warsaw University of Technology), Filip Graliński (Applica / Adam Mickiewicz University) |
|
|
|
How do we tackle text modeling challenges in business applications? We will present a prototype architecture for automation of processes in text based work and a few use cases of machine learning models. Use cases will be about emotion detection, abusive language recognition and more. We will also show our tool to explain suspicious findings in datasets and the models behaviour. |
28 February 2019 |
|
Jakub Dutkiewicz (Poznan University of Technology) |
|
We discuss results and evaluation procedures a of the bioCADDIE 2016 challenge on search of precision medical data. Our good results are due to word embedding query expansion with appropriate weights. Information Retrieval (IR) evaluation is demanding because of considerable effort required to judge over 10000 documents. A simple sampling method was proposed over 10 years ago for estimation of Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG) in spite of incomplete judgments. For this method to work the number of judged documents has to be relatively large. Such conditions were not fulfilled in bioCADDIE 2016 challenge and TREC PM 2017, 2018. The specificity of the bioCADDIE evaluation makes the post-challenge results incompatible with these judged during the contest. In bioCADDIE, for some questions there were not any judged relevant document. The results are strongly dependent on the cut-off rank. As the effect, in the bioCADDIE challenge infAP is weakly correlated with infNDCG, and an error could by up to 0.15-0.20 in absolute value. We believe, that the deviation of evaluation measures may override the primary role of the measure in such a case. We collaborate this claim by simulation of synthetic results. We propose a simulated environment with properties, which mirror the real systems. We implement a number of evaluation measures within the simulation and discuss the usefulness of the measures with partially annotated collection of documents in regard to the collection size, number of annotated document and proportion between the number of relevant and irrelevant documents. In particular we focus on the behavior of aforementioned AP and NDCG and their inferred versions. Other studies suggest that infNDCG weakly correlates with other measures and therefore should not be selected as the most important measure. |
21 March 2019 (NOTE: Thursday!) |
Grzegorz Wojdyga (Institute of Computer Science, Polish Academy of Sciences) |
During the seminar, the results of work on reducing the size of language models will be discussed. The author will review the literature on the size reduction of recurrent neural networks (in terms of language models). Then, author's own implementations will be presented along with evaluation results on different Polish and English corpora. |
25 March 2019 |
Łukasz Dębowski (Institute of Computer Science, Polish Academy of Sciences) |
GPT-2 is the latest neural statistical language model by the OpenAI team. A statistical language model is a distribution of probabilities on texts that can be used for automatic text generation. In essence, GPT-2 turned out to be a surprisingly good generator of semantically coherent texts of the length of several paragraphs, pushing the boundaries of what has seemed possible technically so far. Anticipating the use of GPT-2 to generate fake news, the OpenAI team decided to publish only a ten times reduced version of the model. In my talk, I will share some remarks about GPT-2. |
15 April 2019 (NOTE: planned for 13:00!) |
Alina Wróblewska, Piotr Rybak (Institute of Computer Science, Polish Academy of Sciences) |
Talk title will be available shortly |
Talk summary will be available shortly. |
13 May 2019 |
Piotr Niewiński, Maria Pszona (Samsung R&D Poland), Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences) |
Talk title will be available shortly |
Talk summary will be available shortly. |
27 May 2019 |
Magdalena Zawisławska (University of Warsaw) |
Polish corpus of synaesthetic metaphor |
Talk summary will be available shortly. |
Please see also the talks given in 2000–2015 and 2015–2018. |