Natural Language Processing Seminar 2017–2018

2 October 2017

Paweł Rutkowski (University of Warsaw)

https://www.youtube.com/watch?v=Acfdv6kUe5I Polish Sign Language from the perspective of corpus linguistics  Talk delivered in Polish. Slides in English.

Polish Sign Language (polski język migowy, PJM) is a full-fledged visual-spatial language used by the Polish Deaf community. It started to evolve in the second decade of the nineteenth century, with the foundation of the first school for the deaf in Poland. Until recently, PJM attracted very little attention from the linguistic community in Poland. The aim of this talk is to present a large-scale research project aimed at creating an extensive and representative corpus of PJM. The corpus is currently being compiled at the University of Warsaw. It is a collection of video clips showing Deaf people using PJM in a variety of different communication contexts. The videos are richly annotated: they are segmented, lemmatized, translated into Polish, tagged for various grammatical features and transcribed with HamNoSys symbols. The Corpus of PJM is currently one of the two largest sets of annotated sign language data in the world. Special attention will be paid to the issue of lexical frequency in PJM. Studies of this type are available for a handful of sign languages only, including American Sign Language, New Zealand Sign Language, British Sign Language, Australian Sign Language and Slovene Sign Language. Their empirical basis ranged from 100,000 tokens (NZSL) to as little as 4,000 tokens (ASL). The present talk contributes to our understanding of lexical frequency in sign languages by analyzing a much larger set of relevant data from PJM.

23 October 2017

Katarzyna Krasnowska-Kieraś, Piotr Rybak, Alina Wróblewska (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=8qzqn69nCmg Towards the evaluation of feature embedding models of the fusional languages in the context of morphosyntactic disambiguation and dependency parsing  Talk delivered in Polish.

Neural networks are recently very successful in various natural language processing tasks. An important component of a neural network approach is a dense vector representation of features, i.e. feature embedding. Various feature types are possible, e.g. words, part-of-speech tags. In our talk we are going to present results of an analysis showing what should be used as features in estimating embedding models of the fusional languages – tokens or lemmata. Furthermore, we are going to discuss the methodological question whether the results of the intrinsic evaluation of embeddings are informative for downstream applications, or the embedding models should be evaluated extrinsically. The accompanying experiments were conducted on Polish – a fusional Slavic language with a relatively free word order. The mentioned research has inspired us to implement a morphosyntactic disambiguator – Toygger (Krasnowska-Kieraś, 2017). The tool won the shared task 1 (A) in PolEval 2017 competition and will be presented in our talk.

6 November 2017

Szymon Łęski (Samsung R&D Poland)

https://www.youtube.com/watch?v=266ftzwmKeU Deep neural networks in language models  Talk delivered in Polish. Slides in English.

In my talk I will first give introduction to language models: traditional, n-gram based, and new, based on recurrent networks. Then, based on recent papers, I will discuss the most interesting extensions and modifications to RNN-based language models, such as modifying word representations or models with output not limited to a pre-defined vocabulary.

20 November 2017

Michał Ptaszyński (Kitami Institute of Technology, Japan)

https://www.youtube.com/watch?v=hUtI5lCyUew Capturing Emotions in Context as a way towards Computational Phronesis  Talk delivered in Polish.

Research on emotions within Artificial Intelligence and related fields has flourished rapidly through recent years. Unfortunately, in most research emotions are analyzed without their context. I will argue, that recognizing emotions without recognizing their context is incomplete and cannot be sufficient for real-world applications. I will also describe some consequences of disregarding the context of emotions. Finally, I will present one approach, in which the context of emotions is considered and briefly describe some of the first experiments performed in this matter.

27 November 2017

Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

Automated coreference resolution in Polish  Talk delivered in Polish.

The talk presents the description of nominal referential constructs in Polish (i.e. textual fragments referencing the same discourse entities) and the computational-linguistic methods implemented for their decoding. The algorithms are corpus-based with manual annotation of coreferential constructs and are evaluated using standard metrics.

4 December 2017

Adam Dobaczewski, Piotr Sobotka, Sebastian Żurowski (Nicolaus Copernicus University in Toruń)

https://www.youtube.com/watch?v=az06czLflMw Dictionary of Polish reduplications and repetitions  Talk delivered in Polish.

In our talk we will present a dictionary prepared by the team from the Institute of Polish Language of the Nicolaus Copernicus University in Toruń (grant NPRH 11H 13 0265 82). We document In the dictionary expressions of the Polish language in which the presence of reduplication or repetition of forms of the same lexemes can be observed. We distinguish the units of language according to the Bogusławski's operational grammar framework and divide them into two basic groups: (i) lexical units consisting of two such segments or forms of the same lexeme (Pol. całkiem całkiem; fakt faktem); operational units based on some pattern of repetition of words belonging to a certain class predicted by this scheme (Pol. N[nom] N[inst] ale _, where N stands for any noun, e.g. sąd sądem, ale _; miłość miłością, ale _). We have prepared a dictionary in traditional (printed) form due to the relatively small number of registered units. Its material base is the resources of the NKJP, which were searched using dedicated search engine of repetitions in the NKJP. This tool was specially prepared for this project at the LEG ICS PAS.

29 January 2018

Roman Grundkiewicz (Adam Mickiewicz University in Poznań/University of Edinburgh)

https://www.youtube.com/watch?v=dj9rTwzDCdA Automatic Grammatical Error Correction using Machine Translation  Talk delivered in Polish. Slides in English.

In my presentation I will be talking about the task of automated grammatical error correction (GEC) in texts written by non-native English speakers. I will present our experiments on the application of the phrase-based statistical machine translation (SMT), and our GEC system, which achieved new state-of-the-art results. The importance of the parameter optimization towards the task-specific evaluation metric and new GEC-adapted dense and sparse features will be discussed. I will also briefly describe the results of further research using neural machine translation (NMT).

12 February 2018

Agnieszka Mykowiecka, Aleksander Wawer, Małgorzata Marciniak, Piotr Rychlik (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=9QPldbRyIzU Recognition of metaphorical noun phrases in Polish with distributional semantics  The talk delivered in Polish.

Our talk addresses the use of vector models for Polish based on lemmas and forms. We compare the results for two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition. Then we apply vector models to detect metaphorical and literal meaning of adjective-noun (AN) phrases. We show the results of our method for isolated phrases and compare them to other known methods. Finally, we discuss the problem of recognition of metaphorical/literal meaning of AN phrases in sentences.

26 February 2018

Celina Heliasz (University of Warsaw)

To create or to contribute? On the search for synergy between computer scientists and linguists  The talk delivered in Polish.

The main topic of my presentation are the methods of conducting research in the field of corpus linguistics, which is currently being addressed by both computer scientists and linguists. In my speech, I will present the attempts to recognize and visualize semantic relations in the text undertaken by computer scientists as part of the two projects: RST (Rhetorical Structure Theory) and PDTB (Penn Discourse Treebank). Then, I contrast RST and PDTB with analogous attempts made by computer scientists and linguists at IPI PAN as part of the CLARIN-PL venture. The aim of the presentation is to show the determinants of effective linguistic analysis, which must be taken into account when designing IT tools, if these tools are to conduct research on text and derive strong foundations of linguistic theories from them, and not only to implement existing theories in this field.

9 April 2018

Jan Kocoń (Wrocław University of Technology)

https://www.youtube.com/watch?v=XgSyuWEHWhU Recognition of temporal expressions and events in Polish text documents  The talk delivered in Polish.

A temporal expression is a sequence of words that informs you about when, how often an event occurs or how long it lasts. Event descriptions are words which indicate a change of state in the description of reality (and also some states). These issues fall within the scope of information extraction. They are well defined and described for English and partly for other languages. The TimeML specification, whose temporal information description language has been accepted as an ISO standard, has been officially adapted for six languages and the temporal expressions description section is defined for eleven languages. The result of the work carried out within CLARIN-PL is the adaptation of TimeML guidelines for Polish language. The motivation for this topic was the fact that temporal information is used by various natural language processing tasks, including methods for question answering, automatic text summarisation, semantic relations extraction and many others. These methods allow researchers in the domain of Digital Humanities and Social Sciences to work with a very large collection of texts whose analysis, without these methods, would be very time-consuming, if possible at all. In addition to the adaptation of the temporal information description language itself, the quality and efficiency of methods is a key aspect for temporal expressions and events recognition. The presentation will discuss both the analysis of the quality of data prepared by domain experts (including annotation agreement analysis) and the results of research aimed at reducing the complexity of the computational problem while preserving the quality of methods.

23 April 2018

Włodzimierz Gruszczyński, Dorota Adamiec, Renata Bronikowska (Institute of the Polish Language, Polish Academy of Sciences), Witold Kieraś, Dorota Komosińska, Marcin Woliński (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=APvZdALq6ZU Historical corpus – problems of transliteration, transcription and annotation on the example of the Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772)  The talk delivered in Polish.

During the seminar, the process of creating the Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772), also called the Baroque Corpus, will be discussed. The particular emphasis will be placed on those tasks and problems that are specific to historical corpora, in contrast to corpora of contemporary texts, e.g. the National Corpus of Polish. We will also show the tools that were created for the needs of the project or adapted to these needs. After the general presentation of the project (assumptions, financing, team, current status, corpus's purpose) we will discuss particular problems in the order in which they appeared during the creation of the corpus: the selecting of texts, gathering them and incorporating them into a database, the necessity of their transcription into modern spelling (resulting from a huge spelling differentiation of old prints and manuscripts), issues of morphological analysis, morphosyntactic annotation (manual and automatic) and corpus searching.

14 May 2018

Łukasz Kobyliński, Michał Wasiluk, Zbigniew Gawłowicz (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=QpmLVzqQfcM MTAS corpus search engine and its implementation for Polish language corpora  The talk delivered in Polish.

During the seminar we will discuss our experiences with the MTAS search engine in the context of Polish language corpora. We will present several implementations of MTAS in such corpus-related projects as KORBA (the corpus of Polish language of the XVII and XVIII century), the XIX century language corpus, as well as National Corpus of Polish. We will also discuss preliminary experiments with implementing MTAS in Korpusomat - a tool that allows users to create their own corpora. During the presentation we will share our solutions to the problems encountered during the adaptation of MTAS to Polish and preliminary efficiency test results. We will also discuss the search capabilities of the engine and our plans for enhancing MTAS.

21 May 2018 (IPI PAN seminar presentation, 13:00)

Piotr Borkowski (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=o2FFtfrqh3I Semantic methods of categorization in the tasks of text document analysis  The talk delivered in Polish.

In my PhD thesis entitled `Semantic methods of categorization in the tasks of text document analysis', a new algorithm of semantic categorization of documents was proposed and examined. On its basis, a new algorithm for category aggregation was developed, a family of semantic algorithms of classifiers, as well as a heterogeneous classifier committee (which combines the algorithm of semantic categorization and previously known classifiers). In my talk I will briefly present their concepts and the results of their effectiveness studies.

28 May 2018

Krzysztof Wołk (Polish-Japanese Academy of Information Technology)

https://www.youtube.com/watch?v=FyeVRSXbBOg Exploration and usage of comparable corpora in machine translation  The talk delivered in Polish.

The problem that will be presented in the seminar is how to improve machine speech translation between Polish and English. The most popular methodologies and tools are not well-suited for the Polish language and therefore require adaptation. Polish language resources are lacking in parallel and monolingual data. Therefore, the main objective of the study was to develop an automatic toolkit for textual resources preparation by mining comparable corpora and quasi comparable corpora. Experiments were conducted mostly on casual human speech, consisting of lectures, movie subtitles, European Parliament proceedings, and European Medicines Agency texts. The aims were to rigorously analyze the problems and to improve the quality of baseline systems, i.e., adaptation of techniques and training parameters to increase the Bilingual Evaluation Understudy (BLEU) score for maximum performance. A further aim was to create additional bilingual and monolingual data resources by using available online data and by obtaining and mining comparable corpora for parallel sentence pairs. For this task, a methodology employing a Support Vector Machine and the Needleman-Wunsch algorithm was used, along with a chain of specialized tools.

4 June 2018

Piotr Przybyła (University of Manchester)

https://www.youtube.com/watch?v=thHOtqsfsys Supporting document screening for systematic reviews using machine learning and text mining  The talk delivered in Polish.

Systematic reviews, aiming to aggregate and analyse all the literature for a given research question, are a crucial tool in medical research. Their most laborious stage is screening, i.e. manual selection of dozens of relevant articles from thousands returned by search engines. Formulating the problem as a text classification task and using appropriate unsupervised text mining tools could lead to significant work saved. The presentation will cover adaptation of machine learning algorithms to the problem, tools for extracting and visualising terms and topics in collections, system deployment and evaluation at NICE (National Institute for Health and Care Excellence), a UK agency publishing health technology guidelines.

11 June 2018

Danijel Korzinek (Polish-Japanese Academy of Information Technology)

https://www.youtube.com/watch?v=mc8T5rXlk1I Preparing a speech corpus using the recordings of the Polish Film Chronicle  Talk delivered in Polish. Slides in English.

The presentation will describe how a speech corpus based on the Polish Film Chronicle, a collection of short historical news segments, was created during the CLARIN-PL project. This resource is an extremely useful tool for linguistic research, specifically in the context of historical speech and language. The years 1945–1960 were chosen for this purpose. The presentation will discuss various topics: from the legal issues of acquiring the resources, to more the more technical aspects of dealing with the adaptation of speech analysis tools to this, rather uncommon domain.