Locked History Actions

seminar-archive

Natural Language Processing Seminar 2023–2024

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). All recorded talks are available on YouTube.

seminarium-archiwum

9 October 2023

Agnieszka Mikołajczyk-Bareła, Wojciech Janowski (VoiceLab), Piotr Pęzik (University of Łódź / VoiceLab), Filip Żarnecki, Alicja Golisowicz (VoiceLab)

https://www.youtube.com/watch?v=q5nCUwhj2us TRURL.AI: Fine-tuning large language models on multilingual instruction datasets  Talk delivered in Polish.

This talk will summarize our recent work on fine-tuning a large generative language model on bilingual instruction datasets, which resulted in the release of an open version of Trurl (trurl.ai). The motivation behind creating this model was to improve the performance of the original Llama 2 7B- and 13B-parameter models (Touvron et al. 2023), from which it was derived in a number of areas such as information extraction from customer-agent interactions and data labeling with a special focus on processing texts and instructions written in Polish. We discuss the process of optimizing the instruction datasets and the effect of the fine-tuning process on a number of selected downstream tasks.

16 October 2023

Konrad Wojtasik, Vadim Shishkin, Kacper Wołowiec, Arkadiusz Janz, Maciej Piasecki (Wrocław University of Science and Technology)

https://www.youtube.com/watch?v=ehBE6qTKlcM Evaluation of information retrieval models in zero-shot settings on different documents domains  Talk delivered in English.

Information Retrieval over large collections of documents is an extremely important research direction in the field of natural language processing. It is a key component in question-answering systems, where the answering model often relies on information contained in a database with up-to-date knowledge. This not only allows for updating the knowledge upon which the system responds to user queries but also limits its hallucinations. Currently, information retrieval models are neural networks and require significant training resources. For many years, lexical matching methods like BM25 outperformed trained neural models in Open Domain setting, but current architectures and extensive datasets allow surpassing lexical solutions. In the presentation, I will introduce available datasets for the evaluation and training of modern information retrieval architectures in document collections from various domains, as well as future development directions.

30 October 2023

Agnieszka Faleńska (University of Stuttgart)

https://www.youtube.com/watch?v=6Kgj0N4MvIA Steps towards Bias-Aware NLP Systems  Talk in English.

For many, Natural Language Processing (NLP) systems have become everyday necessities, with applications ranging from automatic document translation to voice-controlled personal assistants. Recently, the increasing influence of these AI tools on human lives has raised significant concerns about the possible harm these tools can cause.

In this talk, I will start by showing a few examples of such harmful behaviors and discussing their potential origins. I will argue that biases in NLP models should be addressed by advancing our understanding of their linguistic sources. Then, the talk will zoom into three compelling case studies that shed light on inequalities in commonly used training data sources: Wikipedia, instructional texts, and discussion forums. Through these case studies, I will show that regardless of the perspective on the particular demographic group (speaking about, speaking to, and speaking as), subtle biases are present in all these datasets and can perpetuate harmful outcomes of NLP models.

13 November 2023

Piotr Rybak (Institute of Computer Science, Polish Academy of Sciences)

Advancing Polish Question Answering: Datasets and Models  Talk delivered in Polish. Slides in English.

Although question answering (QA) is one of the most popular topics in natural language processing, until recently it was virtually absent in the Polish scientific community. However, the last few years have seen a significant increase in work related to this topic. In this talk, I will discuss what question answering is, how current QA systems work, and what datasets and models are available for Polish QA. In particular, I will discuss the resources created at IPI PAN, namely the PolQA and MAUPQA and the Silver Retriever model. Finally, I will point out further directions of work that are still open when it comes to Polish question answering.

11 December 2023 (a series of short invited talks by Coventry Univerity researchers)

Xiaorui Jiang, Opeoluwa Akinseloyin, Vasile Palade (Coventry University)

https://www.youtube.com/watch?v=_BnuR3fY1FY Towards More Human-Effortless Systematic Review Automation  Wystąpienie w jęz. angielskim.

Systematic literature review (SLR) is the standard tool for synthesising medical and clinical evidence from the ocean of publications. SLR is extremely expensive. SLR is extremely expensive. AI can play a significant role in automating the SLR process, such as for citation screening, i.e., the selection of primary studies-based title and abstract. Some tools exist, but they suffer from tremendous obstacles, including lack of trust. In addition, a specific characteristic of systematic review, which is the fact that each systematic review is a unique dataset and starts with no annotation, makes the problem even more challenging. In this study, we present some seminal but initial efforts on utilising the transfer learning and zero-shot learning capabilities of pretrained language models and large language models to solve or alleviate this challenge. Preliminary results are to be reported.

Kacper Sówka (Coventry University)

https://www.youtube.com/watch?v=Of8-cfhvzXU Attack Tree Generation Using Machine Learning  Wystąpienie w jęz. angielskim.

My research focuses on applying machine learning and NLP to the problem of cybersecurity attack modelling. This is done by generating "attack tree" models using public cybersecurity datasets (CVE) and training a siamese neural network to predict the relationship between individual cybersecurity vulnerabilities using a DistilBERT encoder fine-tuned using Masked Language Modelling.

Xiaorui Jiang (Coventry University)

https://www.youtube.com/watch?v=UCiOk0AZa0M Towards Semantic Science Citation Index  Wystąpienie w jęz. angielskim.

It is a difficult task to understand and summarise the development of scientific research areas. This task is especially cognitively demanding for postgraduate students and early-career researchers, of the whose main jobs is to identify such developments by reading a large amount of literature. Will AI help? We believe so. This short talk summarises some recent initial work on extracting the semantic backbone of a scientific area through the synergy of natural language processing and network analysis, which is believed to serve a certain type of discourse models for summarisation (in future work). As a small step from it, the second part of the talk introduces how comparison citations are utilised to improve multi-document summarisation of scientific papers.

Xiaorui Jiang, Alireza Daneshkhah (Coventry University)

https://www.youtube.com/watch?v=5z7rdnafpjU Natural Language Processing for Automated Triaging at NHS  Talk in English.

In face of a post-COVID global economic slowdown and aging society, the primary care units in the National Healthcare Services (NHS) are receiving increasingly higher pressure, resulting in delays and errors in healthcare and patient management. AI can play a significant role in alleviating this investment-requirement discrepancy, especially in the primary care settings. A large portion of clinical diagnosis and management can be assisted with AI tools for automation and reduce delays. This short presentation reports the initial studies worked with an NHS partner on developing NLP-based solutions for the automation of clinical intention classification (to save more time for better patient treatment and management) and an early alert application for Gout Flare prediction from chief complaints (to avoid delays in patient treatment and management).

8 January 2024

Danijel Korzinek (Polish-Japanese Academy of Information Technology)

https://www.youtube.com/watch?v=W_A8W_Hu73I ParlaSpeech – Developing Large-Scale Speech Corpora in the ParlaMint project  Talk delivered in Polish.

The purpose of this sub-project was to develop tools and methodologies that would allow the linking of the textual corpora developed within the ParlaMint project with their coresponding audio and video footage available online. The task was naturally more involved than it may seem intuitivetily and it higned mostly on the proper alignment of very long audio (up to a full working day of parliamentary sessions) to its corresponding transcripts, while accounting for many mistakes and inaccuracies in the matching and order between the two modalities. The project was developed using fully open-source models and tools, which are available online for use in other projects of similar scope. So far, it was used to fully prepare corpora for two languages (Polish and Croatian), but more are being currently developed.

12 February 2024

Tsimur Hadeliya, Dariusz Kajtoch (Allegro ML Research)

https://www.youtube.com/watch?v=b8FE2_lzfE8 Evaluation and analysis of in-context learning for Polish classification tasks  Talk in English.

With the advent of language models such as ChatGPT, we are witnessing a paradigm shift in the way we approach natural language processing tasks. Instead of training a model from scratch, we can now solve tasks by designing appropriate prompts and choosing suitable demonstrations as input to a generative model. This approach, known as in-context learning (ICL), has shown remarkable capabilities for classification tasks in the English language . In this presentation, we will investigate how different language models perform on Polish classification tasks using the ICL approach. We will explore the effectiveness of various models, including multilingual and large-scale models, and compare their results with existing solutions. Through a comprehensive evaluation and analysis, we aim to gain insights into the strengths and limitations of this approach for Polish classification tasks. Our findings will shed light on the potential of ICL for the Polish language. We will discuss challenges and opportunities, and propose directions for future work.

29 February 2024

Seminar on analysis of parliamentary data  All talks in Polish.

Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

Polish Parliamentary Corpus and ParlaMint corpus

Bartłomiej Klimowski (University of Warsaw)

Application to analyse the sentiment of utterances of Polish MPs

Konrad Kiljan (University of Warsaw), Ewelina Gajewska (Warsaw University of Technology)

Analysis of the dynamics of emotions in parliamentary debates about the war in Ukraine

Aleksandra Tomaszewska (Institute of Computer Science, Polish Academy of Sciences), Anna Jamka (Universty of Warsaw)

Gender-fair language in the Polish parliament: a corpus-based study of parliamentary debates in the ParlaMint corpus

Marek Łaziński (University of Warsaw)

Changes in the Polish language of the last hundred years in the mirror of parliamentary debates

25 March 2024

Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=IS_Miy2o8-A Are text credibility classifiers robust to adversarial actions?  Talk in Polish.

Automatic text classifiers are widely used for helping in content moderation for platforms hosting user-generated text, especially social networks. They can be employed to filter out unfriendly, misinforming, manipulative or simply illegal information. However, we have to remember that authors of such text often have a strong motivation to spread them and might try to modify the original content, until they find a reformulation that gets through automatic filters. Such modified variants of original data, called adversarial examples, play a crucial role in analyzing the robustness of ML models to the attacks of motivated actors. The presentation will be devoted to a systematic analysis of the problem in context of detecting misinformation. I am going to show concrete examples where a replacement of trivial words causes a change in a classifier's decision, as well as the BODEGA framework for robustness analysis, used in the InCreiblAE shared task at CheckThat! evaluation lab at CLEF 2024.

28 March 2024

Krzysztof Węcel (Poznań University of Economics and Business)

https://www.youtube.com/watch?v=Om1ypFnYUIE Credibility of information in the context of fact-checking process  Talk in Polish.

The presentation will focus on the topics of OpenFact project, which is a response to the problem of fake news. As part of the project, we develop methods that allow us to verify the credibility of information. In order to ensure methodological correctness, we rely on the process used by fact-checking agencies. These activities are based on complex data sets obtained, among others, from ClaimReview, Common Crawl or by monitoring social media and extracting statements from texts. It is also important to evaluate information in terms of its checkworthiness and the credibility of sources whose reputation may result from publications sourced from OpenAlex or Crossref. Stylometric analysis allows us to determine authorship, and the comparison of human and machine work opens up new possibilities in detecting the use of artificial intelligence. We use local small language models as well as remote LLMs with various scenarios. We have built large sets of statements that can be used to verify new texts by examining semantic similarity. They are described with additional, constantly expanded metadata allowing for the implementation of various use cases.

25 April 2024

Seminar summarising the work on the Corpus of Modern Polish (Decade 2011-2020)  All talks in Polish.

11:30–11:35: About the project (Małgorzata Marciniak)

11:35–12:05: The Corpus of Modern Polish, Decade 2011-2020 (Marek Łaziński)

12:05–12:35: Annotation, lemmatisation, frequency lists (Witold Kieraś)

12:35–13:00: Coffee break

13:00–13:30: Hybrid representation of syntactic information (Marcin Woliński)

13:30–14:15: Discussion on the future of corpora

13 May 2024

Michal Křen (Charles University in Prague)

Latest developments in the Czech National Corpus  Talk in English.

The talk will give an overview of the Czech National Corpus (CNC) research infrastructure in all the main areas of its operation: corpus compilation, data annotation, application development and user support. Special attention will be paid to the variety of language corpora and user applications where CNC has recently seen a significant progress. In addition, it is the end-user web applications that shape the way linguists and other scholars think about the language data and how they can be utilized. The talk will conclude with an outline of future plans.

3 June 2024 (the talk given at the institute seminar)

Marcin Woliński, Katarzyna Krasnowska-Kieraś (Institute of Computer Science, Polish Academy of Sciences)

Constituency and dependency parsing of natural language using neural networks  Talk in Polish.

In the talk, we will present a method of automatic syntactic analysis (parsing) of natural language. In the proposed approach, syntactic structures are expressed using syntactic spines and their attachments, which allows a simultaneous generation of two popular representations: dependency and constituency trees. We will discuss the implementation of this concept in the form of a set of classifiers fed with the outputs of a BERT-type language model. Tests of the algorithm on Polish and German data showed a high quality of the results obtained. The method was used to introduce a syntactic layer of annotation in the Corpus of Contemporary Polish Language developed at IPI PAN.

4 July 2024

Purificação Silvano (University of Porto)

https://www.youtube.com/watch?v=VUnZIrr2Av8 Unifying Semantic Annotation with ISO 24617 for Narrative Extraction, Understanding and Visualisation  Talk in English.

In this talk, I will present the successful application of Language resource management – Semantic annotation framework (ISO-24617) for representing semantic information in texts. Initially, I will introduce the harmonisation of five parts of ISO 24617 (1, 4, 7, 8, 9) into a comprehensive annotation scheme designed to represent semantic information pertaining to eventualities, times, participants, space, discourse relations and semantic roles. Subsequently, I will explore the applications of this annotation, specifically highlighting the Text2Story and StorySense projects, which focus on narrative extraction, understanding and visualisation of the journalistic text.



Natural Language Processing Seminar 2022–2023

3 October 2022

Sławomir Dadas (National Information Processing Institute)

https://www.youtube.com/watch?v=TGwLeE1Y5X4 Our experience with training neural sentence encoders for the Polish language  Talk delivered in Polish.

Representing sentences or short texts as dense vectors with a fixed number of dimensions is a common technique in tasks such as information retrieval, question answering, text clustering or plagiarism detection. A simple method to construct such representation is to aggregate vectors generated by a language model or extracted from word embeddings. However, higher quality representations can be obtained by fine-tuning a language model on a dataset of semantically similar sentence pairs. In this presentation, we will introduce methods for learning sentence encoders based on the Transformer architecture as well as our experiences with training such models for the Polish language. In addition, we will discuss approaches for building large datasets of paraphrases using publicly available corpora. We will also show a practical application of sentence encoders in a system developed for finding abusive clauses in consumer agreements.

14 November 2022

Łukasz Augustyniak, Kamil Tagowski, Albert Sawczyn, Denis Janiak, Roman Bartusiak, Adrian Dominik Szymczak, Arkadiusz Janz, Piotr Szymański, Marcin Wątroba, Mikołaj Morzy, Tomasz Jan Kajdanowicz, Maciej Piasecki (Wrocław University of Science and Technology)

https://pwr-edu.zoom.us/j/96657909989?pwd=VXFmcEc5blNyM0M3ekxvNGc3Q2Rsdz09 This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish  Talk delivered in Polish. Slides in English.

The availability of compute and data to train larger and larger language models increases the demand for robust methods of benchmarking the true progress of LM training. Recent years witnessed significant progress in standardized benchmarking for English. Benchmarks such as GLUE, SuperGLUE, or KILT have become a de facto standard tools to compare large language models. Following the trend to replicate GLUE for other languages, the KLEJ benchmark (klej is the word for glue in Polish) has been released for Polish. In this paper, we evaluate the progress in benchmarking for low-resourced languages. We note that only a handful of languages have such comprehensive benchmarks. We also note the gap in the number of tasks being evaluated by benchmarks for resource-rich English/Chinese and the rest of the world. In this paper, we introduce LEPISZCZE (lepiszcze is the Polish word for glew, the Middle English predecessor of glue), a new, comprehensive benchmark for Polish NLP with a large variety of tasks and high-quality operationalization of the benchmark. We design LEPISZCZE with flexibility in mind. Including new models, datasets, and tasks is as simple as possible while still offering data versioning and model tracking. In the first run of the benchmark, we test 13 experiments (task and dataset pairs) based on the five most recent LMs for Polish. We use five datasets from the Polish benchmark and add eight novel datasets. As the paper's main contribution, apart from LEPISZCZE, we provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.

28 November 2022

Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences)

https://www.youtube.com/watch?v=zVbQ7gmbqvA Lying in Polish: language analysis and methods of automated detection  Talk delivered in Polish.

Lying is an integral part of daily communication in both written and oral form. In this presentation, we will present the results obtained on a collection of nearly 1,500 true and false statements, half of which are transcripts and the other half are written statements, from probably the largest study on lying in the Polish language. In the first part of the presentation, we will examine the differences between true and false statements: we will check whether they differ in terms of complexity and sentiment, as well as characteristics such as length, concreteness and distribution of parts of speech. In the second part of the presentation, we will discuss models that automatically distinguish true from false statements. We will cover simple approaches, such as models trained on dictionary features, as well as more complex, pre-trained transformer neural networks. We will also talk about an attempt to detect lying with the use of automated fact-checking and present the preliminary results of work on the interpretability (explanations) of lie detection models.

19 December 2022

Wojciech Kryściński (Salesforce Research)

https://www.youtube.com/watch?v=54qidiBmiok Long Story Short: A Talk about Text Summarization  Talk and slides in English.

Automatic Text Summarization is a challenging task within Natural Language Processing that requires advanced language understanding and generation capabilities. In recent years substantial progress has been made in developing neural models for the task thanks to the efforts of the research community and advancements in the broader field of NLP. Despite this progress, text summarization remains a challenging task that is far from being solved. In this talk, we will first discuss the early approaches and the current state of the field. Next, we will critically evaluate key ingredients of the existing research setup: datasets, evaluation metrics, and models. Finally, we will focus on emerging research directions and consider the future of text summarization.

9 January 2023

Marzena Karpińska (University of Massachusetts Amherst)

Challenges in Evaluation of Machine Generated Text  Talk delivered in Polish.

The recent progress in natural language generation (NLG) has made it difficult for researchers to effectively evaluate the output of their models. Traditional metrics, such as BLEU and ROUGE, are no longer sufficient to distinguish between high quality and low quality outputs, especially in open-ended tasks like story and poetry generation, or at the paragraph level. As a result, many researchers rely on crowdsourced human evaluations of text quality, using platforms like Amazon Mechanical Turk (AMT) to collect ratings of coherence or grammaticality. In this talk, I will first present a series of experiments highlighting the challenges and pitfalls of such approaches showing that even experts may struggle to accurately evaluate model-generated text using Likert-style scales, especially in the story generation task. Next, I will address similar issues in automatic evaluation of machine translation of the literary domain, and outline some unique difficulties inherent in the translation task itself.

6 February 2023

Agnieszka Mikołajczyk-Bareła (VoiceLab / Politechnika Gdańska / HearAI)

https://www.youtube.com/watch?v=f5wt381IYeI HearAI: Towards Deep learning-based Sign Language Recognition  Talk delivered in Polish. Slides in English.

Deaf and hearing-impaired people have a huge communication barrier. Different nationalities use different sign languages, and there is no universal one, as they are natural human languages with their own grammatical rules and lexicons. Deep learning-based methods for sign language translation need a lot of adequately labeled training data to perform well. In the HearAI non-profit project, we addressed this problem and investigated different multilingual open sign language corpora labeled by linguists in the language-agnostic Hamburg Notation System (HamNoSys). First, we simplified the difficult-to-understand structure of the HamNoSys without significant loss of gloss meaning by introducing numerical multilabels. Second, we utilized estimated pose landmarks and selected video keyframes' image-level features to recognize isolated glosses. We separately analyzed possibilities of dominant hand location, its position and shape, and overall movement symmetry, which allowed us to deeply explore the usefulness of HamNoSys for gloss recognition.

13 February 2023

Artur Nowakowski, Gabriela Pałka, Kamil Guttmann, Mikołaj Pokrywka (Adam Mickiewicz University in Poznań)

AMU at WMT 2022: state-of-the-art machine translation methods  Talk delivered in Polish. Slides in English.

The majority of machine translation systems are trained at the sentence level. However, today, the expectation is that the translation system will take into account the context of the entire document. To meet this expectation, the organizers of the WMT 2022 conference created the General MT Task, which involves translating texts from different domains: news articles, social media content, conversations, and e-commerce texts. The presentation will discuss the task faced during the WMT 2022 conference in the Czech-Ukrainian and Ukrainian-Czech translation directions. The encountered problems such as correct translation of named entities, consideration of document context, and proper inclusion of rarely used characters like emojis will be discussed. Additionally, methods for selecting the best translation among the translations generated by the system using automatic translation quality assessment models will be presented. The primary goal of the presentation is to showcase the components of the system that contributed to achieving the best results among all shared task participants.

27 February 2023

Sebastian Vincent (University of Sheffield)

https://www.youtube.com/watch?v=An6sNU50UVM MTCUE: Learning Zero-Shot Control of Extra-Textual Attributes by Leveraging Unstructured Context in Neural Machine Translation  Talk partially delivered in Polish. But most of the talk and the slides are in English.

Efficient use of both intra- and extra-textual context is one of the critical gaps between human and neural machine translation. Research so far has mostly focused on individual, well-defined types of context, such as the surrounding text or discrete external variables such as the gender of the speaker. This work introduces MTCue, a novel neural machine translation framework which rewrites all context as text and learns an abstract representation of context enabling transfer across different data settings and leveraging similar attributes in low resource settings. Focusing on the domain of dialogue with access to document and metadata context, we evaluate multiple variants of MTCue, with four choices for context-source combination and several context vectorisation functions. Our experiments across six language pairs show gains in translation quality over a non-contextual baseline. Further analysis shows that the context encoder of MTCue learns a context space representation which is organised w.r.t. specific attributes such as formality, effectively enabling their zero-shot control. Pre-training on context embeddings also lets MTCue learn new control codes with less data than a tagging baseline.

27 March 2023

Julian Zubek, Joanna Rączaszek-Leonardi (Faculty of Psychology, University of Warsaw)

https://www.youtube.com/watch?v=RJrYftyDIzw Agent-based models of symbol emergence in communication inspired by processes of language development  Talk delivered in Polish.

Influenced by computer science, we come to understand symbols as discrete elements of an abstract structure on which formal operations are performed. Semiotically, symbols are a particular type of signs that function within a system of interdependencies and whose interpretation requires knowledge of the rules governing this system. From the perspective of language evolution and development, the emergence of symbolic structures and the ability to use them pose a number of basic questions. In our research program, we focus on how abstract symbols emerge together with the ability to perform physical actions in the world and how symbols can control these actions. To illustrate these relations, we use computer simulations in which agents coordinate their actions using a communication protocol that emerges from the bottom-up in a reinforcement learning scheme. We point out the assumptions underlying these types of models and the existing difficulties in modeling multiple sources of pressure shaping the structure of language. We present the results of our own simulations, illustrating a) the influence of interaction history on the structure of language, b) the relations between context availability and communication protocol ambiguity, c) the role of dialogue in the coordination and structuring of actions in a dynamic environment. The results show the complex nature of symbols, which requires complementary description at the level of formal structure and at the level of system dynamics. This complexity should also be reflected in the design and evaluation of artificial intelligence algorithms intended for interaction with humans.

24 April 2023

Mateusz Krubiński (Charles University in Prague)

A picture is worth a thousand words – on Multimodal Summarization  Talk delivered in Polish.

Automatic summarization is one of the basic tasks both in Natural Language Processing – text summarization – and in Computer Vision – video summarization. Multimodal summarization connects those two fields by creating a summary based on information from different modalities. To motivate such research, it’s enough to visit any news portal: the most popular multimedia news formats are now multimodal – the reader is often presented not only with a textual article but also with a short, vivid video. To draw the attention of the reader, such video-based articles are usually presented as a short textual summary paired with an image thumbnail.

In this talk, I will present a brief history of text-centric Multimodal Summarization - a formulation in which we require the textual modality to be present both in the input and in the output. I will show how the task evolved over the years and highlight what I believe to be the major challenges. In the second part, I will talk about my own experiments, focusing on pre-training and evaluation methodologies. I will also share my experience with creating a dataset based on information automatically collected from internet webpages, which shows that sometimes aiming lower may lead to a great outcome.

25 May 2023

Agata Savary (Université Paris-Saclay)

https://www.youtube.com/watch?v=Hzbjw5A7uec We thought the eyes of coreference were shut to multiword expressions and they mostly are  Talk delivered in Polish. Slides in English.

Multiword expressions are combinations of words which exhibit peculiar semantic properties such as different degrees of non-compositionality, decomposability, transparency and figuration. Long-standing linguistic debates suggest that such semantic idiosyncrasy conditions the morpho-syntactic configurations in which a given multiword expression can occur. This papers extends this argumentation to nominal coreference. Namely, we hypothesise that internal components of a multiword expression are unlikely to occur in coreference chains. While previous work noticed the rareness of coreference-related phenomena in presence of multiword expressions, this observation has never been quantified, to the best of our knowledge. We bridge this gap by performing an automated corpus-based study of the intersections between verbal multiword expressions and nominal coreference in French. The results largely corroborate our hypothesis but also display various tendencies depending on the types of multiword expressions and of the corpus genre. The analysis of the corpus examples highlights interesting properties of coreference, notably in speech.



Natural Language Processing Seminar 2021–2022

11 October 2021

Adam Przepiórkowski (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)

Polyadic Quantifiers in Heterofunctional Coordination  Talk delivered in Polish.

The aim of this talk is to provide a semantic analysis of a construction – Heterofunctional Coordination – which is typical of Slavic and some neighbouring languages. In this construction, expressions bearing different grammatical functions may be conjoined. In this talk, I will propose a semantic analysis of such constructions based on the concept of generalized quantifiers (Mostowski; Lindström; Barwise and Cooper), and more specifically – polyadic quantifiers (van Benthem; Keenan; Westerståhl). Some familiarity with the language of predicate logic should suffice to fully understand the talk; all linguistic concepts (including "coordination", "grammatical functions") and logical concepts (including "generalized quantifiers" and "polyadic quantifiers") will be explained in the talk.

18 October 2021

Przemysław Kazienko, Jan Kocoń (Wrocław University of Technology)

https://www.youtube.com/watch?v=mvjO4R1r6gM Personalized NLP  Talk delivered in English.

Many natural language processing tasks, such as classifying offensive, toxic, or emotional texts, are inherently subjective in nature. This is a major challenge, especially with regard to the annotation process. Humans tend to perceive textual content in their own individual way. Most current annotation procedures aim to achieve a high level of agreement in order to generate a high quality reference source. Existing machine learning methods commonly rely on agreed output values that are the same for all annotators. However, annotation guidelines for subjective content can limit annotators' decision-making freedom. Motivated by moderate annotation agreement on offensive and emotional content datasets, we hypothesize that a personalized approach should be introduced for such subjective tasks. We propose new deep learning architectures that take into account not only the content but also the characteristics of the individual. We propose different approaches for learning the representation and processing of data about text readers. Experiments were conducted on four datasets: Wikipedia discussion texts labeled with attack, aggression, and toxicity, and opinions annotated with ten numerical emotional categories. All of our models based on human biases and their representations significantly improve prediction quality in subjective tasks evaluated from an individual's perspective. Additionally, we have developed requirements for annotation, personalization, and content processing procedures to make our solutions human-centric.

8 November 2021

Ryszard Tuora, Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=KeeVWXXQlw8 Dependency Trees in Automatic Inflection of Multi Word Expressions in Polish  Talk delivered in Polish.

Natural language generation for morphologically rich languages can benefit from automatic inflection systems. This work presents such a system, which can tackle inflection, with particular emphasis on Multi Word Expressions (MWEs). This is done using rules induced automatically from a dependency treebank. The system is evaluated on a dictionary of Polish MWEs. Additionally, a similar algorithm can be utilized for lemmatization of MWEs. In principle, the system can also be applied to other languages with similar morphological mechanisms. To prove that, we will present a simple solution for Russian.

29 November 2021

Piotr Przybyła (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=zJssN3-5cyg When classification accuracy is not enough: Explaining news credibility assessment and measuring users' reaction  Talk delivered in Polish.

Automatic assessment of text credibility has recently become a very popular task in NLP, with many solutions proposed and evaluated through accuracy-based measures. However, little attention has been given to the deployment scenarios for such models that would reduce the spread of misinformation, as intended. Within the study presented here, two credibility assessment techniques were implemented in a browser extension, which was then used in a user study, allowing to answer questions in three areas. Firstly, how resource-intensive NLP models can be compressed to work in a constrained environment? Secondly, what interpretability and visualisation techniques are most effective in human-computer cooperation? Thirdly, are user relying on such automated tools really more effective in spotting fake news?

6 December 2021

Joanna Byszuk (Institute of Polish Language, Polish Academy of Sciences)

Towards multimodal stylometry – possibilities and challenges of new approach to film and TV series analysis  Talk delivered in Polish.

This talk will present a proposal of novel approach to quantitative analysis of multimodal works on the example of the corpus of Doctor Who television series, which draws from stylometry and multimodal theory of film analysis. Stylometric methods have long been popular in the analysis of literary texts. They usually include comparision of texts based on the frequencies of use of selected features which create "stylometric fingerprints", i.e. patterns characteristic of authors, genres and other factors. They are, however, rarely applied to data other than text, with a few new approaches applying stylometry to the study of dance movements (works by Miguel Escobar Varela) or music (Backer and Kranenburg). Multimodal theory of film analysis is in turn a relatively new approach (developed primarily by John Bateman and Janina Wildfeuer), emphasizing the importance of examining information from various image, language and sound modalities for a more comprehensive interpretation. The presented approach uses stylometric method of comparison but taking multiple types of features from various film modalities, i.e. features of image and sound as well as the content of the spoken dialogues. The talk will discuss the benefits and challenges of such an approach and quantitative film media analysis in general.

20 December 2021

Piotr Pęzik (University of Łódź / VoiceLab), Agnieszka Mikołajczyk, Adam Wawrzyński (VoiceLab), Bartłomiej Nitoń, Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

Keyword Extraction with a Text-to-text Transfer Transformer (T5)  Talk delivered in Polish.

The talk will explore the relevance of the Text-To-Text Transfer Transfomer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the newly released Polish Open Science Metadata Corpus (POSMAC), which is currently a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5, extremeText, TermoPL, KeyBert and conclude that the T5 model yields particularly promising results for sparsely represented keywords. Furthermore, a plT5 keyword generation model trained on the POSMAC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.

31 January 2022

Tomasz Limisiewicz (Charles University in Prague)

https://www.youtube.com/watch?v=d1WHbE2gLjk Interpreting and Controlling Linguistic Features in Neural Networks’ Representations  Talk delivered in English.

Neural networks have achieved state-of-the-art results in a variety of tasks in natural language processing. Nevertheless, neural models are black boxes; we do not understand the mechanisms behind their successes. I will present the tools and methodologies used to interpret black box models. The talk will primarily focus on the representations of Transformer-based language models and our novel method — orthogonal probe, which offers good insight into the network's hidden states. The results show that specific linguistic signals are encoded distinctly in the Transformer. Therefore, we can effectively separate their representations. Additionally, we demonstrate that our findings generalize to multiple diverse languages. Identifying specific information encoded in the network allows removing unwanted biases from the representation. Such an intervention increases system reliability for high-stakes applications.

28 February 2022

Maciej Chrabąszcz (Sages)

https://www.youtube.com/watch?v=zB26bW-t5wA Natural Language Generation  Talk delivered in Polish.

The seminar focuses on the problem of generating image descriptions. Models, which will be presented, were tested as part of creating a solution for automatic photo annotation. Among others, there will be presented models with attention and models which use pre-trained vision and text-generating models.

28 March 2022

Tomasz Stanisławek (Applica)

https://www.youtube.com/watch?v=NrDh-UIfgwU Information extraction from documents with complex layout  Talk delivered in Polish.

The rapid development of the domain of NLP in recent years, and particularly the introduction of new language models (BERT, RoBERTa, T5, GPT-3), has popularised the use of information extraction techniques to automate business processes. Unfortunately, most business documents contain not only plain text, but also various types of graphical structures (for example: tables, lists, bold text, forms) that prevent correct processing with the currently existing methods (reading text as a sequence of tokens). During the presentation, I will discuss: a) problems with the existing methods used in the Information Extraction domain, b) Kleister - new data sets created for the purpose of testing new models c) LAMBERT - the new language model with injection of information about the position of tokens, d) further directions of development of the field.

11 April 2022

Daniel Ziembicki (University of Warsaw), Anna Wróblewska, Karolina Seweryn (Warsaw University of Technology)

https://www.youtube.com/watch?v=cU1y78uFCps Polish natural language inference and factivity — an expert-based dataset and benchmarks  Talk delivered in Polish.

The presentation will focus on four themes: (1) the phenomenon of factivity in contemporary Polish, (2) the prediction of entailment, contradictory, and neutrality relations in text, (3) the linguistic dataset we built centered on the factivity-nonfactivity opposition, and (4) a discussion of the results of ML models trained on the dataset in (3), that aimed to predict the semantic relations from (2).

16 May 2022

Inez Okulska, Anna Zawadzka, Michał Szczyszek, Anna Kołos, Zofia Cieślińska (NASK)

https://www.youtube.com/watch?v=u5A3SNw0a7M Style effect(iveness): How and why to encode morphosyntactic features of entire documents  Talk delivered in Polish.

What if we could represent the text of any length with a single, equal, and additionally fully interpretable vector? No corpus to train, no dictionary of pretrained embeddings, one document at a time, to analyze by humans or classifiers? Why not! StyloMetrix vectors are a combination of linguistic metrics that build on the richness of the spaCy library. This approach, of course, misses the semantics of individual words or phrases; thus, it theoretically does not allow for the detection of specific topics. Unless semantics is also carried by style. And in fact, previous experiments and the results of philological research show that these areas are strongly intertwined. For it turns out that – for example – content inappropriate for children or young people is not only an obvious set of forbidden keywords but also a combination of characteristic morphosyntactic indicators of the text. These are so clear and distinctive that, using only the StyloMetrix representation, one can achieve a precision of 90% in a multi-class classification task. Moreover, it turns out that since each vector value is a normalized indicator of a particular grammatical feature of a document, one can also learn something about the linguistic determinants of a given style. This construction of metrics is also a step toward the interpretability of algebraic feature selection methods. All the experiments presented in the talk will be based on content published on the Internet.

23 May 2022

Karolina Stańczak (Copenhagen University)

https://www.youtube.com/watch?v=3oCLO-CRExM A Latent-Variable Model for Intrinsic Probing  Talk delivered in Polish. Slides in English.

The success of pre-trained contextualized representations has prompted researchers to analyze them for the presence of linguistic information. Indeed, it is natural to assume that these pre-trained representations do encode some level of linguistic knowledge as they have brought about large empirical improvements on a wide variety of NLP tasks, which suggests they are learning true linguistic generalization. In this work, we focus on intrinsic probing, an analysis technique where the goal is not only to identify whether a representation encodes a linguistic attribute, but also to pinpoint where this attribute is encoded. We propose a novel latent-variable formulation for constructing intrinsic probes and derive a tractable variational approximation to the log-likelihood. Our results show that our model is versatile and yields tighter mutual information estimates than two intrinsic probes previously proposed in the literature. Finally, we find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.

6 June 2022

Cezary Klamra, Grzegorz Wojdyga (Institute of Computer Science, Polish Academy of Sciences), Sebastian Żurowski (Nicolaus Copernicus University in Toruń), Paulina Rosalska (Nicolaus Copernicus University in Toruń / Applica.ai), Matylda Kozłowska (Oracle Poland), Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=SnjqVft5SzA Devulgarization of Polish Texts Using Pre-trained Language Models  Talk delivered in Polish.

We will propose a text style transfer method for replacing vulgar expressions in Polish utterances with their non-vulgar equivalents while preserving the main characteristics of the text. After fine-tuning three pre-trained language models (GPT-2, GPT-3 and T-5) on a newly created parallel corpus of vulgar/non-vulgar sentence pairs, we evaluate their style transfer accuracy, content preservation and language quality. To the best of our knowledge, the proposed solution is the first of its kind for Polish. The paper presenting the solution was accepted to ICCS 2022.

13 June 2022

Michał Ulewicz

https://www.youtube.com/watch?v=4ZcVXg2Y_fA Semantic Role Labeling – data and models  Talk delivered in Polish.

Semantic Role Labeling (SRL) represents the meaning of a sentence in the form of predicate-argument structures (so called frames). This approach allows to divide the sentence into meaningful parts and for each part precisely answer the questions: who did what to whom, when, where, and how. SRL consists of two steps: i) predicate identification and sense disambiguation, ii) argument identification and classification. High quality training data in the form of propbanks is crucial for building accurate SRL models. Such datasets are available for English language, unfortunately most languages simply do not have corresponding propbanks due to the high effort and cost of constructing such resources. In my presentation, I will describe how SRL can help in precise text processing. I will present attempts to automatically generate datasets for various languages, including Polish, using the annotation projection technique and the identified problems specific to projection from English into Polish. I will tell you about SRL models that I built based on the Transformer architecture.



Natural Language Processing Seminar 2020–2021

5 October 2020

Piotr Rybak, Robert Mroczkowski, Janusz Tracz (ML Research at Allegro.pl), Ireneusz Gawlik (ML Research at Allegro.pl & AGH University of Science and Technology)

https://www.youtube.com/watch?v=LkR-i2Z1RwM Review of BERT-based Models for Polish Language  Delivered in Polish.

In recent years, a series of BERT-based models improved the performance of many natural language processing systems. During this talk, we will briefly introduce the BERT model as well as some of its variants. Next, we will focus on the available BERT-based models for Polish language and their results on the KLEJ benchmark. Finally, we will dive into the details of the new model developed in cooperation between ICS PAS and Allegro.

2 November 2020

Inez Okulska (NASK National Research Institute)

https://www.youtube.com/watch?v=B7Y9fK2CDWw Concise, robust, sparse? Algebraic transformations of word2vec embeddings versus precision of classification  Talk delivered in Polish.

The introduction of the vector representation of words, containing the weights of context and central words, calculated as a result of mapping giant corpora of a given language, and not encoding manually selected, linguistic features of words, proved to be a breakthrough for NLP research. After the first delight, there came revision and search for improvements - primarily in order to broaden the context, to handle homonyms, etc. Nevertheless, the classic embeddinga still apply to many tasks - for example, content classification - and in many cases their performance is still good enough. What do they code? Do they contain redundant elements? If transformed or reduced, will they maintain the information in a way that still preserves the original "meaning"? What is the meaning here? How far can these vectors be deformed and how does it relate to encryption methods? In my speech I will present a reflection on this subject, illustrated by the results of various "tortures” of the embeddings (word2vec and glove) and their precision in the task of classifying texts whose content must remain masked for human users.

16 November 2020

Agnieszka Chmiel (Adam Mickiewicz University, Poznań), Danijel Korzinek (Polish-Japanese Academy of Information Technology)

https://www.youtube.com/watch?v=MxbgQL316DQ PINC (Polish Interpreting Corpus): how a corpus can help study the process of simultaneous interpreting  Talk delivered in Polish.

PINC is the first Polish simultaneous interpreting corpus based on Polish-English and English-Polish interpretations from the European Parliament. Using naturalistic data makes it possible to answer many questions about the process of simultaneous interpreting. By analysing the ear-voice span, or the delay between the source text and the target text, mechanisms of activation and inhibition can be investigated in the interpreter’s lexical processing. Fluency and pause data help us examine the cognitive load. This talk will focus on how we process data in the corpus (such as interpreter voice identification) and what challenges we face in relation to linguistic analysis, dependency parsing and bilingual alignment. We will show how specific data can be applied to help us understand what interpreting involves or even what happens in the interpreter’s mind.

30 November 2020

Findings of ACL: EMNLP 2020: Polish session

Łukasz Borchmann et al. (Applica.ai)

https://www.youtube.com/watch?v=THe1URk40Nk Contract Discovery: Dataset and a Few-Shot Semantic Retrieval Challenge with Competitive Baselines  Talk delivered in Polish. Slides in English.

Contract Discovery deals with tasks, such as ensuring the inclusion of relevant legal clauses or their retrieval for further analysis (e.g., risk assessment). Because there was no publicly available benchmark for span identification from legal texts, we proposed it along with hard-to-beat baselines. It is expected to process unstructured text, as in most real-world usage scenarios; that is, no legal documents segmentation into the hierarchy of distinct (sub)sections is to be given in advance. What is more, it is assumed that a searched passage can be any part of the document and not necessarily a complete paragraph, subparagraph, or clause. Instead, the process should be considered as a few-shot span identification task. In this particular setting, pretrained, universal encoders fail to provide satisfactory results. In contrast, solutions based on the Language Models perform well, especially when unsupervised fine-tuning is applied.

Piotr Szymański (Wrocław Technical University), Piotr Żelasko (Johns Hopkins University)

https://www.youtube.com/watch?v=TXSDhCtTRpw WER we are and WER we think we are  Talk delivered in Polish. Slides in English.

Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark. We show that WERs are significantly higher than the best reported results. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.

17 December 2020

Piotr Przybyła (Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=newobY5cBJo Multi-Word Lexical Simplification  Talk delivered in Polish.

The presentation will cover the task of multi-word lexical simplification, in which a sentence in natural language is made easier to understand by replacing its fragment with a simpler alternative, both of which can consist of many words. In order to explore this new direction, a corpus (MWLS1) including 1462 sentences in English from various sources with 7059 simplifications was prepared through crowdsourcing. Additionally, an automatic solution (Plainifier) for the problem, based on a purpose-trained neural language model, will be discussed along with the evaluation, comparing to human and resource-based baselines. The results of the presented study were also published at the COLING 2020 conference in an article of the same title.

18 January 2021

Norbert Ryciak, Maciej Chrabąszcz, Maciej Bartoszuk (Sages)

https://www.youtube.com/watch?v=L8RRx9KVhJs Classification of patent applications  Talk delivered in Polish. Slides in English.

During our presentation we will discuss the solution for patent applications classification task that was one of GovTech competition problems. We will describe the characteristics of the problem and proposed solution, especially the original method of representing documents as “clouds of word embedding”.

1 February 2021

Adam Jatowt (University of Innsbruck)

https://www.youtube.com/watch?v=e7NblngMe6A Question Answering & Finding Temporal Analogs in News Archives  Talk delivered mostly in English (introduction in Polish).

News archives offer immense value to our society, helping users to learn details of events that occurred in the past. Currently, the access to such collections is difficult for average users due to large sizes and the need for expertise in history. We propose a large-scale open-domain question answering model designed for long-term news article collections, with a dedicated module for re-ranking articles by using temporal information. In the second part of the talk we will discuss methods for finding and explaining temporal analogs – entities in the past which are analogical to the entities in the present (e.g., walkman as a temporal analog of iPad).

15 February 2021

Aleksandra Nabożny (Polish-Japanese Academy of Information Technology)

https://www.youtube.com/watch?v=Rd0nHiVuSZk Methods of optimizing the work of experts during the annotation of non-credible medical texts  Talk delivered in Polish.

Automatic credibility assessment of medical content is an extremely difficult task. This is because expert assessment is burdened with a large interpretive bias, which depends on the individual clinical experience of a given doctor. Moreover, a simple factual assessment turns out to be insufficient to determine the credibility of this type of content. During the seminar, I will present the results of my and my team's efforts to optimize the annotation process. We proposed a sentence ordering method where non-credible sentences are more likely to be placed at the beginning of the queue for evaluation. I will also present our proposals for extending the annotator protocol to increase the consistency of assessments. Finally, I invite you to a discussion on potential research directions to detect harmful narratives in the so-called medical fake news.

9 March 2021

Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences), Izabela Chojnicka (Faculty of Psychology, University of Warsaw), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences)

https://www.youtube.com/watch?v=ja04r8WW4Nk Machine learning in detecting schizophrenia and autism from textual utterances  Talk delivered in Polish.

Detection of mental disorders from textual input is an emerging field for applied machine and deep learning methods. In our talk, we will explore the limits of automated detection of autism spectrum disorder and schizophrenia. We will analyse both disorders and describe two diagnostic tools: TLC and ADOS-2, along with the characteristics of the collected data. We will compare the performance of: (1) TLC and ADOS-2, (2) machine learning and deep learning methods applied to the data gathered by these tools, and (3) psychiatrists. We will discuss the effectiveness of several baseline approaches such as bag-of-words and dictionary-based methods, including sentiment and language abstraction. We will then introduce the newest approaches using deep learning for text representation and inference. Owing to the related nature of both disorders, we will describe experiments with transfer and zero-shot learning techniques. Finally, we will explore few-shot methods dedicated to low data size scenarios, which is a typical problem for the clinical setting. Psychiatry is one of the few medical fields in which the diagnosis of most disorders is based on the subjective assessment of a psychiatrist. Therefore, the introduction of objective tools supporting diagnostics seems to be pivotal. This work is a step in this direction.

15 March 2021

Filip Graliński, Agnieszka Kaliska (Applica.ai / Adam Mickiewicz University), Tomasz Stanisławek, Anna Wróblewska (Applica.ai / Warsaw University of Technology), Dawid Lipiński, Bartosz Topolski (Applica.ai), Paulina Rosalska (Applica.ai / Nicolaus Copernicus University), Przemysław Biecek (Warsaw University of Technology / Samsung R&D Institute Poland)

https://www.youtube.com/watch?v=uDBaqxmzppk Key Information Extraction from documents: Kleister NDA/Charity challenges  Talk delivered in Polish. Slides in English.

This presentation will show-case two new datasets (Kleister NDA and Kleister Charity) for Key Information Extraction. They involve a mix of born-digital and scanned long formal documents in English. In these datasets, an NLP system is expected to find or infer various types of entities by utilizing both textual and structural layout features.

12 April 2021

Marek Kubis (Adam Mickiewicz University)

https://www.youtube.com/watch?v=37d0br2axyQ Quantitative analysis of character networks in Polish 19th- and 20th-century novels  Talk delivered in Polish.

I will present a study on induction and quantitative analysis of character networks inferred from Polish novels. The corpus compiled for this study includes both 19th- and 20th-century literary works obtained from publicly available sources. I will discuss the development of the corpus and the network extraction procedure. The structural properties observed for the networks induced from Polish novels will be confronted with the results observed for English novels. Furthermore, I will compare the networks induced from 19th-century novels to the 20th-century networks.

7 June 2021

Maciej Ogrodniczuk, Michał Rudolf (Institute of Computer Science, Polish Academy of Sciences)

ParlaMint: Towards Comparable Parliamentary Corpora  The first part of the slides in Polish.

Marta Kołczyńska (Institute of Political Studies, Polish Academy of Sciences)

Parliamentary debates in COVID times  The second part of the slides in English.

In the first part of the talk we will present the CLARIN-ERIC-funded project ParlaMint which aims to create a multilingual comparable corpus of parliamentary data based on national corpora of sitting transcripts. The second part of the talk will focus on the work of a research group that used the ParlaMint corpus data in the Parliamentary debate analysis task during the Helsinki Digital Humanities Hackathon #DHH21.



Natural Language Processing Seminar 2019–2020

23 September 2019

Igor Boguslavsky (Institute for Information Transmission Problems, Russian Academy of Sciences / Universidad Politécnica de Madrid)

Semantic analysis based on inference  Talk delivered in English.

I will present a semantic analyzer SemETAP, which is a module of a linguistic processor ETAP designed to perform analysis and generation of NL texts. We proceed from the assumption that the depth of understanding is determined by the number and quality of inferences we can draw from the text. Extensive use of background knowledge and inferences permits to extract implicit information.

Salient features of SemETAP include:

— knowledge base contains both linguistic and background knowledge;

— inference types include strict entailments and plausible expectations;

— words and concepts of the ontology may be supplied with explicit decompositions for inference purposes;

— two levels of semantic structure are distinguished. Basic semantic structure (BSemS) interprets the text in terms of ontological elements. Enhanced semantic structure (EnSemS) extends BSemS by means of a series of inferences;

— a new logical formalism Etalog is developed in which all inference rules are written.

7 October 2019

Tomasz Stanisz (Institute of Nuclear Physics, Polish Academy of Sciences)

https://www.youtube.com/watch?v=sRreAjtf2Jo What can a complex network say about a text?  Talk delivered in Polish.

Complex networks, which have found application in the quantitative description of many different phenomena, have proven to be useful in research on natural language. The network formalism allows to study language from various points of view - a complex network may represent, for example, distances between given words in a text, semantic similarities, or grammatical relationships. One of the types of linguistic networks are word-adjacency networks, which describe mutual co-occurrences of words in texts. Although simple in construction, word-adjacency networks have a number of properties allowing for their practical use. The structure of such networks, expressed by appropriately defined quantities, reflects selected characteristics of language; applying machine learning methods to collections of those quantities may be used, for example, for authorship attribution.

21 October 2019

Agnieszka Patejuk (Institute of Computer Science, Polish Academy of Sciences / University of Oxford), Adam Przepiórkowski (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)

Coordination in the Universal Dependencies standard  Talk delivered in Polish. Slides in English.

Universal Dependencies (UD; https://universaldependencies.org/) is a widespread syntactic annotation scheme employed by many parsers of multiple languages. However, the scheme does not adequately represent coordination, i.e., structures involving conjunctions. In this talk, we propose representations of two aspects of coordination which have not so far been properly represented either in UD or in dependency grammars: coordination of unlike grammatical functions and nested coordination.

4 November 2019

Marcin Będkowski (University of Warsaw / Educational Research Institute), Wojciech Stęchły, Leopold Będkowski, Joanna Rabiega-Wiśniewska (Educational Research Institute), Michał Marcińczuk (Wrocław University of Science and Technology), Grzegorz Wojdyga, Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=-oSBqG4_VDk Similarity of descriptions of qualifications contained in the Integrated Qualifications Register  Talk delivered in Polish.

Analysis of existing solutions for grouping of qualifications  Talk delivered in Polish.

In the talk we will discuss the problem of comparing documents contained in the Integrated Qualifications Register in terms of their content similarity.

In the first part, we characterize the background of the issue, including the structure of the description of learning outcomes in qualifications and sentences describing learning outcomes. According to the definition in the Act on the Integrated Qualifications System, the learning effect is knowledge, skills and social competences acquired in the learning process, and the qualification is a set of learning effects, the achievement of which is confirmed by an appropriate document (e.g. diploma, certificate). Sentences whose referents are learning outcomes have a stable structure and consist essentially of so-called an operational verb (describing an activity constituting a learning effect) and a nominal phrase that complements it (naming the object that is the subject of this activity, in short: the object of skill). For example: "Determines vision defects and how to correct them based on eye refraction measurement" or "The student reads technical drawings."

In the second part, we outline the approach that allows determining the degree of similarity between qualifications and their grouping, along with its assumptions and the intuitions behind them. We will define the accepted understanding of content similarity, namely we outline the approach to determine the similarity of texts in a variant that allows automatic text processing using computer tools. We will present a simple representation model, the so-called bag of words, in two versions.

The first of them assumes the full atomization of learning outcomes (including nominal phrases, skill objects) and their presentation as sets of single plata-mathematical nouns representing skills objects. The second is based on n-grams, taking into account the TFIDF measure (i.e. weighing by frequency of terms - inverse frequency in documents), allowing the extraction of key words and phrases from texts.

The first approach can be described as "wasteful", while the second – "frugal". The first allows for presenting many similar qualifications for each qualification (although the degree of similarity may be low). On the other hand, the second allows a situation in which there will be no similar for a given qualification.

In the third part, we describe sample groupings and ranking lists based on both approaches, based on multidimensional scaling and the k-average algorithm, as well as hierarchical grouping. We will also present a case study that will illustrate the advantages and disadvantages of both approaches.

In the fourth part we will present the conclusions on grouping qualifications, but also general conclusions related to the definition of key words. In particular, we will present conclusions on the use of the indicated methods for comparing texts of varying length, as well as partially overlapping (containing common fragments).

The talk was prepared in cooperation with the authors of the expertise on automatic analysis and comparison of qualifications for the purpose of grouping them prepared under the project "Keeping and developing the Integrated Qualifications Register", POWR.02.11.00-00-0001/17.

18 November 2019

Alexander Rosen (Charles University in Prague)

https://www.youtube.com/watch?v=kkqlUnq7jGE The InterCorp multilingual parallel corpus: representation of grammatical categories  Talk delivered in English.

InterCorp, a multilingual parallel component of the Czech National Corpus, has been on-line since 2008, growing steadily to its present size of 1.7 billion words in 40 languages. A substantial share of fiction is complemented by legal and journalistic texts, parliament proceedings, film subtitles and the Bible. The texts are sentence-aligned and – in most languages – tagged and lemmatized. We will focus on the issue of morphosyntactic annotation, currently using language-specific tagsets and tokenization rules, and explore various solutions, including those based on the guidelines, data and tools developed in the Universal Dependencies project.

21 November 2019

Alexander Rosen (Charles University in Prague)

https://www.youtube.com/watch?v=OQ-3B4-MXCw A learner corpus of Czech  Talk delivered in English.

Texts produced by language learners (native or non-native) include all sorts of non-canonical phenomena, complicating the task of linguistic annotation while requiring an explicit markup of deviations from the standard. Although a number of English learner corpora exist and other languages have been catching up recently, a commonly accepted approach to designing an error taxonomy and annotation scheme has not emerged yet. For CzeSL, the corpus of Czech as a Second Language, several such approaches were designed and tested, later extended also to texts produced by Czech schoolchildren. I will show various pros and cons of these approaches, especially with a view of Czech as a highly inflectional language with free word order.

12 December 2019

Aleksandra Tomaszewska (Institute of Applied Linguistics, University of Warsaw)

https://www.youtube.com/watch?v=_WJF6BuQML4 Cross-Genre Analysis of EU Borrowings in Polish — the Need for Research Automation  Talk delivered in Polish.

During this presentation, the project ”EU Borrowings —formation mechanisms, functions, evolution, and assimilation in the Polish language” will be presented, funded by a Diamond Grant from the Polish Ministry of Science and Higher Education. The project aims to analyze and categorize EU borrowings, that is the effects of language contacts that occur in the European Union.

First, the author will discuss the theoretical background of the phenomenon, the aims of the research project; and present a compiled corpus of EU Polish language genres composed of three sub-corpora: transcriptions of interviews with MEPs, EU law (regulations and directives), and press releases of EU institutions. In the next part of the presentation, various methods and tools used in this research will be presented, including the methods of conducting analyses on the collected research material. Based on these specific examples, the need for automation of research on the latest borrowings in Polish will also be signaled.

13 January 2020

Ryszard Tuora, Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=sux6l5glZrA Integrating Polish Language Tools and Resources in spaCy  Talk delivered in Polish.

In our project we aim to fill a niche between the robust tools, which have been developed during research work, and are dedicated to particular NLP tasks in Polish, and users looking for, and expecting an easily accessible resources. spaCy is one of the leading NLP frameworks, which is open-source, but has no official support for Polish. In our talk we will present the model for spaCy that we have been working on. It currently allows for segmentation, lemmatization, morphosyntactic analysis, dependency parsing and named entity recognition. We will discuss the tools which we have integrated, the results of evaluation, a real-world case in which it was used, and some possible paths for further development.

27 January 2020

Alina Wróblewska, Katarzyna Krasnowska-Kieraś (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=v6YncOiFMuY Empirical Linguistic Study of Sentence Embeddings  Talk delivered in Polish.

The results of empirical linguistic study on retention of linguistic information in sentence embeddings will be presented. The research methods are based on universal probing tasks and downstream tasks. The results of experiments on English and Polish indicate that different types of sentence embeddings encode linguistic information to varying degrees. The research was published in the article Empirical Linguistic Study of Sentence Embeddings in the proceedings of ACL 2019.

24 February 2020

Piotr Niewiński (Samsung R&D Polska), Aleksander Wawer, Grzegorz Wojdyga (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=kU79Q00fCI0 Fact-checking in FEVER competition  Talk delivered in Polish. Slides in English.

Aleksander Wawer, Grzegorz Wojdyga (Institute of Computer Science, Polish Academy of Sciences), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences)

Fact Checking or Psycholinguistics: How to Distinguish Fake and True Claims?  Talk delivered in Polish. Slides in English.

Piotr Niewiński, Maria Pszona, Maria Janicka (Samsung R&D Polska)

Generative Enhanced Model (extended, redesigned & fine-tuned GPT language model) for adversarial attacks  Talk delivered in Polish. Slides in English.

During seminar we will present our works for the FEVER (Fact Extraction and Verification) competition. "Fake news" has become a dangerous phenomenon in the modern information circulation. There are many approaches to the problem of recognizing fake messages – in FEVER competition, having certain text, the task is to find specific evidence from certain sources for verification. During the presentation, we will show the most interesting ideas submitted by the participants of previous editions, we will discuss our article, that compares facts verification approaches with psycholinguistic analysis, and we will also present a winning model to cheat facts verification systems.

9 March 2020

Piotr Przybyła (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=YWdqlMR6bfs Assessing a document credibility based on style  Talk delivered in Polish.

The presentation will cover my work on automatically detecting documents of low credibility, such as fake news, based on their stylistic properties. During the study, a new corpus of 103,219 documents from 229 sources was gathered and used to evaluate general-purpose text classifiers. Given their unsatisfactory performance, new methods were implemented based on stylometric features and neural architectures. It was also verified whether the proposed classifiers indeed pay attention to the vocabulary known to by typical for fake news. The results of the presented research were published at the AAAI 2020 conference in an article entitled Capturing the Style of Fake News.



Natural Language Processing Seminar 2018–2019

1 October 2018

Janusz S. Bień (University of Warsaw – prof. emeritus)

https://www.youtube.com/watch?v=mOYzwpjTAf4 Electronic indexes to lexicographical resources  Talk delivered in Polish.

We will focus on the indexes to lexicographical resources available online in DjVu format. Such indexes can be browsed, searched, modified and created with the djview4poliqarp open source program; the origins and the history of the program will be briefly presented. Originally the index support was added to the program to handle the list of entries in the 19th century Linde's dictionary, but can be used conveniently also for other resources, as will be demonstrated on selected examples. In particular some new features, introduced to the program in the last months, will be presented publicly for the first time.

15 October 2018

Wojciech Jaworski, Szymon Rutkowski (University of Warsaw)

https://www.youtube.com/watch?v=SbPAdmRmW08 A multilayer rule based model of Polish inflection  Talk delivered in Polish.

The presentation will be devoted to the multilayer model of Polish inflection. The model has been developed on the basis of Grammatical Dictionary of Polish; it does not use the concept of a inflexion paradigm. The model consists of three layers of hand-made rules: "orthographic-phonetic layer" converting a segment to representation reflecting morphological patterns of the language, "analytic layer" generating lemma and determining affix and "interpretation layer" giving a morphosyntactic interpretation based on detected affixes. The model provides knowledge about the language to a morphological analyzer supplemented with the function of guessing lemmas and morphosyntactic interpretations for non-dictionary forms (guesser). The second use of the model is generation of word forms based on lemma and morphosyntactic interpretation. The presentation will also cover the issue of disambiguation of the results provided by the morphological analyzer. The demo version of the program is available on the Internet.

29 October 2018

Jakub Waszczuk (Heinrich-Heine-Universität Düsseldorf)

https://www.youtube.com/watch?v=zjGQRG2PNu0 From morphosyntactic tagging to identification of verbal multiword expressions: a discriminative approach  Talk delivered in Polish. Slides in English.

The first part of the talk was dedicated to Concraft-pl 2.0, the new version of a morphosyntactic tagger for Polish based on conditional random fields. Concraft-pl 2.0 performs morphosyntactic segmentation as a by-product of disambiguation, which allows to use it directly on the segmentation graphs provided by the analyser Morfeusz. This is in contrast with other existing taggers for Polish, which either neglect the problem of segmentation or rely on heuristics to perform it in a pre-processing stage. During the second part, an approach to identifying verbal multiword expressions (VMWEs) based on dependency parsing results was presented. In this approach, VMWE identification is reduced to the problem of dependency tree labeling, where one of two labels (MWE or not-MWE) must be predicted for each node in the dependency tree. The underlying labeling model can be seen as conditional random fields (as used in Concraft) adapted to tree structures. A system based on this approach ranked 1st in the closed track of the PARSEME shared task 2018.

5 November 2018

Jakub Kozakoszczak (Faculty of Modern Languages, University of Warsaw / Heinrich-Heine-Universität Düsseldorf)

https://www.youtube.com/watch?v=sz7dGmf8p3k Mornings to Wednesdays — semantics and normalization of Polish quasi-periodical temporal expressions  Talk delivered in Polish.

The standard interpretations of expressions like “Januarys” and “Fridays” in temporal representation and reasoning are slices of collections of 2nd order, e.g. all the sixth elements of day sequences of cardinality 7 aligned with calendar weeks. I will present results of the work on normalizing most frequent Polish quasi-periodical temporal expressions for online booking systems. On the linguistic side I will argue against synonymy of the kind “Fridays” = “sixth days of the weeks” and give semantic tests for rudimentary classification of quasi-periodicity. In the formal part I will propose an extension to existing formalisms covering intensional quasi-periodical operators “from”, “to”, “before” and “after” restricted to monotonic domains. In the implementation part I will demonstrate an algorithm for lazy generation of generalized intersection of collections.

19 November 2018

Daniel Zeman (Institute of Formal and Applied Linguistics, Charles University in Prague)

https://www.youtube.com/watch?v=xUmZ8Mxcmg0 Universal Dependencies and the Slavic Languages  Talk delivered in English.

I will present Universal Dependencies, a worldwide community effort aimed at providing multilingual corpora, annotated at the morphological and syntactic levels following unified annotation guidelines. I will discuss the concept of core arguments, one of the cornerstones of the UD framework. In the second part of the talk I will focus on some interesting problems and challenges of applying Universal Dependencies to the Slavic languages. I will discuss examples from 12 Slavic languages that are currently represented in UD and show that cross-linguistic consistency can still be improved.

3 December 2018

Ekaterina Lapshinova-Koltunski (Saarland University)

https://www.youtube.com/watch?v=UQ_6dDNEw8E Analysis and Annotation of Coreference for Contrastive Linguistics and Translation Studies  Talk delivered in English.

In this talk, I will report on the ongoing work on coreference analysis in a multilingual context. I will present two approaches in the analysis of coreference and coreference-related phenomena: (1) top-down or theory-driven: here we start from some linguistic knowledge derived from the existing frameworks, define linguistic categories to analyse and create an annotated corpus that can be used either for further linguistic analysis or as training data for NLP applications; (2) bottom-up or data-driven: in this case, we start from a set of features of shallow character that we believe are discourse-related. We extract these structures from a huge amount of data and analyse them from a linguistic point of view trying to describe and explain the observed phenomena from the point of view of existing theories and grammars.

7 January 2019

Adam Przepiórkowski (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw), Agnieszka Patejuk (Institute of Computer Science, Polish Academy of Sciences / University of Oxford)

Enhanced Universal Dependencies  Talk delivered in Polish. Slides in English.

The aim of this talk is to present the two threads of our recent work on Universal Dependencies (UD), a standard for syntactically annotated corpora (http://universaldependencies.org/). The first thread is concerned with the developement of a new UD treebank of Polish, one that makes extensive use of the enhanced level of representation made available in the current UD standard. The treebank is the result of conversion from an earlier ‘treebank’ of Polish, one that was annotated with constituency and functional structures as they are understood in Lexical Functional Grammar. We will outline the conversion procedure and present the resulting UD treebank of Polish. The second thread is concerned with various inconsistencies and deficiencies of UD that we identified in the process of developing the UD treebank of Polish. We will concentrate on two particularly problematic areas in UD, namely, on the core/oblique distinction, which aims to – but does not really – replace the infamous argument/adjunct dichotomy, and on coordination, a phenomenon problematic for all dependency approaches.

14 January 2019

Agata Savary (François Rabelais University Tours)

Literal occurrences of multiword expressions: quantitative and qualitative analyses  Talk delivered in Polish. Slides in English.

Multiword expressions (MWEs) such as to “pull strings” (to use one's influence), “to take part” or to “do in” (to kill) are word combinations which exhibit lexical, syntactic, and especially semantic idiosyncrasies. They pose special challenges to linguistic modeling and computational linguistics due to their non-compositional semantics, i.e. the fact that their meaning cannot be deduced from the meanings of their components, and from their syntactic structure, in a way deemed regular for the given language. Additionally, MWEs can have both idiomatic and literal occurrences. For instance “pulling strings” can be understood either as making use of one’s influence, or literally. Even if this phenomenon has been largely addressed in psycholinguistics, linguistics and natural language processing, the notion of a literal reading has rarely been formally defined or subject to quantitative analyses. I will propose a syntax-based definition of a literal reading of a MWE. I will also present the results of a quantitative and qualitative analysis of this phenomenon in Polish, as well as in 4 typologically distinct languages: Basque, German, Greek and Portuguese. This study, performed in a multilingual annotated corpus of the PARSEME network, shows that literal readings constitute a rare phenomenon. We also identify some properties that may distinguish them from their idiomatic counterparts.

21 January 2019

Marek Łaziński (University of Warsaw), Michał Woźniak (Jagiellonian University)

Aspect in dictionaries and corpora. What for and how aspect pairs should be tagged in corpora?  Talk delivered in Polish.

Corpora are generally tagged for grammatical categories, also for verbal aspect value. They all choose between pf and ipf, some of them add the third value: bi-aspectual (not present in the National Corpus of Polish). However, no Slavic corpus tags the aspect value of a verb form in reference to an aspect partner. If we can mark aspect pairs in dictionaries, it should be also possible in corpora. However under the condition, that we extrapolate aspect features of lexeme to specific verb forms in specific uses. Retaining the existing morphological tagging including aspect value, two more aspect tags have been added: 1) morphological markers of aspect and 2) reference to superlemma. Every verb form in the corpus has thus three parts: 1) The existing grammatcial characteristics (TAKIPI), 2) Repeated or corrected aspect value (including bi-aspecual) and morphological markers, 3) Reference to the aspect pair–superlemma. A corpus tagged for aspect pairs, even with alternative reference for every lexeme, opens new perspectives for research. The possibilities are especially rich in a parallel corpus with one Slavic and one aspectless language, as the Mainz-Warsaw Corpus. In order to check the usefulness of our aspect pair tagging a series of queries will be built which allow to compare grammatical profiles of suffixal and prefixal aspect pf and ipf partners.

11 February 2019

Anna Wróblewska (Applica / Warsaw University of Technology), Filip Graliński (Applica / Adam Mickiewicz University)

https://www.youtube.com/watch?v=tZ_rkR7XqRY Text-based machine learning processes and their interpretability  Talk delivered in Polish. Slides in English.

How do we tackle text modeling challenges in business applications? We will present a prototype architecture for automation of processes in text based work and a few use cases of machine learning models. Use cases will be about emotion detection, abusive language recognition and more. We will also show our tool to explain suspicious findings in datasets and the models behaviour.

28 February 2019

Jakub Dutkiewicz (Poznan University of Technology)

https://www.youtube.com/watch?v=Ap2zn8-RfWI Empirical research on medical information retrieval  Talk delivered in Polish. Slides in English.

We discuss results and evaluation procedures a of the bioCADDIE 2016 challenge on search of precision medical data. Our good results are due to word embedding query expansion with appropriate weights. Information Retrieval (IR) evaluation is demanding because of considerable effort required to judge over 10000 documents. A simple sampling method was proposed over 10 years ago for estimation of Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG) in spite of incomplete judgments. For this method to work the number of judged documents has to be relatively large. Such conditions were not fulfilled in bioCADDIE 2016 challenge and TREC PM 2017, 2018. The specificity of the bioCADDIE evaluation makes the post-challenge results incompatible with these judged during the contest. In bioCADDIE, for some questions there were not any judged relevant document. The results are strongly dependent on the cut-off rank. As the effect, in the bioCADDIE challenge infAP is weakly correlated with infNDCG, and an error could by up to 0.15-0.20 in absolute value. We believe, that the deviation of evaluation measures may override the primary role of the measure in such a case. We collaborate this claim by simulation of synthetic results. We propose a simulated environment with properties, which mirror the real systems. We implement a number of evaluation measures within the simulation and discuss the usefulness of the measures with partially annotated collection of documents in regard to the collection size, number of annotated document and proportion between the number of relevant and irrelevant documents. In particular we focus on the behavior of aforementioned AP and NDCG and their inferred versions. Other studies suggest that infNDCG weakly correlates with other measures and therefore should not be selected as the most important measure.

21 March 2019

Grzegorz Wojdyga (Institute of Computer Science, Polish Academy of Sciences)

Size optimisation of language models  Talk delivered in Polish.

During the seminar, the results of work on reducing the size of language models will be discussed. The author will review the literature on the size reduction of recurrent neural networks (in terms of language models). Then, author's own implementations will be presented along with evaluation results on different Polish and English corpora.

25 March 2019

Łukasz Dębowski (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=gIoI-A00Y7M GPT-2 – Some remarks of an observer  Talk delivered in Polish.

GPT-2 is the latest neural statistical language model by the OpenAI team. A statistical language model is a distribution of probabilities on texts that can be used for automatic text generation. In essence, GPT-2 turned out to be a surprisingly good generator of semantically coherent texts of the length of several paragraphs, pushing the boundaries of what has seemed possible technically so far. Anticipating the use of GPT-2 to generate fake news, the OpenAI team decided to publish only a ten times reduced version of the model. In my talk, I will share some remarks about GPT-2.

8 April 2019

Agnieszka Wołk (Polish-Japanese Academy of Information Technology and Institute of Literary Research, Polish Academy of Sciences)

https://www.youtube.com/watch?v=QVrY4rRzMOI Language collocations in quantitative research  Talk delivered in Polish.

This presentation is aimed to aid the enormous effort required to analyze phraseological writing competence by developing an automatic evaluation tool for texts. An attempt is made to measure both second language (L2) writing proficiency and text quality. The CollGram technique that searches a reference corpus to determine the frequency of each pair of tokens (n-grams) and calculates the t-score and related information. We used the Level 3 Corpus of Contemporary American English as a reference corpus. Our solution performed well in writing evaluation and is freely available as a web service or as source for other researchers. We also present how to use it as early depression detection tools and stylometry.

15 April 2019

Alina Wróblewska, Piotr Rybak (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=p-VldtRqvmg Dependency parsing of Polish  Talk delivered in Polish.

Dependency parsing is a crucial issue in various NLP tasks. The predicate-argument structure transparently encoded in dependency-based syntactic representations may support machine translation, question answering, sentiment analysis, etc. In the talk, we will present PDB – the largest dependency treebank for Polish, and COMBO – a language-independent neural system for part-of-speech tagging, morphological analysis, lemmatisation and dependency parsing.

13 May 2019

Piotr Niewiński, Maria Pszona, Alessandro Seganti, Helena Sobol (Samsung R&D Poland), Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences)

Samsung R&D Poland in SemEval 2019 competition  Talk delivered in Polish. Slides in English.

The talk presents Samsung R&D Poland solutions that participated in SemEval 2019 competition. Both were ranked as the second one in two different tasks of competition.

1. Fact Checking in Community Question Answering Forums

We present our submission to SemEval 2019 Task 8 on Fact-Checking in Community Forums. The aim was to classify questions from QatarLiving forum as OPINION, FACTUAL or SOCIALIZING. We will present our primary solution: Deeply Regularized Residual Neural Network (DRR NN) with Universal Sentence Encoder embeddings, which was ranked second in the official evaluation phase. Moreover, we will compare this solution with two contrastive models based on ensemble methods.

2. Linguistically enhanced deep learning offensive sentence classifier

How to define an offensive content? What is a bad word? In our presentation we will discuss the problem of recognizing what is offensive and what is not in social media (Twitter etc.). Furthermore we present the system that we implemented to participate in the SemEval 2019 Task 5 and Task 6 (where we had 2nd place in Task 6 Subtask C) and compare our results to other state of the art approaches. We will see that our approach outperformed other models by adding linguistically based observation to the model features.

27 May 2019

Magdalena Zawisławska (University of Warsaw)

https://www.youtube.com/watch?v=157YzQ70bV4 Synamet — Polish Corpus of Synesthetic Metaphors  Talk delivered in Polish.

The aim of the paper is to discuss the procedure of the identification of synesthetic metaphors and the annotation of metaphoric units (MUs) in the Synamet corpus, which was created within the frames of the NCN grant (UMO-2014/15/B/ HS2/00182). The theoretical basis for the description of metaphors was the Conceptual Metaphor Theory (CMT) by Lakoff and Johnson combined with Fillmore's frame semantics. Lakoff and Johnson define a metaphor as a conceptual mapping from the source domain to the target domain, e.g. LOVE IS A JOURNEY. Because the concept of a domain is unclear, it has been replaced by a frame which, unlike a conceptual domain, links the semantic and linguistic levels (frames are activated by lexical units). The synesthetic metaphor in a narrower sense is defined as mapping from one perceptual modality to a different perceptual modality, e.g. a bright sound (VISION → HEARING), and in a broader sense—it is defined as description of non-perceptual phenomena with expressions referring primarily to sensory perceptions, e.g. rough character (TOUCH → PERSON). The Synamet project uses an even wider definition of synesthetic metaphor as any expression in which two different frames are activated and one of them is perceptual. Texts in the Synamet corpus come from blogs devoted to perfumes, wine, beer, music, or coffee, in which, due to the topics, the chance to find synesthetic metaphors was the greatest. The paper presents the basic statistics of the corpus and atypical metaphorical units that required modification of the annotation procedure.



Natural Language Processing Seminar 2017–2018

2 October 2017

Paweł Rutkowski (University of Warsaw)

https://www.youtube.com/watch?v=Acfdv6kUe5I Polish Sign Language from the perspective of corpus linguistics  Talk delivered in Polish. Slides in English.

Polish Sign Language (polski język migowy, PJM) is a full-fledged visual-spatial language used by the Polish Deaf community. It started to evolve in the second decade of the nineteenth century, with the foundation of the first school for the deaf in Poland. Until recently, PJM attracted very little attention from the linguistic community in Poland. The aim of this talk is to present a large-scale research project aimed at creating an extensive and representative corpus of PJM. The corpus is currently being compiled at the University of Warsaw. It is a collection of video clips showing Deaf people using PJM in a variety of different communication contexts. The videos are richly annotated: they are segmented, lemmatized, translated into Polish, tagged for various grammatical features and transcribed with HamNoSys symbols. The Corpus of PJM is currently one of the two largest sets of annotated sign language data in the world. Special attention will be paid to the issue of lexical frequency in PJM. Studies of this type are available for a handful of sign languages only, including American Sign Language, New Zealand Sign Language, British Sign Language, Australian Sign Language and Slovene Sign Language. Their empirical basis ranged from 100,000 tokens (NZSL) to as little as 4,000 tokens (ASL). The present talk contributes to our understanding of lexical frequency in sign languages by analyzing a much larger set of relevant data from PJM.

23 October 2017

Katarzyna Krasnowska-Kieraś, Piotr Rybak, Alina Wróblewska (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=8qzqn69nCmg Towards the evaluation of feature embedding models of the fusional languages in the context of morphosyntactic disambiguation and dependency parsing  Talk delivered in Polish.

Neural networks are recently very successful in various natural language processing tasks. An important component of a neural network approach is a dense vector representation of features, i.e. feature embedding. Various feature types are possible, e.g. words, part-of-speech tags. In our talk we are going to present results of an analysis showing what should be used as features in estimating embedding models of the fusional languages – tokens or lemmata. Furthermore, we are going to discuss the methodological question whether the results of the intrinsic evaluation of embeddings are informative for downstream applications, or the embedding models should be evaluated extrinsically. The accompanying experiments were conducted on Polish – a fusional Slavic language with a relatively free word order. The mentioned research has inspired us to implement a morphosyntactic disambiguator – Toygger (Krasnowska-Kieraś, 2017). The tool won the shared task 1 (A) in PolEval 2017 competition and will be presented in our talk.

6 November 2017

Szymon Łęski (Samsung R&D Poland)

https://www.youtube.com/watch?v=266ftzwmKeU Deep neural networks in language models  Talk delivered in Polish. Slides in English.

In my talk I will first give introduction to language models: traditional, n-gram based, and new, based on recurrent networks. Then, based on recent papers, I will discuss the most interesting extensions and modifications to RNN-based language models, such as modifying word representations or models with output not limited to a pre-defined vocabulary.

20 November 2017

Michał Ptaszyński (Kitami Institute of Technology, Japan)

https://www.youtube.com/watch?v=hUtI5lCyUew Capturing Emotions in Context as a way towards Computational Phronesis  Talk delivered in Polish.

Research on emotions within Artificial Intelligence and related fields has flourished rapidly through recent years. Unfortunately, in most research emotions are analyzed without their context. I will argue, that recognizing emotions without recognizing their context is incomplete and cannot be sufficient for real-world applications. I will also describe some consequences of disregarding the context of emotions. Finally, I will present one approach, in which the context of emotions is considered and briefly describe some of the first experiments performed in this matter.

27 November 2017

Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

Automated coreference resolution in Polish  Talk delivered in Polish.

The talk presents the description of nominal referential constructs in Polish (i.e. textual fragments referencing the same discourse entities) and the computational-linguistic methods implemented for their decoding. The algorithms are corpus-based with manual annotation of coreferential constructs and are evaluated using standard metrics.

4 December 2017

Adam Dobaczewski, Piotr Sobotka, Sebastian Żurowski (Nicolaus Copernicus University in Toruń)

https://www.youtube.com/watch?v=az06czLflMw Dictionary of Polish reduplications and repetitions  Talk delivered in Polish.

In our talk we will present a dictionary prepared by the team from the Institute of Polish Language of the Nicolaus Copernicus University in Toruń (grant NPRH 11H 13 0265 82). We document In the dictionary expressions of the Polish language in which the presence of reduplication or repetition of forms of the same lexemes can be observed. We distinguish the units of language according to the Bogusławski's operational grammar framework and divide them into two basic groups: (i) lexical units consisting of two such segments or forms of the same lexeme (Pol. całkiem całkiem; fakt faktem); operational units based on some pattern of repetition of words belonging to a certain class predicted by this scheme (Pol. N[nom] N[inst] ale _, where N stands for any noun, e.g. sąd sądem, ale _; miłość miłością, ale _). We have prepared a dictionary in traditional (printed) form due to the relatively small number of registered units. Its material base is the resources of the NKJP, which were searched using dedicated search engine of repetitions in the NKJP. This tool was specially prepared for this project at the LEG ICS PAS.

29 January 2018

Roman Grundkiewicz (Adam Mickiewicz University in Poznań/University of Edinburgh)

https://www.youtube.com/watch?v=dj9rTwzDCdA Automatic Grammatical Error Correction using Machine Translation  Talk delivered in Polish. Slides in English.

In my presentation I will be talking about the task of automated grammatical error correction (GEC) in texts written by non-native English speakers. I will present our experiments on the application of the phrase-based statistical machine translation (SMT), and our GEC system, which achieved new state-of-the-art results. The importance of the parameter optimization towards the task-specific evaluation metric and new GEC-adapted dense and sparse features will be discussed. I will also briefly describe the results of further research using neural machine translation (NMT).

12 February 2018

Agnieszka Mykowiecka, Aleksander Wawer, Małgorzata Marciniak, Piotr Rychlik (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=9QPldbRyIzU Recognition of metaphorical noun phrases in Polish with distributional semantics  The talk delivered in Polish.

Our talk addresses the use of vector models for Polish based on lemmas and forms. We compare the results for two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition. Then we apply vector models to detect metaphorical and literal meaning of adjective-noun (AN) phrases. We show the results of our method for isolated phrases and compare them to other known methods. Finally, we discuss the problem of recognition of metaphorical/literal meaning of AN phrases in sentences.

26 February 2018

Celina Heliasz (University of Warsaw)

To create or to contribute? On the search for synergy between computer scientists and linguists  The talk delivered in Polish.

The main topic of my presentation are the methods of conducting research in the field of corpus linguistics, which is currently being addressed by both computer scientists and linguists. In my speech, I will present the attempts to recognize and visualize semantic relations in the text undertaken by computer scientists as part of the two projects: RST (Rhetorical Structure Theory) and PDTB (Penn Discourse Treebank). Then, I contrast RST and PDTB with analogous attempts made by computer scientists and linguists at IPI PAN as part of the CLARIN-PL venture. The aim of the presentation is to show the determinants of effective linguistic analysis, which must be taken into account when designing IT tools, if these tools are to conduct research on text and derive strong foundations of linguistic theories from them, and not only to implement existing theories in this field.

9 April 2018

Jan Kocoń (Wrocław University of Technology)

https://www.youtube.com/watch?v=XgSyuWEHWhU Recognition of temporal expressions and events in Polish text documents  The talk delivered in Polish.

A temporal expression is a sequence of words that informs you about when, how often an event occurs or how long it lasts. Event descriptions are words which indicate a change of state in the description of reality (and also some states). These issues fall within the scope of information extraction. They are well defined and described for English and partly for other languages. The TimeML specification, whose temporal information description language has been accepted as an ISO standard, has been officially adapted for six languages and the temporal expressions description section is defined for eleven languages. The result of the work carried out within CLARIN-PL is the adaptation of TimeML guidelines for Polish language. The motivation for this topic was the fact that temporal information is used by various natural language processing tasks, including methods for question answering, automatic text summarisation, semantic relations extraction and many others. These methods allow researchers in the domain of Digital Humanities and Social Sciences to work with a very large collection of texts whose analysis, without these methods, would be very time-consuming, if possible at all. In addition to the adaptation of the temporal information description language itself, the quality and efficiency of methods is a key aspect for temporal expressions and events recognition. The presentation will discuss both the analysis of the quality of data prepared by domain experts (including annotation agreement analysis) and the results of research aimed at reducing the complexity of the computational problem while preserving the quality of methods.

23 April 2018

Włodzimierz Gruszczyński, Dorota Adamiec, Renata Bronikowska (Institute of the Polish Language, Polish Academy of Sciences), Witold Kieraś, Dorota Komosińska, Marcin Woliński (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=APvZdALq6ZU Historical corpus – problems of transliteration, transcription and annotation on the example of the Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772)  The talk delivered in Polish.

During the seminar, the process of creating the Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772), also called the Baroque Corpus, will be discussed. The particular emphasis will be placed on those tasks and problems that are specific to historical corpora, in contrast to corpora of contemporary texts, e.g. the National Corpus of Polish. We will also show the tools that were created for the needs of the project or adapted to these needs. After the general presentation of the project (assumptions, financing, team, current status, corpus's purpose) we will discuss particular problems in the order in which they appeared during the creation of the corpus: the selecting of texts, gathering them and incorporating them into a database, the necessity of their transcription into modern spelling (resulting from a huge spelling differentiation of old prints and manuscripts), issues of morphological analysis, morphosyntactic annotation (manual and automatic) and corpus searching.

14 May 2018

Łukasz Kobyliński, Michał Wasiluk, Zbigniew Gawłowicz (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=QpmLVzqQfcM MTAS corpus search engine and its implementation for Polish language corpora  The talk delivered in Polish.

During the seminar we will discuss our experiences with the MTAS search engine in the context of Polish language corpora. We will present several implementations of MTAS in such corpus-related projects as KORBA (the corpus of Polish language of the XVII and XVIII century), the XIX century language corpus, as well as National Corpus of Polish. We will also discuss preliminary experiments with implementing MTAS in Korpusomat - a tool that allows users to create their own corpora. During the presentation we will share our solutions to the problems encountered during the adaptation of MTAS to Polish and preliminary efficiency test results. We will also discuss the search capabilities of the engine and our plans for enhancing MTAS.

21 May 2018 (IPI PAN seminar presentation, 13:00)

Piotr Borkowski (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=o2FFtfrqh3I Semantic methods of categorization in the tasks of text document analysis  The talk delivered in Polish.

In my PhD thesis entitled `Semantic methods of categorization in the tasks of text document analysis', a new algorithm of semantic categorization of documents was proposed and examined. On its basis, a new algorithm for category aggregation was developed, a family of semantic algorithms of classifiers, as well as a heterogeneous classifier committee (which combines the algorithm of semantic categorization and previously known classifiers). In my talk I will briefly present their concepts and the results of their effectiveness studies.

28 May 2018

Krzysztof Wołk (Polish-Japanese Academy of Information Technology)

https://www.youtube.com/watch?v=FyeVRSXbBOg Exploration and usage of comparable corpora in machine translation  The talk delivered in Polish.

The problem that will be presented in the seminar is how to improve machine speech translation between Polish and English. The most popular methodologies and tools are not well-suited for the Polish language and therefore require adaptation. Polish language resources are lacking in parallel and monolingual data. Therefore, the main objective of the study was to develop an automatic toolkit for textual resources preparation by mining comparable corpora and quasi comparable corpora. Experiments were conducted mostly on casual human speech, consisting of lectures, movie subtitles, European Parliament proceedings, and European Medicines Agency texts. The aims were to rigorously analyze the problems and to improve the quality of baseline systems, i.e., adaptation of techniques and training parameters to increase the Bilingual Evaluation Understudy (BLEU) score for maximum performance. A further aim was to create additional bilingual and monolingual data resources by using available online data and by obtaining and mining comparable corpora for parallel sentence pairs. For this task, a methodology employing a Support Vector Machine and the Needleman-Wunsch algorithm was used, along with a chain of specialized tools.

4 June 2018

Piotr Przybyła (University of Manchester)

https://www.youtube.com/watch?v=thHOtqsfsys Supporting document screening for systematic reviews using machine learning and text mining  The talk delivered in Polish.

Systematic reviews, aiming to aggregate and analyse all the literature for a given research question, are a crucial tool in medical research. Their most laborious stage is screening, i.e. manual selection of dozens of relevant articles from thousands returned by search engines. Formulating the problem as a text classification task and using appropriate unsupervised text mining tools could lead to significant work saved. The presentation will cover adaptation of machine learning algorithms to the problem, tools for extracting and visualising terms and topics in collections, system deployment and evaluation at NICE (National Institute for Health and Care Excellence), a UK agency publishing health technology guidelines.

11 June 2018

Danijel Korzinek (Polish-Japanese Academy of Information Technology)

https://www.youtube.com/watch?v=mc8T5rXlk1I Preparing a speech corpus using the recordings of the Polish Film Chronicle  Talk delivered in Polish. Slides in English.

The presentation will describe how a speech corpus based on the Polish Film Chronicle, a collection of short historical news segments, was created during the CLARIN-PL project. This resource is an extremely useful tool for linguistic research, specifically in the context of historical speech and language. The years 1945–1960 were chosen for this purpose. The presentation will discuss various topics: from the legal issues of acquiring the resources, to more the more technical aspects of dealing with the adaptation of speech analysis tools to this, rather uncommon domain.



Natural Language Processing Seminar 2016–2017

10 October 2016

Katarzyna Pakulska, Barbara Rychalska, Krystyna Chodorowska, Wojciech Walczak, Piotr Andruszkiewicz (Samsung)

Paraphrase Detection Ensemble – SemEval 2016 winner  Talk delivered in Polish. Slides in English.

This seminar describes the winning solution designed for a core track within the SemEval 2016 English Semantic Textual Similarity task. The goal of the competition was to measure semantic similarity between two given sentences on a scale from 0 to 5. At the same time the solution should replicate human language understanding. The presented model is a novel hybrid of recursive auto-encoders from deep learning (RAE) and a WordNet award-penalty system, enriched with a number of other similarity models and features used as input for Linear Support Vector Regression.

24 October 2016

Adam Przepiórkowski, Jakub Kozakoszczak, Jan Winkowski, Daniel Ziembicki, Tadeusz Teleżyński (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)

Corpus of formalized textual entailment steps  Talk delivered in Polish.

The authors present resources created within CLARIN project aiming to help with qualitative evaluation of RTE systems: two textual derivations corpora and a corpus of textual entailment rules. Textual derivation is a series of atomic steps which connects Text with Hypothesis in a textual entailment pair. Original pairs are taken from the FraCaS corpus and a polish translation of the RTE3 corpus. Textual entailment rule sanctions textual entailment relation between the input and the output of a step, using syntactic patterns written in the UD standard and some other semantic, logical and contextual constraints expressed in FOL.

7 November 2016

Rafał Jaworski (Adam Mickiewicz University in Poznań)

Concordia – translation memory search algorithm  Talk delivered in Polish.

The talk covers the Concordia algorithm which is used to maximize the productivity of a human translator. The algorithm combines the features of standard fuzzy translation memory searching with a concordancer. As the key non-functional requirement of computer-aided translation mechanisms is performance, Concordia incorporates upgraded versions of standard approximate searching techniques, aiming at reducing the computational complexity.

21 November 2016

Norbert Ryciak, Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=hGKzZxFa0ik Using recursive deep neural networks and syntax to compute phrase semantics  Talk delivered in Polish.

The seminar presents initial experiments on recursive phrase-level sentiment computation using dependency syntax and deep learning. We discuss neural network architectures and implementations created within Clarin 2 and present results on English language resources. Seminar also covers undergoing work on Polish language resources.

5 December 2017

Dominika Rogozińska, Marcin Woliński (Institute of Computer Science, Polish Academy of Sciences)

Methods of syntax disambiguation for constituent parse trees in Polish as post–processing phase of the Świgra parser  Talk delivered in Polish.

The presentation shows methods of syntax disambiguation for Polish utterances produced by the Świgra parser. Presented methods include probabilistic context free grammars and maximum entropy models. The best of described models achieves efficiency measure at the level of 96.2%. The outcome of our experiments is a module for post-processing Świgra's parses.

9 January 2017

Agnieszka Pluwak (Institute of Slavic Studies, Polish Academy of Sciences)

Building a domain-specific knowledge representation using an extended method of frame semantics on a corpus of Polish, English and German lease agreements  Wystąpienie w języku polskim.

The FrameNet project is defined by its authors as a lexical base with some ontological features (not an ontology sensu stricto, however, due to a selective approach towards description of frames and lexical units, as well as frame-to-frame relations). Ontologies, as knowledge representations in the field of NLP, should have the capacity of implementation to specific domains and texts, however, in the FrameNet bibliography published before January 2016 I haven’t found a single knowledge representation based entirely on frames or on an extensive structure of frame-to-frame relations. I did find a few examples of domain-specific knowledge representations with the use of selected FrameNet frames, such as BioFrameNet or Legal FrameNet, where frames were applied to connect data from different sources. Therefore, in my dissertation, I decided to conduct an experiment and build a knowledge representation of frame-to-frame relations for the domain of lease agreements. The aim of my study was the description of frames useful in case of building a possible data extraction system from lease agreements, this is frames containing answers to questions asked by a professional analyst while reading lease agreements. In my work I have asked several questions, e.g. would I be able to use FrameNet frames for this purpose or would I have to build my own frames? Will the analysis of Polish cause language-specific problems? How will the professional language affect the use of frames in context? Etc.

23 January 2017

Marek Rogalski (Lodz University of Technology)

Automatic paraphrasing  Talk delivered in Polish.

Paraphrasing is conveying the essential meaning of a message using different words. The ability to paraphrase is a measure of understanding. A teacher asking student a question "could you please tell us using your own words ...", tests whether the student has understood the topic. On this presentation we will discuss the task of automatic paraphrasing. We will differentiate between syntax-level paraphrases and essential-meaning-level paraphrases. We will bring up several techniques from seemingly unrelated fields that can be applied in automatic paraphrasing. We will also show results that we've been able to produce with those techniques.

6 February 2017

Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=TP9pmPKla1k Korpusomat – a tool for creation of searcheable own corpora  Talk delivered in Polish.

Korpusomat is a web tool facilitating unassisted creation of corpora for linguistic studies. After sending a set of text files they are automatically morphologically analysed and lemmatised using Morfeusz and disambiguated using Concraft tagger. The resulting corpus can be then downloaded and analysed offline using Poliqarp search engine to query for information related to text segmentation, base forms, inflectional interpretations and (dis)ambiguities. Poliqarp is also capable of calculating frequencies and applying basic statistical measures necessary for quantitative analysis. Apart from plain text files Korpusomat can also process more complex textual formats such as popular EPUBs, download source data from the Internet, strip unnecessary information and extract document metadata.

20 February 2017 (invited talk at the Institute seminar)

Elżbieta Hajnicz (Institute of Computer Science, Polish Academy of Sciences)

https://youtu.be/lDKQ9jhIays Representation language of the valency dictionary Walenty  The talk delivered in Polish.

The Polish Valence Dictionary (Walenty) is intended to be used by natural language processing tools, particularly parsers, and thus it offers formalized representation of the valency information. The talk presented the notion of valency and its representation in the dictionary along with examples illustrating how particular syntactic and semantic language phenomena are modelled.

2 March 2017

Wojciech Jaworski (University of Warsaw)

https://youtu.be/VgCsXsicoR8 Integration of dependency parser with a categorial parser  Talk delivered in Polish.

As part of the talk I will describe the division of texts into sentences and controlling the execution of each parser within the emerging hybrid parser in the Clarin-bis project. I will describe the adopted method of dependency structure conversion aimed to make them compatible with the structures of categorial parser. The conversion will have two aspects: changing the attributes of each node and changing the links between nodes. I will depict how the method used can be extended to convert compressed forests generated by the parser Świgra. At the end I wil talk about the plans and the goals of reimplementation of the MateParser algorithm.

13 March 2017

Marek Kozłowski, Szymon Roziewski (National Information Processing Institute)

https://youtu.be/3mtjJfI3HkU Internet model of Polish and semantic text processing  Talk delivered in Polish.

The presentation shows how BabelNet (the multilingual encyclopaedia and semantic network based on publicly available data sources such as Wikipedia and WordNet), can be used in the task of grouping short texts, sentiment analysis or emotional profiling of movies based on their subtitles. The second part presents the work based on CommonCrawl – publicly available petabyte-size open repository of multilingual Web pages. CommonCrawl was used to build two models of Polish: n-gram-based and semantic distribution-based.

20 March 2017

Jakub Szymanik (University of Amsterdam)

https://www.youtube.com/watch?v=OzftWhtGoAU Semantic Complexity Influences Quantifier Distribution in Corpora  Talk delivered in Polish. Slides in English.

In this joint paper with Camilo Thorne, we study whether semantic complexity influences the distribution of generalized quantifiers in a large English corpus derived from Wikipedia. We consider the minimal computational device recognizing a generalized quantifier as the core measure of its semantic complexity. We regard quantifiers that belong to three increasingly more complex classes: Aristotelian (recognizable by 2-state acyclic finite automata), counting (k+2-state finite automata), and proportional quantifiers (pushdown automata). Using regression analysis we show that semantic complexity is a statistically significant factor explaining 27.29% of frequency variation. We compare this impact to that of other known sources of complexity, both semantic (quantifier monotonicity and the comparative/superlative distinction) and superficial (e.g., the length of quantifier surface forms). In general, we observe that the more complex a quantifier, the less frequent it is.

27 March 2017 (invited talk at the institute seminar)

Paweł Morawiecki (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=onaYI6XY1S4 Introduction to deep neural networks  Talk delivered in Polish.

In the last few years, Deep Neural Networks (DNN) has become a tool that provides the best solution for many problems from image and speech recognition. Also in natural language processing DNN totally revolutionizes the way how translation or word representation is done (and for many other problems). This presentation aims to provide good intuitions related to the DNN, their core architectures and how they operate. I will discuss and suggest the tools and source materials that can help in the further exploration of the topic and independent experiments.

3 April 2017

Katarzyna Budzynska, Chris Reed (Institute of Philosophy and Sociology, Polish Academy of Sciences / University of Dundee)

Argument Corpora, Argument Mining and Argument Analytics (part I)  Talk delivered in English.

Argumentation, the most prominent way people communicate, has been attracting a lot of attention since the very beginning of the scientific reflection. The Centre for Argument Technology has been developing the infrastructure for studying argument structures for almost two decades. Our approach demonstrate several characteristics. First, we build upon the graph-based standard for argument representation, Argument Interchange Format AIF (Rahwan et al., 2007); and Inference Anchoring Theory IAT (Budzynska and Reed, 2011) which allows us to capture dialogic context of argumentation. Second, we focus on a variety of aspects of argument structures such as argumentation schemes (Lawrence and Reed, 2016); illocutionary intentions speakers associate with arguments (Budzynska et al., 2014a); ethos of arguments' authors (Duthie et al., 2016); rephrase relation which paraphrases parts of argument structures (Konat et al., 2016); and protocols of argumentative dialogue games (Yaskorska and Budzynska, forthcoming).

10 April 2017

Paweł Morawiecki (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=6H9oUYsfaw8 Neural nets for natural language processing – selected architectures and problems  Talk delivered in Polish.

For the last few years more and more problems in NLP have been successfully tackled with neural nets, particularly with deep architectures. These are such problems as sentiment analysis, topic classification, coreference, word representations and image labelling. In this talk i will give some details on most promising architectures used in NLP including recurrent and convolutional nets. The presented solutions will be given in a context of a concrete problem, namely the coreference problem in Polish language.

15 May 2017

Katarzyna Budzynska, Chris Reed (Institute of Philosophy and Sociology, Polish Academy of Sciences / University of Dundee)

Argument Corpora, Argument Mining and Argument Analytics (part II)  Talk delivered in English.

In the second part of our presentation we will describe characteristics of argument structures using examples from our AIF corpora of annotated argument structures in various domains and genres (see also OVA+ annotation tool) including moral radio debates (Budzynska et al., 2014b); Hansard records of the UK parliamentary debates (Duthie et al., 2016); e-participation (Konat et al., 2016; Lawrence et al., forthcoming); and the US 2016 presidential debates (Visser et al., forthcoming). Finally, we will show how such complex argument structures, which on the one hand make the annotation process more time-consuming and less reliable, can on the other hand result in automatic extraction of a variety of valuable information when applying technologies for argument mining (Budzynska and Villata, 2017; Lawrence and Reed, forthcoming) and argument analytics (Reed et al., forthcoming).

12 June 2017 (invited talk at the Institute seminar)

Adam Pawłowski (University of Wroclaw)

https://www.youtube.com/watch?v=RNIThH3b4uQ Sequential structures in texts  Talk delivered in Polish.

The subject of my lecture is the phenomenon of sequentiality in linguistics. Sequentiality is defined here as a characteristic feature of a text or of a collection of texts, which expresses the sequential relationship between units of the same type, ordered along the axis of time or according to a different variable (e.g. the sequence of reading or publishing). In order to model sequentiality which is thus understood, we can use, among others, time series, spectral analysis, theory of stochastic processes, theory of information or some tools of acoustics.Referring to both my own research and existing literature, in my lecture I will be presenting sequential structures and selected models thereof in continuous texts, as well as models used in relation to sequences of several texts (known as chronologies of works); I will equally mention glottochronology, which is a branch of quantitative linguistics that aims at mathematical modeling of the development of language over long periods of time. Finally, I will relate to philosophical attempts to elucidate sequentiality (the notion of the text’s ‘memory’, the result chain, Pitagoreism, Platonism).



Natural Language Processing Seminar 2015–2016

12 October 2015

Vincent Ng (University of Texas at Dallas)

Beyond OntoNotes Coreference  The talk delivered in English.

Recent years have seen considerable progress on the notoriously difficult task of coreference resolution owing in part to the availability of coreference-annotated corpora such as MUC, ACE, and OntoNotes. Coreference, however, is more than MUC/ACE/OntoNotes coreference: it encompasses many interesting cases of anaphora that are not covered in the extensively investigated MUC/ACE/OntoNotes entity coreference task. This talk examined several comparatively less-studied coreference tasks that were arguably no less challenging than the MUC/ACE/OntoNotes entity coreference task, including the Winograd Schema Challenge, zero anaphora resolution, and event coreference resolution.

26 October 2015

Wojciech Jaworski (University of Warsaw)

Syntactic-semantic parser for Polish  The talk delivered in Polish.

The author presented the parser being developed within CLARIN-PL project, its morphological pre-processing, a categorial grammar of Polish integrated with valency dictionary and used by the parser and the semantic graph formalism used for meaning representation. He also discussed algorithms used by the parser and optimization strategies, both related to performance and concise representation of ambiguous syntactic and semantic parsing trees.

16 November 2015

Izabela Gatkowska (Jagiellonian University in Kraków)

The Empirical Network of Lexical Links  The talk delivered in Polish.

The empirical network of lexical links is the result of an experiment using a human associative mechanism – the person who is the subject of the research says the test first word that comes to his mind after understanding the stimulus word. The study was conducted in a cyclical manner, i.e. response words obtained in the first cycle were used as stimuli in the second cycle, which enabled the creation of a semantic network, which differs from the network created with the bodies of a text, for example, WORTSCHATZ and a network constructed by hand, for example. WordNet. The empirically obtained words, which are derived from those words in the network, have a direction and power connections. The set of incoming and outgoing connections, in which is found a specific expression, creates a lexical node network (subnet). The manner in which the network characterizes meaning, is shown in the example of feedback connections which are a specific example of the dependencies which appear between two words, appearing in the lexical node. A qualitative analysis of the semantic lexical relations known in linguistics, and employed for example in the WordNet dictionary, permit an interpretation of only approximately 25% of linkage feedback. The remaining links may be interpreted by referring to the model of the description of the significance as proposed in the FrameNet dictionary. A qualitative interpretation of all the links found in the lexical node may permit a study of the comparative lexical network nodes experimentally constructed for different natural languages, and may also allow, a separation of empirical semantic models employed by the same set of links found between nodes in a given network.

30 November 2015

Dora Montagna (Universidad Autónoma de Madrid)

Semantic representation of a polysemous verb in Spanish  The talk delivered in English.

The author presented a theoretical model of representation of meaning, based on Pustejovsky's theory of the Generative Lexicon. The proposal is intended as a base for automatic disambiguation, but also as a new model of lexicographic description. The model will be applied to a highly productive verb in Spanish, assuming the hypothesis of verbal underspecification in order to establish patterns of semantic behaviors.

7 December 2015

Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences), Witold Kieraś (University of Warsaw)

Morphosyntactic tagging of Polish – state of the art and future perspectives  The talk delivered in Polish.

During the presentation, the state of the art in the area of automatic approaches to morphosyntactic tagging of Polish language text was discussed, with a particular focus on the analysis of performance of publicly available tools, which are possible to use in real applications. A qualitative and quantitative analysis of the errors made by the taggers was conducted, along with a discussion on the possible causes and solutions to these problems. Tagging results for Polish was compared and contrasted with the results for other European languages.

8 December 2015

Salvador Pons Bordería (Universitat de València)

Discourse Markers from a pragmatic perspective: The role of discourse units in defining functions  The talk delivered in English.

One of the most disregarded aspects in the description of discourse markers is position. Notions such as "initial position" or "final position" are meaningless unless it can be specified with regard to what a DM is "initial" or "final". The presentation defended the idea that, for this question to be answered, appeal must be made to the notion of "discourse unit". Provided with a set of a) discourse units, and b) discourse positions, determining the function of a given DM is quasi-automatic.

11 January 2016

Małgorzata Marciniak, Agnieszka Mykowiecka, Piotr Rychlik (Institute of Computer Science, Polish Academy of Sciences)

Terminology extraction from Polish data – program TermoPL  The talk delivered in Polish.

The presentation addressed the problems of terminology extraction from Polish domain corpora. The authors described the C-value method to rank term candidates based on frequency measure and number of term contexts. The method takes into account nested terms that may not appear by themselves in data. Using this method, several nested grammatical subphrases are obtained which are syntactically correct, but semantically odd, like 'USG jamy' `USG of cavity’. The recognition of nested terms is supported by word connection strength which allows to eliminate truncated phrases from the top part of the term list. The talk was completed by the demo of the TermoPL tool.

25 January 2015

Wojciech Jaworski (University of Warsaw)

Syntactic-semantic parser for Polish: integration with lexical resources, parsing  The talk delivered in Polish.

During the lecture the author presented the integration of syntactic-semantic with SGJP, Polimorf, Słowosieć and Walenty as well as preliminary observations concerning the impact that checking semantic preferences has on parsing. He also described a categorical formalism used to parse and presented briefly how the parser works.

22 February 2016

Witold Dyrka (Wrocław University of Technology)

Language(s) of proteins? – premises, contributions and perspectives  The talk delivered in Polish.

In his speech the author presented arguments in favour of treating protein sequences, or higher protein structures, as sentences in some language(s). Then he plans to show several interesting results (my own and others') of application of quantitative methods of text analysis, and formal linguistics tools (such as probabilistic context-free grammars) for the analysis of proteins. Eventually, he presented plans of his further work on the "protein linguistics", which - as he hopes - would inspire an interesting discussion.

22 February 2016

Linguistic Engineering Group (Institute of Computer Science, Polish Academy of Sciences)

Extended seminar  Series of short lectures in Polish presenting Linguistic Engineering Group research topics.

12:00–12:15: People, projects, tools

12:15–12:45: Morfeusz 2: analyzer and inflectional synthesizer for Polish

12:45–13:15: Toposław: Creating MWU lexicons

13:15–13:45: Lunch break

13:45–14:15: TermoPL: Terminology extraction from Polish data

14:15–14:45: Walenty: Valency dictionary of Polish

14:45–15:15: POLFIE: LFG grammar for Polish

7 March 2016

Zbigniew Bronk (Grammatical Dictionary of Polish team member)

JOD – a markup language for Polish declension  The talk delivered in Polish.

JOD, a markup language for Polish declension, had been constructed in order to precisely describe inflectional rules and schemes for nouns and adjectives in Polish. Its first application was the description of inflection of surnames, taking into account the sex of the person or persons using the given surname. This model has been the basis for the "Automaton of declension of Polish surnames." The author presented the general idea of the language and the implementation of its interpreter, as well as the JOD editor and the website "Automaton of declension of Polish surnames".

21 March 2016

Bartosz Zaborowski, Aleksander Zabłocki (Institute of Computer Science, Polish Academy of Sciences)

Poliqarp2 on the home straight  The talk delivered in Polish.

In this talk the authors present a linguistic data search engine Poliqarp 2, on which they have been working for last three years. They describe both technical aspects as well as interesting features from the user's point of view. They briefly recall the data model supported by the engine, the structure of language supported by the new query engine, its expressive power, and differences compared to the previous version. In particular, they focus on elements added or modified during the development of the project (support for Składnica and LFG data models, post-processing, syntactic sugars). Among technicals they shortly present the software architecture and some details about the implementation of indexes. They also describe nontrivial decisions related to the input data processing (National Corpus of Polish in particular). They end the talk by presenting results of preliminary efficiency measurements.

4 April 2016

Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences)

Identification of opinion targets in Polish  The talk delivered in Polish.

Seminar concluded and summarised the results of a grant of The National Science Centre (NCN) finished in January 2016. It presented three resources with labelled sentiments and opinion targets, developed within the project: a bank of dependency trees, created from the corpus of product reviews, a subset of Skladnica dependency treebank and a collection of tweets. The seminar included a discussion of experiments on automated recognition of opinion targets. These involved the use of two parsing methods: dependency and shallow, and a hybrid method in which the results of syntactic analysis were used by statistical models (eg. CRF).

21 April 2016 (Thursday)

Magdalena Derwojedowa (University of Warsaw)

“Tem lepiej, ale jest to interes miljonowy i traktujemy go poważnie” – A thousand words a thousand times in 5 parts  The talk delivered in Polish.

The talk presented the 1M corpus of the project „Automatic morphological analysis of Polish texts from 1830-1918 period with respect to the evolution of inflection and spelling” (DEC-2012/07/B/HS2/00570), the structure of the corpus, its stylistic, temporal and regional diversity as well as the resource inflectional characteristics in comparison with features described in Bajerowa's works.

9 May 2016

Daniel Janus (Rebased.pl)

From unstructured data to searchable metadata-rich corpus: Skyscraper, P4, Smyrna  The talk delivered in Polish.

The presentation described tools facilitating construction of custom datasets: in particular, corpora of texts. The author presented Skyscraper, a library allowing scraping structured data out of whole WWW sites, and Smyrna, a concordancer for Polish texts enriched with metadata. In addition, a dataset built using these tools was be presented: Polish Parliamentary Proceedings Processor (PPPP, or P4), including, inter alia, a continuously updated corpus of speeches in the Polish parliament. The presentation largely focused on technical solutions used in the tools shown.

19 May 2016 (Thursday)

Kamil Kędzia, Konrad Krulikowski (University of Warsaw)

Generating paraphrases' templates for Polish using parallel corpus  The talk delivered in Polish.

A software for generating paraphrases in Polish under CLARIN-PL project was prepared. The developers will demonstrate how it works on chosen examples. They will also explain a method of Ganitkevitch et al. (2013) which allowed its authors to create an openly available Paraphrase Database (PPDB). Furthermore, they will discuss its enhancements and the approach to the challenges specific to the Polish language. Additionally they will demonstrate a way of measuring paraphrases' quality.

23 May 2016

Damir Ćavar (Indiana University)

The Free Linguistic Environment  The talk delivered in English.

The Free Linguistic Environment (FLE) started as a project to develop an open and free platform for white-box modeling and grammar engineering, i.e. development of natural language morphologies, prosody, syntax, and semantic processing components that are for example based on theoretical frameworks like two-level morphology, Lexical Functional Grammar (LFG), Glue Semantics, and similar. FLE provides a platform that makes use of some classical algorithms and also new approaches based on Weighted Finite State Transducer models to enable probabilistic modeling and parsing at all linguistic levels. Currently its focus is to provide a platform that is compatible with LFG and an extended version of it, one that we call Probabilistic Lexical Functional Grammar (PLFG). This probabilistic modeling can apply to the c(onstituent) -structure component, i.e. a Context Free Grammar (CFG) backbone can be extended by a Probabilistic Context Free Grammar (PCFG). Probabilities in PLFG can also be associated with structural representations and corresponding f(unctional feature)-structures or semantic properties, i.e. structural and functional properties and their relations can be modeled using weights that can represent probabilities or other forms of complex scores or metrics. In addition to these extensions of the LFG-framework, FLE provides also an open platform for experimenting with algorithms for semantic processing or analyses based on (probabilistic) lexical analyses, c- and f-structures, or similar such representations. Its architecture is extensible to cope with different frameworks, e.g. dependency grammar, optimality theory based approaches, and many more.

6 June 2016

Karol Opara (Systems Research Institute of the Polish Academy of Sciences)

Grammatical rhymes in Polish poetry – a quantitative analysis  The talk delivered in Polish.

Polish is a highly inflected language and parts of speech in the same morphological form have common endings. This allows one to easily find a multitude of rhyming words known as grammatical rhymes. Their overuse is strongly discouraged in the contemporary Polish literary cannon due to their alleged banality. The speech presented the results of computer-aided investigations into poets’ technical mastery based on estimating the share of grammatical rhymes in their verses. A method of automatic rhyme detection was discussed as well as the extraction of statistical information from texts, and a new “literary” criterion of choosing the sample size for statistical tests. Finally, a ranking of the technical mastery of various Polish poets was presented.

See the talks given between 2000 and 2015 and the current schedule.