Natural Language Processing Seminar 2023–2024

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). All recorded talks are available on YouTube.

seminarium-archiwum

9 October 2023

Agnieszka Mikołajczyk-Bareła, Wojciech Janowski (VoiceLab), Piotr Pęzik (University of Łódź / VoiceLab), Filip Żarnecki, Alicja Golisowicz (VoiceLab)

https://www.youtube.com/watch?v=q5nCUwhj2us TRURL.AI: Fine-tuning large language models on multilingual instruction datasets  Talk delivered in Polish.

This talk will summarize our recent work on fine-tuning a large generative language model on bilingual instruction datasets, which resulted in the release of an open version of Trurl (trurl.ai). The motivation behind creating this model was to improve the performance of the original Llama 2 7B- and 13B-parameter models (Touvron et al. 2023), from which it was derived in a number of areas such as information extraction from customer-agent interactions and data labeling with a special focus on processing texts and instructions written in Polish. We discuss the process of optimizing the instruction datasets and the effect of the fine-tuning process on a number of selected downstream tasks.

16 October 2023

Konrad Wojtasik, Vadim Shishkin, Kacper Wołowiec, Arkadiusz Janz, Maciej Piasecki (Wrocław University of Science and Technology)

https://www.youtube.com/watch?v=ehBE6qTKlcM Evaluation of information retrieval models in zero-shot settings on different documents domains  Talk delivered in English.

Information Retrieval over large collections of documents is an extremely important research direction in the field of natural language processing. It is a key component in question-answering systems, where the answering model often relies on information contained in a database with up-to-date knowledge. This not only allows for updating the knowledge upon which the system responds to user queries but also limits its hallucinations. Currently, information retrieval models are neural networks and require significant training resources. For many years, lexical matching methods like BM25 outperformed trained neural models in Open Domain setting, but current architectures and extensive datasets allow surpassing lexical solutions. In the presentation, I will introduce available datasets for the evaluation and training of modern information retrieval architectures in document collections from various domains, as well as future development directions.

30 October 2023

Agnieszka Faleńska (University of Stuttgart)

https://www.youtube.com/watch?v=6Kgj0N4MvIA Steps towards Bias-Aware NLP Systems  Talk in English.

For many, Natural Language Processing (NLP) systems have become everyday necessities, with applications ranging from automatic document translation to voice-controlled personal assistants. Recently, the increasing influence of these AI tools on human lives has raised significant concerns about the possible harm these tools can cause.

In this talk, I will start by showing a few examples of such harmful behaviors and discussing their potential origins. I will argue that biases in NLP models should be addressed by advancing our understanding of their linguistic sources. Then, the talk will zoom into three compelling case studies that shed light on inequalities in commonly used training data sources: Wikipedia, instructional texts, and discussion forums. Through these case studies, I will show that regardless of the perspective on the particular demographic group (speaking about, speaking to, and speaking as), subtle biases are present in all these datasets and can perpetuate harmful outcomes of NLP models.

13 November 2023

Piotr Rybak (Institute of Computer Science, Polish Academy of Sciences)

Advancing Polish Question Answering: Datasets and Models  Talk delivered in Polish. Slides in English.

Although question answering (QA) is one of the most popular topics in natural language processing, until recently it was virtually absent in the Polish scientific community. However, the last few years have seen a significant increase in work related to this topic. In this talk, I will discuss what question answering is, how current QA systems work, and what datasets and models are available for Polish QA. In particular, I will discuss the resources created at IPI PAN, namely the PolQA and MAUPQA and the Silver Retriever model. Finally, I will point out further directions of work that are still open when it comes to Polish question answering.

11 December 2023 (a series of short invited talks by Coventry Univerity researchers)

Xiaorui Jiang, Opeoluwa Akinseloyin, Vasile Palade (Coventry University)

https://www.youtube.com/watch?v=_BnuR3fY1FY Towards More Human-Effortless Systematic Review Automation  Wystąpienie w jęz. angielskim.

Systematic literature review (SLR) is the standard tool for synthesising medical and clinical evidence from the ocean of publications. SLR is extremely expensive. SLR is extremely expensive. AI can play a significant role in automating the SLR process, such as for citation screening, i.e., the selection of primary studies-based title and abstract. Some tools exist, but they suffer from tremendous obstacles, including lack of trust. In addition, a specific characteristic of systematic review, which is the fact that each systematic review is a unique dataset and starts with no annotation, makes the problem even more challenging. In this study, we present some seminal but initial efforts on utilising the transfer learning and zero-shot learning capabilities of pretrained language models and large language models to solve or alleviate this challenge. Preliminary results are to be reported.

Kacper Sówka (Coventry University)

https://www.youtube.com/watch?v=Of8-cfhvzXU Attack Tree Generation Using Machine Learning  Wystąpienie w jęz. angielskim.

My research focuses on applying machine learning and NLP to the problem of cybersecurity attack modelling. This is done by generating "attack tree" models using public cybersecurity datasets (CVE) and training a siamese neural network to predict the relationship between individual cybersecurity vulnerabilities using a DistilBERT encoder fine-tuned using Masked Language Modelling.

Xiaorui Jiang (Coventry University)

https://www.youtube.com/watch?v=UCiOk0AZa0M Towards Semantic Science Citation Index  Wystąpienie w jęz. angielskim.

It is a difficult task to understand and summarise the development of scientific research areas. This task is especially cognitively demanding for postgraduate students and early-career researchers, of the whose main jobs is to identify such developments by reading a large amount of literature. Will AI help? We believe so. This short talk summarises some recent initial work on extracting the semantic backbone of a scientific area through the synergy of natural language processing and network analysis, which is believed to serve a certain type of discourse models for summarisation (in future work). As a small step from it, the second part of the talk introduces how comparison citations are utilised to improve multi-document summarisation of scientific papers.

Xiaorui Jiang, Alireza Daneshkhah (Coventry University)

https://www.youtube.com/watch?v=5z7rdnafpjU Natural Language Processing for Automated Triaging at NHS  Talk in English.

In face of a post-COVID global economic slowdown and aging society, the primary care units in the National Healthcare Services (NHS) are receiving increasingly higher pressure, resulting in delays and errors in healthcare and patient management. AI can play a significant role in alleviating this investment-requirement discrepancy, especially in the primary care settings. A large portion of clinical diagnosis and management can be assisted with AI tools for automation and reduce delays. This short presentation reports the initial studies worked with an NHS partner on developing NLP-based solutions for the automation of clinical intention classification (to save more time for better patient treatment and management) and an early alert application for Gout Flare prediction from chief complaints (to avoid delays in patient treatment and management).

8 January 2024

Danijel Korzinek (Polish-Japanese Academy of Information Technology)

https://www.youtube.com/watch?v=W_A8W_Hu73I ParlaSpeech – Developing Large-Scale Speech Corpora in the ParlaMint project  Talk delivered in Polish.

The purpose of this sub-project was to develop tools and methodologies that would allow the linking of the textual corpora developed within the ParlaMint project with their coresponding audio and video footage available online. The task was naturally more involved than it may seem intuitivetily and it higned mostly on the proper alignment of very long audio (up to a full working day of parliamentary sessions) to its corresponding transcripts, while accounting for many mistakes and inaccuracies in the matching and order between the two modalities. The project was developed using fully open-source models and tools, which are available online for use in other projects of similar scope. So far, it was used to fully prepare corpora for two languages (Polish and Croatian), but more are being currently developed.

12 February 2024

Tsimur Hadeliya, Dariusz Kajtoch (Allegro ML Research)

https://www.youtube.com/watch?v=b8FE2_lzfE8 Evaluation and analysis of in-context learning for Polish classification tasks  Talk in English.

With the advent of language models such as ChatGPT, we are witnessing a paradigm shift in the way we approach natural language processing tasks. Instead of training a model from scratch, we can now solve tasks by designing appropriate prompts and choosing suitable demonstrations as input to a generative model. This approach, known as in-context learning (ICL), has shown remarkable capabilities for classification tasks in the English language . In this presentation, we will investigate how different language models perform on Polish classification tasks using the ICL approach. We will explore the effectiveness of various models, including multilingual and large-scale models, and compare their results with existing solutions. Through a comprehensive evaluation and analysis, we aim to gain insights into the strengths and limitations of this approach for Polish classification tasks. Our findings will shed light on the potential of ICL for the Polish language. We will discuss challenges and opportunities, and propose directions for future work.

29 February 2024

Seminar on analysis of parliamentary data  All talks in Polish.

Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

Polish Parliamentary Corpus and ParlaMint corpus

Bartłomiej Klimowski (University of Warsaw)

Application to analyse the sentiment of utterances of Polish MPs

Konrad Kiljan (University of Warsaw), Ewelina Gajewska (Warsaw University of Technology)

Analysis of the dynamics of emotions in parliamentary debates about the war in Ukraine

Aleksandra Tomaszewska (Institute of Computer Science, Polish Academy of Sciences), Anna Jamka (Universty of Warsaw)

Gender-fair language in the Polish parliament: a corpus-based study of parliamentary debates in the ParlaMint corpus

Marek Łaziński (University of Warsaw)

Changes in the Polish language of the last hundred years in the mirror of parliamentary debates

25 March 2024

Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=IS_Miy2o8-A Are text credibility classifiers robust to adversarial actions?  Talk in Polish.

Automatic text classifiers are widely used for helping in content moderation for platforms hosting user-generated text, especially social networks. They can be employed to filter out unfriendly, misinforming, manipulative or simply illegal information. However, we have to remember that authors of such text often have a strong motivation to spread them and might try to modify the original content, until they find a reformulation that gets through automatic filters. Such modified variants of original data, called adversarial examples, play a crucial role in analyzing the robustness of ML models to the attacks of motivated actors. The presentation will be devoted to a systematic analysis of the problem in context of detecting misinformation. I am going to show concrete examples where a replacement of trivial words causes a change in a classifier's decision, as well as the BODEGA framework for robustness analysis, used in the InCreiblAE shared task at CheckThat! evaluation lab at CLEF 2024.

28 March 2024

Krzysztof Węcel (Poznań University of Economics and Business)

https://www.youtube.com/watch?v=Om1ypFnYUIE Credibility of information in the context of fact-checking process  Talk in Polish.

The presentation will focus on the topics of OpenFact project, which is a response to the problem of fake news. As part of the project, we develop methods that allow us to verify the credibility of information. In order to ensure methodological correctness, we rely on the process used by fact-checking agencies. These activities are based on complex data sets obtained, among others, from ClaimReview, Common Crawl or by monitoring social media and extracting statements from texts. It is also important to evaluate information in terms of its checkworthiness and the credibility of sources whose reputation may result from publications sourced from OpenAlex or Crossref. Stylometric analysis allows us to determine authorship, and the comparison of human and machine work opens up new possibilities in detecting the use of artificial intelligence. We use local small language models as well as remote LLMs with various scenarios. We have built large sets of statements that can be used to verify new texts by examining semantic similarity. They are described with additional, constantly expanded metadata allowing for the implementation of various use cases.

25 April 2024

Seminar summarising the work on the Corpus of Modern Polish (Decade 2011-2020)  All talks in Polish.

11:30–11:35: About the project (Małgorzata Marciniak)

11:35–12:05: The Corpus of Modern Polish, Decade 2011-2020 (Marek Łaziński)

12:05–12:35: Annotation, lemmatisation, frequency lists (Witold Kieraś)

12:35–13:00: Coffee break

13:00–13:30: Hybrid representation of syntactic information (Marcin Woliński)

13:30–14:15: Discussion on the future of corpora

13 May 2024

Michal Křen (Charles University in Prague)

Latest developments in the Czech National Corpus  Talk in English.

The talk will give an overview of the Czech National Corpus (CNC) research infrastructure in all the main areas of its operation: corpus compilation, data annotation, application development and user support. Special attention will be paid to the variety of language corpora and user applications where CNC has recently seen a significant progress. In addition, it is the end-user web applications that shape the way linguists and other scholars think about the language data and how they can be utilized. The talk will conclude with an outline of future plans.

3 June 2024 (the talk given at the institute seminar)

Marcin Woliński, Katarzyna Krasnowska-Kieraś (Institute of Computer Science, Polish Academy of Sciences)

Constituency and dependency parsing of natural language using neural networks  Talk in Polish.

In the talk, we will present a method of automatic syntactic analysis (parsing) of natural language. In the proposed approach, syntactic structures are expressed using syntactic spines and their attachments, which allows a simultaneous generation of two popular representations: dependency and constituency trees. We will discuss the implementation of this concept in the form of a set of classifiers fed with the outputs of a BERT-type language model. Tests of the algorithm on Polish and German data showed a high quality of the results obtained. The method was used to introduce a syntactic layer of annotation in the Corpus of Contemporary Polish Language developed at IPI PAN.

4 July 2024

Purificação Silvano (University of Porto)

https://www.youtube.com/watch?v=VUnZIrr2Av8 Unifying Semantic Annotation with ISO 24617 for Narrative Extraction, Understanding and Visualisation  Talk in English.

In this talk, I will present the successful application of Language resource management – Semantic annotation framework (ISO-24617) for representing semantic information in texts. Initially, I will introduce the harmonisation of five parts of ISO 24617 (1, 4, 7, 8, 9) into a comprehensive annotation scheme designed to represent semantic information pertaining to eventualities, times, participants, space, discourse relations and semantic roles. Subsequently, I will explore the applications of this annotation, specifically highlighting the Text2Story and StorySense projects, which focus on narrative extraction, understanding and visualisation of the journalistic text.