Locked History Actions

Diff for "seminar"

Differences between revisions 388 and 644 (spanning 256 versions)
Revision 388 as of 2021-02-08 14:35:50
Size: 17442
Comment:
Revision 644 as of 2024-07-05 10:19:54
Size: 29338
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
= Natural Language Processing Seminar 2020–2021 = = Natural Language Processing Seminar 2023–2024 =
Line 5: Line 5:
||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pjl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, currently online – please use the link next to the presentation title. All recorded talks are available on [[https://www.youtube.com/ipipan|YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]|| ||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pjl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on [[https://www.youtube.com/ipipan|YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]||
Line 7: Line 7:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''5 October 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Rybak''', '''Robert Mroczkowski''', '''Janusz Tracz''' (ML Research at Allegro.pl), '''Ireneusz Gawlik''' (ML Research at Allegro.pl & AGH University of Science and Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=LkR-i2Z1RwM|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-10-05.pdf|Review of BERT-based Models for Polish Language]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">In recent years, a series of BERT-based models improved the performance of many natural language processing systems. During this talk, we will briefly introduce the BERT model as well as some of its variants. Next, we will focus on the available BERT-based models for Polish language and their results on the KLEJ benchmark. Finally, we will dive into the details of the new model developed in cooperation between ICS PAS and Allegro.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''9 October 2023'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Mikołajczyk-Bareła''', '''Wojciech Janowski''' (!VoiceLab), '''Piotr Pęzik''' (University of Łódź / !VoiceLab), '''Filip Żarnecki''', '''Alicja Golisowicz''' (!VoiceLab)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-10-09.pdf|TRURL.AI: Fine-tuning large language models on multilingual instruction datasets]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">This talk will summarize our recent work on fine-tuning a large generative language model on bilingual instruction datasets, which resulted in the release of an open version of Trurl (trurl.ai). The motivation behind creating this model was to improve the performance of the original Llama 2 7B- and 13B-parameter models (Touvron et al. 2023), from which it was derived in a number of areas such as information extraction from customer-agent interactions and data labeling with a special focus on processing texts and instructions written in Polish. We discuss the process of optimizing the instruction datasets and the effect of the fine-tuning process on a number of selected downstream tasks.||
Line 12: Line 12:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 November 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Inez Okulska''' (NASK National Research Institute)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=B7Y9fK2CDWw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-11-02.pdf|Concise, robust, sparse? Algebraic transformations of word2vec embeddings versus precision of classification]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The introduction of the vector representation of words, containing the weights of context and central words, calculated as a result of mapping giant corpora of a given language, and not encoding manually selected, linguistic features of words, proved to be a breakthrough for NLP research. After the first delight, there came revision and search for improvements - primarily in order to broaden the context, to handle homonyms, etc. Nevertheless, the classic embeddinga still apply to many tasks - for example, content classification - and in many cases their performance is still good enough. What do they code? Do they contain redundant elements? If transformed or reduced, will they maintain the information in a way that still preserves the original "meaning"? What is the meaning here? How far can these vectors be deformed and how does it relate to encryption methods? In my speech I will present a reflection on this subject, illustrated by the results of various "tortures” of the embeddings (word2vec and glove) and their precision in the task of classifying texts whose content must remain masked for human users.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''16 October 2023'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Konrad Wojtasik''', '''Vadim Shishkin''', '''Kacper Wołowiec''', '''Arkadiusz Janz''', '''Maciej Piasecki''' (Wrocław University of Science and Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-10-16.pdf|Evaluation of information retrieval models in zero-shot settings on different documents domains]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Information Retrieval over large collections of documents is an extremely important research direction in the field of natural language processing. It is a key component in question-answering systems, where the answering model often relies on information contained in a database with up-to-date knowledge. This not only allows for updating the knowledge upon which the system responds to user queries but also limits its hallucinations. Currently, information retrieval models are neural networks and require significant training resources. For many years, lexical matching methods like BM25 outperformed trained neural models in Open Domain setting, but current architectures and extensive datasets allow surpassing lexical solutions. In the presentation, I will introduce available datasets for the evaluation and training of modern information retrieval architectures in document collections from various domains, as well as future development directions.||
Line 17: Line 17:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''16 November 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Chmiel''' (Adam Mickiewicz University, Poznań), '''Danijel Korzinek''' (Polish-Japanese Academy of Information Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=MxbgQL316DQ|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-11-16.pdf|PINC (Polish Interpreting Corpus): how a corpus can help study the process of simultaneous interpreting]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">PINC is the first Polish simultaneous interpreting corpus based on Polish-English and English-Polish interpretations from the European Parliament. Using naturalistic data makes it possible to answer many questions about the process of simultaneous interpreting. By analysing the ear-voice span, or the delay between the source text and the target text, mechanisms of activation and inhibition can be investigated in the interpreter’s lexical processing. Fluency and pause data help us examine the cognitive load. This talk will focus on how we process data in the corpus (such as interpreter voice identification) and what challenges we face in relation to linguistic analysis, dependency parsing and bilingual alignment. We will show how specific data can be applied to help us understand what interpreting involves or even what happens in the interpreter’s mind.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''30 October 2023'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Faleńska''' (University of Stuttgart)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-10-30.pdf|Steps towards Bias-Aware NLP Systems]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">For many, Natural Language Processing (NLP) systems have become everyday necessities, with applications ranging from automatic document translation to voice-controlled personal assistants. Recently, the increasing influence of these AI tools on human lives has raised significant concerns about the possible harm these tools can cause.||
||<style="border:0;padding-left:30px;padding-bottom:15px">In this talk, I will start by showing a few examples of such harmful behaviors and discussing their potential origins. I will argue that biases in NLP models should be addressed by advancing our understanding of their linguistic sources. Then, the talk will zoom into three compelling case studies that shed light on inequalities in commonly used training data sources: Wikipedia, instructional texts, and discussion forums. Through these case studies, I will show that regardless of the perspective on the particular demographic group (speaking about, speaking to, and speaking as), subtle biases are present in all these datasets and can perpetuate harmful outcomes of NLP models.||
Line 22: Line 23:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''30 November 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Findings of ACL: EMNLP 2020''': Polish session||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Łukasz Borchmann''' et al. (Applica.ai)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=THe1URk40Nk|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-11-30a.pdf|Contract Discovery: Dataset and a Few-Shot Semantic Retrieval Challenge with Competitive Baselines]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}&#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:10px">Contract Discovery deals with tasks, such as ensuring the inclusion of relevant legal clauses or their retrieval for further analysis (e.g., risk assessment). Because there was no publicly available benchmark for span identification from legal texts, we proposed it along with hard-to-beat baselines. It is expected to process unstructured text, as in most real-world usage scenarios; that is, no legal documents segmentation into the hierarchy of distinct (sub)sections is to be given in advance. What is more, it is assumed that a searched passage can be any part of the document and not necessarily a complete paragraph, subparagraph, or clause. Instead, the process should be considered as a few-shot span identification task. In this particular setting, pretrained, universal encoders fail to provide satisfactory results. In contrast, solutions based on the Language Models perform well, especially when unsupervised fine-tuning is applied.||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Szymański''' (Wrocław Technical University), '''Piotr Żelasko''' (Johns Hopkins University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=TXSDhCtTRpw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-11-30b.pdf|WER we are and WER we think we are]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}&#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark. We show that WERs are significantly higher than the best reported results. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''13 November 2023'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Rybak''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-11-13.pdf|Advancing Polish Question Answering: Datasets and Models]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}&#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Although question answering (QA) is one of the most popular topics in natural language processing, until recently it was virtually absent in the Polish scientific community. However, the last few years have seen a significant increase in work related to this topic. In this talk, I will discuss what question answering is, how current QA systems work, and what datasets and models are available for Polish QA. In particular, I will discuss the resources created at IPI PAN, namely the [[https://huggingface.co/datasets/ipipan/polqa|PolQA]] and [[https://huggingface.co/datasets/ipipan/maupqa|MAUPQA]] and the [[https://huggingface.co/ipipan/silver-retriever-base-v1|Silver Retriever]] model. Finally, I will point out further directions of work that are still open when it comes to Polish question answering.||
Line 31: Line 28:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''17 December 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Przybyła''' (Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=newobY5cBJo|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-12-17.pdf|Multi-Word Lexical Simplification]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will cover the task of multi-word lexical simplification, in which a sentence in natural language is made easier to understand by replacing its fragment with a simpler alternative, both of which can consist of many words. In order to explore this new direction, a corpus (MWLS1) including 1462 sentences in English from various sources with 7059 simplifications was prepared through crowdsourcing. Additionally, an automatic solution (Plainifier) for the problem, based on a purpose-trained neural language model, will be discussed along with the evaluation, comparing to human and resource-based baselines. The results of the presented study were also published at the COLING 2020 conference in [[https://www.aclweb.org/anthology/2020.coling-main.123.pdf|an article of the same title]].||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 December 2023''' (a series of short invited talks by Coventry Univerity researchers)||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Xiaorui Jiang''', '''Opeoluwa Akinseloyin''', '''Vasile Palade''' (Coventry University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-12-11-1.pdf|Towards More Human-Effortless Systematic Review Automation]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Wystąpienie w jęz. angielskim.}}||
||<style="border:0;padding-left:30px;padding-bottom:10px">Systematic literature review (SLR) is the standard tool for synthesising medical and clinical evidence from the ocean of publications. SLR is extremely expensive. SLR is extremely expensive. AI can play a significant role in automating the SLR process, such as for citation screening, i.e., the selection of primary studies-based title and abstract. [[http://systematicreviewtools.com/|Some tools exist]], but they suffer from tremendous obstacles, including lack of trust. In addition, a specific characteristic of systematic review, which is the fact that each systematic review is a unique dataset and starts with no annotation, makes the problem even more challenging. In this study, we present some seminal but initial efforts on utilising the transfer learning and zero-shot learning capabilities of pretrained language models and large language models to solve or alleviate this challenge. Preliminary results are to be reported.||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Kacper Sówka''' (Coventry University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-12-11-2.pdf|Attack Tree Generation Using Machine Learning]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Wystąpienie w jęz. angielskim.}}||
||<style="border:0;padding-left:30px;padding-bottom:10px">My research focuses on applying machine learning and NLP to the problem of cybersecurity attack modelling. This is done by generating "attack tree" models using public cybersecurity datasets (CVE) and training a siamese neural network to predict the relationship between individual cybersecurity vulnerabilities using a DistilBERT encoder fine-tuned using Masked Language Modelling.||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Xiaorui Jiang''' (Coventry University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-12-11-3.pdf|Towards Semantic Science Citation Index]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Wystąpienie w jęz. angielskim.}}||
||<style="border:0;padding-left:30px;padding-bottom:10px">It is a difficult task to understand and summarise the development of scientific research areas. This task is especially cognitively demanding for postgraduate students and early-career researchers, of the whose main jobs is to identify such developments by reading a large amount of literature. Will AI help? We believe so. This short talk summarises some recent initial work on extracting the semantic backbone of a scientific area through the synergy of natural language processing and network analysis, which is believed to serve a certain type of discourse models for summarisation (in future work). As a small step from it, the second part of the talk introduces how comparison citations are utilised to improve multi-document summarisation of scientific papers.||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Xiaorui Jiang''', '''Alireza Daneshkhah''' (Coventry University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-12-11-4.pdf|Natural Language Processing for Automated Triaging at NHS]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15x">In face of a post-COVID global economic slowdown and aging society, the primary care units in the National Healthcare Services (NHS) are receiving increasingly higher pressure, resulting in delays and errors in healthcare and patient management. AI can play a significant role in alleviating this investment-requirement discrepancy, especially in the primary care settings. A large portion of clinical diagnosis and management can be assisted with AI tools for automation and reduce delays. This short presentation reports the initial studies worked with an NHS partner on developing NLP-based solutions for the automation of clinical intention classification (to save more time for better patient treatment and management) and an early alert application for Gout Flare prediction from chief complaints (to avoid delays in patient treatment and management).||
Line 36: Line 42:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''18 January 2021'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Norbert Ryciak''', '''Maciej Chrabąszcz''', '''Maciej Bartoszuk''' (Sages)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=L8RRx9KVhJs|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2021-01-18.pdf|Classification of patent applications]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}&#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">During our presentation we will discuss the solution for patent applications classification task that was one of !GovTech competition problems. We will describe the characteristics of the problem and proposed solution, especially the original method of representing documents as “clouds of word embedding”.||
||<style="border:0;padding-top:15px;padding-bottom:5px">'''8 January 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Danijel Korzinek''' (Polish-Japanese Academy of Information Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-01-08.pdf|ParlaSpeech – Developing Large-Scale Speech Corpora in the ParlaMint project]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The purpose of this sub-project was to develop tools and methodologies that would allow the linking of the textual corpora developed within the [[https://www.clarin.eu/parlamint|ParlaMint]] project with their coresponding audio and video footage available online. The task was naturally more involved than it may seem intuitivetily and it higned mostly on the proper alignment of very long audio (up to a full working day of parliamentary sessions) to its corresponding transcripts, while accounting for many mistakes and inaccuracies in the matching and order between the two modalities. The project was developed using fully open-source models and tools, which are available online for use in other projects of similar scope. So far, it was used to fully prepare corpora for two languages (Polish and Croatian), but more are being currently developed.||
Line 41: Line 47:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''1 February 2021'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Adam Jatowt''' (University of Innsbruck)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2021-02-01.pdf|Question Answering & Finding Temporal Analogs in News Archives]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered mostly in English (introduction in Polish).}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">News archives offer immense value to our society, helping users to learn details of events that occurred in the past. Currently, the access to such collections is difficult for average users due to large sizes and the need for expertise in history. We propose a large-scale open-domain question answering model designed for long-term news article collections, with a dedicated module for re-ranking articles by using temporal information. In the second part of the talk we will discuss methods for finding and explaining temporal analogs – entities in the past which are analogical to the entities in the present (e.g., walkman as a temporal analog of iPad).||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''15 February 2021'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Aleksandra Nabożny''' (Polish-Japanese Academy of Information Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Methods of optimizing the work of experts during the annotation of non-credible medical texts''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Automatic credibility assessment of medical content is an extremely difficult task. This is because expert assessment is burdened with a large interpretive bias, which depends on the individual clinical experience of a given doctor. Moreover, a simple factual assessment turns out to be insufficient to determine the credibility of this type of content. During the seminar, I will present the results of my and my team's efforts to optimize the annotation process. We proposed a sentence ordering method where non-credible sentences are more likely to be placed at the beginning of the queue for evaluation. I will also present our proposals for extending the annotator protocol to increase the consistency of assessments. Finally, I invite you to a discussion on potential research directions to detect harmful narratives in the so-called medical fake news.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''9 March 2021''' ('''NOTE: the seminar will start at 12:00''')||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Aleksander Wawer''' (Institute of Computer Science, Polish Academy of Sciences), Izabela Chojnicka (Faculty of Psychology, University of Warsaw), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Machine learning in detecting schizophrenia and autism from textual utterances''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Detection of mental disorders from textual input is an emerging field for applied machine and deep learning methods. In our talk, we will explore the limits of automated detection of autism spectrum disorder and schizophrenia. We will analyse both disorders and describe two diagnostic tools: TLC and ADOS-2, along with the characteristics of the collected data. We will compare the performance of: (1) TLC and ADOS-2, (2) machine learning and deep learning methods applied to the data gathered by these tools, and (3) psychiatrists. We will discuss the effectiveness of several baseline approaches such as bag-of-words and dictionary-based methods, including sentiment and language abstraction. We will then introduce the newest approaches using deep learning for text representation and inference. Owing to the related nature of both disorders, we will describe experiments with transfer and zero-shot learning techniques. Finally, we will explore few-shot methods dedicated to low data size scenarios, which is a typical problem for the clinical setting. Psychiatry is one of the few medical fields in which the diagnosis of most disorders is based on the subjective assessment of a psychiatrist. Therefore, the introduction of objective tools supporting diagnostics seems to be pivotal. This work is a step in this direction.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''12 February 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Tsimur Hadeliya''', '''Dariusz Kajtoch''' (Allegro ML Research)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-02-12.pdf|Evaluation and analysis of in-context learning for Polish classification tasks]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">With the advent of language models such as ChatGPT, we are witnessing a paradigm shift in the way we approach natural language processing tasks. Instead of training a model from scratch, we can now solve tasks by designing appropriate prompts and choosing suitable demonstrations as input to a generative model. This approach, known as in-context learning (ICL), has shown remarkable capabilities for classification tasks in the English language . In this presentation, we will investigate how different language models perform on Polish classification tasks using the ICL approach. We will explore the effectiveness of various models, including multilingual and large-scale models, and compare their results with existing solutions. Through a comprehensive evaluation and analysis, we aim to gain insights into the strengths and limitations of this approach for Polish classification tasks. Our findings will shed light on the potential of ICL for the Polish language. We will discuss challenges and opportunities, and propose directions for future work.||
Line 57: Line 53:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''29 February 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Seminar on analysis of parliamentary data''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|All talks in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Maciej Ogrodniczuk''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-02-29-1.pdf|Polish Parliamentary Corpus and ParlaMint corpus]]'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Bartłomiej Klimowski''' (University of Warsaw)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-02-29-2.pdf|Application to analyse the sentiment of utterances of Polish MPs]]'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Konrad Kiljan''' (University of Warsaw), '''Ewelina Gajewska''' (Warsaw University of Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-02-29-3.pdf|Analysis of the dynamics of emotions in parliamentary debates about the war in Ukraine]]'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Aleksandra Tomaszewska''' (Institute of Computer Science, Polish Academy of Sciences), '''Anna Jamka''' (Universty of Warsaw)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-02-29-4.pdf|Gender-fair language in the Polish parliament: a corpus-based study of parliamentary debates in the ParlaMint corpus]]'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marek Łaziński''' (University of Warsaw)||
||<style="border:0;padding-left:30px;padding-bottom:16px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-02-29-5.pdf|Changes in the Polish language of the last hundred years in the mirror of parliamentary debates]]'''||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''25 March 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Przybyła''' (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-03-25.pdf|Are text credibility classifiers robust to adversarial actions?]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Automatic text classifiers are widely used for helping in content moderation for platforms hosting user-generated text, especially social networks. They can be employed to filter out unfriendly, misinforming, manipulative or simply illegal information. However, we have to remember that authors of such text often have a strong motivation to spread them and might try to modify the original content, until they find a reformulation that gets through automatic filters. Such modified variants of original data, called adversarial examples, play a crucial role in analyzing the robustness of ML models to the attacks of motivated actors. The presentation will be devoted to a systematic analysis of the problem in context of detecting misinformation. I am going to show concrete examples where a replacement of trivial words causes a change in a classifier's decision, as well as the BODEGA framework for robustness analysis, used in the InCreiblAE shared task at [[https://checkthat.gitlab.io/clef2024/task6/|CheckThat! evaluation lab at CLEF 2024]].||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 March 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Krzysztof Węcel''' (Poznań University of Economics and Business)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-03-28.pdf|Credibility of information in the context of fact-checking process]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will focus on the topics of !OpenFact project, which is a response to the problem of fake news. As part of the project, we develop methods that allow us to verify the credibility of information. In order to ensure methodological correctness, we rely on the process used by fact-checking agencies. These activities are based on complex data sets obtained, among others, from !ClaimReview, Common Crawl or by monitoring social media and extracting statements from texts. It is also important to evaluate information in terms of its checkworthiness and the credibility of sources whose reputation may result from publications sourced from !OpenAlex or Crossref. Stylometric analysis allows us to determine authorship, and the comparison of human and machine work opens up new possibilities in detecting the use of artificial intelligence. We use local small language models as well as remote LLMs with various scenarios. We have built large sets of statements that can be used to verify new texts by examining semantic similarity. They are described with additional, constantly expanded metadata allowing for the implementation of various use cases.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''25 April 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Seminar summarising the work on the [[https://kwjp.pl|Corpus of Modern Polish (Decade 2011-2020)]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|All talks in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:0px">11:30–11:35: '''[[attachment:seminarium-archiwum/2024-04-25-1.pdf|About the project]]''' (Małgorzata Marciniak)||
||<style="border:0;padding-left:30px;padding-bottom:0px">11:35–12:05: '''[[attachment:seminarium-archiwum/2024-04-25-2.pdf|The Corpus of Modern Polish, Decade 2011-2020]]''' (Marek Łaziński)||
||<style="border:0;padding-left:30px;padding-bottom:0px">12:05–12:35: '''[[attachment:seminarium-archiwum/2024-04-25-3.pdf|Annotation, lemmatisation, frequency lists]]''' (Witold Kieraś)||
||<style="border:0;padding-left:30px;padding-bottom:0px">12:35–13:00: Coffee break||
||<style="border:0;padding-left:30px;padding-bottom:0px">13:00–13:30: '''[[attachment:seminarium-archiwum/2024-04-25-4.pdf|Hybrid representation of syntactic information]]''' (Marcin Woliński)||
||<style="border:0;padding-left:30px;padding-bottom:15px">13:30–14:15: '''[[attachment:seminarium-archiwum/2024-04-25-5.pdf|Discussion on the future of corpora]]'''||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''13 May 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Michal Křen''' (Charles University in Prague)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-05-13.pdf|Latest developments in the Czech National Corpus]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The talk will give an overview of the Czech National Corpus (CNC) research infrastructure in all the main areas of its operation: corpus compilation, data annotation, application development and user support. Special attention will be paid to the variety of language corpora and user applications where CNC has recently seen a significant progress. In addition, it is the end-user web applications that shape the way linguists and other scholars think about the language data and how they can be utilized. The talk will conclude with an outline of future plans.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''3 June 2024''' (the talk given at the [[https://ipipan.waw.pl/instytut/dzialalnosc-naukowa/seminaria/ogolnoinstytutowe|institute seminar]])||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marcin Woliński''', '''Katarzyna Krasnowska-Kieraś''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-06-03.pdf|Constituency and dependency parsing of natural language using neural networks]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">In the talk, we will present a method of automatic syntactic analysis (parsing) of natural language. In the proposed approach, syntactic structures are expressed using syntactic spines and their attachments, which allows a simultaneous generation of two popular representations: dependency and constituency trees. We will discuss the implementation of this concept in the form of a set of classifiers fed with the outputs of a BERT-type language model. Tests of the algorithm on Polish and German data showed a high quality of the results obtained. The method was used to introduce a syntactic layer of annotation in the [[https://kwjp.pl|Corpus of Contemporary Polish Language]] developed at IPI PAN.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''4 July 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Purificação Silvano''' (University of Porto)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-07-04.pdf|Unifying Semantic Annotation with ISO 24617 for Narrative Extraction, Understanding and Visualisation]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">In this talk, I will present the successful application of Language resource management – Semantic annotation framework (ISO-24617) for representing semantic information in texts. Initially, I will introduce the harmonisation of five parts of ISO 24617 (1, 4, 7, 8, 9) into a comprehensive annotation scheme designed to represent semantic information pertaining to eventualities, times, participants, space, discourse relations and semantic roles. Subsequently, I will explore the applications of this annotation, specifically highlighting the [[https://text2story.inesctec.pt/|Text2Story]] and [[https://storysense.inesctec.pt/|StorySense]] projects, which focus on narrative extraction, understanding and visualisation of the journalistic text.||


||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2023]].||
Line 58: Line 104:


||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 March 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Mateusz Krubiński''' (Charles University in Prague)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Talk title will be given shortly''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made available soon.||
Line 64: Line 116:

||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2020]].||

Natural Language Processing Seminar 2023–2024

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube.

seminarium

9 October 2023

Agnieszka Mikołajczyk-Bareła, Wojciech Janowski (VoiceLab), Piotr Pęzik (University of Łódź / VoiceLab), Filip Żarnecki, Alicja Golisowicz (VoiceLab)

http://zil.ipipan.waw.pl/seminarium-online TRURL.AI: Fine-tuning large language models on multilingual instruction datasets  Talk delivered in Polish.

This talk will summarize our recent work on fine-tuning a large generative language model on bilingual instruction datasets, which resulted in the release of an open version of Trurl (trurl.ai). The motivation behind creating this model was to improve the performance of the original Llama 2 7B- and 13B-parameter models (Touvron et al. 2023), from which it was derived in a number of areas such as information extraction from customer-agent interactions and data labeling with a special focus on processing texts and instructions written in Polish. We discuss the process of optimizing the instruction datasets and the effect of the fine-tuning process on a number of selected downstream tasks.

16 October 2023

Konrad Wojtasik, Vadim Shishkin, Kacper Wołowiec, Arkadiusz Janz, Maciej Piasecki (Wrocław University of Science and Technology)

http://zil.ipipan.waw.pl/seminarium-online Evaluation of information retrieval models in zero-shot settings on different documents domains  Talk delivered in English.

Information Retrieval over large collections of documents is an extremely important research direction in the field of natural language processing. It is a key component in question-answering systems, where the answering model often relies on information contained in a database with up-to-date knowledge. This not only allows for updating the knowledge upon which the system responds to user queries but also limits its hallucinations. Currently, information retrieval models are neural networks and require significant training resources. For many years, lexical matching methods like BM25 outperformed trained neural models in Open Domain setting, but current architectures and extensive datasets allow surpassing lexical solutions. In the presentation, I will introduce available datasets for the evaluation and training of modern information retrieval architectures in document collections from various domains, as well as future development directions.

30 October 2023

Agnieszka Faleńska (University of Stuttgart)

http://zil.ipipan.waw.pl/seminarium-online Steps towards Bias-Aware NLP Systems  Talk in English.

For many, Natural Language Processing (NLP) systems have become everyday necessities, with applications ranging from automatic document translation to voice-controlled personal assistants. Recently, the increasing influence of these AI tools on human lives has raised significant concerns about the possible harm these tools can cause.

In this talk, I will start by showing a few examples of such harmful behaviors and discussing their potential origins. I will argue that biases in NLP models should be addressed by advancing our understanding of their linguistic sources. Then, the talk will zoom into three compelling case studies that shed light on inequalities in commonly used training data sources: Wikipedia, instructional texts, and discussion forums. Through these case studies, I will show that regardless of the perspective on the particular demographic group (speaking about, speaking to, and speaking as), subtle biases are present in all these datasets and can perpetuate harmful outcomes of NLP models.

13 November 2023

Piotr Rybak (Institute of Computer Science, Polish Academy of Sciences)

http://zil.ipipan.waw.pl/seminarium-online Advancing Polish Question Answering: Datasets and Models  Talk delivered in Polish. Slides in English.

Although question answering (QA) is one of the most popular topics in natural language processing, until recently it was virtually absent in the Polish scientific community. However, the last few years have seen a significant increase in work related to this topic. In this talk, I will discuss what question answering is, how current QA systems work, and what datasets and models are available for Polish QA. In particular, I will discuss the resources created at IPI PAN, namely the PolQA and MAUPQA and the Silver Retriever model. Finally, I will point out further directions of work that are still open when it comes to Polish question answering.

11 December 2023 (a series of short invited talks by Coventry Univerity researchers)

Xiaorui Jiang, Opeoluwa Akinseloyin, Vasile Palade (Coventry University)

http://zil.ipipan.waw.pl/seminarium-online Towards More Human-Effortless Systematic Review Automation  Wystąpienie w jęz. angielskim.

Systematic literature review (SLR) is the standard tool for synthesising medical and clinical evidence from the ocean of publications. SLR is extremely expensive. SLR is extremely expensive. AI can play a significant role in automating the SLR process, such as for citation screening, i.e., the selection of primary studies-based title and abstract. Some tools exist, but they suffer from tremendous obstacles, including lack of trust. In addition, a specific characteristic of systematic review, which is the fact that each systematic review is a unique dataset and starts with no annotation, makes the problem even more challenging. In this study, we present some seminal but initial efforts on utilising the transfer learning and zero-shot learning capabilities of pretrained language models and large language models to solve or alleviate this challenge. Preliminary results are to be reported.

Kacper Sówka (Coventry University)

http://zil.ipipan.waw.pl/seminarium-online Attack Tree Generation Using Machine Learning  Wystąpienie w jęz. angielskim.

My research focuses on applying machine learning and NLP to the problem of cybersecurity attack modelling. This is done by generating "attack tree" models using public cybersecurity datasets (CVE) and training a siamese neural network to predict the relationship between individual cybersecurity vulnerabilities using a DistilBERT encoder fine-tuned using Masked Language Modelling.

Xiaorui Jiang (Coventry University)

http://zil.ipipan.waw.pl/seminarium-online Towards Semantic Science Citation Index  Wystąpienie w jęz. angielskim.

It is a difficult task to understand and summarise the development of scientific research areas. This task is especially cognitively demanding for postgraduate students and early-career researchers, of the whose main jobs is to identify such developments by reading a large amount of literature. Will AI help? We believe so. This short talk summarises some recent initial work on extracting the semantic backbone of a scientific area through the synergy of natural language processing and network analysis, which is believed to serve a certain type of discourse models for summarisation (in future work). As a small step from it, the second part of the talk introduces how comparison citations are utilised to improve multi-document summarisation of scientific papers.

Xiaorui Jiang, Alireza Daneshkhah (Coventry University)

http://zil.ipipan.waw.pl/seminarium-online Natural Language Processing for Automated Triaging at NHS  Talk in English.

In face of a post-COVID global economic slowdown and aging society, the primary care units in the National Healthcare Services (NHS) are receiving increasingly higher pressure, resulting in delays and errors in healthcare and patient management. AI can play a significant role in alleviating this investment-requirement discrepancy, especially in the primary care settings. A large portion of clinical diagnosis and management can be assisted with AI tools for automation and reduce delays. This short presentation reports the initial studies worked with an NHS partner on developing NLP-based solutions for the automation of clinical intention classification (to save more time for better patient treatment and management) and an early alert application for Gout Flare prediction from chief complaints (to avoid delays in patient treatment and management).

8 January 2024

Danijel Korzinek (Polish-Japanese Academy of Information Technology)

http://zil.ipipan.waw.pl/seminarium-online ParlaSpeech – Developing Large-Scale Speech Corpora in the ParlaMint project  Talk delivered in Polish.

The purpose of this sub-project was to develop tools and methodologies that would allow the linking of the textual corpora developed within the ParlaMint project with their coresponding audio and video footage available online. The task was naturally more involved than it may seem intuitivetily and it higned mostly on the proper alignment of very long audio (up to a full working day of parliamentary sessions) to its corresponding transcripts, while accounting for many mistakes and inaccuracies in the matching and order between the two modalities. The project was developed using fully open-source models and tools, which are available online for use in other projects of similar scope. So far, it was used to fully prepare corpora for two languages (Polish and Croatian), but more are being currently developed.

12 February 2024

Tsimur Hadeliya, Dariusz Kajtoch (Allegro ML Research)

http://zil.ipipan.waw.pl/seminarium-online Evaluation and analysis of in-context learning for Polish classification tasks  Talk in English.

With the advent of language models such as ChatGPT, we are witnessing a paradigm shift in the way we approach natural language processing tasks. Instead of training a model from scratch, we can now solve tasks by designing appropriate prompts and choosing suitable demonstrations as input to a generative model. This approach, known as in-context learning (ICL), has shown remarkable capabilities for classification tasks in the English language . In this presentation, we will investigate how different language models perform on Polish classification tasks using the ICL approach. We will explore the effectiveness of various models, including multilingual and large-scale models, and compare their results with existing solutions. Through a comprehensive evaluation and analysis, we aim to gain insights into the strengths and limitations of this approach for Polish classification tasks. Our findings will shed light on the potential of ICL for the Polish language. We will discuss challenges and opportunities, and propose directions for future work.

29 February 2024

Seminar on analysis of parliamentary data  All talks in Polish.

Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

http://zil.ipipan.waw.pl/seminarium-online Polish Parliamentary Corpus and ParlaMint corpus

Bartłomiej Klimowski (University of Warsaw)

http://zil.ipipan.waw.pl/seminarium-online Application to analyse the sentiment of utterances of Polish MPs

Konrad Kiljan (University of Warsaw), Ewelina Gajewska (Warsaw University of Technology)

http://zil.ipipan.waw.pl/seminarium-online Analysis of the dynamics of emotions in parliamentary debates about the war in Ukraine

Aleksandra Tomaszewska (Institute of Computer Science, Polish Academy of Sciences), Anna Jamka (Universty of Warsaw)

http://zil.ipipan.waw.pl/seminarium-online Gender-fair language in the Polish parliament: a corpus-based study of parliamentary debates in the ParlaMint corpus

Marek Łaziński (University of Warsaw)

http://zil.ipipan.waw.pl/seminarium-online Changes in the Polish language of the last hundred years in the mirror of parliamentary debates

25 March 2024

Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)

http://zil.ipipan.waw.pl/seminarium-online Are text credibility classifiers robust to adversarial actions?  Talk in Polish.

Automatic text classifiers are widely used for helping in content moderation for platforms hosting user-generated text, especially social networks. They can be employed to filter out unfriendly, misinforming, manipulative or simply illegal information. However, we have to remember that authors of such text often have a strong motivation to spread them and might try to modify the original content, until they find a reformulation that gets through automatic filters. Such modified variants of original data, called adversarial examples, play a crucial role in analyzing the robustness of ML models to the attacks of motivated actors. The presentation will be devoted to a systematic analysis of the problem in context of detecting misinformation. I am going to show concrete examples where a replacement of trivial words causes a change in a classifier's decision, as well as the BODEGA framework for robustness analysis, used in the InCreiblAE shared task at CheckThat! evaluation lab at CLEF 2024.

28 March 2024

Krzysztof Węcel (Poznań University of Economics and Business)

http://zil.ipipan.waw.pl/seminarium-online Credibility of information in the context of fact-checking process  Talk in Polish.

The presentation will focus on the topics of OpenFact project, which is a response to the problem of fake news. As part of the project, we develop methods that allow us to verify the credibility of information. In order to ensure methodological correctness, we rely on the process used by fact-checking agencies. These activities are based on complex data sets obtained, among others, from ClaimReview, Common Crawl or by monitoring social media and extracting statements from texts. It is also important to evaluate information in terms of its checkworthiness and the credibility of sources whose reputation may result from publications sourced from OpenAlex or Crossref. Stylometric analysis allows us to determine authorship, and the comparison of human and machine work opens up new possibilities in detecting the use of artificial intelligence. We use local small language models as well as remote LLMs with various scenarios. We have built large sets of statements that can be used to verify new texts by examining semantic similarity. They are described with additional, constantly expanded metadata allowing for the implementation of various use cases.

25 April 2024

http://zil.ipipan.waw.pl/seminarium-online Seminar summarising the work on the Corpus of Modern Polish (Decade 2011-2020)  All talks in Polish.

11:30–11:35: About the project (Małgorzata Marciniak)

11:35–12:05: The Corpus of Modern Polish, Decade 2011-2020 (Marek Łaziński)

12:05–12:35: Annotation, lemmatisation, frequency lists (Witold Kieraś)

12:35–13:00: Coffee break

13:00–13:30: Hybrid representation of syntactic information (Marcin Woliński)

13:30–14:15: Discussion on the future of corpora

13 May 2024

Michal Křen (Charles University in Prague)

http://zil.ipipan.waw.pl/seminarium-online Latest developments in the Czech National Corpus  Talk in English.

The talk will give an overview of the Czech National Corpus (CNC) research infrastructure in all the main areas of its operation: corpus compilation, data annotation, application development and user support. Special attention will be paid to the variety of language corpora and user applications where CNC has recently seen a significant progress. In addition, it is the end-user web applications that shape the way linguists and other scholars think about the language data and how they can be utilized. The talk will conclude with an outline of future plans.

3 June 2024 (the talk given at the institute seminar)

Marcin Woliński, Katarzyna Krasnowska-Kieraś (Institute of Computer Science, Polish Academy of Sciences)

http://zil.ipipan.waw.pl/seminarium-online Constituency and dependency parsing of natural language using neural networks  Talk in Polish.

In the talk, we will present a method of automatic syntactic analysis (parsing) of natural language. In the proposed approach, syntactic structures are expressed using syntactic spines and their attachments, which allows a simultaneous generation of two popular representations: dependency and constituency trees. We will discuss the implementation of this concept in the form of a set of classifiers fed with the outputs of a BERT-type language model. Tests of the algorithm on Polish and German data showed a high quality of the results obtained. The method was used to introduce a syntactic layer of annotation in the Corpus of Contemporary Polish Language developed at IPI PAN.

4 July 2024

Purificação Silvano (University of Porto)

http://zil.ipipan.waw.pl/seminarium-online Unifying Semantic Annotation with ISO 24617 for Narrative Extraction, Understanding and Visualisation  Talk in English.

In this talk, I will present the successful application of Language resource management – Semantic annotation framework (ISO-24617) for representing semantic information in texts. Initially, I will introduce the harmonisation of five parts of ISO 24617 (1, 4, 7, 8, 9) into a comprehensive annotation scheme designed to represent semantic information pertaining to eventualities, times, participants, space, discourse relations and semantic roles. Subsequently, I will explore the applications of this annotation, specifically highlighting the Text2Story and StorySense projects, which focus on narrative extraction, understanding and visualisation of the journalistic text.

Please see also the talks given in 2000–2015 and 2015–2023.