|
Size: 19690
Comment:
|
Size: 29338
Comment:
|
| Deletions are marked like this. | Additions are marked like this. |
| Line 3: | Line 3: |
| = Natural Language Processing Seminar 2018–2019 = | = Natural Language Processing Seminar 2023–2024 = |
| Line 5: | Line 5: |
| ||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). All recorded talks are available [[https://www.youtube.com/channel/UC5PEPpMqjAr7Pgdvq0wRn0w|on YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]|| | ||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pjl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on [[https://www.youtube.com/ipipan|YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]|| |
| Line 7: | Line 7: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''1 October 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Janusz S. Bień''' (University of Warsaw – prof. emeritus)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=mOYzwpjTAf4|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-10-01.pdf|Electronic indexes to lexicographical resources]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">We will focus on the indexes to lexicographical resources available online in !DjVu format. Such indexes can be browsed, searched, modified and created with the djview4poliqarp open source program; the origins and the history of the program will be briefly presented. Originally the index support was added to the program to handle the list of entries in the 19th century Linde's dictionary, but can be used conveniently also for other resources, as will be demonstrated on selected examples. In particular some new features, introduced to the program in the last months, will be presented publicly for the first time.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''9 October 2023'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Mikołajczyk-Bareła''', '''Wojciech Janowski''' (!VoiceLab), '''Piotr Pęzik''' (University of Łódź / !VoiceLab), '''Filip Żarnecki''', '''Alicja Golisowicz''' (!VoiceLab)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-10-09.pdf|TRURL.AI: Fine-tuning large language models on multilingual instruction datasets]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">This talk will summarize our recent work on fine-tuning a large generative language model on bilingual instruction datasets, which resulted in the release of an open version of Trurl (trurl.ai). The motivation behind creating this model was to improve the performance of the original Llama 2 7B- and 13B-parameter models (Touvron et al. 2023), from which it was derived in a number of areas such as information extraction from customer-agent interactions and data labeling with a special focus on processing texts and instructions written in Polish. We discuss the process of optimizing the instruction datasets and the effect of the fine-tuning process on a number of selected downstream tasks.|| |
| Line 12: | Line 12: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''15 October 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Wojciech Jaworski, Szymon Rutkowski''' (University of Warsaw)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=SbPAdmRmW08|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-10-15.pdf|A multilayer rule based model of Polish inflection]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will be devoted to the multilayer model of Polish inflection. The model has been developed on the basis of Grammatical Dictionary of Polish; it does not use the concept of a inflexion paradigm. The model consists of three layers of hand-made rules: "orthographic-phonetic layer" converting a segment to representation reflecting morphological patterns of the language, "analytic layer" generating lemma and determining affix and "interpretation layer" giving a morphosyntactic interpretation based on detected affixes. The model provides knowledge about the language to a morphological analyzer supplemented with the function of guessing lemmas and morphosyntactic interpretations for non-dictionary forms (guesser). The second use of the model is generation of word forms based on lemma and morphosyntactic interpretation. The presentation will also cover the issue of disambiguation of the results provided by the morphological analyzer. The demo version of the program is available on the Internet.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''16 October 2023'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Konrad Wojtasik''', '''Vadim Shishkin''', '''Kacper Wołowiec''', '''Arkadiusz Janz''', '''Maciej Piasecki''' (Wrocław University of Science and Technology)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-10-16.pdf|Evaluation of information retrieval models in zero-shot settings on different documents domains]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Information Retrieval over large collections of documents is an extremely important research direction in the field of natural language processing. It is a key component in question-answering systems, where the answering model often relies on information contained in a database with up-to-date knowledge. This not only allows for updating the knowledge upon which the system responds to user queries but also limits its hallucinations. Currently, information retrieval models are neural networks and require significant training resources. For many years, lexical matching methods like BM25 outperformed trained neural models in Open Domain setting, but current architectures and extensive datasets allow surpassing lexical solutions. In the presentation, I will introduce available datasets for the evaluation and training of modern information retrieval architectures in document collections from various domains, as well as future development directions.|| |
| Line 17: | Line 17: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''29 October 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Waszczuk''' (Heinrich-Heine-Universität Düsseldorf)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=zjGQRG2PNu0|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-10-29.pdf|From morphosyntactic tagging to identification of verbal multiword expressions: a discriminative approach]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The first part of the talk was dedicated to Concraft-pl 2.0, the new version of a morphosyntactic tagger for Polish based on conditional random fields. Concraft-pl 2.0 performs morphosyntactic segmentation as a by-product of disambiguation, which allows to use it directly on the segmentation graphs provided by the analyser Morfeusz. This is in contrast with other existing taggers for Polish, which either neglect the problem of segmentation or rely on heuristics to perform it in a pre-processing stage. During the second part, an approach to identifying verbal multiword expressions (VMWEs) based on dependency parsing results was presented. In this approach, VMWE identification is reduced to the problem of dependency tree labeling, where one of two labels (MWE or not-MWE) must be predicted for each node in the dependency tree. The underlying labeling model can be seen as conditional random fields (as used in Concraft) adapted to tree structures. A system based on this approach ranked 1st in the closed track of the PARSEME shared task 2018.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''30 October 2023'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Faleńska''' (University of Stuttgart)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-10-30.pdf|Steps towards Bias-Aware NLP Systems]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">For many, Natural Language Processing (NLP) systems have become everyday necessities, with applications ranging from automatic document translation to voice-controlled personal assistants. Recently, the increasing influence of these AI tools on human lives has raised significant concerns about the possible harm these tools can cause.|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In this talk, I will start by showing a few examples of such harmful behaviors and discussing their potential origins. I will argue that biases in NLP models should be addressed by advancing our understanding of their linguistic sources. Then, the talk will zoom into three compelling case studies that shed light on inequalities in commonly used training data sources: Wikipedia, instructional texts, and discussion forums. Through these case studies, I will show that regardless of the perspective on the particular demographic group (speaking about, speaking to, and speaking as), subtle biases are present in all these datasets and can perpetuate harmful outcomes of NLP models.|| |
| Line 22: | Line 23: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''5 November 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Kozakoszczak''' (Faculty of Modern Languages, University of Warsaw / Heinrich-Heine-Universität Düsseldorf)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=sz7dGmf8p3k|{{attachment:seminarium-archiwum/youtube.png}}]] '''Mornings to Wednesdays — semantics and normalization of Polish quasi-periodical temporal expression'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The standard interpretations of expressions like “Januarys” and “Fridays” in temporal representation and reasoning are slices of collections of 2nd order, e.g. all the sixth elements of day sequences of cardinality 7 aligned with calendar weeks. I will present results of the work on normalizing most frequent Polish quasi-periodical temporal expressions for online booking systems. On the linguistic side I will argue against synonymy of the kind “Fridays” = “sixth days of the weeks” and give semantic tests for rudimentary classification of quasi-periodicity. In the formal part I will propose an extension to existing formalisms covering intensional quasi-periodical operators “from”, “to”, “before” and “after” restricted to monotonic domains. In the implementation part I will demonstrate an algorithm for lazy generation of generalized intersection of collections.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''13 November 2023'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Rybak''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-11-13.pdf|Advancing Polish Question Answering: Datasets and Models]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Although question answering (QA) is one of the most popular topics in natural language processing, until recently it was virtually absent in the Polish scientific community. However, the last few years have seen a significant increase in work related to this topic. In this talk, I will discuss what question answering is, how current QA systems work, and what datasets and models are available for Polish QA. In particular, I will discuss the resources created at IPI PAN, namely the [[https://huggingface.co/datasets/ipipan/polqa|PolQA]] and [[https://huggingface.co/datasets/ipipan/maupqa|MAUPQA]] and the [[https://huggingface.co/ipipan/silver-retriever-base-v1|Silver Retriever]] model. Finally, I will point out further directions of work that are still open when it comes to Polish question answering.|| |
| Line 27: | Line 28: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''19 November 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Daniel Zeman''' (Institute of Formal and Applied Linguistics, Charles University in Prague)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=xUmZ8Mxcmg0|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-11-19.pdf|Universal Dependencies and the Slavic Languages]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">I will present Universal Dependencies, a worldwide community effort aimed at providing multilingual corpora, annotated at the morphological and syntactic levels following unified annotation guidelines. I will discuss the concept of core arguments, one of the cornerstones of the UD framework. In the second part of the talk I will focus on some interesting problems and challenges of applying Universal Dependencies to the Slavic languages. I will discuss examples from 12 Slavic languages that are currently represented in UD and show that cross-linguistic consistency can still be improved.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 December 2023''' (a series of short invited talks by Coventry Univerity researchers)|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Xiaorui Jiang''', '''Opeoluwa Akinseloyin''', '''Vasile Palade''' (Coventry University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-12-11-1.pdf|Towards More Human-Effortless Systematic Review Automation]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Wystąpienie w jęz. angielskim.}}|| ||<style="border:0;padding-left:30px;padding-bottom:10px">Systematic literature review (SLR) is the standard tool for synthesising medical and clinical evidence from the ocean of publications. SLR is extremely expensive. SLR is extremely expensive. AI can play a significant role in automating the SLR process, such as for citation screening, i.e., the selection of primary studies-based title and abstract. [[http://systematicreviewtools.com/|Some tools exist]], but they suffer from tremendous obstacles, including lack of trust. In addition, a specific characteristic of systematic review, which is the fact that each systematic review is a unique dataset and starts with no annotation, makes the problem even more challenging. In this study, we present some seminal but initial efforts on utilising the transfer learning and zero-shot learning capabilities of pretrained language models and large language models to solve or alleviate this challenge. Preliminary results are to be reported.|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Kacper Sówka''' (Coventry University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-12-11-2.pdf|Attack Tree Generation Using Machine Learning]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Wystąpienie w jęz. angielskim.}}|| ||<style="border:0;padding-left:30px;padding-bottom:10px">My research focuses on applying machine learning and NLP to the problem of cybersecurity attack modelling. This is done by generating "attack tree" models using public cybersecurity datasets (CVE) and training a siamese neural network to predict the relationship between individual cybersecurity vulnerabilities using a DistilBERT encoder fine-tuned using Masked Language Modelling.|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Xiaorui Jiang''' (Coventry University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-12-11-3.pdf|Towards Semantic Science Citation Index]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Wystąpienie w jęz. angielskim.}}|| ||<style="border:0;padding-left:30px;padding-bottom:10px">It is a difficult task to understand and summarise the development of scientific research areas. This task is especially cognitively demanding for postgraduate students and early-career researchers, of the whose main jobs is to identify such developments by reading a large amount of literature. Will AI help? We believe so. This short talk summarises some recent initial work on extracting the semantic backbone of a scientific area through the synergy of natural language processing and network analysis, which is believed to serve a certain type of discourse models for summarisation (in future work). As a small step from it, the second part of the talk introduces how comparison citations are utilised to improve multi-document summarisation of scientific papers.|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Xiaorui Jiang''', '''Alireza Daneshkhah''' (Coventry University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-12-11-4.pdf|Natural Language Processing for Automated Triaging at NHS]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15x">In face of a post-COVID global economic slowdown and aging society, the primary care units in the National Healthcare Services (NHS) are receiving increasingly higher pressure, resulting in delays and errors in healthcare and patient management. AI can play a significant role in alleviating this investment-requirement discrepancy, especially in the primary care settings. A large portion of clinical diagnosis and management can be assisted with AI tools for automation and reduce delays. This short presentation reports the initial studies worked with an NHS partner on developing NLP-based solutions for the automation of clinical intention classification (to save more time for better patient treatment and management) and an early alert application for Gout Flare prediction from chief complaints (to avoid delays in patient treatment and management).|| |
| Line 32: | Line 42: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''3 December 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Ekaterina Lapshinova-Koltunski''' (Saarland University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=UQ_6dDNEw8E|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-12-03.pdf|Analysis and Annotation of Coreference for Contrastive Linguistics and Translation Studies]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In this talk, I will report on the ongoing work on coreference analysis in a multilingual context. I will present two approaches in the analysis of coreference and coreference-related phenomena: (1) top-down or theory-driven: here we start from some linguistic knowledge derived from the existing frameworks, define linguistic categories to analyse and create an annotated corpus that can be used either for further linguistic analysis or as training data for NLP applications; (2) bottom-up or data-driven: in this case, we start from a set of features of shallow character that we believe are discourse-related. We extract these structures from a huge amount of data and analyse them from a linguistic point of view trying to describe and explain the observed phenomena from the point of view of existing theories and grammars.|| |
||<style="border:0;padding-top:15px;padding-bottom:5px">'''8 January 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Danijel Korzinek''' (Polish-Japanese Academy of Information Technology)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-01-08.pdf|ParlaSpeech – Developing Large-Scale Speech Corpora in the ParlaMint project]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The purpose of this sub-project was to develop tools and methodologies that would allow the linking of the textual corpora developed within the [[https://www.clarin.eu/parlamint|ParlaMint]] project with their coresponding audio and video footage available online. The task was naturally more involved than it may seem intuitivetily and it higned mostly on the proper alignment of very long audio (up to a full working day of parliamentary sessions) to its corresponding transcripts, while accounting for many mistakes and inaccuracies in the matching and order between the two modalities. The project was developed using fully open-source models and tools, which are available online for use in other projects of similar scope. So far, it was used to fully prepare corpora for two languages (Polish and Croatian), but more are being currently developed.|| |
| Line 37: | Line 47: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''7 January 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Adam Przepiórkowski''' (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw), '''Agnieszka Patejuk''' (Institute of Computer Science, Polish Academy of Sciences / University of Oxford)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-01-07.pdf|Enhanced Universal Dependencies]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The aim of this talk is to present the two threads of our recent work on Universal Dependencies (UD), a standard for syntactically annotated corpora (http://universaldependencies.org/). The first thread is concerned with the developement of a new UD treebank of Polish, one that makes extensive use of the enhanced level of representation made available in the current UD standard. The treebank is the result of conversion from an earlier ‘treebank’ of Polish, one that was annotated with constituency and functional structures as they are understood in Lexical Functional Grammar. We will outline the conversion procedure and present the resulting UD treebank of Polish. The second thread is concerned with various inconsistencies and deficiencies of UD that we identified in the process of developing the UD treebank of Polish. We will concentrate on two particularly problematic areas in UD, namely, on the core/oblique distinction, which aims to – but does not really – replace the infamous argument/adjunct dichotomy, and on coordination, a phenomenon problematic for all dependency approaches.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''12 February 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Tsimur Hadeliya''', '''Dariusz Kajtoch''' (Allegro ML Research)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-02-12.pdf|Evaluation and analysis of in-context learning for Polish classification tasks]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">With the advent of language models such as ChatGPT, we are witnessing a paradigm shift in the way we approach natural language processing tasks. Instead of training a model from scratch, we can now solve tasks by designing appropriate prompts and choosing suitable demonstrations as input to a generative model. This approach, known as in-context learning (ICL), has shown remarkable capabilities for classification tasks in the English language . In this presentation, we will investigate how different language models perform on Polish classification tasks using the ICL approach. We will explore the effectiveness of various models, including multilingual and large-scale models, and compare their results with existing solutions. Through a comprehensive evaluation and analysis, we aim to gain insights into the strengths and limitations of this approach for Polish classification tasks. Our findings will shed light on the potential of ICL for the Polish language. We will discuss challenges and opportunities, and propose directions for future work.|| |
| Line 42: | Line 52: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 January 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agata Savary''' (François Rabelais University Tours)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-01-14.pdf|Literal occurrences of multiword expressions: quantitative and qualitative analyses]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Multiword expressions (MWEs) such as to “pull strings” (to use one's influence), “to take part” or to “do in” (to kill) are word combinations which exhibit lexical, syntactic, and especially semantic idiosyncrasies. They pose special challenges to linguistic modeling and computational linguistics due to their non-compositional semantics, i.e. the fact that their meaning cannot be deduced from the meanings of their components, and from their syntactic structure, in a way deemed regular for the given language. Additionally, MWEs can have both idiomatic and literal occurrences. For instance “pulling strings” can be understood either as making use of one’s influence, or literally. Even if this phenomenon has been largely addressed in psycholinguistics, linguistics and natural language processing, the notion of a literal reading has rarely been formally defined or subject to quantitative analyses. I will propose a syntax-based definition of a literal reading of a MWE. I will also present the results of a quantitative and qualitative analysis of this phenomenon in Polish, as well as in 4 typologically distinct languages: Basque, German, Greek and Portuguese. This study, performed in a multilingual annotated corpus of the [[http://www.parseme.eu|PARSEME network]], shows that literal readings constitute a rare phenomenon. We also identify some properties that may distinguish them from their idiomatic counterparts.|| |
|
| Line 47: | Line 53: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 January 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marek Łaziński''' (University of Warsaw), '''Michał Woźniak''' (Jagiellonian University) || ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-01-21.pdf|Aspect in dictionaries and corpora. What for and how aspect pairs should be tagged in corpora?]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Corpora are generally tagged for grammatical categories, also for verbal aspect value. They all choose between pf and ipf, some of them add the third value: bi-aspectual (not present in the National Corpus of Polish). However, no Slavic corpus tags the aspect value of a verb form in reference to an aspect partner. If we can mark aspect pairs in dictionaries, it should be also possible in corpora. However under the condition, that we extrapolate aspect features of lexeme to specific verb forms in specific uses. Retaining the existing morphological tagging including aspect value, two more aspect tags have been added: 1) morphological markers of aspect and 2) reference to superlemma. Every verb form in the corpus has thus three parts: 1) The existing grammatcial characteristics (TAKIPI), 2) Repeated or corrected aspect value (including bi-aspecual) and morphological markers, 3) Reference to the aspect pair–superlemma. A corpus tagged for aspect pairs, even with alternative reference for every lexeme, opens new perspectives for research. The possibilities are especially rich in a parallel corpus with one Slavic and one aspectless language, as the Mainz-Warsaw Corpus. In order to check the usefulness of our aspect pair tagging a series of queries will be built which allow to compare grammatical profiles of suffixal and prefixal aspect pf and ipf partners.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''29 February 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Seminar on analysis of parliamentary data'''  {{attachment:seminarium-archiwum/icon-pl.gif|All talks in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Maciej Ogrodniczuk''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-02-29-1.pdf|Polish Parliamentary Corpus and ParlaMint corpus]]'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Bartłomiej Klimowski''' (University of Warsaw)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-02-29-2.pdf|Application to analyse the sentiment of utterances of Polish MPs]]'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Konrad Kiljan''' (University of Warsaw), '''Ewelina Gajewska''' (Warsaw University of Technology)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-02-29-3.pdf|Analysis of the dynamics of emotions in parliamentary debates about the war in Ukraine]]'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Aleksandra Tomaszewska''' (Institute of Computer Science, Polish Academy of Sciences), '''Anna Jamka''' (Universty of Warsaw)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-02-29-4.pdf|Gender-fair language in the Polish parliament: a corpus-based study of parliamentary debates in the ParlaMint corpus]]'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marek Łaziński''' (University of Warsaw)|| ||<style="border:0;padding-left:30px;padding-bottom:16px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-02-29-5.pdf|Changes in the Polish language of the last hundred years in the mirror of parliamentary debates]]'''|| |
| Line 52: | Line 66: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 February 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Anna Wróblewska''' (Applica / Warsaw University of Technology), '''Filip Graliński''' (Applica / Adam Mickiewicz University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=tZ_rkR7XqRY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-02-11.pdf|Text-based machine learning processes and their interpretability]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|||| ||<style="border:0;padding-left:30px;padding-bottom:15px">How do we tackle text modeling challenges in business applications? We will present a prototype architecture for automation of processes in text based work and a few use cases of machine learning models. Use cases will be about emotion detection, abusive language recognition and more. We will also show our tool to explain suspicious findings in datasets and the models behaviour.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''25 March 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Przybyła''' (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-03-25.pdf|Are text credibility classifiers robust to adversarial actions?]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Automatic text classifiers are widely used for helping in content moderation for platforms hosting user-generated text, especially social networks. They can be employed to filter out unfriendly, misinforming, manipulative or simply illegal information. However, we have to remember that authors of such text often have a strong motivation to spread them and might try to modify the original content, until they find a reformulation that gets through automatic filters. Such modified variants of original data, called adversarial examples, play a crucial role in analyzing the robustness of ML models to the attacks of motivated actors. The presentation will be devoted to a systematic analysis of the problem in context of detecting misinformation. I am going to show concrete examples where a replacement of trivial words causes a change in a classifier's decision, as well as the BODEGA framework for robustness analysis, used in the InCreiblAE shared task at [[https://checkthat.gitlab.io/clef2024/task6/|CheckThat! evaluation lab at CLEF 2024]].|| |
| Line 57: | Line 71: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 February 2019''' (NOTE: the seminar will be held on Thursday!) || ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Dutkiewicz''' (Poznan University of Technology)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Empirical research on medical information retrieval'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">We discuss results and evaluation procedures a of the bioCADDIE 2016 challenge on search of precision medical data. Our good results are due to word embedding query expansion with appropriate weights. Information Retrieval (IR) evaluation is demanding because of considerable effort required to judge over 10000 documents. A simple sampling method was proposed over 10 years ago for estimation of Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG) in spite of incomplete judgments. For this method to work the number of judged documents has to be relatively large. Such conditions were not fulfilled in bioCADDIE 2016 challenge and TREC PM 2017, 2018. The specificity of the bioCADDIE evaluation makes the post-challenge results incompatible with these judged during the contest. In bioCADDIE, for some questions there were not any judged relevant document. The results are strongly dependent on the cut-off rank. As the effect, in the bioCADDIE challenge infAP is weakly correlated with infNDCG, and an error could by up to 0.15-0.20 in absolute value. We believe, that the deviation of evaluation measures may override the primary role of the measure in such a case. We collaborate this claim by simulation of synthetic results. We propose a simulated environment with properties, which mirror the real systems. We implement a number of evaluation measures within the simulation and discuss the usefulness of the measures with partially annotated collection of documents in regard to the collection size, number of annotated document and proportion between the number of relevant and irrelevant documents. In particular we focus on the behavior of aforementioned AP and NDCG and their inferred versions. Other studies suggest that infNDCG weakly correlates with other measures and therefore should not be selected as the most important measure.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 March 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Krzysztof Węcel''' (Poznań University of Economics and Business)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-03-28.pdf|Credibility of information in the context of fact-checking process]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will focus on the topics of !OpenFact project, which is a response to the problem of fake news. As part of the project, we develop methods that allow us to verify the credibility of information. In order to ensure methodological correctness, we rely on the process used by fact-checking agencies. These activities are based on complex data sets obtained, among others, from !ClaimReview, Common Crawl or by monitoring social media and extracting statements from texts. It is also important to evaluate information in terms of its checkworthiness and the credibility of sources whose reputation may result from publications sourced from !OpenAlex or Crossref. Stylometric analysis allows us to determine authorship, and the comparison of human and machine work opens up new possibilities in detecting the use of artificial intelligence. We use local small language models as well as remote LLMs with various scenarios. We have built large sets of statements that can be used to verify new texts by examining semantic similarity. They are described with additional, constantly expanded metadata allowing for the implementation of various use cases.|| |
| Line 62: | Line 76: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''25 March 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Łukasz Dębowski''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''GPT-2'''  {{attachment:seminarium-archiwum/icon-pl.gif|Wystąpienie w języku polskim.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The summary will be available shortly.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''25 April 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Seminar summarising the work on the [[https://kwjp.pl|Corpus of Modern Polish (Decade 2011-2020)]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|All talks in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:0px">11:30–11:35: '''[[attachment:seminarium-archiwum/2024-04-25-1.pdf|About the project]]''' (Małgorzata Marciniak)|| ||<style="border:0;padding-left:30px;padding-bottom:0px">11:35–12:05: '''[[attachment:seminarium-archiwum/2024-04-25-2.pdf|The Corpus of Modern Polish, Decade 2011-2020]]''' (Marek Łaziński)|| ||<style="border:0;padding-left:30px;padding-bottom:0px">12:05–12:35: '''[[attachment:seminarium-archiwum/2024-04-25-3.pdf|Annotation, lemmatisation, frequency lists]]''' (Witold Kieraś)|| ||<style="border:0;padding-left:30px;padding-bottom:0px">12:35–13:00: Coffee break|| ||<style="border:0;padding-left:30px;padding-bottom:0px">13:00–13:30: '''[[attachment:seminarium-archiwum/2024-04-25-4.pdf|Hybrid representation of syntactic information]]''' (Marcin Woliński)|| ||<style="border:0;padding-left:30px;padding-bottom:15px">13:30–14:15: '''[[attachment:seminarium-archiwum/2024-04-25-5.pdf|Discussion on the future of corpora]]'''|| |
| Line 67: | Line 85: |
| ||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2018]].|| | ||<style="border:0;padding-top:5px;padding-bottom:5px">'''13 May 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Michal Křen''' (Charles University in Prague)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-05-13.pdf|Latest developments in the Czech National Corpus]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The talk will give an overview of the Czech National Corpus (CNC) research infrastructure in all the main areas of its operation: corpus compilation, data annotation, application development and user support. Special attention will be paid to the variety of language corpora and user applications where CNC has recently seen a significant progress. In addition, it is the end-user web applications that shape the way linguists and other scholars think about the language data and how they can be utilized. The talk will conclude with an outline of future plans.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''3 June 2024''' (the talk given at the [[https://ipipan.waw.pl/instytut/dzialalnosc-naukowa/seminaria/ogolnoinstytutowe|institute seminar]])|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marcin Woliński''', '''Katarzyna Krasnowska-Kieraś''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-06-03.pdf|Constituency and dependency parsing of natural language using neural networks]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In the talk, we will present a method of automatic syntactic analysis (parsing) of natural language. In the proposed approach, syntactic structures are expressed using syntactic spines and their attachments, which allows a simultaneous generation of two popular representations: dependency and constituency trees. We will discuss the implementation of this concept in the form of a set of classifiers fed with the outputs of a BERT-type language model. Tests of the algorithm on Polish and German data showed a high quality of the results obtained. The method was used to introduce a syntactic layer of annotation in the [[https://kwjp.pl|Corpus of Contemporary Polish Language]] developed at IPI PAN.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''4 July 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Purificação Silvano''' (University of Porto)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-07-04.pdf|Unifying Semantic Annotation with ISO 24617 for Narrative Extraction, Understanding and Visualisation]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In this talk, I will present the successful application of Language resource management – Semantic annotation framework (ISO-24617) for representing semantic information in texts. Initially, I will introduce the harmonisation of five parts of ISO 24617 (1, 4, 7, 8, 9) into a comprehensive annotation scheme designed to represent semantic information pertaining to eventualities, times, participants, space, discourse relations and semantic roles. Subsequently, I will explore the applications of this annotation, specifically highlighting the [[https://text2story.inesctec.pt/|Text2Story]] and [[https://storysense.inesctec.pt/|StorySense]] projects, which focus on narrative extraction, understanding and visualisation of the journalistic text.|| ||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2023]].|| {{{#!wiki comment ||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 March 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Mateusz Krubiński''' (Charles University in Prague)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Talk title will be given shortly'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made available soon.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 April 2020'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Stan Matwin''' (Dalhousie University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Efficient training of word embeddings with a focus on negative examples'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">This presentation is based on our [[https://pdfs.semanticscholar.org/1f50/db5786913b43f9668f997fc4c97d9cd18730.pdf|AAAI 2018]] and [[https://aaai.org/ojs/index.php/AAAI/article/view/4683|AAAI 2019]] papers on English word embeddings. In particular, we examine the notion of “negative examples”, the unobserved or insignificant word-context co-occurrences, in spectral methods. we provide a new formulation for the word embedding problem by proposing a new intuitive objective function that perfectly justifies the use of negative examples. With the goal of efficient learning of embeddings, we propose a kernel similarity measure for the latent space that can effectively calculate the similarities in high dimensions. Moreover, we propose an approximate alternative to our algorithm using a modified Vantage Point tree and reduce the computational complexity of the algorithm with respect to the number of words in the vocabulary. We have trained various word embedding algorithms on articles of Wikipedia with 2.3 billion tokens and show that our method outperforms the state-of-the-art in most word similarity tasks by a good margin. We will round up our discussion with some general thought s about the use of embeddings in modern NLP.|| }}} |
Natural Language Processing Seminar 2023–2024
The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube. |
9 October 2023 |
Agnieszka Mikołajczyk-Bareła, Wojciech Janowski (VoiceLab), Piotr Pęzik (University of Łódź / VoiceLab), Filip Żarnecki, Alicja Golisowicz (VoiceLab) |
|
This talk will summarize our recent work on fine-tuning a large generative language model on bilingual instruction datasets, which resulted in the release of an open version of Trurl (trurl.ai). The motivation behind creating this model was to improve the performance of the original Llama 2 7B- and 13B-parameter models (Touvron et al. 2023), from which it was derived in a number of areas such as information extraction from customer-agent interactions and data labeling with a special focus on processing texts and instructions written in Polish. We discuss the process of optimizing the instruction datasets and the effect of the fine-tuning process on a number of selected downstream tasks. |
16 October 2023 |
Konrad Wojtasik, Vadim Shishkin, Kacper Wołowiec, Arkadiusz Janz, Maciej Piasecki (Wrocław University of Science and Technology) |
|
Information Retrieval over large collections of documents is an extremely important research direction in the field of natural language processing. It is a key component in question-answering systems, where the answering model often relies on information contained in a database with up-to-date knowledge. This not only allows for updating the knowledge upon which the system responds to user queries but also limits its hallucinations. Currently, information retrieval models are neural networks and require significant training resources. For many years, lexical matching methods like BM25 outperformed trained neural models in Open Domain setting, but current architectures and extensive datasets allow surpassing lexical solutions. In the presentation, I will introduce available datasets for the evaluation and training of modern information retrieval architectures in document collections from various domains, as well as future development directions. |
30 October 2023 |
Agnieszka Faleńska (University of Stuttgart) |
For many, Natural Language Processing (NLP) systems have become everyday necessities, with applications ranging from automatic document translation to voice-controlled personal assistants. Recently, the increasing influence of these AI tools on human lives has raised significant concerns about the possible harm these tools can cause. |
In this talk, I will start by showing a few examples of such harmful behaviors and discussing their potential origins. I will argue that biases in NLP models should be addressed by advancing our understanding of their linguistic sources. Then, the talk will zoom into three compelling case studies that shed light on inequalities in commonly used training data sources: Wikipedia, instructional texts, and discussion forums. Through these case studies, I will show that regardless of the perspective on the particular demographic group (speaking about, speaking to, and speaking as), subtle biases are present in all these datasets and can perpetuate harmful outcomes of NLP models. |
13 November 2023 |
Piotr Rybak (Institute of Computer Science, Polish Academy of Sciences) |
Although question answering (QA) is one of the most popular topics in natural language processing, until recently it was virtually absent in the Polish scientific community. However, the last few years have seen a significant increase in work related to this topic. In this talk, I will discuss what question answering is, how current QA systems work, and what datasets and models are available for Polish QA. In particular, I will discuss the resources created at IPI PAN, namely the PolQA and MAUPQA and the Silver Retriever model. Finally, I will point out further directions of work that are still open when it comes to Polish question answering. |
11 December 2023 (a series of short invited talks by Coventry Univerity researchers) |
Xiaorui Jiang, Opeoluwa Akinseloyin, Vasile Palade (Coventry University) |
Systematic literature review (SLR) is the standard tool for synthesising medical and clinical evidence from the ocean of publications. SLR is extremely expensive. SLR is extremely expensive. AI can play a significant role in automating the SLR process, such as for citation screening, i.e., the selection of primary studies-based title and abstract. Some tools exist, but they suffer from tremendous obstacles, including lack of trust. In addition, a specific characteristic of systematic review, which is the fact that each systematic review is a unique dataset and starts with no annotation, makes the problem even more challenging. In this study, we present some seminal but initial efforts on utilising the transfer learning and zero-shot learning capabilities of pretrained language models and large language models to solve or alleviate this challenge. Preliminary results are to be reported. |
Kacper Sówka (Coventry University) |
My research focuses on applying machine learning and NLP to the problem of cybersecurity attack modelling. This is done by generating "attack tree" models using public cybersecurity datasets (CVE) and training a siamese neural network to predict the relationship between individual cybersecurity vulnerabilities using a DistilBERT encoder fine-tuned using Masked Language Modelling. |
Xiaorui Jiang (Coventry University) |
It is a difficult task to understand and summarise the development of scientific research areas. This task is especially cognitively demanding for postgraduate students and early-career researchers, of the whose main jobs is to identify such developments by reading a large amount of literature. Will AI help? We believe so. This short talk summarises some recent initial work on extracting the semantic backbone of a scientific area through the synergy of natural language processing and network analysis, which is believed to serve a certain type of discourse models for summarisation (in future work). As a small step from it, the second part of the talk introduces how comparison citations are utilised to improve multi-document summarisation of scientific papers. |
Xiaorui Jiang, Alireza Daneshkhah (Coventry University) |
In face of a post-COVID global economic slowdown and aging society, the primary care units in the National Healthcare Services (NHS) are receiving increasingly higher pressure, resulting in delays and errors in healthcare and patient management. AI can play a significant role in alleviating this investment-requirement discrepancy, especially in the primary care settings. A large portion of clinical diagnosis and management can be assisted with AI tools for automation and reduce delays. This short presentation reports the initial studies worked with an NHS partner on developing NLP-based solutions for the automation of clinical intention classification (to save more time for better patient treatment and management) and an early alert application for Gout Flare prediction from chief complaints (to avoid delays in patient treatment and management). |
8 January 2024 |
Danijel Korzinek (Polish-Japanese Academy of Information Technology) |
|
The purpose of this sub-project was to develop tools and methodologies that would allow the linking of the textual corpora developed within the ParlaMint project with their coresponding audio and video footage available online. The task was naturally more involved than it may seem intuitivetily and it higned mostly on the proper alignment of very long audio (up to a full working day of parliamentary sessions) to its corresponding transcripts, while accounting for many mistakes and inaccuracies in the matching and order between the two modalities. The project was developed using fully open-source models and tools, which are available online for use in other projects of similar scope. So far, it was used to fully prepare corpora for two languages (Polish and Croatian), but more are being currently developed. |
12 February 2024 |
Tsimur Hadeliya, Dariusz Kajtoch (Allegro ML Research) |
|
With the advent of language models such as ChatGPT, we are witnessing a paradigm shift in the way we approach natural language processing tasks. Instead of training a model from scratch, we can now solve tasks by designing appropriate prompts and choosing suitable demonstrations as input to a generative model. This approach, known as in-context learning (ICL), has shown remarkable capabilities for classification tasks in the English language . In this presentation, we will investigate how different language models perform on Polish classification tasks using the ICL approach. We will explore the effectiveness of various models, including multilingual and large-scale models, and compare their results with existing solutions. Through a comprehensive evaluation and analysis, we aim to gain insights into the strengths and limitations of this approach for Polish classification tasks. Our findings will shed light on the potential of ICL for the Polish language. We will discuss challenges and opportunities, and propose directions for future work. |
29 February 2024 |
Seminar on analysis of parliamentary data |
Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences) |
Bartłomiej Klimowski (University of Warsaw) |
|
Konrad Kiljan (University of Warsaw), Ewelina Gajewska (Warsaw University of Technology) |
|
Aleksandra Tomaszewska (Institute of Computer Science, Polish Academy of Sciences), Anna Jamka (Universty of Warsaw) |
|
Marek Łaziński (University of Warsaw) |
|
25 March 2024 |
Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences) |
|
Automatic text classifiers are widely used for helping in content moderation for platforms hosting user-generated text, especially social networks. They can be employed to filter out unfriendly, misinforming, manipulative or simply illegal information. However, we have to remember that authors of such text often have a strong motivation to spread them and might try to modify the original content, until they find a reformulation that gets through automatic filters. Such modified variants of original data, called adversarial examples, play a crucial role in analyzing the robustness of ML models to the attacks of motivated actors. The presentation will be devoted to a systematic analysis of the problem in context of detecting misinformation. I am going to show concrete examples where a replacement of trivial words causes a change in a classifier's decision, as well as the BODEGA framework for robustness analysis, used in the InCreiblAE shared task at CheckThat! evaluation lab at CLEF 2024. |
28 March 2024 |
Krzysztof Węcel (Poznań University of Economics and Business) |
|
The presentation will focus on the topics of OpenFact project, which is a response to the problem of fake news. As part of the project, we develop methods that allow us to verify the credibility of information. In order to ensure methodological correctness, we rely on the process used by fact-checking agencies. These activities are based on complex data sets obtained, among others, from ClaimReview, Common Crawl or by monitoring social media and extracting statements from texts. It is also important to evaluate information in terms of its checkworthiness and the credibility of sources whose reputation may result from publications sourced from OpenAlex or Crossref. Stylometric analysis allows us to determine authorship, and the comparison of human and machine work opens up new possibilities in detecting the use of artificial intelligence. We use local small language models as well as remote LLMs with various scenarios. We have built large sets of statements that can be used to verify new texts by examining semantic similarity. They are described with additional, constantly expanded metadata allowing for the implementation of various use cases. |
25 April 2024 |
|
11:30–11:35: About the project (Małgorzata Marciniak) |
11:35–12:05: The Corpus of Modern Polish, Decade 2011-2020 (Marek Łaziński) |
12:05–12:35: Annotation, lemmatisation, frequency lists (Witold Kieraś) |
12:35–13:00: Coffee break |
13:00–13:30: Hybrid representation of syntactic information (Marcin Woliński) |
13:30–14:15: Discussion on the future of corpora |
13 May 2024 |
Michal Křen (Charles University in Prague) |
The talk will give an overview of the Czech National Corpus (CNC) research infrastructure in all the main areas of its operation: corpus compilation, data annotation, application development and user support. Special attention will be paid to the variety of language corpora and user applications where CNC has recently seen a significant progress. In addition, it is the end-user web applications that shape the way linguists and other scholars think about the language data and how they can be utilized. The talk will conclude with an outline of future plans. |
3 June 2024 (the talk given at the institute seminar) |
Marcin Woliński, Katarzyna Krasnowska-Kieraś (Institute of Computer Science, Polish Academy of Sciences) |
|
In the talk, we will present a method of automatic syntactic analysis (parsing) of natural language. In the proposed approach, syntactic structures are expressed using syntactic spines and their attachments, which allows a simultaneous generation of two popular representations: dependency and constituency trees. We will discuss the implementation of this concept in the form of a set of classifiers fed with the outputs of a BERT-type language model. Tests of the algorithm on Polish and German data showed a high quality of the results obtained. The method was used to introduce a syntactic layer of annotation in the Corpus of Contemporary Polish Language developed at IPI PAN. |
4 July 2024 |
Purificação Silvano (University of Porto) |
|
In this talk, I will present the successful application of Language resource management – Semantic annotation framework (ISO-24617) for representing semantic information in texts. Initially, I will introduce the harmonisation of five parts of ISO 24617 (1, 4, 7, 8, 9) into a comprehensive annotation scheme designed to represent semantic information pertaining to eventualities, times, participants, space, discourse relations and semantic roles. Subsequently, I will explore the applications of this annotation, specifically highlighting the Text2Story and StorySense projects, which focus on narrative extraction, understanding and visualisation of the journalistic text. |
Please see also the talks given in 2000–2015 and 2015–2023. |


