|
Size: 5648
Comment:
|
Size: 6586
Comment:
|
| Deletions are marked like this. | Additions are marked like this. |
| Line 3: | Line 3: |
| = Natural Language Processing Seminar 2023–2024 = | = Natural Language Processing Seminar 2025–2026 = |
| Line 5: | Line 5: |
| ||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pjl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on [[https://www.youtube.com/ipipan|YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]|| | ||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pjl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It will restart in October and will take place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on [[https://www.youtube.com/ipipan|YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]|| |
| Line 7: | Line 7: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''9 October 2023'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Mikołajczyk-Bareła''', '''Wojciech Janowski''' (!VoiceLab), '''Piotr Pęzik''' (University of Łódź / !VoiceLab), '''Filip Żarnecki''', '''Alicja Golisowicz''' (!VoiceLab)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Trurl.ai Fine-tuning large language models on multilingual instruction datasets'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">This talk will summarize our recent work on fine-tuning a large generative language model on bilingual instruction datasets, which resulted in the release of an open version of Trurl (trurl.ai). The motivation behind creating this model was to improve the performance of the original Llama 2 7B- and 13B-parameter models (Touvron et al. 2023), from which it was derived in a number of areas such as information extraction from customer-agent interactions and data labeling with a special focus on processing texts and instructions written in Polish. We discuss the process of optimizing the instruction datasets and the effect of the fine-tuning process on a number of selected downstream tasks.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''15 September 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Louis Esteve''' (Universite Paris-Saclay) || ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2025-09-15.pdf|Diversity and dataset size – a quantitative perspective]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The field of Natural Language Processing (NLP) studies the abilities of computer systems to process and generate natural language, and has received increasing attention from the general population since the democratisation of generative and conversational models. However, behind the scenes, state-of-the-art NLP models are trained on ever-larger datasets, reaching trillions of tokens. It may be argued that the creation and use of such immense datasets is motivated by the idea that 'the larger the dataset, the more diverse it is', and that in turn 'if the training set is more diverse, it shall yield better models'. However, these statements thus far remain intuitions and need to be properly tested. To this end, this presentation will tackle methods and caveats of formal diversity quantification including limitations of the literature, a preliminary discussion on the link between diversity and dataset size, as well as their impact on downstream applications.|| |
| Line 12: | Line 12: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''16 October 2023'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Konrad Wojtasik''', '''Vadim Shishkin''', '''Kacper Wołowiec''', '''Arkadiusz Janz''', '''Maciej Piasecki''' (Wrocław University of Science and Technology)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Evaluation of information retrieval models in zero-shot settings on different documents domains'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The summary will be available soon.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''30 October 2023'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Faleńska''' (University of Stuttgart)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Steps towards Bias-Aware NLP Systems'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The summary will be available soon.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''13 November 2023'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Rybak''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Advancing Polish Question Answering: Datasets and Models'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The summary will be available soon.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''6 October 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Stan Matwin''' (Dalhousie University / Institute of Computer Science, Polish Academy of Sciences) || ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Deep, multi-faceted learning of diagnosing mental disorders from clinical interview records'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The key characteristics of mental illnesses are reflected in audio recordings of clinical interviews with patients and their families. We have developed a deep learning method that automatically extracts the relevant features necessary for the diagnosis of mental illnesses (ADHD, depression, bipolar disorder and schizophrenia) from such interviews. We use a variety of pre-trained models to extract representations from both the audio segments of these interviews and their text versions. We use several modern representation techniques (embeddings). We apply a Big Data approach by exploring existing audio and text corpora annotated with emotional labels. We address the problem of annotated data scarcity by using parametric model fine-tuning (Parameter Efficient Fine-Tuning). All these representations are then combined into a single multimodal form. To diagnose the above mental disorders, we use contrastive learning and model synthesis using a committee of experts (Mixture of Experts). The results show that through multimodal analysis of clinical interviews, mental disorders can be diagnosed with satisfactory accuracy (project conducted in collaboration with H. Naderi and R. Uher).|| |
| Line 28: | Line 18: |
||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2023]].|| |
||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2025]].|| |
| Line 32: | Line 21: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 March 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Mateusz Krubiński''' (Charles University in Prague)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Talk title will be given shortly'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made available soon.|| |
Natural Language Processing Seminar 2025–2026
The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It will restart in October and will take place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube. |
15 September 2025 |
Louis Esteve (Universite Paris-Saclay) |
The field of Natural Language Processing (NLP) studies the abilities of computer systems to process and generate natural language, and has received increasing attention from the general population since the democratisation of generative and conversational models. However, behind the scenes, state-of-the-art NLP models are trained on ever-larger datasets, reaching trillions of tokens. It may be argued that the creation and use of such immense datasets is motivated by the idea that 'the larger the dataset, the more diverse it is', and that in turn 'if the training set is more diverse, it shall yield better models'. However, these statements thus far remain intuitions and need to be properly tested. To this end, this presentation will tackle methods and caveats of formal diversity quantification including limitations of the literature, a preliminary discussion on the link between diversity and dataset size, as well as their impact on downstream applications. |
Please see also the talks given in 2000–2015 and 2015–2025. |


