Size: 10365
Comment:
|
Size: 10415
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 14: | Line 14: |
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://pwr-edu.zoom.us/j/96657909989?pwd=VXFmcEc5blNyM0M3ekxvNGc3Q2Rsdz09|{{attachment:seminarium-archiwum/zoom.png}}]] '''This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| | ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://pwr-edu.zoom.us/j/96657909989?pwd=VXFmcEc5blNyM0M3ekxvNGc3Q2Rsdz09|{{attachment:seminarium-archiwum/zoom.png}}]] '''[[attachment:seminarium-archiwum/2022-11-14.pdf|This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| |
Natural Language Processing Seminar 2022–2023
The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube. |
3 October 2022 |
Sławomir Dadas (National Information Processing Institute) |
|
Representing sentences or short texts as dense vectors with a fixed number of dimensions is a common technique in tasks such as information retrieval, question answering, text clustering or plagiarism detection. A simple method to construct such representation is to aggregate vectors generated by a language model or extracted from word embeddings. However, higher quality representations can be obtained by fine-tuning a language model on a dataset of semantically similar sentence pairs. In this presentation, we will introduce methods for learning sentence encoders based on the Transformer architecture as well as our experiences with training such models for the Polish language. In addition, we will discuss approaches for building large datasets of paraphrases using publicly available corpora. We will also show a practical application of sentence encoders in a system developed for finding abusive clauses in consumer agreements. |
14 November 2022 |
Łukasz Augustyniak, Kamil Tagowski, Albert Sawczyn, Denis Janiak, Roman Bartusiak, Adrian Dominik Szymczak, Arkadiusz Janz, Piotr Szymański, Marcin Wątroba, Mikołaj Morzy, Tomasz Jan Kajdanowicz, Maciej Piasecki (Wrocław University of Science and Technology) |
|
The availability of compute and data to train larger and larger language models increases the demand for robust methods of benchmarking the true progress of LM training. Recent years witnessed significant progress in standardized benchmarking for English. Benchmarks such as GLUE, SuperGLUE, or KILT have become a de facto standard tools to compare large language models. Following the trend to replicate GLUE for other languages, the KLEJ benchmark (klej is the word for glue in Polish) has been released for Polish. In this paper, we evaluate the progress in benchmarking for low-resourced languages. We note that only a handful of languages have such comprehensive benchmarks. We also note the gap in the number of tasks being evaluated by benchmarks for resource-rich English/Chinese and the rest of the world. In this paper, we introduce LEPISZCZE (lepiszcze is the Polish word for glew, the Middle English predecessor of glue), a new, comprehensive benchmark for Polish NLP with a large variety of tasks and high-quality operationalization of the benchmark. We design LEPISZCZE with flexibility in mind. Including new models, datasets, and tasks is as simple as possible while still offering data versioning and model tracking. In the first run of the benchmark, we test 13 experiments (task and dataset pairs) based on the five most recent LMs for Polish. We use five datasets from the Polish benchmark and add eight novel datasets. As the paper's main contribution, apart from LEPISZCZE, we provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages. |
28 November 2022 |
Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences) |
Talk title will be made available shortly |
Talk summary will be made avaliable shortly. |
19 December 2022 |
Wojciech Kryściński (Salesforce Research) |
Long Story Short: A Talk about Text Summarization |
Automatic Text Summarization is a challenging task within Natural Language Processing that requires advanced language understanding and generation capabilities. In recent years substantial progress has been made in developing neural models for the task thanks to the efforts of the research community and advancements in the broader field of NLP. Despite this progress, text summarization remains a challenging task that is far from being solved. In this talk, we will first discuss the early approaches and the current state of the field. Next, we will critically evaluate key ingredients of the existing research setup: datasets, evaluation metrics, and models. Finally, we will focus on emerging research directions and consider the future of text summarization. |
9 January 2023 |
Marzena Karpińska (University of Massachusetts Amherst) |
Talk title will be made available shortly |
Talk summary will be made avaliable shortly. |
23 January 2023 |
Agnieszka Mikołajczyk (VoiceLab / Politechnika Gdańska / hear.ai) |
Talk title will be made available shortly |
Talk summary will be made avaliable shortly. |
6 February 2023 |
Artur Nowakowski, Gabriela Pałka, Kamil Guttmann, Mikołaj Pokrywka (Adam Mickiewicz University in Poznań) |
Talk title will be made available shortly |
Talk summary will be made avaliable shortly. |
Please see also the talks given in 2000–2015 and 2015–2020. |