Locked History Actions

Diff for "seminar"

Differences between revisions 1 and 501 (spanning 500 versions)
Revision 1 as of 2016-06-27 22:35:36
Size: 834
Comment:
Revision 501 as of 2022-10-20 10:14:07
Size: 9404
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
= Natural Language Processing Seminar 2016–2017 = = Natural Language Processing Seminar 2022–2023 =
Line 5: Line 5:
||<style="border:0;padding:0">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). ||<style="border:0;padding-left:30px">[[seminarium-archiwum|{{attachment:pl.png}}]]|| ||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pjl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on [[https://www.youtube.com/ipipan|YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]||
Line 7: Line 7:
||<style="border:0;padding-top:10px">Please come back in October! And now see [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given between 2000 and 2015]] and [[http://zil.ipipan.waw.pl/seminar|2015-16]]. ||<style="border:0;padding-top:5px;padding-bottom:5px">'''3 October 2022'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Sławomir Dadas''' (National Information Processing Institute)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=TGwLeE1Y5X4|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2022-10-03.pdf|Our experience with training neural sentence encoders for the Polish language]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Representing sentences or short texts as dense vectors with a fixed number of dimensions is a common technique in tasks such as information retrieval, question answering, text clustering or plagiarism detection. A simple method to construct such representation is to aggregate vectors generated by a language model or extracted from word embeddings. However, higher quality representations can be obtained by fine-tuning a language model on a dataset of semantically similar sentence pairs. In this presentation, we will introduce methods for learning sentence encoders based on the Transformer architecture as well as our experiences with training such models for the Polish language. In addition, we will discuss approaches for building large datasets of paraphrases using publicly available corpora. We will also show a practical application of sentence encoders in a system developed for finding abusive clauses in consumer agreements.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 November 2022'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Łukasz Augustyniak''', '''Kamil Tagowski''', '''Albert Sawczyn''', '''Denis Janiak''', '''Roman Bartusiak''', '''Adrian Dominik Szymczak''', '''Arkadiusz Janz''', '''Piotr Szymański''', '''Marcin Wątroba''', '''Mikołaj Morzy''', '''Tomasz Jan Kajdanowicz''', '''Maciej Piasecki''' (Wrocław University of Science and Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The availability of compute and data to train larger and larger language models increases the demand for robust methods of benchmarking the true progress of LM training. Recent years witnessed significant progress in standardized benchmarking for English. Benchmarks such as GLUE, SuperGLUE, or KILT have become a de facto standard tools to compare large language models. Following the trend to replicate GLUE for other languages, the KLEJ benchmark (''klej'' is the word for glue in Polish) has been released for Polish. In this paper, we evaluate the progress in benchmarking for low-resourced languages. We note that only a handful of languages have such comprehensive benchmarks. We also note the gap in the number of tasks being evaluated by benchmarks for resource-rich English/Chinese and the rest of the world. In this paper, we introduce LEPISZCZE (''lepiszcze'' is the Polish word for glew, the Middle English predecessor of glue), a new, comprehensive benchmark for Polish NLP with a large variety of tasks and high-quality operationalization of the benchmark. We design LEPISZCZE with flexibility in mind. Including new models, datasets, and tasks is as simple as possible while still offering data versioning and model tracking. In the first run of the benchmark, we test 13 experiments (task and dataset pairs) based on the five most recent LMs for Polish. We use five datasets from the Polish benchmark and add eight novel datasets. As the paper's main contribution, apart from LEPISZCZE, we provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 November 2022'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Aleksander Wawer''' (Institute of Computer Science, Polish Academy of Sciences), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Talk title will be made available shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made avaliable shortly.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''12 December 2022'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Paula Czarnowska''' (University of Cambridge)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Talk title will be made available shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made avaliable shortly.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''19 December 2022'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Wojciech Kryściński''' (Salesforce Research)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Current state, challenges, and approaches to Text Summarization''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} &#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made avaliable shortly.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''9 January 2023'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marzena Karpińska''' (University of Massachusetts Amherst)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Talk title will be made available shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made avaliable shortly.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''23 January 2023'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Mikołajczyk''' (!VoiceLab / Politechnika Gdańska / hear.ai)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Talk title will be made available shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made avaliable shortly.||



||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2020]].||

{{{#!wiki comment

||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 April 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Stan Matwin''' (Dalhousie University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Efficient training of word embeddings with a focus on negative examples''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">This presentation is based on our [[https://pdfs.semanticscholar.org/1f50/db5786913b43f9668f997fc4c97d9cd18730.pdf|AAAI 2018]] and [[https://aaai.org/ojs/index.php/AAAI/article/view/4683|AAAI 2019]] papers on English word embeddings. In particular, we examine the notion of “negative examples”, the unobserved or insignificant word-context co-occurrences, in spectral methods. we provide a new formulation for the word embedding problem by proposing a new intuitive objective function that perfectly justifies the use of negative examples. With the goal of efficient learning of embeddings, we propose a kernel similarity measure for the latent space that can effectively calculate the similarities in high dimensions. Moreover, we propose an approximate alternative to our algorithm using a modified Vantage Point tree and reduce the computational complexity of the algorithm with respect to the number of words in the vocabulary. We have trained various word embedding algorithms on articles of Wikipedia with 2.3 billion tokens and show that our method outperforms the state-of-the-art in most word similarity tasks by a good margin. We will round up our discussion with some general thought s about the use of embeddings in modern NLP.||
}}}

Natural Language Processing Seminar 2022–2023

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube.

seminarium

3 October 2022

Sławomir Dadas (National Information Processing Institute)

https://www.youtube.com/watch?v=TGwLeE1Y5X4 Our experience with training neural sentence encoders for the Polish language  Talk delivered in Polish.

Representing sentences or short texts as dense vectors with a fixed number of dimensions is a common technique in tasks such as information retrieval, question answering, text clustering or plagiarism detection. A simple method to construct such representation is to aggregate vectors generated by a language model or extracted from word embeddings. However, higher quality representations can be obtained by fine-tuning a language model on a dataset of semantically similar sentence pairs. In this presentation, we will introduce methods for learning sentence encoders based on the Transformer architecture as well as our experiences with training such models for the Polish language. In addition, we will discuss approaches for building large datasets of paraphrases using publicly available corpora. We will also show a practical application of sentence encoders in a system developed for finding abusive clauses in consumer agreements.

14 November 2022

Łukasz Augustyniak, Kamil Tagowski, Albert Sawczyn, Denis Janiak, Roman Bartusiak, Adrian Dominik Szymczak, Arkadiusz Janz, Piotr Szymański, Marcin Wątroba, Mikołaj Morzy, Tomasz Jan Kajdanowicz, Maciej Piasecki (Wrocław University of Science and Technology)

This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish  Talk delivered in Polish.

The availability of compute and data to train larger and larger language models increases the demand for robust methods of benchmarking the true progress of LM training. Recent years witnessed significant progress in standardized benchmarking for English. Benchmarks such as GLUE, SuperGLUE, or KILT have become a de facto standard tools to compare large language models. Following the trend to replicate GLUE for other languages, the KLEJ benchmark (klej is the word for glue in Polish) has been released for Polish. In this paper, we evaluate the progress in benchmarking for low-resourced languages. We note that only a handful of languages have such comprehensive benchmarks. We also note the gap in the number of tasks being evaluated by benchmarks for resource-rich English/Chinese and the rest of the world. In this paper, we introduce LEPISZCZE (lepiszcze is the Polish word for glew, the Middle English predecessor of glue), a new, comprehensive benchmark for Polish NLP with a large variety of tasks and high-quality operationalization of the benchmark. We design LEPISZCZE with flexibility in mind. Including new models, datasets, and tasks is as simple as possible while still offering data versioning and model tracking. In the first run of the benchmark, we test 13 experiments (task and dataset pairs) based on the five most recent LMs for Polish. We use five datasets from the Polish benchmark and add eight novel datasets. As the paper's main contribution, apart from LEPISZCZE, we provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.

28 November 2022

Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences)

Talk title will be made available shortly  Talk delivered in Polish.

Talk summary will be made avaliable shortly.

12 December 2022

Paula Czarnowska (University of Cambridge)

Talk title will be made available shortly  Talk delivered in Polish.

Talk summary will be made avaliable shortly.

19 December 2022

Wojciech Kryściński (Salesforce Research)

Current state, challenges, and approaches to Text Summarization  Talk delivered in Polish.  Slides in English.

Talk summary will be made avaliable shortly.

9 January 2023

Marzena Karpińska (University of Massachusetts Amherst)

Talk title will be made available shortly  Talk delivered in Polish.

Talk summary will be made avaliable shortly.

23 January 2023

Agnieszka Mikołajczyk (VoiceLab / Politechnika Gdańska / hear.ai)

Talk title will be made available shortly  Talk delivered in Polish.

Talk summary will be made avaliable shortly.

Please see also the talks given in 2000–2015 and 2015–2020.