Locked History Actions

Diff for "seminar"

Differences between revisions 185 and 275 (spanning 90 versions)
Revision 185 as of 2018-10-03 18:07:16
Size: 5650
Comment:
Revision 275 as of 2019-10-18 14:31:49
Size: 8500
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
= Natural Language Processing Seminar 2018–2019 = = Natural Language Processing Seminar 2019–2020 =
Line 7: Line 7:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''1 October 2018'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Janusz S. Bień''' (University of Warsaw – prof. emeritus)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=mOYzwpjTAf4|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-10-01.pdf|Electronic indexes to lexicographical resources]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">We will focus on the indexes to lexicographical resources available online in !DjVu format. Such indexes can be browsed, searched, modified and created with the djview4poliqarp open source program; the origins and the history of the program will be briefly presented. Originally the index support was added to the program to handle the list of entries in the 19th century Linde's dictionary, but can be used conveniently also for other resources, as will be demonstrated on selected examples. In particular some new features, introduced to the program in the last months, will be presented publicly for the first time.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''23 September 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Igor Boguslavsky''' (Institute for Information Transmission Problems, Russian Academy of Sciences / Universidad Politécnica de Madrid)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-09-23.pdf|Semantic analysis based on inference]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">I will present a semantic analyzer SemETAP, which is a module of a linguistic processor ETAP designed to perform analysis and generation of NL texts. We proceed from the assumption that the depth of understanding is determined by the number and quality of inferences we can draw from the text. Extensive use of background knowledge and inferences permits to extract implicit information.||
||<style="border:0;padding-left:30px;padding-bottom:0px">Salient features of SemETAP include: ||
||<style="border:0;padding-left:30px;padding-bottom:0px">— knowledge base contains both linguistic and background knowledge;||
||<style="border:0;padding-left:30px;padding-bottom:0px">— inference types include strict entailments and plausible expectations; ||
||<style="border:0;padding-left:30px;padding-bottom:0px">— words and concepts of the ontology may be supplied with explicit decompositions for inference purposes; ||
||<style="border:0;padding-left:30px;padding-bottom:0px">— two levels of semantic structure are distinguished. Basic semantic structure (BSemS) interprets the text in terms of ontological elements. Enhanced semantic structure (EnSemS) extends BSemS by means of a series of inferences; ||
||<style="border:0;padding-left:30px;padding-bottom:15px">— a new logical formalism Etalog is developed in which all inference rules are written.||
Line 12: Line 18:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''15 October 2018'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Wojciech Jaworski, Szymon Rutkowski''' (University of Warsaw)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''A multilayer rule based model of Polish inflection''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will be devoted to the multilayer model of Polish inflection. The model has been developed on the basis of Grammatical Dictionary of Polish; it does not use the concept of a inflexion paradigm. The model consists of three layers of hand-made rules: "orthographic-phonetic layer" converting a segment to representation reflecting morphological patterns of the language, "analytic layer" generating lemma and determining affix and "interpretation layer" giving a morphosyntactic interpretation based on detected affixes. The model provides knowledge about the language to a morphological analyzer supplemented with the function of guessing lemmas and morphosyntactic interpretations for non-dictionary forms (guesser). The second use of the model is generation of word forms based on lemma and morphosyntactic interpretation. The presentation will also cover the issue of disambiguation of the results provided by the morphological analyzer. The demo version of the program is available on the Internet.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''7 October 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Tomasz Stanisz''' (Institute of Nuclear Physics, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=sRreAjtf2Jo|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-10-07.pdf|What can a complex network say about a text?]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Complex networks, which have found application in the quantitative description of many different phenomena, have proven to be useful in research on natural language. The network formalism allows to study language from various points of view - a complex network may represent, for example, distances between given words in a text, semantic similarities, or grammatical relationships. One of the types of linguistic networks are word-adjacency networks, which describe mutual co-occurrences of words in texts. Although simple in construction, word-adjacency networks have a number of properties allowing for their practical use. The structure of such networks, expressed by appropriately defined quantities, reflects selected characteristics of language; applying machine learning methods to collections of those quantities may be used, for example, for authorship attribution.||
Line 17: Line 23:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''29 October 2018'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Waszczuk'''||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Talk title will be available shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be available shortly.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 October 2019''' (NOTE: The seminar will start at 12:30!)||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Patejuk''' (Institute of Computer Science, Polish Academy of Sciences / University of Oxford), '''Adam Przepiórkowski''' (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Coordination in the Universal Dependencies standard''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">''Universal Dependencies'' (UD; [[https://universaldependencies.org/]]) is a widespread syntactic annotation scheme employed by many parsers of multiple languages. However, the scheme does not adequately represent coordination, i.e., structures involving conjunctions. In this talk, we propose representations of two aspects of coordination which have not so far been properly represented either in UD or in dependency grammars: coordination of unlike grammatical functions and nested coordination.||
Line 22: Line 28:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''5 November 2018'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Kozakoszczak''' (Faculty of Modern Languages, University of Warsaw / Heinrich-Heine-Universität Düsseldorf)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Talk title will be available shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be available shortly.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''4 November 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marcin Będkowski''' (Educational Research Institute), '''Łukasz Kobyliński''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''The title of the talk will be available shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The summary of the talk will be available shortly.||
Line 27: Line 33:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''19 November 2018'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Daniel Zeman''' (Institute of Formal and Applied Linguistics, Charles University, Czech Republic)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Talk title will be available shortly''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be available shortly.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''18 November 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alexander Rosen''' (Charles University in Prague)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''The !InterCorp multilingual parallel corpus: representation of grammatical categories''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">!InterCorp, a multilingual parallel component of the Czech National Corpus, has been on-line since 2008, growing steadily to its present size of 1.7 billion words in 40 languages. A substantial share of fiction is complemented by legal and journalistic texts, parliament proceedings, film subtitles and the Bible. The texts are sentence-aligned and – in most languages – tagged and lemmatized. We will focus on the issue of morphosyntactic annotation, currently using language-specific tagsets and tokenization rules, and explore various solutions, including those based on the guidelines, data and tools developed in the Universal Dependencies project.||
Line 32: Line 38:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''3 December 2018'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Ekaterina Lapshinova-Koltunski''' (Saarland University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Talk title will be available shortly''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be available shortly.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 November 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alexander Rosen''' (Charles University in Prague)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''A learner corpus of Czech''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Texts produced by language learners (native or non-native) include all sorts of non-canonical phenomena, complicating the task of linguistic annotation while requiring an explicit markup of deviations from the standard. Although a number of English learner corpora exist and other languages have been catching up recently, a commonly accepted approach to designing an error taxonomy and annotation scheme has not emerged yet. For !CzeSL, the corpus of Czech as a Second Language, several such approaches were designed and tested, later extended also to texts produced by Czech schoolchildren. I will show various pros and cons of these approaches, especially with a view of Czech as a highly inflectional language with free word order.||
Line 37: Line 43:

||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2018]].||
||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2019]].||

Natural Language Processing Seminar 2019–2020

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). All recorded talks are available on YouTube.

seminarium

23 September 2019

Igor Boguslavsky (Institute for Information Transmission Problems, Russian Academy of Sciences / Universidad Politécnica de Madrid)

Semantic analysis based on inference  Talk delivered in English.

I will present a semantic analyzer SemETAP, which is a module of a linguistic processor ETAP designed to perform analysis and generation of NL texts. We proceed from the assumption that the depth of understanding is determined by the number and quality of inferences we can draw from the text. Extensive use of background knowledge and inferences permits to extract implicit information.

Salient features of SemETAP include:

— knowledge base contains both linguistic and background knowledge;

— inference types include strict entailments and plausible expectations;

— words and concepts of the ontology may be supplied with explicit decompositions for inference purposes;

— two levels of semantic structure are distinguished. Basic semantic structure (BSemS) interprets the text in terms of ontological elements. Enhanced semantic structure (EnSemS) extends BSemS by means of a series of inferences;

— a new logical formalism Etalog is developed in which all inference rules are written.

7 October 2019

Tomasz Stanisz (Institute of Nuclear Physics, Polish Academy of Sciences)

https://www.youtube.com/watch?v=sRreAjtf2Jo What can a complex network say about a text?  Talk delivered in Polish.

Complex networks, which have found application in the quantitative description of many different phenomena, have proven to be useful in research on natural language. The network formalism allows to study language from various points of view - a complex network may represent, for example, distances between given words in a text, semantic similarities, or grammatical relationships. One of the types of linguistic networks are word-adjacency networks, which describe mutual co-occurrences of words in texts. Although simple in construction, word-adjacency networks have a number of properties allowing for their practical use. The structure of such networks, expressed by appropriately defined quantities, reflects selected characteristics of language; applying machine learning methods to collections of those quantities may be used, for example, for authorship attribution.

21 October 2019 (NOTE: The seminar will start at 12:30!)

Agnieszka Patejuk (Institute of Computer Science, Polish Academy of Sciences / University of Oxford), Adam Przepiórkowski (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)

Coordination in the Universal Dependencies standard  Talk delivered in Polish. Slides in English.

Universal Dependencies (UD; https://universaldependencies.org/) is a widespread syntactic annotation scheme employed by many parsers of multiple languages. However, the scheme does not adequately represent coordination, i.e., structures involving conjunctions. In this talk, we propose representations of two aspects of coordination which have not so far been properly represented either in UD or in dependency grammars: coordination of unlike grammatical functions and nested coordination.

4 November 2019

Marcin Będkowski (Educational Research Institute), Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences)

The title of the talk will be available shortly  Talk delivered in Polish.

The summary of the talk will be available shortly.

18 November 2019

Alexander Rosen (Charles University in Prague)

The InterCorp multilingual parallel corpus: representation of grammatical categories  Talk delivered in English.

InterCorp, a multilingual parallel component of the Czech National Corpus, has been on-line since 2008, growing steadily to its present size of 1.7 billion words in 40 languages. A substantial share of fiction is complemented by legal and journalistic texts, parliament proceedings, film subtitles and the Bible. The texts are sentence-aligned and – in most languages – tagged and lemmatized. We will focus on the issue of morphosyntactic annotation, currently using language-specific tagsets and tokenization rules, and explore various solutions, including those based on the guidelines, data and tools developed in the Universal Dependencies project.

21 November 2019

Alexander Rosen (Charles University in Prague)

A learner corpus of Czech  Talk delivered in English.

Texts produced by language learners (native or non-native) include all sorts of non-canonical phenomena, complicating the task of linguistic annotation while requiring an explicit markup of deviations from the standard. Although a number of English learner corpora exist and other languages have been catching up recently, a commonly accepted approach to designing an error taxonomy and annotation scheme has not emerged yet. For !CzeSL, the corpus of Czech as a Second Language, several such approaches were designed and tested, later extended also to texts produced by Czech schoolchildren. I will show various pros and cons of these approaches, especially with a view of Czech as a highly inflectional language with free word order.

Please see also the talks given in 2000–2015 and 2015–2019.