Size: 28503
Comment:
|
Size: 20311
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 3: | Line 3: |
= Natural Language Processing Seminar 2018–2019 = | = Natural Language Processing Seminar 2019–2020 = |
Line 7: | Line 7: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''1 October 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Janusz S. Bień''' (University of Warsaw – prof. emeritus)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=mOYzwpjTAf4|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-10-01.pdf|Electronic indexes to lexicographical resources]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">We will focus on the indexes to lexicographical resources available online in !DjVu format. Such indexes can be browsed, searched, modified and created with the djview4poliqarp open source program; the origins and the history of the program will be briefly presented. Originally the index support was added to the program to handle the list of entries in the 19th century Linde's dictionary, but can be used conveniently also for other resources, as will be demonstrated on selected examples. In particular some new features, introduced to the program in the last months, will be presented publicly for the first time.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''23 September 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Igor Boguslavsky''' (Institute for Information Transmission Problems, Russian Academy of Sciences / Universidad Politécnica de Madrid)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-09-23.pdf|Semantic analysis based on inference]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">I will present a semantic analyzer SemETAP, which is a module of a linguistic processor ETAP designed to perform analysis and generation of NL texts. We proceed from the assumption that the depth of understanding is determined by the number and quality of inferences we can draw from the text. Extensive use of background knowledge and inferences permits to extract implicit information.|| ||<style="border:0;padding-left:30px;padding-bottom:0px">Salient features of SemETAP include: || ||<style="border:0;padding-left:30px;padding-bottom:0px">— knowledge base contains both linguistic and background knowledge;|| ||<style="border:0;padding-left:30px;padding-bottom:0px">— inference types include strict entailments and plausible expectations; || ||<style="border:0;padding-left:30px;padding-bottom:0px">— words and concepts of the ontology may be supplied with explicit decompositions for inference purposes; || ||<style="border:0;padding-left:30px;padding-bottom:0px">— two levels of semantic structure are distinguished. Basic semantic structure (BSemS) interprets the text in terms of ontological elements. Enhanced semantic structure (EnSemS) extends BSemS by means of a series of inferences; || ||<style="border:0;padding-left:30px;padding-bottom:15px">— a new logical formalism Etalog is developed in which all inference rules are written.|| |
Line 12: | Line 18: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''15 October 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Wojciech Jaworski, Szymon Rutkowski''' (University of Warsaw)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=SbPAdmRmW08|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-10-15.pdf|A multilayer rule based model of Polish inflection]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will be devoted to the multilayer model of Polish inflection. The model has been developed on the basis of Grammatical Dictionary of Polish; it does not use the concept of a inflexion paradigm. The model consists of three layers of hand-made rules: "orthographic-phonetic layer" converting a segment to representation reflecting morphological patterns of the language, "analytic layer" generating lemma and determining affix and "interpretation layer" giving a morphosyntactic interpretation based on detected affixes. The model provides knowledge about the language to a morphological analyzer supplemented with the function of guessing lemmas and morphosyntactic interpretations for non-dictionary forms (guesser). The second use of the model is generation of word forms based on lemma and morphosyntactic interpretation. The presentation will also cover the issue of disambiguation of the results provided by the morphological analyzer. The demo version of the program is available on the Internet.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''7 October 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Tomasz Stanisz''' (Institute of Nuclear Physics, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=sRreAjtf2Jo|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-10-07.pdf|What can a complex network say about a text?]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Complex networks, which have found application in the quantitative description of many different phenomena, have proven to be useful in research on natural language. The network formalism allows to study language from various points of view - a complex network may represent, for example, distances between given words in a text, semantic similarities, or grammatical relationships. One of the types of linguistic networks are word-adjacency networks, which describe mutual co-occurrences of words in texts. Although simple in construction, word-adjacency networks have a number of properties allowing for their practical use. The structure of such networks, expressed by appropriately defined quantities, reflects selected characteristics of language; applying machine learning methods to collections of those quantities may be used, for example, for authorship attribution.|| |
Line 17: | Line 23: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''29 October 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Waszczuk''' (Heinrich-Heine-Universität Düsseldorf)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=zjGQRG2PNu0|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-10-29.pdf|From morphosyntactic tagging to identification of verbal multiword expressions: a discriminative approach]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The first part of the talk was dedicated to Concraft-pl 2.0, the new version of a morphosyntactic tagger for Polish based on conditional random fields. Concraft-pl 2.0 performs morphosyntactic segmentation as a by-product of disambiguation, which allows to use it directly on the segmentation graphs provided by the analyser Morfeusz. This is in contrast with other existing taggers for Polish, which either neglect the problem of segmentation or rely on heuristics to perform it in a pre-processing stage. During the second part, an approach to identifying verbal multiword expressions (VMWEs) based on dependency parsing results was presented. In this approach, VMWE identification is reduced to the problem of dependency tree labeling, where one of two labels (MWE or not-MWE) must be predicted for each node in the dependency tree. The underlying labeling model can be seen as conditional random fields (as used in Concraft) adapted to tree structures. A system based on this approach ranked 1st in the closed track of the PARSEME shared task 2018.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 October 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Patejuk''' (Institute of Computer Science, Polish Academy of Sciences / University of Oxford), '''Adam Przepiórkowski''' (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-10-21.pdf|Coordination in the Universal Dependencies standard]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">''Universal Dependencies'' (UD; [[https://universaldependencies.org/]]) is a widespread syntactic annotation scheme employed by many parsers of multiple languages. However, the scheme does not adequately represent coordination, i.e., structures involving conjunctions. In this talk, we propose representations of two aspects of coordination which have not so far been properly represented either in UD or in dependency grammars: coordination of unlike grammatical functions and nested coordination.|| |
Line 22: | Line 28: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''5 November 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Kozakoszczak''' (Faculty of Modern Languages, University of Warsaw / Heinrich-Heine-Universität Düsseldorf)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=sz7dGmf8p3k|{{attachment:seminarium-archiwum/youtube.png}}]] '''Mornings to Wednesdays — semantics and normalization of Polish quasi-periodical temporal expression'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The standard interpretations of expressions like “Januarys” and “Fridays” in temporal representation and reasoning are slices of collections of 2nd order, e.g. all the sixth elements of day sequences of cardinality 7 aligned with calendar weeks. I will present results of the work on normalizing most frequent Polish quasi-periodical temporal expressions for online booking systems. On the linguistic side I will argue against synonymy of the kind “Fridays” = “sixth days of the weeks” and give semantic tests for rudimentary classification of quasi-periodicity. In the formal part I will propose an extension to existing formalisms covering intensional quasi-periodical operators “from”, “to”, “before” and “after” restricted to monotonic domains. In the implementation part I will demonstrate an algorithm for lazy generation of generalized intersection of collections.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''4 November 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marcin Będkowski''' (University of Warsaw / Educational Research Institute), '''Wojciech Stęchły''', '''Leopold Będkowski''', '''Joanna Rabiega-Wiśniewska''' (Educational Research Institute), '''Michał Marcińczuk''' (Wrocław University of Science and Technology), '''Grzegorz Wojdyga''', '''Łukasz Kobyliński''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=-oSBqG4_VDk|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-11-04a.pdf|Similarity of descriptions of qualifications contained in the Integrated Qualifications Register]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-11-04b.pdf|Analysis of existing solutions for grouping of qualifications]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">In the talk we will discuss the problem of comparing documents contained in the Integrated Qualifications Register in terms of their content similarity.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">In the first part, we characterize the background of the issue, including the structure of the description of learning outcomes in qualifications and sentences describing learning outcomes. According to the definition in the Act on the Integrated Qualifications System, the learning effect is knowledge, skills and social competences acquired in the learning process, and the qualification is a set of learning effects, the achievement of which is confirmed by an appropriate document (e.g. diploma, certificate). Sentences whose referents are learning outcomes have a stable structure and consist essentially of so-called an operational verb (describing an activity constituting a learning effect) and a nominal phrase that complements it (naming the object that is the subject of this activity, in short: the object of skill). For example: "Determines vision defects and how to correct them based on eye refraction measurement" or "The student reads technical drawings."|| ||<style="border:0;padding-left:30px;padding-bottom:5px">In the second part, we outline the approach that allows determining the degree of similarity between qualifications and their grouping, along with its assumptions and the intuitions behind them. We will define the accepted understanding of content similarity, namely we outline the approach to determine the similarity of texts in a variant that allows automatic text processing using computer tools. We will present a simple representation model, the so-called bag of words, in two versions.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">The first of them assumes the full atomization of learning outcomes (including nominal phrases, skill objects) and their presentation as sets of single plata-mathematical nouns representing skills objects. The second is based on n-grams, taking into account the TFIDF measure (i.e. weighing by frequency of terms - inverse frequency in documents), allowing the extraction of key words and phrases from texts.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">The first approach can be described as "wasteful", while the second – "frugal". The first allows for presenting many similar qualifications for each qualification (although the degree of similarity may be low). On the other hand, the second allows a situation in which there will be no similar for a given qualification.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">In the third part, we describe sample groupings and ranking lists based on both approaches, based on multidimensional scaling and the k-average algorithm, as well as hierarchical grouping. We will also present a case study that will illustrate the advantages and disadvantages of both approaches.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">In the fourth part we will present the conclusions on grouping qualifications, but also general conclusions related to the definition of key words. In particular, we will present conclusions on the use of the indicated methods for comparing texts of varying length, as well as partially overlapping (containing common fragments).|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The talk was prepared in cooperation with the authors of the expertise on automatic analysis and comparison of qualifications for the purpose of grouping them prepared under the project "Keeping and developing the Integrated Qualifications Register", POWR.02.11.00-00-0001/17.|| |
Line 27: | Line 41: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''19 November 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Daniel Zeman''' (Institute of Formal and Applied Linguistics, Charles University in Prague)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=xUmZ8Mxcmg0|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-11-19.pdf|Universal Dependencies and the Slavic Languages]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">I will present Universal Dependencies, a worldwide community effort aimed at providing multilingual corpora, annotated at the morphological and syntactic levels following unified annotation guidelines. I will discuss the concept of core arguments, one of the cornerstones of the UD framework. In the second part of the talk I will focus on some interesting problems and challenges of applying Universal Dependencies to the Slavic languages. I will discuss examples from 12 Slavic languages that are currently represented in UD and show that cross-linguistic consistency can still be improved.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''18 November 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alexander Rosen''' (Charles University in Prague)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=kkqlUnq7jGE|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-11-18.pdf|The InterCorp multilingual parallel corpus: representation of grammatical categories]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">!InterCorp, a multilingual parallel component of the Czech National Corpus, has been on-line since 2008, growing steadily to its present size of 1.7 billion words in 40 languages. A substantial share of fiction is complemented by legal and journalistic texts, parliament proceedings, film subtitles and the Bible. The texts are sentence-aligned and – in most languages – tagged and lemmatized. We will focus on the issue of morphosyntactic annotation, currently using language-specific tagsets and tokenization rules, and explore various solutions, including those based on the guidelines, data and tools developed in the Universal Dependencies project.|| |
Line 32: | Line 46: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''3 December 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Ekaterina Lapshinova-Koltunski''' (Saarland University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=UQ_6dDNEw8E|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-12-03.pdf|Analysis and Annotation of Coreference for Contrastive Linguistics and Translation Studies]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In this talk, I will report on the ongoing work on coreference analysis in a multilingual context. I will present two approaches in the analysis of coreference and coreference-related phenomena: (1) top-down or theory-driven: here we start from some linguistic knowledge derived from the existing frameworks, define linguistic categories to analyse and create an annotated corpus that can be used either for further linguistic analysis or as training data for NLP applications; (2) bottom-up or data-driven: in this case, we start from a set of features of shallow character that we believe are discourse-related. We extract these structures from a huge amount of data and analyse them from a linguistic point of view trying to describe and explain the observed phenomena from the point of view of existing theories and grammars.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 November 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alexander Rosen''' (Charles University in Prague)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=OQ-3B4-MXCw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-11-21.pdf|A learner corpus of Czech]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Texts produced by language learners (native or non-native) include all sorts of non-canonical phenomena, complicating the task of linguistic annotation while requiring an explicit markup of deviations from the standard. Although a number of English learner corpora exist and other languages have been catching up recently, a commonly accepted approach to designing an error taxonomy and annotation scheme has not emerged yet. For CzeSL, the corpus of Czech as a Second Language, several such approaches were designed and tested, later extended also to texts produced by Czech schoolchildren. I will show various pros and cons of these approaches, especially with a view of Czech as a highly inflectional language with free word order.|| |
Line 37: | Line 51: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''7 January 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Adam Przepiórkowski''' (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw), '''Agnieszka Patejuk''' (Institute of Computer Science, Polish Academy of Sciences / University of Oxford)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-01-07.pdf|Enhanced Universal Dependencies]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The aim of this talk is to present the two threads of our recent work on Universal Dependencies (UD), a standard for syntactically annotated corpora (http://universaldependencies.org/). The first thread is concerned with the developement of a new UD treebank of Polish, one that makes extensive use of the enhanced level of representation made available in the current UD standard. The treebank is the result of conversion from an earlier ‘treebank’ of Polish, one that was annotated with constituency and functional structures as they are understood in Lexical Functional Grammar. We will outline the conversion procedure and present the resulting UD treebank of Polish. The second thread is concerned with various inconsistencies and deficiencies of UD that we identified in the process of developing the UD treebank of Polish. We will concentrate on two particularly problematic areas in UD, namely, on the core/oblique distinction, which aims to – but does not really – replace the infamous argument/adjunct dichotomy, and on coordination, a phenomenon problematic for all dependency approaches.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''12 December 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Aleksandra Tomaszewska''' (Institute of Applied Linguistics, University of Warsaw)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=_WJF6BuQML4|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-12-12.pdf|Cross-Genre Analysis of EU Borrowings in Polish — the Need for Research Automation]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">During this presentation, the project ”EU Borrowings —formation mechanisms, functions, evolution, and assimilation in the Polish language” will be presented, funded by a Diamond Grant from the Polish Ministry of Science and Higher Education. The project aims to analyze and categorize EU borrowings, that is the effects of language contacts that occur in the European Union.|| ||<style="border:0;padding-left:30px;padding-bottom:15px">First, the author will discuss the theoretical background of the phenomenon, the aims of the research project; and present a compiled corpus of EU Polish language genres composed of three sub-corpora: transcriptions of interviews with MEPs, EU law (regulations and directives), and press releases of EU institutions. In the next part of the presentation, various methods and tools used in this research will be presented, including the methods of conducting analyses on the collected research material. Based on these specific examples, the need for automation of research on the latest borrowings in Polish will also be signaled.|| |
Line 42: | Line 57: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 January 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agata Savary''' (François Rabelais University Tours)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-01-14.pdf|Literal occurrences of multiword expressions: quantitative and qualitative analyses]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Multiword expressions (MWEs) such as to “pull strings” (to use one's influence), “to take part” or to “do in” (to kill) are word combinations which exhibit lexical, syntactic, and especially semantic idiosyncrasies. They pose special challenges to linguistic modeling and computational linguistics due to their non-compositional semantics, i.e. the fact that their meaning cannot be deduced from the meanings of their components, and from their syntactic structure, in a way deemed regular for the given language. Additionally, MWEs can have both idiomatic and literal occurrences. For instance “pulling strings” can be understood either as making use of one’s influence, or literally. Even if this phenomenon has been largely addressed in psycholinguistics, linguistics and natural language processing, the notion of a literal reading has rarely been formally defined or subject to quantitative analyses. I will propose a syntax-based definition of a literal reading of a MWE. I will also present the results of a quantitative and qualitative analysis of this phenomenon in Polish, as well as in 4 typologically distinct languages: Basque, German, Greek and Portuguese. This study, performed in a multilingual annotated corpus of the [[http://www.parseme.eu|PARSEME network]], shows that literal readings constitute a rare phenomenon. We also identify some properties that may distinguish them from their idiomatic counterparts.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''13 January 2020'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Ryszard Tuora''', '''Łukasz Kobyliński''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=sux6l5glZrA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-01-13.pdf|Integrating Polish Language Tools and Resources in spaCy]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In our project we aim to fill a niche between the robust tools, which have been developed during research work, and are dedicated to particular NLP tasks in Polish, and users looking for, and expecting an easily accessible resources. spaCy is one of the leading NLP frameworks, which is open-source, but has no official support for Polish. In our talk we will present the model for spaCy that we have been working on. It currently allows for segmentation, lemmatization, morphosyntactic analysis, dependency parsing and named entity recognition. We will discuss the tools which we have integrated, the results of evaluation, a real-world case in which it was used, and some possible paths for further development.|| |
Line 47: | Line 62: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 January 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marek Łaziński''' (University of Warsaw), '''Michał Woźniak''' (Jagiellonian University) || ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-01-21.pdf|Aspect in dictionaries and corpora. What for and how aspect pairs should be tagged in corpora?]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Corpora are generally tagged for grammatical categories, also for verbal aspect value. They all choose between pf and ipf, some of them add the third value: bi-aspectual (not present in the National Corpus of Polish). However, no Slavic corpus tags the aspect value of a verb form in reference to an aspect partner. If we can mark aspect pairs in dictionaries, it should be also possible in corpora. However under the condition, that we extrapolate aspect features of lexeme to specific verb forms in specific uses. Retaining the existing morphological tagging including aspect value, two more aspect tags have been added: 1) morphological markers of aspect and 2) reference to superlemma. Every verb form in the corpus has thus three parts: 1) The existing grammatcial characteristics (TAKIPI), 2) Repeated or corrected aspect value (including bi-aspecual) and morphological markers, 3) Reference to the aspect pair–superlemma. A corpus tagged for aspect pairs, even with alternative reference for every lexeme, opens new perspectives for research. The possibilities are especially rich in a parallel corpus with one Slavic and one aspectless language, as the Mainz-Warsaw Corpus. In order to check the usefulness of our aspect pair tagging a series of queries will be built which allow to compare grammatical profiles of suffixal and prefixal aspect pf and ipf partners.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''27 January 2020'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alina Wróblewska''', '''Katarzyna Krasnowska-Kieraś''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=v6YncOiFMuY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-01-27.pdf|Empirical Linguistic Study of Sentence Embeddings]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The results of empirical linguistic study on retention of linguistic information in sentence embeddings will be presented. The research methods are based on universal probing tasks and downstream tasks. The results of experiments on English and Polish indicate that different types of sentence embeddings encode linguistic information to varying degrees. The research was published in the article [[https://www.aclweb.org/anthology/P19-1573/|Empirical Linguistic Study of Sentence Embeddings]] in the proceedings of ACL 2019.|| |
Line 52: | Line 67: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 February 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Anna Wróblewska''' (Applica / Warsaw University of Technology), '''Filip Graliński''' (Applica / Adam Mickiewicz University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=tZ_rkR7XqRY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-02-11.pdf|Text-based machine learning processes and their interpretability]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|||| ||<style="border:0;padding-left:30px;padding-bottom:15px">How do we tackle text modeling challenges in business applications? We will present a prototype architecture for automation of processes in text based work and a few use cases of machine learning models. Use cases will be about emotion detection, abusive language recognition and more. We will also show our tool to explain suspicious findings in datasets and the models behaviour.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''24 February 2020'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Niewiński''', '''Maria Pszona''', '''Maria Janicka''' (Samsung R&D Polska), '''Aleksander Wawer''' (Institute of Computer Science, Polish Academy of Sciences), '''Grzegorz Wojdyga''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=kU79Q00fCI0|{{attachment:seminarium-archiwum/youtube.png}}]] '''Fact-checking in FEVER competition'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2020-02-24b.pdf|Generative Enhanced Model (extended, redesigned & fine-tuned GPT language model) for adversarial attacks]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">During seminar we will present our works for the [[http://fever.ai/|FEVER (Fact Extraction and Verification)]] competition. "Fake news" has become a dangerous phenomenon in the modern information circulation. There are many approaches to the problem of recognizing fake messages – in FEVER competition, having certain text, the task is to find specific evidence from certain sources for verification. During the presentation, we will show the most interesting ideas submitted by the participants of previous editions, we will discuss our article, that compares facts verification approaches with psycholinguistic analysis, and we will also present a winning model to cheat facts verification systems.|| |
Line 57: | Line 73: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 February 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Dutkiewicz''' (Poznan University of Technology)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=Ap2zn8-RfWI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-02-28.pdf|Empirical research on medical information retrieval]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|||| ||<style="border:0;padding-left:30px;padding-bottom:15px">We discuss results and evaluation procedures a of the bioCADDIE 2016 challenge on search of precision medical data. Our good results are due to word embedding query expansion with appropriate weights. Information Retrieval (IR) evaluation is demanding because of considerable effort required to judge over 10000 documents. A simple sampling method was proposed over 10 years ago for estimation of Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG) in spite of incomplete judgments. For this method to work the number of judged documents has to be relatively large. Such conditions were not fulfilled in bioCADDIE 2016 challenge and TREC PM 2017, 2018. The specificity of the bioCADDIE evaluation makes the post-challenge results incompatible with these judged during the contest. In bioCADDIE, for some questions there were not any judged relevant document. The results are strongly dependent on the cut-off rank. As the effect, in the bioCADDIE challenge infAP is weakly correlated with infNDCG, and an error could by up to 0.15-0.20 in absolute value. We believe, that the deviation of evaluation measures may override the primary role of the measure in such a case. We collaborate this claim by simulation of synthetic results. We propose a simulated environment with properties, which mirror the real systems. We implement a number of evaluation measures within the simulation and discuss the usefulness of the measures with partially annotated collection of documents in regard to the collection size, number of annotated document and proportion between the number of relevant and irrelevant documents. In particular we focus on the behavior of aforementioned AP and NDCG and their inferred versions. Other studies suggest that infNDCG weakly correlates with other measures and therefore should not be selected as the most important measure.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''9 March 2020'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Przybyła''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''The title of the talk will be made available shortly'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The summary of the talk will be made available shortly.|| |
Line 62: | Line 78: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 April 2020'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Stan Matwin''' (Dalhousie University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''The title of the talk will be made available shortly'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The summary of the talk will be made available shortly.|| |
|
Line 63: | Line 83: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 March 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Grzegorz Wojdyga''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-03-21.pdf|Size optimisation of language models]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">During the seminar, the results of work on reducing the size of language models will be discussed. The author will review the literature on the size reduction of recurrent neural networks (in terms of language models). Then, author's own implementations will be presented along with evaluation results on different Polish and English corpora.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''25 March 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Łukasz Dębowski''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=gIoI-A00Y7M|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-03-25.pdf|GPT-2 – Some remarks of an observer]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">GPT-2 is the latest neural statistical language model by the OpenAI team. A statistical language model is a distribution of probabilities on texts that can be used for automatic text generation. In essence, GPT-2 turned out to be a surprisingly good generator of semantically coherent texts of the length of several paragraphs, pushing the boundaries of what has seemed possible technically so far. Anticipating the use of GPT-2 to generate fake news, the OpenAI team decided to publish only a ten times reduced version of the model. In my talk, I will share some remarks about GPT-2.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''8 April 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Wołk''' (Polish-Japanese Academy of Information Technology and Institute of Literary Research, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=QVrY4rRzMOI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-04-08.pdf|Language collocations in quantitative research]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">This presentation is aimed to aid the enormous effort required to analyze phraseological writing competence by developing an automatic evaluation tool for texts. An attempt is made to measure both second language (L2) writing proficiency and text quality. The !CollGram technique that searches a reference corpus to determine the frequency of each pair of tokens (n-grams) and calculates the t-score and related information. We used the Level 3 Corpus of Contemporary American English as a reference corpus. Our solution performed well in writing evaluation and is freely available as a web service or as source for other researchers. We also present how to use it as early depression detection tools and stylometry.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''15 April 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alina Wróblewska''', '''Piotr Rybak''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=p-VldtRqvmg|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-04-15.pdf|Dependency parsing of Polish]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Dependency parsing is a crucial issue in various NLP tasks. The predicate-argument structure transparently encoded in dependency-based syntactic representations may support machine translation, question answering, sentiment analysis, etc. In the talk, we will present PDB – the largest dependency treebank for Polish, and COMBO – a language-independent neural system for part-of-speech tagging, morphological analysis, lemmatisation and dependency parsing.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''13 May 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Niewiński''', '''Maria Pszona''', '''Alessandro Seganti''', '''Helena Sobol''' (Samsung R&D Poland), Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences) || ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Samsung R&D Poland in !SemEval 2019 competition'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:0px">The talk presents Samsung R&D Poland solutions that participated in !SemEval 2019 competition. Both were ranked as the second one in two different tasks of competition.|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''1. [[attachment:seminarium-archiwum/2019-05-13a.pdf|Fact Checking in Community Question Answering Forums]]'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">We present our submission to !SemEval 2019 Task 8 on Fact-Checking in Community Forums. The aim was to classify questions from !QatarLiving forum as OPINION, FACTUAL or SOCIALIZING. We will present [[attachment:seminarium-archiwum/2019-05-13a-opis.pdf|our primary solution]]: Deeply Regularized Residual Neural Network (DRR NN) with Universal Sentence Encoder embeddings, which was ranked second in the official evaluation phase. Moreover, we will compare this solution with two contrastive models based on ensemble methods.|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''2. [[attachment:seminarium-archiwum/2019-05-13b.pdf|Linguistically enhanced deep learning offensive sentence classifier]]'''|| ||<style="border:0;padding-left:30px;padding-bottom:15px">How to define an offensive content? What is a bad word? In our presentation we will discuss the problem of recognizing what is offensive and what is not in social media (Twitter etc.). Furthermore we present [[attachment:seminarium-archiwum/2019-05-13b-opis.pdf|the system that we implemented]] to participate in the !SemEval 2019 Task 5 and Task 6 (where we had 2nd place in Task 6 Subtask C) and compare our results to other state of the art approaches. We will see that our approach outperformed other models by adding linguistically based observation to the model features.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''27 May 2019'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Magdalena Zawisławska''' (University of Warsaw)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=157YzQ70bV4|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-05-27.pdf|Synamet — Polish Corpus of Synesthetic Metaphors]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The aim of the paper is to discuss the procedure of the identification of synesthetic metaphors and the annotation of metaphoric units (MUs) in the Synamet corpus, which was created within the frames of the NCN grant (UMO-2014/15/B/ HS2/00182). The theoretical basis for the description of metaphors was the Conceptual Metaphor Theory (CMT) by Lakoff and Johnson combined with Fillmore's frame semantics. Lakoff and Johnson define a metaphor as a conceptual mapping from the source domain to the target domain, e.g. LOVE IS A JOURNEY. Because the concept of a domain is unclear, it has been replaced by a frame which, unlike a conceptual domain, links the semantic and linguistic levels (frames are activated by lexical units). The synesthetic metaphor in a narrower sense is defined as mapping from one perceptual modality to a different perceptual modality, e.g. a bright sound (VISION → HEARING), and in a broader sense—it is defined as description of non-perceptual phenomena with expressions referring primarily to sensory perceptions, e.g. rough character (TOUCH → PERSON). The Synamet project uses an even wider definition of synesthetic metaphor as any expression in which two different frames are activated and one of them is perceptual. Texts in the Synamet corpus come from blogs devoted to perfumes, wine, beer, music, or coffee, in which, due to the topics, the chance to find synesthetic metaphors was the greatest. The paper presents the basic statistics of the corpus and atypical metaphorical units that required modification of the annotation procedure.|| ||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2018]].|| |
||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2019]].|| |
Natural Language Processing Seminar 2019–2020
The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). All recorded talks are available on YouTube. |
23 September 2019 |
Igor Boguslavsky (Institute for Information Transmission Problems, Russian Academy of Sciences / Universidad Politécnica de Madrid) |
I will present a semantic analyzer SemETAP, which is a module of a linguistic processor ETAP designed to perform analysis and generation of NL texts. We proceed from the assumption that the depth of understanding is determined by the number and quality of inferences we can draw from the text. Extensive use of background knowledge and inferences permits to extract implicit information. |
Salient features of SemETAP include: |
— knowledge base contains both linguistic and background knowledge; |
— inference types include strict entailments and plausible expectations; |
— words and concepts of the ontology may be supplied with explicit decompositions for inference purposes; |
— two levels of semantic structure are distinguished. Basic semantic structure (BSemS) interprets the text in terms of ontological elements. Enhanced semantic structure (EnSemS) extends BSemS by means of a series of inferences; |
— a new logical formalism Etalog is developed in which all inference rules are written. |
7 October 2019 |
Tomasz Stanisz (Institute of Nuclear Physics, Polish Academy of Sciences) |
Complex networks, which have found application in the quantitative description of many different phenomena, have proven to be useful in research on natural language. The network formalism allows to study language from various points of view - a complex network may represent, for example, distances between given words in a text, semantic similarities, or grammatical relationships. One of the types of linguistic networks are word-adjacency networks, which describe mutual co-occurrences of words in texts. Although simple in construction, word-adjacency networks have a number of properties allowing for their practical use. The structure of such networks, expressed by appropriately defined quantities, reflects selected characteristics of language; applying machine learning methods to collections of those quantities may be used, for example, for authorship attribution. |
21 October 2019 |
Agnieszka Patejuk (Institute of Computer Science, Polish Academy of Sciences / University of Oxford), Adam Przepiórkowski (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw) |
Universal Dependencies (UD; https://universaldependencies.org/) is a widespread syntactic annotation scheme employed by many parsers of multiple languages. However, the scheme does not adequately represent coordination, i.e., structures involving conjunctions. In this talk, we propose representations of two aspects of coordination which have not so far been properly represented either in UD or in dependency grammars: coordination of unlike grammatical functions and nested coordination. |
4 November 2019 |
Marcin Będkowski (University of Warsaw / Educational Research Institute), Wojciech Stęchły, Leopold Będkowski, Joanna Rabiega-Wiśniewska (Educational Research Institute), Michał Marcińczuk (Wrocław University of Science and Technology), Grzegorz Wojdyga, Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences) |
|
Analysis of existing solutions for grouping of qualifications |
In the talk we will discuss the problem of comparing documents contained in the Integrated Qualifications Register in terms of their content similarity. |
In the first part, we characterize the background of the issue, including the structure of the description of learning outcomes in qualifications and sentences describing learning outcomes. According to the definition in the Act on the Integrated Qualifications System, the learning effect is knowledge, skills and social competences acquired in the learning process, and the qualification is a set of learning effects, the achievement of which is confirmed by an appropriate document (e.g. diploma, certificate). Sentences whose referents are learning outcomes have a stable structure and consist essentially of so-called an operational verb (describing an activity constituting a learning effect) and a nominal phrase that complements it (naming the object that is the subject of this activity, in short: the object of skill). For example: "Determines vision defects and how to correct them based on eye refraction measurement" or "The student reads technical drawings." |
In the second part, we outline the approach that allows determining the degree of similarity between qualifications and their grouping, along with its assumptions and the intuitions behind them. We will define the accepted understanding of content similarity, namely we outline the approach to determine the similarity of texts in a variant that allows automatic text processing using computer tools. We will present a simple representation model, the so-called bag of words, in two versions. |
The first of them assumes the full atomization of learning outcomes (including nominal phrases, skill objects) and their presentation as sets of single plata-mathematical nouns representing skills objects. The second is based on n-grams, taking into account the TFIDF measure (i.e. weighing by frequency of terms - inverse frequency in documents), allowing the extraction of key words and phrases from texts. |
The first approach can be described as "wasteful", while the second – "frugal". The first allows for presenting many similar qualifications for each qualification (although the degree of similarity may be low). On the other hand, the second allows a situation in which there will be no similar for a given qualification. |
In the third part, we describe sample groupings and ranking lists based on both approaches, based on multidimensional scaling and the k-average algorithm, as well as hierarchical grouping. We will also present a case study that will illustrate the advantages and disadvantages of both approaches. |
In the fourth part we will present the conclusions on grouping qualifications, but also general conclusions related to the definition of key words. In particular, we will present conclusions on the use of the indicated methods for comparing texts of varying length, as well as partially overlapping (containing common fragments). |
The talk was prepared in cooperation with the authors of the expertise on automatic analysis and comparison of qualifications for the purpose of grouping them prepared under the project "Keeping and developing the Integrated Qualifications Register", POWR.02.11.00-00-0001/17. |
18 November 2019 |
Alexander Rosen (Charles University in Prague) |
|
InterCorp, a multilingual parallel component of the Czech National Corpus, has been on-line since 2008, growing steadily to its present size of 1.7 billion words in 40 languages. A substantial share of fiction is complemented by legal and journalistic texts, parliament proceedings, film subtitles and the Bible. The texts are sentence-aligned and – in most languages – tagged and lemmatized. We will focus on the issue of morphosyntactic annotation, currently using language-specific tagsets and tokenization rules, and explore various solutions, including those based on the guidelines, data and tools developed in the Universal Dependencies project. |
21 November 2019 |
Alexander Rosen (Charles University in Prague) |
Texts produced by language learners (native or non-native) include all sorts of non-canonical phenomena, complicating the task of linguistic annotation while requiring an explicit markup of deviations from the standard. Although a number of English learner corpora exist and other languages have been catching up recently, a commonly accepted approach to designing an error taxonomy and annotation scheme has not emerged yet. For CzeSL, the corpus of Czech as a Second Language, several such approaches were designed and tested, later extended also to texts produced by Czech schoolchildren. I will show various pros and cons of these approaches, especially with a view of Czech as a highly inflectional language with free word order. |
12 December 2019 |
Aleksandra Tomaszewska (Institute of Applied Linguistics, University of Warsaw) |
|
During this presentation, the project ”EU Borrowings —formation mechanisms, functions, evolution, and assimilation in the Polish language” will be presented, funded by a Diamond Grant from the Polish Ministry of Science and Higher Education. The project aims to analyze and categorize EU borrowings, that is the effects of language contacts that occur in the European Union. |
First, the author will discuss the theoretical background of the phenomenon, the aims of the research project; and present a compiled corpus of EU Polish language genres composed of three sub-corpora: transcriptions of interviews with MEPs, EU law (regulations and directives), and press releases of EU institutions. In the next part of the presentation, various methods and tools used in this research will be presented, including the methods of conducting analyses on the collected research material. Based on these specific examples, the need for automation of research on the latest borrowings in Polish will also be signaled. |
13 January 2020 |
Ryszard Tuora, Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences) |
In our project we aim to fill a niche between the robust tools, which have been developed during research work, and are dedicated to particular NLP tasks in Polish, and users looking for, and expecting an easily accessible resources. spaCy is one of the leading NLP frameworks, which is open-source, but has no official support for Polish. In our talk we will present the model for spaCy that we have been working on. It currently allows for segmentation, lemmatization, morphosyntactic analysis, dependency parsing and named entity recognition. We will discuss the tools which we have integrated, the results of evaluation, a real-world case in which it was used, and some possible paths for further development. |
27 January 2020 |
Alina Wróblewska, Katarzyna Krasnowska-Kieraś (Institute of Computer Science, Polish Academy of Sciences) |
The results of empirical linguistic study on retention of linguistic information in sentence embeddings will be presented. The research methods are based on universal probing tasks and downstream tasks. The results of experiments on English and Polish indicate that different types of sentence embeddings encode linguistic information to varying degrees. The research was published in the article Empirical Linguistic Study of Sentence Embeddings in the proceedings of ACL 2019. |
24 February 2020 |
Piotr Niewiński, Maria Pszona, Maria Janicka (Samsung R&D Polska), Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences), Grzegorz Wojdyga (Institute of Computer Science, Polish Academy of Sciences) |
During seminar we will present our works for the FEVER (Fact Extraction and Verification) competition. "Fake news" has become a dangerous phenomenon in the modern information circulation. There are many approaches to the problem of recognizing fake messages – in FEVER competition, having certain text, the task is to find specific evidence from certain sources for verification. During the presentation, we will show the most interesting ideas submitted by the participants of previous editions, we will discuss our article, that compares facts verification approaches with psycholinguistic analysis, and we will also present a winning model to cheat facts verification systems. |
9 March 2020 |
Piotr Przybyła (Institute of Computer Science, Polish Academy of Sciences) |
The title of the talk will be made available shortly |
The summary of the talk will be made available shortly. |
2 April 2020 |
Stan Matwin (Dalhousie University) |
The title of the talk will be made available shortly |
The summary of the talk will be made available shortly. |
Please see also the talks given in 2000–2015 and 2015–2019. |