Differences between revisions 102 and 104 (spanning 2 versions)

Natural Language Processing Seminar 2015–2016

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). All recorded talks are available on YouTube.

12 October 2015

Vincent Ng (University of Texas at Dallas)

Beyond OntoNotes Coreference

Recent years have seen considerable progress on the notoriously difficult task of coreference resolution owing in part to the availability of coreference-annotated corpora such as MUC, ACE, and OntoNotes. Coreference, however, is more than MUC/ACE/OntoNotes coreference: it encompasses many interesting cases of anaphora that are not covered in the extensively investigated MUC/ACE/OntoNotes entity coreference task. This talk examined several comparatively less-studied coreference tasks that were arguably no less challenging than the MUC/ACE/OntoNotes entity coreference task, including the Winograd Schema Challenge, zero anaphora resolution, and event coreference resolution.

26 October 2015

Wojciech Jaworski (University of Warsaw)

Syntactic-semantic parser for Polish

The author presented the parser being developed within CLARIN-PL project, its morphological pre-processing, a categorial grammar of Polish integrated with valency dictionary and used by the parser and the semantic graph formalism used for meaning representation. He also discussed algorithms used by the parser and optimization strategies, both related to performance and concise representation of ambiguous syntactic and semantic parsing trees.

16 November 2015

Izabela Gatkowska (Jagiellonian University in Kraków)

The Empirical Network of Lexical Links

The empirical network of lexical links is the result of an experiment using a human associative mechanism – the person who is the subject of the research says the test first word that comes to his mind after understanding the stimulus word. The study was conducted in a cyclical manner, i.e. response words obtained in the first cycle were used as stimuli in the second cycle, which enabled the creation of a semantic network, which differs from the network created with the bodies of a text, for example, WORTSCHATZ and a network constructed by hand, for example. WordNet. The empirically obtained words, which are derived from those words in the network, have a direction and power connections. The set of incoming and outgoing connections, in which is found a specific expression, creates a lexical node network (subnet). The manner in which the network characterizes meaning, is shown in the example of feedback connections which are a specific example of the dependencies which appear between two words, appearing in the lexical node. A qualitative analysis of the semantic lexical relations known in linguistics, and employed for example in the WordNet dictionary, permit an interpretation of only approximately 25% of linkage feedback. The remaining links may be interpreted by referring to the model of the description of the significance as proposed in the FrameNet dictionary. A qualitative interpretation of all the links found in the lexical node may permit a study of the comparative lexical network nodes experimentally constructed for different natural languages, and may also allow, a separation of empirical semantic models employed by the same set of links found between nodes in a given network.

30 November 2015

Dora Montagna (Universidad Autónoma de Madrid)

Semantic representation of a polysemous verb in Spanish

The author presented a theoretical model of representation of meaning, based on Pustejovsky's theory of the Generative Lexicon. The proposal is intended as a base for automatic disambiguation, but also as a new model of lexicographic description. The model will be applied to a highly productive verb in Spanish, assuming the hypothesis of verbal underspecification in order to establish patterns of semantic behaviors.

7 December 2015

Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences), Witold Kieraś (University of Warsaw)

Morphosyntactic tagging of Polish – state of the art and future perspectives

During the presentation, the state of the art in the area of automatic approaches to morphosyntactic tagging of Polish language text was discussed, with a particular focus on the analysis of performance of publicly available tools, which are possible to use in real applications. A qualitative and quantitative analysis of the errors made by the taggers was conducted, along with a discussion on the possible causes and solutions to these problems. Tagging results for Polish was compared and contrasted with the results for other European languages.

8 December 2015

Salvador Pons Bordería (Universitat de València)

Discourse Markers from a pragmatic perspective: The role of discourse units in defining functions

One of the most disregarded aspects in the description of discourse markers is position. Notions such as "initial position" or "final position" are meaningless unless it can be specified with regard to what a DM is "initial" or "final". The presentation defended the idea that, for this question to be answered, appeal must be made to the notion of "discourse unit". Provided with a set of a) discourse units, and b) discourse positions, determining the function of a given DM is quasi-automatic.

11 January 2016

Małgorzata Marciniak, Agnieszka Mykowiecka, Piotr Rychlik (Institute of Computer Science, Polish Academy of Sciences)

Terminology extraction from Polish data – program TermoPL

The presentation addressed the problems of terminology extraction from Polish domain corpora. The authors described the C-value method to rank term candidates based on frequency measure and number of term contexts. The method takes into account nested terms that may not appear by themselves in data. Using this method, several nested grammatical subphrases are obtained which are syntactically correct, but semantically odd, like 'USG jamy' `USG of cavity’. The recognition of nested terms is supported by word connection strength which allows to eliminate truncated phrases from the top part of the term list. The talk was completed by the demo of the TermoPL tool.

25 January 2015

Wojciech Jaworski (University of Warsaw)

Syntactic-semantic parser for Polish: integration with lexical resources, parsing

During the lecture the author presented the integration of syntactic-semantic with SGJP, Polimorf, Słowosieć and Walenty as well as preliminary observations concerning the impact that checking semantic preferences has on parsing. He also described a categorical formalism used to parse and presented briefly how the parser works.

22 February 2016

Witold Dyrka (Wrocław University of Technology)

Language(s) of proteins? – premises, contributions and perspectives

In his speech the author presented arguments in favour of treating protein sequences, or higher protein structures, as sentences in some language(s). Then he plans to show several interesting results (my own and others') of application of quantitative methods of text analysis, and formal linguistics tools (such as probabilistic context-free grammars) for the analysis of proteins. Eventually, he presented plans of his further work on the "protein linguistics", which - as he hopes - would inspire an interesting discussion.

22 February 2016

Linguistic Engineering Group (Institute of Computer Science, Polish Academy of Sciences)

Extended seminar

12:00–12:15: People, projects, tools

12:15–12:45: Morfeusz 2: analyzer and inflectional synthesizer for Polish

12:45–13:15: Toposław: Creating MWU lexicons

13:15–13:45: Lunch break

13:45–14:15: TermoPL: Terminology extraction from Polish data

14:15–14:45: Walenty: Valency dictionary of Polish

14:45–15:15: POLFIE: LFG grammar for Polish

7 March 2016

Zbigniew Bronk (Grammatical Dictionary of Polish team member)

JOD – a markup language for Polish declension

JOD, a markup language for Polish declension, had been constructed in order to precisely describe inflectional rules and schemes for nouns and adjectives in Polish. Its first application was the description of inflection of surnames, taking into account the sex of the person or persons using the given surname. This model has been the basis for the "Automaton of declension of Polish surnames." The author presented the general idea of the language and the implementation of its interpreter, as well as the JOD editor and the website "Automaton of declension of Polish surnames".

21 March 2016

Bartosz Zaborowski, Aleksander Zabłocki (Institute of Computer Science, Polish Academy of Sciences)

Poliqarp2 on the home straight

In this talk the authors present a linguistic data search engine Poliqarp 2, on which they have been working for last three years. They describe both technical aspects as well as interesting features from the user's point of view. They briefly recall the data model supported by the engine, the structure of language supported by the new query engine, its expressive power, and differences compared to the previous version. In particular, they focus on elements added or modified during the development of the project (support for Składnica and LFG data models, post-processing, syntactic sugars). Among technicals they shortly present the software architecture and some details about the implementation of indexes. They also describe nontrivial decisions related to the input data processing (National Corpus of Polish in particular). They end the talk by presenting results of preliminary efficiency measurements.

4 April 2016

Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences)

Identification of opinion targets in Polish

Seminar concluded and summarised the results of a grant of The National Science Centre (NCN) finished in January 2016. It presented three resources with labelled sentiments and opinion targets, developed within the project: a bank of dependency trees, created from the corpus of product reviews, a subset of Skladnica dependency treebank and a collection of tweets. The seminar included a discussion of experiments on automated recognition of opinion targets. These involved the use of two parsing methods: dependency and shallow, and a hybrid method in which the results of syntactic analysis were used by statistical models (eg. CRF).

21 April 2016 (Thursday)

Magdalena Derwojedowa (University of Warsaw)

“Tem lepiej, ale jest to interes miljonowy i traktujemy go poważnie” – A thousand words a thousand times in 5 parts

The talk presented the 1M corpus of the project „Automatic morphological analysis of Polish texts from 1830-1918 period with respect to the evolution of inflection and spelling” (DEC-2012/07/B/HS2/00570), the structure of the corpus, its stylistic, temporal and regional diversity as well as the resource inflectional characteristics in comparison with features described in Bajerowa's works.

9 May 2016

Daniel Janus (Rebased.pl)

From unstructured data to searchable metadata-rich corpus: Skyscraper, P4, Smyrna

The presentation described tools facilitating construction of custom datasets: in particular, corpora of texts. The author presented Skyscraper, a library allowing scraping structured data out of whole WWW sites, and Smyrna, a concordancer for Polish texts enriched with metadata. In addition, a dataset built using these tools was be presented: Polish Parliamentary Proceedings Processor (PPPP, or P4), including, inter alia, a continuously updated corpus of speeches in the Polish parliament. The presentation largely focused on technical solutions used in the tools shown.

19 May 2016 (Thursday)

Kamil Kędzia, Konrad Krulikowski (University of Warsaw)

Generating paraphrases' templates for Polish using parallel corpus

A software for generating paraphrases in Polish under CLARIN-PL project was prepared. The developers will demonstrate how it works on chosen examples. They will also explain a method of Ganitkevitch et al. (2013) which allowed its authors to create an openly available Paraphrase Database (PPDB). Furthermore, they will discuss its enhancements and the approach to the challenges specific to the Polish language. Additionally they will demonstrate a way of measuring paraphrases' quality.

23 May 2016

Damir Ćavar (Indiana University)

The Free Linguistic Environment

The Free Linguistic Environment (FLE) started as a project to develop an open and free platform for white-box modeling and grammar engineering, i.e. development of natural language morphologies, prosody, syntax, and semantic processing components that are for example based on theoretical frameworks like two-level morphology, Lexical Functional Grammar (LFG), Glue Semantics, and similar. FLE provides a platform that makes use of some classical algorithms and also new approaches based on Weighted Finite State Transducer models to enable probabilistic modeling and parsing at all linguistic levels. Currently its focus is to provide a platform that is compatible with LFG and an extended version of it, one that we call Probabilistic Lexical Functional Grammar (PLFG). This probabilistic modeling can apply to the c(onstituent) -structure component, i.e. a Context Free Grammar (CFG) backbone can be extended by a Probabilistic Context Free Grammar (PCFG). Probabilities in PLFG can also be associated with structural representations and corresponding f(unctional feature)-structures or semantic properties, i.e. structural and functional properties and their relations can be modeled using weights that can represent probabilities or other forms of complex scores or metrics. In addition to these extensions of the LFG-framework, FLE provides also an open platform for experimenting with algorithms for semantic processing or analyses based on (probabilistic) lexical analyses, c- and f-structures, or similar such representations. Its architecture is extensible to cope with different frameworks, e.g. dependency grammar, optimality theory based approaches, and many more.

6 June 2016

Karol Opara (Systems Research Institute of the Polish Academy of Sciences)

Grammatical rhymes in Polish poetry – a quantitative analysis

Polish is a highly inflected language and parts of speech in the same morphological form have common endings. This allows one to easily find a multitude of rhyming words known as grammatical rhymes. Their overuse is strongly discouraged in the contemporary Polish literary cannon due to their alleged banality. The speech presented the results of computer-aided investigations into poets’ technical mastery based on estimating the share of grammatical rhymes in their verses. A method of automatic rhyme detection was discussed as well as the extraction of statistical information from texts, and a new “literary” criterion of choosing the sample size for statistical tests. Finally, a ranking of the technical mastery of various Polish poets was presented.

Natural Language Processing Seminar 2016–2017

10 October 2016

Katarzyna Pakulska, Barbara Rychalska, Krystyna Chodorowska, Wojciech Walczak, Piotr Andruszkiewicz (Samsung)

Paraphrase Detection Ensemble – SemEval 2016 winner

This seminar describes the winning solution designed for a core track within the SemEval 2016 English Semantic Textual Similarity task. The goal of the competition was to measure semantic similarity between two given sentences on a scale from 0 to 5. At the same time the solution should replicate human language understanding. The presented model is a novel hybrid of recursive auto-encoders from deep learning (RAE) and a WordNet award-penalty system, enriched with a number of other similarity models and features used as input for Linear Support Vector Regression.

24 October 2016

Adam Przepiórkowski, Jakub Kozakoszczak, Jan Winkowski, Daniel Ziembicki, Tadeusz Teleżyński (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)

Corpus of formalized textual entailment steps

The authors present resources created within CLARIN project aiming to help with qualitative evaluation of RTE systems: two textual derivations corpora and a corpus of textual entailment rules. Textual derivation is a series of atomic steps which connects Text with Hypothesis in a textual entailment pair. Original pairs are taken from the FraCaS corpus and a polish translation of the RTE3 corpus. Textual entailment rule sanctions textual entailment relation between the input and the output of a step, using syntactic patterns written in the UD standard and some other semantic, logical and contextual constraints expressed in FOL.

7 November 2016

Rafał Jaworski (Adam Mickiewicz University in Poznań)

Concordia – translation memory search algorithm

The talk covers the Concordia algorithm which is used to maximize the productivity of a human translator. The algorithm combines the features of standard fuzzy translation memory searching with a concordancer. As the key non-functional requirement of computer-aided translation mechanisms is performance, Concordia incorporates upgraded versions of standard approximate searching techniques, aiming at reducing the computational complexity.

21 November 2016

Norbert Ryciak, Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences)

Using recursive deep neural networks and syntax to compute phrase semantics

The seminar presents initial experiments on recursive phrase-level sentiment computation using dependency syntax and deep learning. We discuss neural network architectures and implementations created within Clarin 2 and present results on English language resources. Seminar also covers undergoing work on Polish language resources.

5 December 2017

Dominika Rogozińska, Marcin Woliński (Institute of Computer Science, Polish Academy of Sciences)

Methods of syntax disambiguation for constituent parse trees in Polish as post–processing phase of the Świgra parser

The presentation shows methods of syntax disambiguation for Polish utterances produced by the Świgra parser. Presented methods include probabilistic context free grammars and maximum entropy models. The best of described models achieves efficiency measure at the level of 96.2%. The outcome of our experiments is a module for post-processing Świgra's parses.

9 January 2017

Agnieszka Pluwak (Institute of Slavic Studies, Polish Academy of Sciences)

Building a domain-specific knowledge representation using an extended method of frame semantics on a corpus of Polish, English and German lease agreements

The FrameNet project is defined by its authors as a lexical base with some ontological features (not an ontology sensu stricto, however, due to a selective approach towards description of frames and lexical units, as well as frame-to-frame relations). Ontologies, as knowledge representations in the field of NLP, should have the capacity of implementation to specific domains and texts, however, in the FrameNet bibliography published before January 2016 I haven’t found a single knowledge representation based entirely on frames or on an extensive structure of frame-to-frame relations. I did find a few examples of domain-specific knowledge representations with the use of selected FrameNet frames, such as BioFrameNet or Legal FrameNet, where frames were applied to connect data from different sources. Therefore, in my dissertation, I decided to conduct an experiment and build a knowledge representation of frame-to-frame relations for the domain of lease agreements. The aim of my study was the description of frames useful in case of building a possible data extraction system from lease agreements, this is frames containing answers to questions asked by a professional analyst while reading lease agreements. In my work I have asked several questions, e.g. would I be able to use FrameNet frames for this purpose or would I have to build my own frames? Will the analysis of Polish cause language-specific problems? How will the professional language affect the use of frames in context? Etc.

23 January 2017

Marek Rogalski (Lodz University of Technology)

Automatic paraphrasing

Paraphrasing is conveying the essential meaning of a message using different words. The ability to paraphrase is a measure of understanding. A teacher asking student a question "could you please tell us using your own words ...", tests whether the student has understood the topic. On this presentation we will discuss the task of automatic paraphrasing. We will differentiate between syntax-level paraphrases and essential-meaning-level paraphrases. We will bring up several techniques from seemingly unrelated fields that can be applied in automatic paraphrasing. We will also show results that we've been able to produce with those techniques.

6 February 2017

Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences)

Korpusomat – a tool for creation of searcheable own corpora

Korpusomat is a web tool facilitating unassisted creation of corpora for linguistic studies. After sending a set of text files they are automatically morphologically analysed and lemmatised using Morfeusz and disambiguated using Concraft tagger. The resulting corpus can be then downloaded and analysed offline using Poliqarp search engine to query for information related to text segmentation, base forms, inflectional interpretations and (dis)ambiguities. Poliqarp is also capable of calculating frequencies and applying basic statistical measures necessary for quantitative analysis. Apart from plain text files Korpusomat can also process more complex textual formats such as popular EPUBs, download source data from the Internet, strip unnecessary information and extract document metadata.

20 February 2017 (invited talk at the Institute seminar)

Elżbieta Hajnicz (Institute of Computer Science, Polish Academy of Sciences)

Representation language of the valency dictionary Walenty

The Polish Valence Dictionary (Walenty) is intended to be used by natural language processing tools, particularly parsers, and thus it offers formalized representation of the valency information. The talk presented the notion of valency and its representation in the dictionary along with examples illustrating how particular syntactic and semantic language phenomena are modelled.

2 March 2017

Wojciech Jaworski (University of Warsaw)

Integration of dependency parser with a categorial parser

As part of the talk I will describe the division of texts into sentences and controlling the execution of each parser within the emerging hybrid parser in the Clarin-bis project. I will describe the adopted method of dependency structure conversion aimed to make them compatible with the structures of categorial parser. The conversion will have two aspects: changing the attributes of each node and changing the links between nodes. I will depict how the method used can be extended to convert compressed forests generated by the parser Świgra. At the end I wil talk about the plans and the goals of reimplementation of the MateParser algorithm.

13 March 2017

Marek Kozłowski, Szymon Roziewski (National Information Processing Institute)

Internet model of Polish and semantic text processing

The presentation shows how BabelNet (the multilingual encyclopaedia and semantic network based on publicly available data sources such as Wikipedia and WordNet), can be used in the task of grouping short texts, sentiment analysis or emotional profiling of movies based on their subtitles. The second part presents the work based on CommonCrawl – publicly available petabyte-size open repository of multilingual Web pages. CommonCrawl was used to build two models of Polish: n-gram-based and semantic distribution-based.

20 March 2017

Jakub Szymanik (University of Amsterdam)

Semantic Complexity Influences Quantifier Distribution in Corpora

In this joint paper with Camilo Thorne, we study whether semantic complexity influences the distribution of generalized quantifiers in a large English corpus derived from Wikipedia. We consider the minimal computational device recognizing a generalized quantifier as the core measure of its semantic complexity. We regard quantifiers that belong to three increasingly more complex classes: Aristotelian (recognizable by 2-state acyclic finite automata), counting (k+2-state finite automata), and proportional quantifiers (pushdown automata). Using regression analysis we show that semantic complexity is a statistically significant factor explaining 27.29% of frequency variation. We compare this impact to that of other known sources of complexity, both semantic (quantifier monotonicity and the comparative/superlative distinction) and superficial (e.g., the length of quantifier surface forms). In general, we observe that the more complex a quantifier, the less frequent it is.

27 March 2017 (invited talk at the institute seminar)

Paweł Morawiecki (Institute of Computer Science, Polish Academy of Sciences)

Introduction to deep neural networks

In the last few years, Deep Neural Networks (DNN) has become a tool that provides the best solution for many problems from image and speech recognition. Also in natural language processing DNN totally revolutionizes the way how translation or word representation is done (and for many other problems). This presentation aims to provide good intuitions related to the DNN, their core architectures and how they operate. I will discuss and suggest the tools and source materials that can help in the further exploration of the topic and independent experiments.

3 April 2017

Katarzyna Budzynska, Chris Reed (Institute of Philosophy and Sociology, Polish Academy of Sciences / University of Dundee)

Argument Corpora, Argument Mining and Argument Analytics (part I)

Argumentation, the most prominent way people communicate, has been attracting a lot of attention since the very beginning of the scientific reflection. The Centre for Argument Technology has been developing the infrastructure for studying argument structures for almost two decades. Our approach demonstrate several characteristics. First, we build upon the graph-based standard for argument representation, Argument Interchange Format AIF (Rahwan et al., 2007); and Inference Anchoring Theory IAT (Budzynska and Reed, 2011) which allows us to capture dialogic context of argumentation. Second, we focus on a variety of aspects of argument structures such as argumentation schemes (Lawrence and Reed, 2016); illocutionary intentions speakers associate with arguments (Budzynska et al., 2014a); ethos of arguments' authors (Duthie et al., 2016); rephrase relation which paraphrases parts of argument structures (Konat et al., 2016); and protocols of argumentative dialogue games (Yaskorska and Budzynska, forthcoming).

10 April 2017

Paweł Morawiecki (Institute of Computer Science, Polish Academy of Sciences)

Neural nets for natural language processing – selected architectures and problems

For the last few years more and more problems in NLP have been successfully tackled with neural nets, particularly with deep architectures. These are such problems as sentiment analysis, topic classification, coreference, word representations and image labelling. In this talk i will give some details on most promising architectures used in NLP including recurrent and convolutional nets. The presented solutions will be given in a context of a concrete problem, namely the coreference problem in Polish language.

15 May 2017

Katarzyna Budzynska, Chris Reed (Institute of Philosophy and Sociology, Polish Academy of Sciences / University of Dundee)

Argument Corpora, Argument Mining and Argument Analytics (part II)

In the second part of our presentation we will describe characteristics of argument structures using examples from our AIF corpora of annotated argument structures in various domains and genres (see also OVA+ annotation tool) including moral radio debates (Budzynska et al., 2014b); Hansard records of the UK parliamentary debates (Duthie et al., 2016); e-participation (Konat et al., 2016; Lawrence et al., forthcoming); and the US 2016 presidential debates (Visser et al., forthcoming). Finally, we will show how such complex argument structures, which on the one hand make the annotation process more time-consuming and less reliable, can on the other hand result in automatic extraction of a variety of valuable information when applying technologies for argument mining (Budzynska and Villata, 2017; Lawrence and Reed, forthcoming) and argument analytics (Reed et al., forthcoming).

12 June 2017 (invited talk at the Institute seminar)

Adam Pawłowski (University of Wroclaw)

Sequential structures in texts

The subject of my lecture is the phenomenon of sequentiality in linguistics. Sequentiality is defined here as a characteristic feature of a text or of a collection of texts, which expresses the sequential relationship between units of the same type, ordered along the axis of time or according to a different variable (e.g. the sequence of reading or publishing). In order to model sequentiality which is thus understood, we can use, among others, time series, spectral analysis, theory of stochastic processes, theory of information or some tools of acoustics.Referring to both my own research and existing literature, in my lecture I will be presenting sequential structures and selected models thereof in continuous texts, as well as models used in relation to sequences of several texts (known as chronologies of works); I will equally mention glottochronology, which is a branch of quantitative linguistics that aims at mathematical modeling of the development of language over long periods of time. Finally, I will relate to philosophical attempts to elucidate sequentiality (the notion of the text’s ‘memory’, the result chain, Pitagoreism, Platonism).

Natural Language Processing Seminar 2017–2018

2 October 2017

Paweł Rutkowski (University of Warsaw)

Polish Sign Language from the perspective of corpus linguistics

Polish Sign Language (polski język migowy, PJM) is a full-fledged visual-spatial language used by the Polish Deaf community. It started to evolve in the second decade of the nineteenth century, with the foundation of the first school for the deaf in Poland. Until recently, PJM attracted very little attention from the linguistic community in Poland. The aim of this talk is to present a large-scale research project aimed at creating an extensive and representative corpus of PJM. The corpus is currently being compiled at the University of Warsaw. It is a collection of video clips showing Deaf people using PJM in a variety of different communication contexts. The videos are richly annotated: they are segmented, lemmatized, translated into Polish, tagged for various grammatical features and transcribed with HamNoSys symbols. The Corpus of PJM is currently one of the two largest sets of annotated sign language data in the world. Special attention will be paid to the issue of lexical frequency in PJM. Studies of this type are available for a handful of sign languages only, including American Sign Language, New Zealand Sign Language, British Sign Language, Australian Sign Language and Slovene Sign Language. Their empirical basis ranged from 100,000 tokens (NZSL) to as little as 4,000 tokens (ASL). The present talk contributes to our understanding of lexical frequency in sign languages by analyzing a much larger set of relevant data from PJM.

23 October 2017

Katarzyna Krasnowska-Kieraś, Piotr Rybak, Alina Wróblewska (Institute of Computer Science, Polish Academy of Sciences)

Towards the evaluation of feature embedding models of the fusional languages in the context of morphosyntactic disambiguation and dependency parsing

Neural networks are recently very successful in various natural language processing tasks. An important component of a neural network approach is a dense vector representation of features, i.e. feature embedding. Various feature types are possible, e.g. words, part-of-speech tags. In our talk we are going to present results of an analysis showing what should be used as features in estimating embedding models of the fusional languages – tokens or lemmata. Furthermore, we are going to discuss the methodological question whether the results of the intrinsic evaluation of embeddings are informative for downstream applications, or the embedding models should be evaluated extrinsically. The accompanying experiments were conducted on Polish – a fusional Slavic language with a relatively free word order. The mentioned research has inspired us to implement a morphosyntactic disambiguator – Toygger (Krasnowska-Kieraś, 2017). The tool won the shared task 1 (A) in PolEval 2017 competition and will be presented in our talk.

6 November 2017

Szymon Łęski (Samsung R&D Poland)

Deep neural networks in language models

In my talk I will first give introduction to language models: traditional, n-gram based, and new, based on recurrent networks. Then, based on recent papers, I will discuss the most interesting extensions and modifications to RNN-based language models, such as modifying word representations or models with output not limited to a pre-defined vocabulary.

20 November 2017

Michał Ptaszyński (Kitami Institute of Technology, Japan)

Capturing Emotions in Context as a way towards Computational Phronesis

Research on emotions within Artificial Intelligence and related fields has flourished rapidly through recent years. Unfortunately, in most research emotions are analyzed without their context. I will argue, that recognizing emotions without recognizing their context is incomplete and cannot be sufficient for real-world applications. I will also describe some consequences of disregarding the context of emotions. Finally, I will present one approach, in which the context of emotions is considered and briefly describe some of the first experiments performed in this matter.

27 November 2017

Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

Automated coreference resolution in Polish

The talk presents the description of nominal referential constructs in Polish (i.e. textual fragments referencing the same discourse entities) and the computational-linguistic methods implemented for their decoding. The algorithms are corpus-based with manual annotation of coreferential constructs and are evaluated using standard metrics.

4 December 2017

Adam Dobaczewski, Piotr Sobotka, Sebastian Żurowski (Nicolaus Copernicus University in Toruń)

Dictionary of Polish reduplications and repetitions

In our talk we will present a dictionary prepared by the team from the Institute of Polish Language of the Nicolaus Copernicus University in Toruń (grant NPRH 11H 13 0265 82). We document In the dictionary expressions of the Polish language in which the presence of reduplication or repetition of forms of the same lexemes can be observed. We distinguish the units of language according to the Bogusławski's operational grammar framework and divide them into two basic groups: (i) lexical units consisting of two such segments or forms of the same lexeme (Pol. całkiem całkiem; fakt faktem); operational units based on some pattern of repetition of words belonging to a certain class predicted by this scheme (Pol. N[nom] N[inst] ale _, where N stands for any noun, e.g. sąd sądem, ale _; miłość miłością, ale _). We have prepared a dictionary in traditional (printed) form due to the relatively small number of registered units. Its material base is the resources of the NKJP, which were searched using dedicated search engine of repetitions in the NKJP. This tool was specially prepared for this project at the LEG ICS PAS.

29 January 2018

Roman Grundkiewicz (Adam Mickiewicz University in Poznań/University of Edinburgh)

Automatic Grammatical Error Correction using Machine Translation

In my presentation I will be talking about the task of automated grammatical error correction (GEC) in texts written by non-native English speakers. I will present our experiments on the application of the phrase-based statistical machine translation (SMT), and our GEC system, which achieved new state-of-the-art results. The importance of the parameter optimization towards the task-specific evaluation metric and new GEC-adapted dense and sparse features will be discussed. I will also briefly describe the results of further research using neural machine translation (NMT).

12 February 2018

Agnieszka Mykowiecka, Aleksander Wawer, Małgorzata Marciniak, Piotr Rychlik (Institute of Computer Science, Polish Academy of Sciences)

Recognition of metaphorical noun phrases in Polish with distributional semantics

Our talk addresses the use of vector models for Polish based on lemmas and forms. We compare the results for two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition. Then we apply vector models to detect metaphorical and literal meaning of adjective-noun (AN) phrases. We show the results of our method for isolated phrases and compare them to other known methods. Finally, we discuss the problem of recognition of metaphorical/literal meaning of AN phrases in sentences.

26 February 2018

Celina Heliasz (University of Warsaw)

To create or to contribute? On the search for synergy between computer scientists and linguists

The main topic of my presentation are the methods of conducting research in the field of corpus linguistics, which is currently being addressed by both computer scientists and linguists. In my speech, I will present the attempts to recognize and visualize semantic relations in the text undertaken by computer scientists as part of the two projects: RST (Rhetorical Structure Theory) and PDTB (Penn Discourse Treebank). Then, I contrast RST and PDTB with analogous attempts made by computer scientists and linguists at IPI PAN as part of the CLARIN-PL venture. The aim of the presentation is to show the determinants of effective linguistic analysis, which must be taken into account when designing IT tools, if these tools are to conduct research on text and derive strong foundations of linguistic theories from them, and not only to implement existing theories in this field.

9 April 2018

Jan Kocoń (Wrocław University of Technology)

Recognition of temporal expressions and events in Polish text documents

A temporal expression is a sequence of words that informs you about when, how often an event occurs or how long it lasts. Event descriptions are words which indicate a change of state in the description of reality (and also some states). These issues fall within the scope of information extraction. They are well defined and described for English and partly for other languages. The TimeML specification, whose temporal information description language has been accepted as an ISO standard, has been officially adapted for six languages and the temporal expressions description section is defined for eleven languages. The result of the work carried out within CLARIN-PL is the adaptation of TimeML guidelines for Polish language. The motivation for this topic was the fact that temporal information is used by various natural language processing tasks, including methods for question answering, automatic text summarisation, semantic relations extraction and many others. These methods allow researchers in the domain of Digital Humanities and Social Sciences to work with a very large collection of texts whose analysis, without these methods, would be very time-consuming, if possible at all. In addition to the adaptation of the temporal information description language itself, the quality and efficiency of methods is a key aspect for temporal expressions and events recognition. The presentation will discuss both the analysis of the quality of data prepared by domain experts (including annotation agreement analysis) and the results of research aimed at reducing the complexity of the computational problem while preserving the quality of methods.

23 April 2018

Włodzimierz Gruszczyński, Dorota Adamiec, Renata Bronikowska (Institute of the Polish Language, Polish Academy of Sciences), Witold Kieraś, Dorota Komosińska, Marcin Woliński (Institute of Computer Science, Polish Academy of Sciences)

Historical corpus – problems of transliteration, transcription and annotation on the example of the Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772)

During the seminar, the process of creating the Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772), also called the Baroque Corpus, will be discussed. The particular emphasis will be placed on those tasks and problems that are specific to historical corpora, in contrast to corpora of contemporary texts, e.g. the National Corpus of Polish. We will also show the tools that were created for the needs of the project or adapted to these needs. After the general presentation of the project (assumptions, financing, team, current status, corpus's purpose) we will discuss particular problems in the order in which they appeared during the creation of the corpus: the selecting of texts, gathering them and incorporating them into a database, the necessity of their transcription into modern spelling (resulting from a huge spelling differentiation of old prints and manuscripts), issues of morphological analysis, morphosyntactic annotation (manual and automatic) and corpus searching.

14 May 2018

Łukasz Kobyliński, Michał Wasiluk, Zbigniew Gawłowicz (Institute of Computer Science, Polish Academy of Sciences)

MTAS corpus search engine and its implementation for Polish language corpora

During the seminar we will discuss our experiences with the MTAS search engine in the context of Polish language corpora. We will present several implementations of MTAS in such corpus-related projects as KORBA (the corpus of Polish language of the XVII and XVIII century), the XIX century language corpus, as well as National Corpus of Polish. We will also discuss preliminary experiments with implementing MTAS in Korpusomat - a tool that allows users to create their own corpora. During the presentation we will share our solutions to the problems encountered during the adaptation of MTAS to Polish and preliminary efficiency test results. We will also discuss the search capabilities of the engine and our plans for enhancing MTAS.

21 May 2018 (IPI PAN seminar presentation, 13:00)

Piotr Borkowski (Institute of Computer Science, Polish Academy of Sciences)

Semantic methods of categorization in the tasks of text document analysis

In my PhD thesis entitled `Semantic methods of categorization in the tasks of text document analysis', a new algorithm of semantic categorization of documents was proposed and examined. On its basis, a new algorithm for category aggregation was developed, a family of semantic algorithms of classifiers, as well as a heterogeneous classifier committee (which combines the algorithm of semantic categorization and previously known classifiers). In my talk I will briefly present their concepts and the results of their effectiveness studies.

28 May 2018

Krzysztof Wołk (Polish-Japanese Academy of Information Technology)

Exploration and usage of comparable corpora in machine translation

The problem that will be presented in the seminar is how to improve machine speech translation between Polish and English. The most popular methodologies and tools are not well-suited for the Polish language and therefore require adaptation. Polish language resources are lacking in parallel and monolingual data. Therefore, the main objective of the study was to develop an automatic toolkit for textual resources preparation by mining comparable corpora and quasi comparable corpora. Experiments were conducted mostly on casual human speech, consisting of lectures, movie subtitles, European Parliament proceedings, and European Medicines Agency texts. The aims were to rigorously analyze the problems and to improve the quality of baseline systems, i.e., adaptation of techniques and training parameters to increase the Bilingual Evaluation Understudy (BLEU) score for maximum performance. A further aim was to create additional bilingual and monolingual data resources by using available online data and by obtaining and mining comparable corpora for parallel sentence pairs. For this task, a methodology employing a Support Vector Machine and the Needleman-Wunsch algorithm was used, along with a chain of specialized tools.

4 June 2018

Piotr Przybyła (University of Manchester)

Supporting document screening for systematic reviews using machine learning and text mining

Systematic reviews, aiming to aggregate and analyse all the literature for a given research question, are a crucial tool in medical research. Their most laborious stage is screening, i.e. manual selection of dozens of relevant articles from thousands returned by search engines. Formulating the problem as a text classification task and using appropriate unsupervised text mining tools could lead to significant work saved. The presentation will cover adaptation of machine learning algorithms to the problem, tools for extracting and visualising terms and topics in collections, system deployment and evaluation at NICE (National Institute for Health and Care Excellence), a UK agency publishing health technology guidelines.

11 June 2018

Danijel Korzinek (Polish-Japanese Academy of Information Technology)

Preparing a speech corpus using the recordings of the Polish Film Chronicle

The presentation will describe how a speech corpus based on the Polish Film Chronicle, a collection of short historical news segments, was created during the CLARIN-PL project. This resource is an extremely useful tool for linguistic research, specifically in the context of historical speech and language. The years 1945–1960 were chosen for this purpose. The presentation will discuss various topics: from the legal issues of acquiring the resources, to more the more technical aspects of dealing with the adaptation of speech analysis tools to this, rather uncommon domain.

Natural Language Processing Seminar 2018–2019

1 October 2018

Janusz S. Bień (University of Warsaw – prof. emeritus)

Electronic indexes to lexicographical resources

We will focus on the indexes to lexicographical resources available online in DjVu format. Such indexes can be browsed, searched, modified and created with the djview4poliqarp open source program; the origins and the history of the program will be briefly presented. Originally the index support was added to the program to handle the list of entries in the 19th century Linde's dictionary, but can be used conveniently also for other resources, as will be demonstrated on selected examples. In particular some new features, introduced to the program in the last months, will be presented publicly for the first time.

15 October 2018

Wojciech Jaworski, Szymon Rutkowski (University of Warsaw)

A multilayer rule based model of Polish inflection

The presentation will be devoted to the multilayer model of Polish inflection. The model has been developed on the basis of Grammatical Dictionary of Polish; it does not use the concept of a inflexion paradigm. The model consists of three layers of hand-made rules: "orthographic-phonetic layer" converting a segment to representation reflecting morphological patterns of the language, "analytic layer" generating lemma and determining affix and "interpretation layer" giving a morphosyntactic interpretation based on detected affixes. The model provides knowledge about the language to a morphological analyzer supplemented with the function of guessing lemmas and morphosyntactic interpretations for non-dictionary forms (guesser). The second use of the model is generation of word forms based on lemma and morphosyntactic interpretation. The presentation will also cover the issue of disambiguation of the results provided by the morphological analyzer. The demo version of the program is available on the Internet.

29 October 2018

Jakub Waszczuk (Heinrich-Heine-Universität Düsseldorf)

From morphosyntactic tagging to identification of verbal multiword expressions: a discriminative approach

The first part of the talk was dedicated to Concraft-pl 2.0, the new version of a morphosyntactic tagger for Polish based on conditional random fields. Concraft-pl 2.0 performs morphosyntactic segmentation as a by-product of disambiguation, which allows to use it directly on the segmentation graphs provided by the analyser Morfeusz. This is in contrast with other existing taggers for Polish, which either neglect the problem of segmentation or rely on heuristics to perform it in a pre-processing stage. During the second part, an approach to identifying verbal multiword expressions (VMWEs) based on dependency parsing results was presented. In this approach, VMWE identification is reduced to the problem of dependency tree labeling, where one of two labels (MWE or not-MWE) must be predicted for each node in the dependency tree. The underlying labeling model can be seen as conditional random fields (as used in Concraft) adapted to tree structures. A system based on this approach ranked 1st in the closed track of the PARSEME shared task 2018.

5 November 2018

Jakub Kozakoszczak (Faculty of Modern Languages, University of Warsaw / Heinrich-Heine-Universität Düsseldorf)

Mornings to Wednesdays — semantics and normalization of Polish quasi-periodical temporal expressions

The standard interpretations of expressions like “Januarys” and “Fridays” in temporal representation and reasoning are slices of collections of 2nd order, e.g. all the sixth elements of day sequences of cardinality 7 aligned with calendar weeks. I will present results of the work on normalizing most frequent Polish quasi-periodical temporal expressions for online booking systems. On the linguistic side I will argue against synonymy of the kind “Fridays” = “sixth days of the weeks” and give semantic tests for rudimentary classification of quasi-periodicity. In the formal part I will propose an extension to existing formalisms covering intensional quasi-periodical operators “from”, “to”, “before” and “after” restricted to monotonic domains. In the implementation part I will demonstrate an algorithm for lazy generation of generalized intersection of collections.

19 November 2018

Daniel Zeman (Institute of Formal and Applied Linguistics, Charles University in Prague)

Universal Dependencies and the Slavic Languages

I will present Universal Dependencies, a worldwide community effort aimed at providing multilingual corpora, annotated at the morphological and syntactic levels following unified annotation guidelines. I will discuss the concept of core arguments, one of the cornerstones of the UD framework. In the second part of the talk I will focus on some interesting problems and challenges of applying Universal Dependencies to the Slavic languages. I will discuss examples from 12 Slavic languages that are currently represented in UD and show that cross-linguistic consistency can still be improved.

3 December 2018

Ekaterina Lapshinova-Koltunski (Saarland University)

Analysis and Annotation of Coreference for Contrastive Linguistics and Translation Studies

In this talk, I will report on the ongoing work on coreference analysis in a multilingual context. I will present two approaches in the analysis of coreference and coreference-related phenomena: (1) top-down or theory-driven: here we start from some linguistic knowledge derived from the existing frameworks, define linguistic categories to analyse and create an annotated corpus that can be used either for further linguistic analysis or as training data for NLP applications; (2) bottom-up or data-driven: in this case, we start from a set of features of shallow character that we believe are discourse-related. We extract these structures from a huge amount of data and analyse them from a linguistic point of view trying to describe and explain the observed phenomena from the point of view of existing theories and grammars.

7 January 2019

Adam Przepiórkowski (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw), Agnieszka Patejuk (Institute of Computer Science, Polish Academy of Sciences / University of Oxford)

Enhanced Universal Dependencies

The aim of this talk is to present the two threads of our recent work on Universal Dependencies (UD), a standard for syntactically annotated corpora (http://universaldependencies.org/). The first thread is concerned with the developement of a new UD treebank of Polish, one that makes extensive use of the enhanced level of representation made available in the current UD standard. The treebank is the result of conversion from an earlier ‘treebank’ of Polish, one that was annotated with constituency and functional structures as they are understood in Lexical Functional Grammar. We will outline the conversion procedure and present the resulting UD treebank of Polish. The second thread is concerned with various inconsistencies and deficiencies of UD that we identified in the process of developing the UD treebank of Polish. We will concentrate on two particularly problematic areas in UD, namely, on the core/oblique distinction, which aims to – but does not really – replace the infamous argument/adjunct dichotomy, and on coordination, a phenomenon problematic for all dependency approaches.

14 January 2019

Agata Savary (François Rabelais University Tours)

Literal occurrences of multiword expressions: quantitative and qualitative analyses

Multiword expressions (MWEs) such as to “pull strings” (to use one's influence), “to take part” or to “do in” (to kill) are word combinations which exhibit lexical, syntactic, and especially semantic idiosyncrasies. They pose special challenges to linguistic modeling and computational linguistics due to their non-compositional semantics, i.e. the fact that their meaning cannot be deduced from the meanings of their components, and from their syntactic structure, in a way deemed regular for the given language. Additionally, MWEs can have both idiomatic and literal occurrences. For instance “pulling strings” can be understood either as making use of one’s influence, or literally. Even if this phenomenon has been largely addressed in psycholinguistics, linguistics and natural language processing, the notion of a literal reading has rarely been formally defined or subject to quantitative analyses. I will propose a syntax-based definition of a literal reading of a MWE. I will also present the results of a quantitative and qualitative analysis of this phenomenon in Polish, as well as in 4 typologically distinct languages: Basque, German, Greek and Portuguese. This study, performed in a multilingual annotated corpus of the PARSEME network, shows that literal readings constitute a rare phenomenon. We also identify some properties that may distinguish them from their idiomatic counterparts.

21 January 2019

Marek Łaziński (University of Warsaw), Michał Woźniak (Jagiellonian University)

Aspect in dictionaries and corpora. What for and how aspect pairs should be tagged in corpora?

Corpora are generally tagged for grammatical categories, also for verbal aspect value. They all choose between pf and ipf, some of them add the third value: bi-aspectual (not present in the National Corpus of Polish). However, no Slavic corpus tags the aspect value of a verb form in reference to an aspect partner. If we can mark aspect pairs in dictionaries, it should be also possible in corpora. However under the condition, that we extrapolate aspect features of lexeme to specific verb forms in specific uses. Retaining the existing morphological tagging including aspect value, two more aspect tags have been added: 1) morphological markers of aspect and 2) reference to superlemma. Every verb form in the corpus has thus three parts: 1) The existing grammatcial characteristics (TAKIPI), 2) Repeated or corrected aspect value (including bi-aspecual) and morphological markers, 3) Reference to the aspect pair–superlemma. A corpus tagged for aspect pairs, even with alternative reference for every lexeme, opens new perspectives for research. The possibilities are especially rich in a parallel corpus with one Slavic and one aspectless language, as the Mainz-Warsaw Corpus. In order to check the usefulness of our aspect pair tagging a series of queries will be built which allow to compare grammatical profiles of suffixal and prefixal aspect pf and ipf partners.

11 February 2019
Anna Wróblewska (Applica / Warsaw University of Technology), Filip Graliński (Applica / Adam Mickiewicz University)
Text-based machine learning processes and their interpretability
How do we tackle text modeling challenges in business applications? We will present a prototype architecture for automation of processes in text based work and a few use cases of machine learning models. Use cases will be about emotion detection, abusive language recognition and more. We will also show our tool to explain suspicious findings in datasets and the models behaviour.

28 February 2019
Jakub Dutkiewicz (Poznan University of Technology)
Empirical research on medical information retrieval
We discuss results and evaluation procedures a of the bioCADDIE 2016 challenge on search of precision medical data. Our good results are due to word embedding query expansion with appropriate weights. Information Retrieval (IR) evaluation is demanding because of considerable effort required to judge over 10000 documents. A simple sampling method was proposed over 10 years ago for estimation of Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG) in spite of incomplete judgments. For this method to work the number of judged documents has to be relatively large. Such conditions were not fulfilled in bioCADDIE 2016 challenge and TREC PM 2017, 2018. The specificity of the bioCADDIE evaluation makes the post-challenge results incompatible with these judged during the contest. In bioCADDIE, for some questions there were not any judged relevant document. The results are strongly dependent on the cut-off rank. As the effect, in the bioCADDIE challenge infAP is weakly correlated with infNDCG, and an error could by up to 0.15-0.20 in absolute value. We believe, that the deviation of evaluation measures may override the primary role of the measure in such a case. We collaborate this claim by simulation of synthetic results. We propose a simulated environment with properties, which mirror the real systems. We implement a number of evaluation measures within the simulation and discuss the usefulness of the measures with partially annotated collection of documents in regard to the collection size, number of annotated document and proportion between the number of relevant and irrelevant documents. In particular we focus on the behavior of aforementioned AP and NDCG and their inferred versions. Other studies suggest that infNDCG weakly correlates with other measures and therefore should not be selected as the most important measure.

21 March 2019

Grzegorz Wojdyga (Institute of Computer Science, Polish Academy of Sciences)

Size optimisation of language models

During the seminar, the results of work on reducing the size of language models will be discussed. The author will review the literature on the size reduction of recurrent neural networks (in terms of language models). Then, author's own implementations will be presented along with evaluation results on different Polish and English corpora.

25 March 2019

Łukasz Dębowski (Institute of Computer Science, Polish Academy of Sciences)

GPT-2 – Some remarks of an observer

GPT-2 is the latest neural statistical language model by the OpenAI team. A statistical language model is a distribution of probabilities on texts that can be used for automatic text generation. In essence, GPT-2 turned out to be a surprisingly good generator of semantically coherent texts of the length of several paragraphs, pushing the boundaries of what has seemed possible technically so far. Anticipating the use of GPT-2 to generate fake news, the OpenAI team decided to publish only a ten times reduced version of the model. In my talk, I will share some remarks about GPT-2.

8 April 2019

Agnieszka Wołk (Polish-Japanese Academy of Information Technology and Institute of Literary Research, Polish Academy of Sciences)

Language collocations in quantitative research

This presentation is aimed to aid the enormous effort required to analyze phraseological writing competence by developing an automatic evaluation tool for texts. An attempt is made to measure both second language (L2) writing proficiency and text quality. The CollGram technique that searches a reference corpus to determine the frequency of each pair of tokens (n-grams) and calculates the t-score and related information. We used the Level 3 Corpus of Contemporary American English as a reference corpus. Our solution performed well in writing evaluation and is freely available as a web service or as source for other researchers. We also present how to use it as early depression detection tools and stylometry.

15 April 2019

Alina Wróblewska, Piotr Rybak (Institute of Computer Science, Polish Academy of Sciences)

Dependency parsing of Polish

Dependency parsing is a crucial issue in various NLP tasks. The predicate-argument structure transparently encoded in dependency-based syntactic representations may support machine translation, question answering, sentiment analysis, etc. In the talk, we will present PDB – the largest dependency treebank for Polish, and COMBO – a language-independent neural system for part-of-speech tagging, morphological analysis, lemmatisation and dependency parsing.

13 May 2019

Piotr Niewiński, Maria Pszona, Alessandro Seganti, Helena Sobol (Samsung R&D Poland), Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences)

Samsung R&D Poland in SemEval 2019 competition

The talk presents Samsung R&D Poland solutions that participated in SemEval 2019 competition. Both were ranked as the second one in two different tasks of competition.

1. Fact Checking in Community Question Answering Forums

We present our submission to SemEval 2019 Task 8 on Fact-Checking in Community Forums. The aim was to classify questions from QatarLiving forum as OPINION, FACTUAL or SOCIALIZING. We will present our primary solution: Deeply Regularized Residual Neural Network (DRR NN) with Universal Sentence Encoder embeddings, which was ranked second in the official evaluation phase. Moreover, we will compare this solution with two contrastive models based on ensemble methods.

2. Linguistically enhanced deep learning offensive sentence classifier

How to define an offensive content? What is a bad word? In our presentation we will discuss the problem of recognizing what is offensive and what is not in social media (Twitter etc.). Furthermore we present the system that we implemented to participate in the SemEval 2019 Task 5 and Task 6 (where we had 2nd place in Task 6 Subtask C) and compare our results to other state of the art approaches. We will see that our approach outperformed other models by adding linguistically based observation to the model features.

27 May 2019

Magdalena Zawisławska (University of Warsaw)

Synamet — Polish Corpus of Synesthetic Metaphors

The aim of the paper is to discuss the procedure of the identification of synesthetic metaphors and the annotation of metaphoric units (MUs) in the Synamet corpus, which was created within the frames of the NCN grant (UMO-2014/15/B/ HS2/00182). The theoretical basis for the description of metaphors was the Conceptual Metaphor Theory (CMT) by Lakoff and Johnson combined with Fillmore's frame semantics. Lakoff and Johnson define a metaphor as a conceptual mapping from the source domain to the target domain, e.g. LOVE IS A JOURNEY. Because the concept of a domain is unclear, it has been replaced by a frame which, unlike a conceptual domain, links the semantic and linguistic levels (frames are activated by lexical units). The synesthetic metaphor in a narrower sense is defined as mapping from one perceptual modality to a different perceptual modality, e.g. a bright sound (VISION → HEARING), and in a broader sense—it is defined as description of non-perceptual phenomena with expressions referring primarily to sensory perceptions, e.g. rough character (TOUCH → PERSON). The Synamet project uses an even wider definition of synesthetic metaphor as any expression in which two different frames are activated and one of them is perceptual. Texts in the Synamet corpus come from blogs devoted to perfumes, wine, beer, music, or coffee, in which, due to the topics, the chance to find synesthetic metaphors was the greatest. The paper presents the basic statistics of the corpus and atypical metaphorical units that required modification of the annotation procedure.

See the talks given between 2000 and 2015 and the current schedule.

Diff for "seminar-archive"

Menu

Natural Language Processing Seminar 2015–2016

Natural Language Processing Seminar 2016–2017

Natural Language Processing Seminar 2017–2018

Natural Language Processing Seminar 2018–2019

⇤ ← Revision 102 as of 2019-08-30 11:17:39 → Size: 98036 Editor: MaciejOgrodniczuk Comment:	← Revision 104 as of 2019-09-19 14:11:01 → ⇥ Size: 98036 Editor: MaciejOgrodniczuk Comment:
No differences found!