Natural Language Processing Seminar 2015–2016

12 October 2015

Vincent Ng (University of Texas at Dallas)

Beyond OntoNotes Coreference  The talk delivered in English.

Recent years have seen considerable progress on the notoriously difficult task of coreference resolution owing in part to the availability of coreference-annotated corpora such as MUC, ACE, and OntoNotes. Coreference, however, is more than MUC/ACE/OntoNotes coreference: it encompasses many interesting cases of anaphora that are not covered in the extensively investigated MUC/ACE/OntoNotes entity coreference task. This talk examined several comparatively less-studied coreference tasks that were arguably no less challenging than the MUC/ACE/OntoNotes entity coreference task, including the Winograd Schema Challenge, zero anaphora resolution, and event coreference resolution.

26 October 2015

Wojciech Jaworski (University of Warsaw)

Syntactic-semantic parser for Polish  The talk delivered in Polish.

The author presented the parser being developed within CLARIN-PL project, its morphological pre-processing, a categorial grammar of Polish integrated with valency dictionary and used by the parser and the semantic graph formalism used for meaning representation. He also discussed algorithms used by the parser and optimization strategies, both related to performance and concise representation of ambiguous syntactic and semantic parsing trees.

16 November 2015

Izabela Gatkowska (Jagiellonian University in Kraków)

The Empirical Network of Lexical Links  The talk delivered in Polish.

The empirical network of lexical links is the result of an experiment using a human associative mechanism – the person who is the subject of the research says the test first word that comes to his mind after understanding the stimulus word. The study was conducted in a cyclical manner, i.e. response words obtained in the first cycle were used as stimuli in the second cycle, which enabled the creation of a semantic network, which differs from the network created with the bodies of a text, for example, WORTSCHATZ and a network constructed by hand, for example. WordNet. The empirically obtained words, which are derived from those words in the network, have a direction and power connections. The set of incoming and outgoing connections, in which is found a specific expression, creates a lexical node network (subnet). The manner in which the network characterizes meaning, is shown in the example of feedback connections which are a specific example of the dependencies which appear between two words, appearing in the lexical node. A qualitative analysis of the semantic lexical relations known in linguistics, and employed for example in the WordNet dictionary, permit an interpretation of only approximately 25% of linkage feedback. The remaining links may be interpreted by referring to the model of the description of the significance as proposed in the FrameNet dictionary. A qualitative interpretation of all the links found in the lexical node may permit a study of the comparative lexical network nodes experimentally constructed for different natural languages, and may also allow, a separation of empirical semantic models employed by the same set of links found between nodes in a given network.

30 November 2015

Dora Montagna (Universidad Autónoma de Madrid)

Semantic representation of a polysemous verb in Spanish  The talk delivered in English.

The author presented a theoretical model of representation of meaning, based on Pustejovsky's theory of the Generative Lexicon. The proposal is intended as a base for automatic disambiguation, but also as a new model of lexicographic description. The model will be applied to a highly productive verb in Spanish, assuming the hypothesis of verbal underspecification in order to establish patterns of semantic behaviors.

7 December 2015

Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences), Witold Kieraś (University of Warsaw)

Morphosyntactic tagging of Polish – state of the art and future perspectives  The talk delivered in Polish.

During the presentation, the state of the art in the area of automatic approaches to morphosyntactic tagging of Polish language text was discussed, with a particular focus on the analysis of performance of publicly available tools, which are possible to use in real applications. A qualitative and quantitative analysis of the errors made by the taggers was conducted, along with a discussion on the possible causes and solutions to these problems. Tagging results for Polish was compared and contrasted with the results for other European languages.

8 December 2015

Salvador Pons Bordería (Universitat de València)

Discourse Markers from a pragmatic perspective: The role of discourse units in defining functions  The talk delivered in English.

One of the most disregarded aspects in the description of discourse markers is position. Notions such as "initial position" or "final position" are meaningless unless it can be specified with regard to what a DM is "initial" or "final". The presentation defended the idea that, for this question to be answered, appeal must be made to the notion of "discourse unit". Provided with a set of a) discourse units, and b) discourse positions, determining the function of a given DM is quasi-automatic.

11 January 2016

Małgorzata Marciniak, Agnieszka Mykowiecka, Piotr Rychlik (Institute of Computer Science, Polish Academy of Sciences)

Terminology extraction from Polish data – program TermoPL  The talk delivered in Polish.

The presentation addressed the problems of terminology extraction from Polish domain corpora. The authors described the C-value method to rank term candidates based on frequency measure and number of term contexts. The method takes into account nested terms that may not appear by themselves in data. Using this method, several nested grammatical subphrases are obtained which are syntactically correct, but semantically odd, like 'USG jamy' `USG of cavity’. The recognition of nested terms is supported by word connection strength which allows to eliminate truncated phrases from the top part of the term list. The talk was completed by the demo of the TermoPL tool.

25 January 2015

Wojciech Jaworski (University of Warsaw)

Syntactic-semantic parser for Polish: integration with lexical resources, parsing  The talk delivered in Polish.

During the lecture the author presented the integration of syntactic-semantic with SGJP, Polimorf, Słowosieć and Walenty as well as preliminary observations concerning the impact that checking semantic preferences has on parsing. He also described a categorical formalism used to parse and presented briefly how the parser works.

22 February 2016

Witold Dyrka (Wrocław University of Technology)

Language(s) of proteins? – premises, contributions and perspectives  The talk delivered in Polish.

In his speech the author presented arguments in favour of treating protein sequences, or higher protein structures, as sentences in some language(s). Then he plans to show several interesting results (my own and others') of application of quantitative methods of text analysis, and formal linguistics tools (such as probabilistic context-free grammars) for the analysis of proteins. Eventually, he presented plans of his further work on the "protein linguistics", which - as he hopes - would inspire an interesting discussion.

22 February 2016

Linguistic Engineering Group (Institute of Computer Science, Polish Academy of Sciences)

Extended seminar  Series of short lectures in Polish presenting Linguistic Engineering Group research topics.

12:00–12:15: People, projects, tools

12:15–12:45: Morfeusz 2: analyzer and inflectional synthesizer for Polish

12:45–13:15: Toposław: Creating MWU lexicons

13:15–13:45: Lunch break

13:45–14:15: TermoPL: Terminology extraction from Polish data

14:15–14:45: Walenty: Valency dictionary of Polish

14:45–15:15: POLFIE: LFG grammar for Polish

7 March 2016

Zbigniew Bronk (Grammatical Dictionary of Polish team member)

JOD – a markup language for Polish declension  The talk delivered in Polish.

JOD, a markup language for Polish declension, had been constructed in order to precisely describe inflectional rules and schemes for nouns and adjectives in Polish. Its first application was the description of inflection of surnames, taking into account the sex of the person or persons using the given surname. This model has been the basis for the "Automaton of declension of Polish surnames." The author presented the general idea of the language and the implementation of its interpreter, as well as the JOD editor and the website "Automaton of declension of Polish surnames".

21 March 2016

Bartosz Zaborowski, Aleksander Zabłocki (Institute of Computer Science, Polish Academy of Sciences)

Poliqarp2 on the home straight  The talk delivered in Polish.

In this talk the authors present a linguistic data search engine Poliqarp 2, on which they have been working for last three years. They describe both technical aspects as well as interesting features from the user's point of view. They briefly recall the data model supported by the engine, the structure of language supported by the new query engine, its expressive power, and differences compared to the previous version. In particular, they focus on elements added or modified during the development of the project (support for Składnica and LFG data models, post-processing, syntactic sugars). Among technicals they shortly present the software architecture and some details about the implementation of indexes. They also describe nontrivial decisions related to the input data processing (National Corpus of Polish in particular). They end the talk by presenting results of preliminary efficiency measurements.

4 April 2016

Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences)

Identification of opinion targets in Polish  The talk delivered in Polish.

Seminar concluded and summarised the results of a grant of The National Science Centre (NCN) finished in January 2016. It presented three resources with labelled sentiments and opinion targets, developed within the project: a bank of dependency trees, created from the corpus of product reviews, a subset of Skladnica dependency treebank and a collection of tweets. The seminar included a discussion of experiments on automated recognition of opinion targets. These involved the use of two parsing methods: dependency and shallow, and a hybrid method in which the results of syntactic analysis were used by statistical models (eg. CRF).

21 April 2016 (Thursday)

Magdalena Derwojedowa (University of Warsaw)

“Tem lepiej, ale jest to interes miljonowy i traktujemy go poważnie” – A thousand words a thousand times in 5 parts  The talk delivered in Polish.

The talk presented the 1M corpus of the project „Automatic morphological analysis of Polish texts from 1830-1918 period with respect to the evolution of inflection and spelling” (DEC-2012/07/B/HS2/00570), the structure of the corpus, its stylistic, temporal and regional diversity as well as the resource inflectional characteristics in comparison with features described in Bajerowa's works.

9 May 2016

Daniel Janus (Rebased.pl)

From unstructured data to searchable metadata-rich corpus: Skyscraper, P4, Smyrna  The talk delivered in Polish.

The presentation described tools facilitating construction of custom datasets: in particular, corpora of texts. The author presented Skyscraper, a library allowing scraping structured data out of whole WWW sites, and Smyrna, a concordancer for Polish texts enriched with metadata. In addition, a dataset built using these tools was be presented: Polish Parliamentary Proceedings Processor (PPPP, or P4), including, inter alia, a continuously updated corpus of speeches in the Polish parliament. The presentation largely focused on technical solutions used in the tools shown.

19 May 2016 (Thursday)

Kamil Kędzia, Konrad Krulikowski (University of Warsaw)

Generating paraphrases' templates for Polish using parallel corpus  The talk delivered in Polish.

A software for generating paraphrases in Polish under CLARIN-PL project was prepared. The developers will demonstrate how it works on chosen examples. They will also explain a method of Ganitkevitch et al. (2013) which allowed its authors to create an openly available Paraphrase Database (PPDB). Furthermore, they will discuss its enhancements and the approach to the challenges specific to the Polish language. Additionally they will demonstrate a way of measuring paraphrases' quality.

23 May 2016

Damir Ćavar (Indiana University)

The Free Linguistic Environment  The talk delivered in English.

The Free Linguistic Environment (FLE) started as a project to develop an open and free platform for white-box modeling and grammar engineering, i.e. development of natural language morphologies, prosody, syntax, and semantic processing components that are for example based on theoretical frameworks like two-level morphology, Lexical Functional Grammar (LFG), Glue Semantics, and similar. FLE provides a platform that makes use of some classical algorithms and also new approaches based on Weighted Finite State Transducer models to enable probabilistic modeling and parsing at all linguistic levels. Currently its focus is to provide a platform that is compatible with LFG and an extended version of it, one that we call Probabilistic Lexical Functional Grammar (PLFG). This probabilistic modeling can apply to the c(onstituent) -structure component, i.e. a Context Free Grammar (CFG) backbone can be extended by a Probabilistic Context Free Grammar (PCFG). Probabilities in PLFG can also be associated with structural representations and corresponding f(unctional feature)-structures or semantic properties, i.e. structural and functional properties and their relations can be modeled using weights that can represent probabilities or other forms of complex scores or metrics. In addition to these extensions of the LFG-framework, FLE provides also an open platform for experimenting with algorithms for semantic processing or analyses based on (probabilistic) lexical analyses, c- and f-structures, or similar such representations. Its architecture is extensible to cope with different frameworks, e.g. dependency grammar, optimality theory based approaches, and many more.

6 June 2016

Karol Opara (Systems Research Institute of the Polish Academy of Sciences)

Grammatical rhymes in Polish poetry – a quantitative analysis  The talk delivered in Polish.

Polish is a highly inflected language and parts of speech in the same morphological form have common endings. This allows one to easily find a multitude of rhyming words known as grammatical rhymes. Their overuse is strongly discouraged in the contemporary Polish literary cannon due to their alleged banality. The speech presented the results of computer-aided investigations into poets’ technical mastery based on estimating the share of grammatical rhymes in their verses. A method of automatic rhyme detection was discussed as well as the extraction of statistical information from texts, and a new “literary” criterion of choosing the sample size for statistical tests. Finally, a ranking of the technical mastery of various Polish poets was presented.

See the talks given between 2000 and 2015 and the current schedule.