LUNA
Project factsheet
English name: |
Spoken Language UNderstanding in multilinguAl communication systems |
Polish name: |
Rozumienie mowy w wielojęzycznych systemach komunikacji |
Project type: |
A European (IST) Specific Targeted Research Project (contract number 033549) |
Duration: |
4 September 2006 ‒ 3 September 2009 |
Extended information: |
http://cordis.europa.eu/fetch?CALLER=PROJ_IST&ACTION=D&DOC=3&CAT=PROJ&QUERY=1197471353861&RCN=79467 |
Project Web page: |
|
Principal investigator: |
Renato de Mori |
Polish partners involved
Polish Japanese Institute of Information Technology (PI: Krzysztof Marasek)
Institute of Computer Science, Polish Academy of Sciences (PI: Agnieszka Mykowiecka)
Project type: |
A Ministry of Science and Higher Education support for the Polish participation in the project |
Duration: |
1 March 2008 ‒ 1 September 2009 |
Principal investigator: |
Małgorzata Marciniak |
Institution: |
Project description
The main objective of LUNA is the creation of a robust natural spoken language understanding toolkit for multilingual dialogue services, able to carry out human-computer communication with a good degree of user satisfaction. The vision of LUNA is to improve current automated telephone systems allowing easy human-machine interactions through spontaneous and unconstrained speech, replacing menu-driven voice recognition. The project aims to enhance the users’ experience, helping callers in using vocal services quickly and accurately.
From a technological point of view, the objectives of LUNA are to propose new methods, algorithms and tools for the fast development of robust SLU components for multilingual telephone services. To this aim, LUNA will address a set of challenging scientific problems, by focusing on five scientific objectives:
- Language Modelling for Speech Understanding;
- Semantic Modelling for Speech Understanding;
- Automatic Learning (including Active and On-Line Learning);
- Robustness issues for SLU;
- Multilingual portability of SLU components.
In particular, three steps will be considered for SLU interpretation process: generation of semantic concept tags, semantic composition and context-sensitive validation. A protocol for semantic annotation using the same formalism as in FrameNet has been established and an annotation manual has been written. It is used for annotating corpora in French, Italian and Polish.
New corpora with complex human-human dialogs have been acquired in Italian and Polish. They are transcribed and annotated in terms of semantic constituents and semantic structures. The French corpus MEDIA has been annotated in terms of semantic structures in which previously annotated semantic constituents are structure roles. Language peculiar aspects for language modelling and understanding are investigated, especially for Polish.
Experiments with probabilistic semantic composition, use of dialog constraints for SLU, use of new confidence indicators, separation of in-domain and out of domain portion of a spoken message have been performed on French telephone corpora.
Different techniques for generating hypotheses about semantic have been implemented and tested, using transducers, classifiers and machine translation techniques.
Kernel methods have been introduced for hypothesizing predicate/ argument structures from parse trees. Other methods for hypothesizing instances of frame structures and for performing inference on them have been conceived.
Active learning have been investigated and applied to corpora in French.
The MEDIA corpus has been used for learning probabilities to be used to statistical modelling dialogs having the purpose of understanding user intensions.
Reviews of the state of the art for SLU have been prepared and presented, together with the LUNA project at workshops and conferences.
LUNA's research results will be validated on different application scenarios, targeted to dialogue-based telephone services of different complexity (e.g. from call routing with utterance classification to dialogue systems with complex semantic domains). The SLU models will be trained and applied to different multilingual spoken dialog systems in French, Italian and Polish. The language-independent components will be shared among the participants and then adapted to each particular language by means of language resources already available or collected within the project.
The highly qualified academic presence in the consortium ensures scientific excellence and credibility in carrying out this leading-edge research activity, while the project results will immediately become a competitive advantage for industrial partners who will be able to exploit them directly, introducing them to the speech technologies market.
Available resources
Corpora annotation description is available in the book „Anotowany korpus dialogów telefonicznych” (The annotated corpus of spoken dialogues, in Polish). The DVD containing the described corpora is attached to the book. The copy of the data is available under 2-clause BSD licence:
Corpus editor (760 kB)
Corpus ontology (60 kB)
The data has been converted into TEI P5 format within CESAR project:
LUNA.PL (977 MB)
LUNA-WOZ.PL (143 MB)
ODD files used to create RNG schemas for LUNA TEI P5 annotation (see http://www.tei-c.org/Support/Learn/odds.xml for more information)
Another version of the corpus with MLF transcription has been created by Aleksandra Wyszyńska at the AGH University of Science and Technology and is described in her BSc Thesis (in Polish). The MLF data is available on commercial license – please contact Bartosz Ziółko.