Jornada de Seguimiento de Proyectos, 2010 Programa Nacional de Tecnologías Informáticas BRAVO: Búsqueda de Respuestas Avanzada Multimodal y Multilingüe TIN2007-67407-C03 Paloma Martínez Fernández * Universidad Carlos III de Madrid José Miguel Goñi Menoyo** Universidad Politécnica de Madrid Antonio Moreno Sandoval*** Universidad Autónoma de Madrid Abstract The project aims at creating a multimodal (text and voice) and multilingual answers search platform which integrates the modules developed by the different participating groups. The stating hypothesis is that it is possible to improve the answers search task of the current systems, working on the modules which made up the architecture of a system of this sort. Specially, the multilingual IR modules, the enhancement of indexing, speeding up the information access, improvement of extraction and arrangement of answers and the questions analysis. We deal with web information, encyclopaedic resources, scientific documents and news. Thus, linguists' work is essential to develop and/or adapt appropriate resources, as well as for the integration of lexical and software resources. We also aim at applying this techniques and methodology to other areas, as ontology and information retrieval, Named Entities and voice interaction, investigating ways of adapting these tasks to new domains and languages. Keywords: question answering, information retrieval, linguistic resources 1 Project Goals and resources The aim of this project is to develop a platform for question answering on multimedia contents. This environment will allow the analysis of available techniques and methods in multilingual Information Retrieval (IR), in question answering (QA), information and ontology extraction as well as in automatic speech recognition for spontaneous speech. Moreover, it is important to focus on Spanish language both in query language and document collections. This objective also implies to apply new techniques and to enhance current ones through defining hybrid techniques and evaluating them. The scope of the project is not limited to treat textual objects but to extend them to multimedia objects that will be described by using particular cases of documental representations used in textual objects. Partial objectives are: • To create a multimodal (text and voice) and multilingual QA Platform to access multimedia contents. • To integrate in this platform the components for the different on line and off line phases that have to be performed in QA systems enhancing the state-of-the-art in this field (Information Retrieval, Answer extraction and ranking and question analysis) To define, implement and evaluate the necessary updates in an IR subsystem to integrate it in the QA Platform. Particularly, the treatment of smaller units than documents (sentences, paragraphs, etc.) will be considered in order to locate the required information as well as the intelligent treatment of entities (name entities recognition) and the integration of lexical and semantic knowledge in query expansion. To evaluate the platform in International forums mainly CLEF, TAC and others. To develop linguistic resources for Arabic and Japanese languages. To integrate linguistic resources to allow a better processing of spontaneous speech in order to adjust a speech recognizer to user queries. To design data models in specific domains in order to build ontologies by using semiautomatic techniques. To achieve the goals stated above the project was assigned 8 EDP from LABDA-UC3M subproject, 5,5 from GSI-UPM subproject and 9 from LLI-UAM subproject. Each subproject planning is shown in Annex I. 2 Level of achievement BRAVO has three subprojects: the first of them, named BRAVO-BR (LABDA-UC3M), realises the platform development tasks and the component related to information extraction, question analysis with temporal expression and speech recognition adapted to QA Systems; the second, named BRAVO-RI (GSI-UPM), is in charge of multimedia information retrieval components (storage and access optimization, paragraph retrieval, integration of resources) as well as domain semantics modelled using metadata and ontologies for restricted application domains; finally, the third subproject, called BRAVO-RL (LLI-UAM), is the multilingual resources provider needed to extend the platform to no occidental languages. In special, it will be developed some resources and linguistic tools for Arabic and Japanese as well as oral corpus for processing of Spanish Questions. Figure 1 shows the participation of each team in the project. 2.1 Question Answering and Information Extraction As the main result, we have developed of a software platform that includes a QA system where the original modules has been improved, a named entities recognizer, SPINDEL, that regardless of language, applies machine learning based on bootstrapping techniques, [2, 3, 4] and a temporal expressions recognizer [20, 21, 22].An evaluation module has been added with a twofold functionality: to test the QA systems in different domains and concerning the voice input, a software tool, RET (Recognition Evaluation Tool), has been developed to test the output of commercial ASR systems and it has been used in three scenarios: the queries to the QA system, the automatic transcription of audio from video files, [27], [44], and a real-time captioning system used in the classroom for deaf students, [9]. Related to the work on building our proper ASR system (activities 1.5.2 to 1.5.5, see planning in Annex 1), we have decided to work exclusively on commercial ASR systems (Via Voice and TIN2007-67407-C03 Dragon) in activities 1.5.1 and 1.5.6. due to only a technical engineer has been recruited and he is working in integrating the modules developed in the project in the QA platform. These developments have been tested in three domains: news collections (EFE), wikipedia and scientific documentation from Medline (biomedical texts). In the biomedical domain, a prototype for drug names recognition and drug-drug interactions extraction in the medical literature using UMLS, dictionaries and USAN rules of naming drugs. As a result, it is available automatically annotated corpus using the DrugNer system with generic drug names and other biomedical concepts and manually evaluated by a pharmacological expert. The corpus consists of 849 abstracts that were downloaded from PubMed and is available at The system combines information obtained by the UMLS MetaMap Transfer (MMTx) program and nomenclature rules recommended by the World Health Organization (WHO) International Nonproprietary Names (INNs) Program to identify and classify pharmaceutical substances, [13, 14, 15, 16, 17, 18]. Evaluation of these techniques in several forums: CLEF 2008 and 2009 ( track on Multiple Language Question Answering (QA &CLEF), [5, 6, 7], Second Web People Search Evaluation [3, 26] and Text Analysis Conference (TAC 2009), [2]. 2.2 Multimedia Information Retrieval During these two first years of the project, the GSI-UPM team has been working in the development and the continuous improvement of the IDRA tool, as well as testing this and other available tools (some were developed previously in the research group) by participating in international competitions on Information Retrieval and related disciplines. Although several tools and indexing systems were available, the decision of a new development was taken. The new tool should be opened to different formats and functionality for evaluating new techniques related with multimedia information retrieval (in particular, text and image annotations). A previous prerrequisite was that the different parameters used for the computation of relevance and similarity among documents, news, technical reports, or even simple image annotations could be easily changed for experiments. IDRA also offer, in addition to basic functionalities, advanced ones for the management and storage of contents in an efficient way. Its design is flexible and it is very well documented, in order to facilitate its future enhancement. Regarding the development of the tool IDRA, [24, 39, 40], the first tasks were the review and adaptation of previously existent resources. From then, the relevant key issues are: (a) It is fully implemented using Java technology, using the most appropriate data structures for the management of indexes. Having this into account, a more indexing ability is achieved, as well as a lesser answer time for index queries; (b) Its interface offer new functionalities such as: more text formats can be indexed, LUCENE integration for results comparisons, viewing, browsing and management of data and data structures stored after indexing, or results analysis using different evaluating metrics; (c) IDRA tool is distributed using a GPL 3.0 licensing schema. (See A set of activities related with sentiment analysis has been initiated, [28]. A full review of available resources, taking into account multilinguality was achieved, as well as a comparative evaluation. Among them, Sentiwordnet, wordnet affect, verbnet and conceptnet were analysed. Unfortunately, it was not possible to participate in the "SemEval Task on Affective Text" competition, that would allowed us a more complete analysis. LABDA-UC3M is working in a methodological approach to apply metadata of multimedia contents to improve accessibility in web [7, 8, 10, 11]. TIN2007-67407-C03 Regarding evaluation activities, in 2008 the team participated in NTCIR-7, an international competition for Asian languages (as well as English) issues related with information retrieval. The task we participated for was multilingual sentiment analysis for Asian languages and English, submitting a few experiments. We participated in the 2008 and 2009 CLEF editions, following our uninterrupted tradition from the 2003 edition, submitting several experiments in different tracks. In particular, the tasks for CLEF 2008 were: ImageCLEFphoto, ImageCLEFmed, ImageCLEF Medical Image Annotation and VideoCLEF. In the CLEF 2009 edition, the tasks we participated were: ImageCLEFphoto and ImageCLEFmed [25, 29, 30, 31, 32, 33, 34, 35] . In the ImageCLEFmed tasks we tried to improve the retrieval of medical images among multilingually-annotated, heterogeneous collections using semantic expansion techniques [36, 37 42]. In ImageCLEFphoto IDRA tool was used in some experiments (integrating text retrieval data and image content-based retrieval data), and in some other, clustering techniques was essayed for the ordering of the results obtained in the queries. 2.3 Linguistic Resources The main tasks of the LLI-UAM in BRAVO are: (a) Creation of new multilingual resources in Arabic, Spanish and Japanese; [52, 54] (b) Design and annotation of a Spanish speech corpus of questions,[9] (c) Definition of a model for question classification,. (d) Adding linguistic resources to improve the management of spontaneous speech, in order to adapt a voice recognizer to questions formulation. From those goals, (a) is the most important and time-consuming effort for the subproject. In this task, the LLI-UAM has worked alone, without coordination with the other two projects. On the other hand, for the last three tasks, LLI-UAM has worked in closed collaboration with the LABDA-UC3M team, as those linguistic resources are basic for the training of the QA system [5, 6, 20]. As for the LR developed, this a list of current work: Improvement of a Spanish PoS tagger and phonological transcriber, development of an Arabic PoS tagger, development of child corpus of Spanish, [50], development of an acoustic database on questions for Spanish and Arabic, development of a spontaneous speech corpus of Japanese, development of a basic audio lexicon of Japanese for didactic purpose The most outstanding resources, in terms of innovation, are those devoted to Arabic (the tagger and the acoustic database) since there are few groups in the world working on Arabic NLP and LR. 3 BRAVO mid-term results 3.1 Personnel in training With respect to the formation of human resources, several Doctoral Dissertations have been performed: • César de Pablo Sánchez, “Semisupervised learning of patterns for answer extraction in QA systems” (july 2010), european mention LABDA-UC3M • Isabel Segura Bedmar, "Application of information extraction techniques to pharmacological domain: extracting drud-drug interactions" (april 2010), european mention, LABDA-UC3M • Lourdes Moreno, “AWA, a methodological Framework specific of accessibility to develop web applications” (march 2010), LABDA-UC3M TIN2007-67407-C03 • Marta Garrote Salazar: “CHIEDE: corpus de habla infantil espontánea del español”. 2008. LLI-UAM.. • Ana González Ledesma: Los marcadores del discurso en el corpus C-ORAL-ROM: anotación pragmática, estrategias computacionales de etiquetado y aplicaciones a otros campos. 2010. LLI-UAM. • Julio Villena: “Hybrid Models for Information Retrieval”, GSI-UPM, in course. • Sara Lana: “Cognitive models of feedback for Information Retrieval”, GSI-UPM, in course. • Mª Teresa Vicente Díez, “Reconocimiento expresiones temporales en castellano y su aplicación a la extracción de información”, LABDA-UC3M, in course. In addition, two new researchers in formation have joined the LLI-UAM team: Alicia González (FPU grant) and Leonardo Campillos (predoctoral contract funded by the Madrid Regional Government). Also ten undergraduate students have been carried out their master thesis around the project research. 3.2 Coordination Coordination of three subprojects has been reflected in the evaluation of the platform in the international CLEF forum under MIRACLE team that includes the three research teams plus DAEDALUS company (EPO in the project proposal). CLEF participation (as is shown in the publications section) has been materialized in “Multilingual Question Answering (QA@CLEF)” and “Cross-Language Image Retrieval (ImageCLEF)” tracks. Moreover, the three research groups belong to the MAVIR consortium (a network of excellence funded by the Madrid Regional Government, where they have actively participated in several workshops, conferences and other projects. LABDA-UC3M has organized the Spanish Conference on Natural Language Processing (SEPLN 2008) (, where researchers from the three groups have taken part in sessions about language technologies. LLI-UAM organized the VI Congreso Nacional de Lingüística General ( with a session on multilingual natural language processing with GSI-UPM researchers. The joint research between the LLI-UAM and the GSI-UPM teams are more than 15 years old. Both groups are participated in several co-ordinated projects, as well as join publications and software development. The relation with LABDA-UC3M is more recent, but very intensive in the last five years: both teams participate jointly in BRAVO and in MAVIR. The UAM and the UC3M have exchanged researchers (Dr. Doaa Samy and Dr. Marta Garrote) during few months, with excellent results for the production. It must be said that the three groups in the project are submitted again a co-ordinated proposal for the next R&D call, as a final proof of the satisfactory research experience. 3.3 Collaboration with other national and international research groups LLI-UAM has strengthened the relations with international groups, very related with the research interests of the project and with the previous connections of the LLI members: Cairo University: Dr. Samy is an Associate Professor of Spanish and Computational Linguistics. In 2009 with a grant by the AECID, a Spanish-Egyptian Workshop on NLP and LR for Spanish and Arabic was held in Cairo, co-organized with Dr. Moreno. The most important result of this international cooperation has been the signature of an agreement of research between UAM and CU, pushed by Moreno and Samy respectively. TIN2007-67407-C03 Tokyo University of Foreign Studies (TUFS): there is already a student/teacher exchange agreement between UAM and TUFS, being Kimura the UAM responsible. In 2009-10 a grant by UAM-Banco Santander for research with Asian institutions has been received. During this period several visits to Tokyo have been programmed for recording spontaneous speech for our corpus of spoken Japanese. Language Technology Lab at DFKI, Saarbruecken: Dr. Alcántara has been a post-doc visiting researcher during two years, working in different projects related with multimodal processing. The relation will be maintained in the future with the participation of T. Declerck in the next project proposal, as external member of the LLI. LABDA-UC3M: In the two last years (2008 and 2009) a considerable effort has been performed in order to promote mobility with the aim to interchange knowledge with other relevant national and international research groups. The researchers that take part in this project proposal have done several stays: in 2008 Lourdes Moreno was three months in DSIC at UPV under Dr. Oscar Pastor supervision, Isabel Segura en Natural Language Engineering group under Dr. Paolo Rosso supervision; in 2009 César de Pablo and Isabel Segura have been at DFKI, Saarbrucken (Germany) during 6 months with Thierry Declerck. Finally, José Luis Martínez is finishing his Phd with the title "Incorporating semantics in a software process development through Business Rules” (april 2010) with José Carlos González and Paloma Martínez as supervisors. 3.3 Technology transfer BRAVO project is of great importance to DAEDALUS company due to its interest on QA technology and the integration on voice user interfaces. For this reason, this company has developed with the collaboration of the teams a web QA system working on the Spanish Wikipedia called that is available at and has supported several grants for three undergraduate students from UC3M to do their master thesis in this demonstrator. This has enabled DAEDALUS to follow the advances in the state of the art in QA technologies as well as the application of ASR technology to this kind of applications. If as the results of BRAVO Project it is viable to define a product that could be commercialised an agreement could be signed among the authors. GSI-UPM and LABDA-UC3M work as DAEDALUS university partners in BUSCAMEDIA-Hacia una adaptación semántica de medios Digitales Multirred- Multiterminal- CENIT-E project (CEN-20091026, 2009-2012). As a result of research in QA, the system "SQUASH: A Question Answering System for Spanish”, which is part of Technology Portfolio, Technical Services and R&D Networks, promoted by the Fundación para el conocimiento madrid+d in 2008 was jointly developed by researchers from the LABDA-UC3M and LLI-UAM teams. SQUASH is a modular question answering system for the Spanish language. It enhances traditional search engine functionality by providing precise answers in real time to questions in natural language. The usefulness of the results for society is related to the impact of the Language Technologies in the Society of Information. Language resources, the main working line for the LLI-UAM, provide data for inferring knowledge and for training NLP systems. The multimodal (audio and text) and multilingual nature of the current resources compiled during the project is a clear signal of innovation of the research. Figure 1: Linguistic and software modules and Participants in BRAVO project Management of subproject 3 3.2.1. Development of Arabic resources 3.2.2. Development of Spanish resources 3.2.3. Development of Japanese resources 3.3.1. Study of domain and design of recordings collec. 3.3.2. Collection of subcorpus of read speech 3.3.3. Collection of subcorpus of spontaneous speech 3.3.4. Annotation of the corpus of questions 3.3.5. Splitting the corpus: training and evaluation 3.4.1. Model to classify textual questions in Spanish 3.4.2. Model to classify textual questions in Arabic 3.4.3. 