BRAVO: Búsqueda de Respuestas Avanzada Multimodal y
Transcripción
BRAVO: Búsqueda de Respuestas Avanzada Multimodal y
Jornada de Seguimiento de Proyectos, 2010 Programa Nacional de Tecnologías Informáticas BRAVO: Búsqueda de Respuestas Avanzada Multimodal y Multilingüe TIN2007-67407-C03 Paloma Martínez Fernández * Universidad Carlos III de Madrid José Miguel Goñi Menoyo** Universidad Politécnica de Madrid Antonio Moreno Sandoval*** Universidad Autónoma de Madrid Abstract The project aims at creating a multimodal (text and voice) and multilingual answers search platform which integrates the modules developed by the different participating groups. The stating hypothesis is that it is possible to improve the answers search task of the current systems, working on the modules which made up the architecture of a system of this sort. Specially, the multilingual IR modules, the enhancement of indexing, speeding up the information access, improvement of extraction and arrangement of answers and the questions analysis. We deal with web information, encyclopaedic resources, scientific documents and news. Thus, linguists' work is essential to develop and/or adapt appropriate resources, as well as for the integration of lexical and software resources. We also aim at applying this techniques and methodology to other areas, as ontology and information retrieval, Named Entities and voice interaction, investigating ways of adapting these tasks to new domains and languages. Keywords: question answering, information retrieval, linguistic resources 1 Project Goals and resources The aim of this project is to develop a platform for question answering on multimedia contents. This environment will allow the analysis of available techniques and methods in multilingual Information Retrieval (IR), in question answering (QA), information and ontology extraction as well as in automatic speech recognition for spontaneous speech. Moreover, it is important to focus on Spanish language both in query language and document collections. This objective also implies to apply new techniques and to enhance current ones through defining hybrid techniques and evaluating them. The scope of the project is not limited to treat textual objects but to extend them to multimedia objects that will be described by using particular cases of documental representations used in textual objects. Partial objectives are: • To create a multimodal (text and voice) and multilingual QA Platform to access multimedia contents. * Email: [email protected] Email: [email protected] *** Email: [email protected] ** TIN2007-67407-C03 • • • • • • To integrate in this platform the components for the different on line and off line phases that have to be performed in QA systems enhancing the state-of-the-art in this field (Information Retrieval, Answer extraction and ranking and question analysis) To define, implement and evaluate the necessary updates in an IR subsystem to integrate it in the QA Platform. Particularly, the treatment of smaller units than documents (sentences, paragraphs, etc.) will be considered in order to locate the required information as well as the intelligent treatment of entities (name entities recognition) and the integration of lexical and semantic knowledge in query expansion. To evaluate the platform in International forums mainly CLEF, TAC and others. To develop linguistic resources for Arabic and Japanese languages. To integrate linguistic resources to allow a better processing of spontaneous speech in order to adjust a speech recognizer to user queries. To design data models in specific domains in order to build ontologies by using semiautomatic techniques. To achieve the goals stated above the project was assigned 8 EDP from LABDA-UC3M subproject, 5,5 from GSI-UPM subproject and 9 from LLI-UAM subproject. Each subproject planning is shown in Annex I. 2 Level of achievement BRAVO has three subprojects: the first of them, named BRAVO-BR (LABDA-UC3M), realises the platform development tasks and the component related to information extraction, question analysis with temporal expression and speech recognition adapted to QA Systems; the second, named BRAVO-RI (GSI-UPM), is in charge of multimedia information retrieval components (storage and access optimization, paragraph retrieval, integration of resources) as well as domain semantics modelled using metadata and ontologies for restricted application domains; finally, the third subproject, called BRAVO-RL (LLI-UAM), is the multilingual resources provider needed to extend the platform to no occidental languages. In special, it will be developed some resources and linguistic tools for Arabic and Japanese as well as oral corpus for processing of Spanish Questions. Figure 1 shows the participation of each team in the project. 2.1 Question Answering and Information Extraction As the main result, we have developed of a software platform that includes a QA system where the original modules has been improved, a named entities recognizer, SPINDEL, that regardless of language, applies machine learning based on bootstrapping techniques, [2, 3, 4] and a temporal expressions recognizer [20, 21, 22].An evaluation module has been added with a twofold functionality: to test the QA systems in different domains and concerning the voice input, a software tool, RET (Recognition Evaluation Tool), has been developed to test the output of commercial ASR systems and it has been used in three scenarios: the queries to the QA system, the automatic transcription of audio from video files, [27], [44], and a real-time captioning system used in the classroom for deaf students, [9]. Related to the work on building our proper ASR system (activities 1.5.2 to 1.5.5, see planning in Annex 1), we have decided to work exclusively on commercial ASR systems (Via Voice and TIN2007-67407-C03 Dragon) in activities 1.5.1 and 1.5.6. due to only a technical engineer has been recruited and he is working in integrating the modules developed in the project in the QA platform. These developments have been tested in three domains: news collections (EFE), wikipedia and scientific documentation from Medline (biomedical texts). In the biomedical domain, a prototype for drug names recognition and drug-drug interactions extraction in the medical literature using UMLS, dictionaries and USAN rules of naming drugs. As a result, it is available automatically annotated corpus using the DrugNer system with generic drug names and other biomedical concepts and manually evaluated by a pharmacological expert. The corpus consists of 849 abstracts that were downloaded from PubMed and is available at http://basesdatos.uc3m.es/index.php?id=359). The system combines information obtained by the UMLS MetaMap Transfer (MMTx) program and nomenclature rules recommended by the World Health Organization (WHO) International Nonproprietary Names (INNs) Program to identify and classify pharmaceutical substances, [13, 14, 15, 16, 17, 18]. Evaluation of these techniques in several forums: CLEF 2008 and 2009 (http://www.clefcampaign.org) track on Multiple Language Question Answering (QA &CLEF), [5, 6, 7], Second Web People Search Evaluation [3, 26] and Text Analysis Conference (TAC 2009), [2]. 2.2 Multimedia Information Retrieval During these two first years of the project, the GSI-UPM team has been working in the development and the continuous improvement of the IDRA tool, as well as testing this and other available tools (some were developed previously in the research group) by participating in international competitions on Information Retrieval and related disciplines. Although several tools and indexing systems were available, the decision of a new development was taken. The new tool should be opened to different formats and functionality for evaluating new techniques related with multimedia information retrieval (in particular, text and image annotations). A previous prerrequisite was that the different parameters used for the computation of relevance and similarity among documents, news, technical reports, or even simple image annotations could be easily changed for experiments. IDRA also offer, in addition to basic functionalities, advanced ones for the management and storage of contents in an efficient way. Its design is flexible and it is very well documented, in order to facilitate its future enhancement. Regarding the development of the tool IDRA, [24, 39, 40], the first tasks were the review and adaptation of previously existent resources. From then, the relevant key issues are: (a) It is fully implemented using Java technology, using the most appropriate data structures for the management of indexes. Having this into account, a more indexing ability is achieved, as well as a lesser answer time for index queries; (b) Its interface offer new functionalities such as: more text formats can be indexed, LUCENE integration for results comparisons, viewing, browsing and management of data and data structures stored after indexing, or results analysis using different evaluating metrics; (c) IDRA tool is distributed using a GPL 3.0 licensing schema. (See http://sourceforge.net/projects/idraproject/). A set of activities related with sentiment analysis has been initiated, [28]. A full review of available resources, taking into account multilinguality was achieved, as well as a comparative evaluation. Among them, Sentiwordnet, wordnet affect, verbnet and conceptnet were analysed. Unfortunately, it was not possible to participate in the "SemEval Task on Affective Text" competition, that would allowed us a more complete analysis. LABDA-UC3M is working in a methodological approach to apply metadata of multimedia contents to improve accessibility in web [7, 8, 10, 11]. TIN2007-67407-C03 Regarding evaluation activities, in 2008 the team participated in NTCIR-7, an international competition for Asian languages (as well as English) issues related with information retrieval. The task we participated for was multilingual sentiment analysis for Asian languages and English, submitting a few experiments. We participated in the 2008 and 2009 CLEF editions, following our uninterrupted tradition from the 2003 edition, submitting several experiments in different tracks. In particular, the tasks for CLEF 2008 were: ImageCLEFphoto, ImageCLEFmed, ImageCLEF Medical Image Annotation and VideoCLEF. In the CLEF 2009 edition, the tasks we participated were: ImageCLEFphoto and ImageCLEFmed [25, 29, 30, 31, 32, 33, 34, 35] . In the ImageCLEFmed tasks we tried to improve the retrieval of medical images among multilingually-annotated, heterogeneous collections using semantic expansion techniques [36, 37 42]. In ImageCLEFphoto IDRA tool was used in some experiments (integrating text retrieval data and image content-based retrieval data), and in some other, clustering techniques was essayed for the ordering of the results obtained in the queries. 2.3 Linguistic Resources The main tasks of the LLI-UAM in BRAVO are: (a) Creation of new multilingual resources in Arabic, Spanish and Japanese; [52, 54] (b) Design and annotation of a Spanish speech corpus of questions,[9] (c) Definition of a model for question classification,. (d) Adding linguistic resources to improve the management of spontaneous speech, in order to adapt a voice recognizer to questions formulation. From those goals, (a) is the most important and time-consuming effort for the subproject. In this task, the LLI-UAM has worked alone, without coordination with the other two projects. On the other hand, for the last three tasks, LLI-UAM has worked in closed collaboration with the LABDA-UC3M team, as those linguistic resources are basic for the training of the QA system [5, 6, 20]. As for the LR developed, this a list of current work: Improvement of a Spanish PoS tagger and phonological transcriber, development of an Arabic PoS tagger, development of child corpus of Spanish, [50], development of an acoustic database on questions for Spanish and Arabic, development of a spontaneous speech corpus of Japanese, development of a basic audio lexicon of Japanese for didactic purpose The most outstanding resources, in terms of innovation, are those devoted to Arabic (the tagger and the acoustic database) since there are few groups in the world working on Arabic NLP and LR. 3 BRAVO mid-term results 3.1 Personnel in training With respect to the formation of human resources, several Doctoral Dissertations have been performed: • César de Pablo Sánchez, “Semisupervised learning of patterns for answer extraction in QA systems” (july 2010), european mention LABDA-UC3M • Isabel Segura Bedmar, "Application of information extraction techniques to pharmacological domain: extracting drud-drug interactions" (april 2010), european mention, LABDA-UC3M • Lourdes Moreno, “AWA, a methodological Framework specific of accessibility to develop web applications” (march 2010), LABDA-UC3M TIN2007-67407-C03 • Marta Garrote Salazar: “CHIEDE: corpus de habla infantil espontánea del español”. 2008. LLI-UAM.. • Ana González Ledesma: Los marcadores del discurso en el corpus C-ORAL-ROM: anotación pragmática, estrategias computacionales de etiquetado y aplicaciones a otros campos. 2010. LLI-UAM. • Julio Villena: “Hybrid Models for Information Retrieval”, GSI-UPM, in course. • Sara Lana: “Cognitive models of feedback for Information Retrieval”, GSI-UPM, in course. • Mª Teresa Vicente Díez, “Reconocimiento expresiones temporales en castellano y su aplicación a la extracción de información”, LABDA-UC3M, in course. In addition, two new researchers in formation have joined the LLI-UAM team: Alicia González (FPU grant) and Leonardo Campillos (predoctoral contract funded by the Madrid Regional Government). Also ten undergraduate students have been carried out their master thesis around the project research. 3.2 Coordination Coordination of three subprojects has been reflected in the evaluation of the platform in the international CLEF forum under MIRACLE team that includes the three research teams plus DAEDALUS company (EPO in the project proposal). CLEF participation (as is shown in the publications section) has been materialized in “Multilingual Question Answering (QA@CLEF)” and “Cross-Language Image Retrieval (ImageCLEF)” tracks. Moreover, the three research groups belong to the MAVIR consortium (a network of excellence funded by the Madrid Regional Government, www.mavir.net) where they have actively participated in several workshops, conferences and other projects. LABDA-UC3M has organized the Spanish Conference on Natural Language Processing (SEPLN 2008) (http://basesdatos.uc3m.es/sepln2008/web/), where researchers from the three groups have taken part in sessions about language technologies. LLI-UAM organized the VI Congreso Nacional de Lingüística General (http://elvira.lllf.uam.es/clg8/) with a session on multilingual natural language processing with GSI-UPM researchers. The joint research between the LLI-UAM and the GSI-UPM teams are more than 15 years old. Both groups are participated in several co-ordinated projects, as well as join publications and software development. The relation with LABDA-UC3M is more recent, but very intensive in the last five years: both teams participate jointly in BRAVO and in MAVIR. The UAM and the UC3M have exchanged researchers (Dr. Doaa Samy and Dr. Marta Garrote) during few months, with excellent results for the production. It must be said that the three groups in the project are submitted again a co-ordinated proposal for the next R&D call, as a final proof of the satisfactory research experience. 3.3 Collaboration with other national and international research groups LLI-UAM has strengthened the relations with international groups, very related with the research interests of the project and with the previous connections of the LLI members: Cairo University: Dr. Samy is an Associate Professor of Spanish and Computational Linguistics. In 2009 with a grant by the AECID, a Spanish-Egyptian Workshop on NLP and LR for Spanish and Arabic was held in Cairo, co-organized with Dr. Moreno. The most important result of this international cooperation has been the signature of an agreement of research between UAM and CU, pushed by Moreno and Samy respectively. TIN2007-67407-C03 Tokyo University of Foreign Studies (TUFS): there is already a student/teacher exchange agreement between UAM and TUFS, being Kimura the UAM responsible. In 2009-10 a grant by UAM-Banco Santander for research with Asian institutions has been received. During this period several visits to Tokyo have been programmed for recording spontaneous speech for our corpus of spoken Japanese. Language Technology Lab at DFKI, Saarbruecken: Dr. Alcántara has been a post-doc visiting researcher during two years, working in different projects related with multimodal processing. The relation will be maintained in the future with the participation of T. Declerck in the next project proposal, as external member of the LLI. LABDA-UC3M: In the two last years (2008 and 2009) a considerable effort has been performed in order to promote mobility with the aim to interchange knowledge with other relevant national and international research groups. The researchers that take part in this project proposal have done several stays: in 2008 Lourdes Moreno was three months in DSIC at UPV under Dr. Oscar Pastor supervision, Isabel Segura en Natural Language Engineering group under Dr. Paolo Rosso supervision; in 2009 César de Pablo and Isabel Segura have been at DFKI, Saarbrucken (Germany) during 6 months with Thierry Declerck. Finally, José Luis Martínez is finishing his Phd with the title "Incorporating semantics in a software process development through Business Rules” (april 2010) with José Carlos González and Paloma Martínez as supervisors. 3.3 Technology transfer BRAVO project is of great importance to DAEDALUS company due to its interest on QA technology and the integration on voice user interfaces. For this reason, this company has developed with the collaboration of the teams a web QA system working on the Spanish Wikipedia called respond.es that is available at http://miracle2.uc3m.es:8180/QAGWTInterface/ and has supported several grants for three undergraduate students from UC3M to do their master thesis in this demonstrator. This has enabled DAEDALUS to follow the advances in the state of the art in QA technologies as well as the application of ASR technology to this kind of applications. If as the results of BRAVO Project it is viable to define a product that could be commercialised an agreement could be signed among the authors. GSI-UPM and LABDA-UC3M work as DAEDALUS university partners in BUSCAMEDIA-Hacia una adaptación semántica de medios Digitales Multirred- Multiterminal- CENIT-E project (CEN-20091026, 2009-2012). As a result of research in QA, the system "SQUASH: A Question Answering System for Spanish”, which is part of Technology Portfolio, Technical Services and R&D Networks, promoted by the Fundación para el conocimiento madrid+d in 2008 was jointly developed by researchers from the LABDA-UC3M and LLI-UAM teams. SQUASH is a modular question answering system for the Spanish language. It enhances traditional search engine functionality by providing precise answers in real time to questions in natural language. The usefulness of the results for society is related to the impact of the Language Technologies in the Society of Information. Language resources, the main working line for the LLI-UAM, provide data for inferring knowledge and for training NLP systems. The multimodal (audio and text) and multilingual nature of the current resources compiled during the project is a clear signal of innovation of the research. In addition to NLP support, some of those linguistic resources are also applied by the team researchers in teaching spoken language, especially Spanish and Japanese. This late application was not foreseen in the project proposal and is becoming a very active and productive (a couple of books will be soon in print). TIN2007-67407-C03 4 References LABDA-UC3M Publications [1] Castro, E. Castaño, L. and Martínez, P. Evaluation of a named entity recognition system over SNOMED CT., Simposio OpenHealth-Spain, Universidad de Alcalá, 29-30 April 2009. [2] César de Pablo-Sanchez, Juan Perea, Isabel Segura-Bedmar, Paloma Martinez. The UC3M team at the Knowledge Base Population task. Text Analysis Conference (TAC 2009), November 2009. [3] De Pablo Sánchez C. and Martínez, P. UC3M at WePS2-AE: Acquiring Patterns for People Attribute Extraction from Webpages. 2nd Web People Search Evaluation Workshop, April 21st Madrid, Spain, Co-located with the WWW2009 conference. [4] De Pablo, C.; Martínez, P. (2009). Building a Graph of Names and Contextual Patterns for Name. ECIR 2009, LNCS 5478 Springer 2009, pp. 530-537. [5] De Pablo-Sánchez, C., Martínez-Fernández, J.L., González-Ledesma, A., Samy D., Martínez P., Moreno-Sandoval A. and Al-Jumaily, H. Combining Wikipedia and Newswire Texts for Question Answering in Spanish. Advances in Multilingual and Multimodal Information Retrieval, CLEF 2007, Revised Selected Papers, LNCS 5152, págs. 352-355. [6] Martínez-González, A., de Pablo-Sánchez, C., Polo-Bayo, C., Vicente-Díez, M.T., MartinezFernández, P., Martínez-Fernández, J.L. 2008. The MIRACLE Team at the CLEF 2008 Multilingual Question Answering Track. CLEF 2008, Revised Selected Papers. LNCS 5706, pp. 409-420. [7] Moreno, L., Martínez, P. and Ruiz, B. “Disability Standards for Multimedia on the Web”. Volume 15, issue 4, 2008, IEEE Multimedia pp:52-54. Moreno L., Martínez P. and Ruiz B. “Guiding accessibility issues in the design of Websites”, SIGDOC´08. Sep 22-24, Lisboa, Portugal, 2008. [8] Moreno L., Martínez P. and Ruiz B. “Integrating HCI in a Web accessibility engineering approach”. 13th International Conference on Human-Computer Interaction. HCI 2009 19-24 July 09, San Diego, CA, USA. [9] Moreno, J., Garrote, M., Martínez, P. and. Martínez-Fernández, J.L Some experiments in evaluating ASR systems applied to multimedia retrieval, 7th Workshop on Adaptative Multimedia Retrieval, 24-25 september, Madrid 2009. [10] Moreno, L., Martínez, P. and Ruiz-Mezcua, B. «A bridge to Web Accessibility from the Usability Heuristics», 5th annual Usability Symposium USAB 2009. Usability & HCI eInclusion Springer LNCS 5889. November 09-10, 2009. Linz, Austria. [11] Moreno, L., Martínez, P. and Ruiz-Mezcua, B. «Guías metodológicas para contenidos multimedia accesibles en la Web». Interacción 2009, X Congreso Internacional de Interacción Persona-Ordenador, 7-9 Septiembre 2009, Barcelona, Spain. [12] Pérez-Lainez, R. Iglesias, A., de Pablo-Sanchez, C. ANONIMYTEXT: Anonimization of Unstructured Documents, KDIR 2009, November 2009. [13] Segura-Bedmar, I., Crespo, M. de Pablo-Sánchez, C (2009) Score-based approach for Anaphora Resolution in Drug-Drug Interactions Documents. 14th International Conference on Applications of Natural Language to Information Systems (NLDB 2009). [14] Segura-Bedmar, I., Crespo, M. de Pablo-Sánchez, C., Martínez, P. (2009) DrugNerAR: Linguistic Rule-Based Anaphora Resolver for Drug-Drug Interaction Extraction in Pharmacological Documents. ACM Third International Workshop on Data and Text Mining in Bioinformatics (DTMBIO 09), november 2009. TIN2007-67407-C03 [15] Segura-Bedmar, I.; Martínez, P.; Samy, D. (2008). A preliminary approach to recognize generic drug names by combining UMLS resources and USAN naming conventions. BIONLP'08, Association for Computational Linguistics (ACL), Columbus, Ohio, 19 de junio de 2008 [16] Segura-Bedmar, I.; Martínez, P.; Samy, D. (2008). Detección de fármacos genéricos en textos biomédicos. Revista Española para el procesamiento del lenguaje natural. 40, 27-34. [17] Segura-Bedmar, I.; Martínez, P.; Segura-Bedmar, M. (2008). Drug name recognition and classification in biomedical texts. Drug Discovery Today. 13, (17/18), 816-823. [18] Segura-Bedmar, Isabel, Crespo, Mario, de Pablo-Sánchez, Cesar, Martínez, Paloma. (2010). Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents. To appear in BMC BioInformatics. [19] Vicente-Díez, M.T. y Martínez, P. Aplicación de técnicas de extracción de información temporal a los sistemas de búsqueda de respuestas. Revista Procesamiento del Lenguaje Natural. N. 42 (marzo 2009); pp.25-30. [20] Vicente-Díez, M.T., de Pablo-Sánchez, C., Martinez-Fernández, P., Moreno, J. and Garrote, M. 2009. Are Passages Enough? The MIRACLE Team Participation at QA@CLEF2009 . In Cross-Language Evaluation Forum (CLEF) 2009 Working Notes, in ECDL 2009 conference. Corfú, Greece. September 2009. [21] Vicente-Díez, M.T., Martínez P. 2009. Temporal Semantics Extraction for Improving Web Search. 8th International Workshop on Web Semantics (WebS' 09), in Proceedings of the 20th, DEXA 2009, Linz, Austria, 31 August - 4 September, 2009. [22] Vicente-Díez, M.T., Samy, D. y Martínez, P. An empirical approach to a preliminary successful identification and resolution of temporal expressions in Spanish news corpora. In Proceedings of the Sixth International Language Resources and Evaluation Conference (LREC'08). European Language Resources Association (ELRA). Marrakech, Morocco. 28-30 May 2008. GSI-UPM Publications [23] Ana García-Serrano and José Miguel Goñi-Menoyo. Applied Research in Linguistic Engineering: Resources and Tools. “Egyptian-Hispanic Meeting on Language Processing and Language Resources in Spanish and Arabic” Cairo University, Egypt, 1-4 November 2009. Supported by AECID and Mavir Consortium. [24] Ana García-Serrano, Xaro Benavent, Rubén Granados and José Miguel Goñi-Menoyo. Some Results Using Different Approaches to Merge Visual and Text-Based Features in CLEF’08 Photo Collection, Evaluating Systems for Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, LNCS 5706. [25] González-Cristóbal, José C.; Goñi-Menoyo, José M.; Villena-Román, Julio; and Lana-Serrano, Sara. (2008) MIRACLE Progress in Monolingual Information Retrieval at Ad-Hoc CLEF 2007. Advances in Multilingual and Multimodal Information Retrieval. 8th Workshop of the CrossLanguage Evaluation Forum, CLEF 2007, LNCS, vol. 5152, págs. 156-159 [26] González-Cristóbal, José C.; Maté, Pablo; Vadillo, Laura; Sotomayor, Rocío; and Carrera, Álvaro. Learning by doing: A baseline approach to the clustering of web people search results. In Proceedings of the 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference. Madrid, Spain, abril de 2009. [27] Julio Villena-Román, Sara Lana-Serrano (2008, Septiembre) MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts. Working Notes for the CLEF 2008 Workshop. [28] Julio Villena-Román, Sara Lana-Serrano and José C. González-Cristóbal (2008, Diciembre) MIRACLE at NTCIR-7 MOAT: First Experiments on Multilingual Opinion Analysis. [29] Julio Villena-Román, Sara Lana-Serrano and José Carlos González-Cristóbal (2008) MIRACLE at ImageCLEFmed 2007: Merging Textual and Visual Strategies to Improve Medical TIN2007-67407-C03 Image Retrieval. Advances in Multilingual and Multimodal Information Retrieval. 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, LNCS, vol. 5152, págs. 593-596. [30] Julio Villena-Román, Sara Lana-Serrano, José C. González-Cristóbal (2008) MIRACLE-GSI at ImageCLEFphoto 2008: Experiments on Semantic and Statistical Topic Expansion. Working Notes for the CLEF 2008 Workshop. [31] Julio Villena-Román, Sara Lana-Serrano, José C. González-Cristóbal (2009) MIRACLE-GSI at ImageCLEFphoto 2009: Comparing Clustering vs. Classification for Result Reranking. Working Notes for the CLEF 2009. [32] Julio Villena-Román, Sara Lana-Serrano, José Luis Martínez-Fernández and José Carlos González-Cristóbal. (2008) MIRACLE at ImageCLEFphoto 2007: Evaluation of Merging Strategies for Multilingual and Multimedia Information Retrieval,” Advances in Multilingual and Multimodal Information Retrieval. 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, LNCS, vol. 5152, págs. 500-503 [33] Lana-Serrano, Sara; Villena-Román, Julio; and González-Cristóbal, José-C.: MIRACLE at ImageCLEFmed 2008: Semantic vs. Statistical Strategies for Topic Expansion, Evaluating Systems for Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, LNCS 5706. [34] Lana-Serrano, Sara; Villena-Román, Julio; and González-Cristóbal, José-C.: MIRACLE-GSI at ImageCLEFphoto 2008: Different Strategies for Automatic Topic Expansion., Evaluating Systems for Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, LNCS 5706. [35] Lana-Serrano, Sara; Villena-Román, Julio; González-Cristóbal, José C.; and Goñi-Menoyo, José M. (2008) MIRACLE at GeoCLEF Query Parsing 2007: Extraction and Classification of Geographical Information. Advances in Multilingual and Multimodal Information Retrieval. 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, LNCS, vol. 5152, págs. 786-793, [36] Lana-Serrano, Sara; Villena-Román, Julio; González-Cristóbal, José C.; and Goñi-Menoyo, José M. (2008) MIRACLE at ImageCLEFanot 2007: Machine Learning Experiments on Medical Image Annotation. Advances in Multilingual and Multimodal Information Retrieval. 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, LNCS, vol. 5152, págs. 597-600, [37] Lana-Serrano, Sara; Villena-Román, Julio; González-Cristóbal, José Carlos; and Goñi-Menoyo, José Miguel. (2009) MIRACLE at ImageCLEFannot 2008: Nearest Neighbour Classification of Image Feature Vectors for Medial Image Annotation. Evaluating Systems for Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, LNCS 5706. [38] R. Granados, X. Benavent, A. García-Serrano, J.M. Goñi. (2008) MIRACLE-FI at ImageCLEFphoto 2008: Experiences in merging Text-based and Content-based Retrievals. Working Notes for the CLEF 2008 Workshop. [39] R. Granados, X. Benavent, R. Agerri, A. García-Serrano, J.M. Goñi, J. Gomar, E. De Ves, J. Domingo, G. Ayala. (2009) MIRACLE-FI at ImageCLEFphoto 2009. Working Notes for the CLEF 2009 Workshop [40] Rubén Granados Muñoz, Ana García Serrano, José M. Goñi Menoyo. La herramienta IDRA (Indexing and Retrieving Automatically). XXV Conferencia de la Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN’09). San Sebastián, 2009. [41] Sara Lana-Serrano, Julio Villena-Román, José C. González-Cristóbal (2008) MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion. Working Notes for the CLEF 2008 Workshop. TIN2007-67407-C03 [42] Sara Lana-Serrano, Julio Villena-Román, José Carlos González-Cristóbal, José Miguel GoñiMenoyo. (2008) MIRACLE at ImageCLEFannot 2008: Classification of Image Features for Medical Image Annotation. Working Notes for the CLEF 2008 Workshop [43] Sara Lana-Serrano, Julio Villena-Román, José Carlos González-Cristóbal. (2009) MIRACLE at ImageCLEFmed 2009: Reevaluating Strategies for Automatic Topic Expansion. Working Notes for the CLEF 2009 Workshop [44] Villena-Román, Julio; and Lana-Serrano, Sara: MIRACLE at VideoCLEF 2008: Topic Identification and Keyframe Extraction in Dual Language Videos, Evaluating Systems for Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, LNCS 5706. LLI-UAM Publications [45] Alcántara, M. Introducción al análisis de estructuras lingüísticas en corpus. Aproximación semántica. Madrid: Servicio de Publicaciones de la UAM,. 2007 [46] Alcántara, M.. "El análisis lingüístico en la transcripción automática de la lengua hablada, el Proyecto COAST" ,VIII Congreso de Lingüística General: El valor de la diversidad [meta]lingüística, Madrid. 2008 [47] Alcántara, M. "La anotación del habla en corpus de vídeo" en Revista de Procesamiento del Lenguaje Natural, 8, 2007 [48] Alcántara, M. "Uso de corpus de habla espontánea en la enseñanza de la cortesía en español" en Nicolás, Carlota: Ricerche sul Corpus del parlato romanzo C-ORAL-ROM. Studi linguistici e applicazioni didattiche per l'insegnamento di L2. Firenze: Firenze University Press. 2007 [49] Campillos, L. "Las expresiones causales en el corpus de habla espontánea C-ORAL-ROM". En Actas del 8ª Congreso de Lingüística General, UAM, 25-28 de junio.: 2008 [50] Garrote, M., Guirao, J.M. y Moreno, A.. "Extracción de unidades distintivas en adultos y niños de un corpus de lengua oral espontánea". 8ª Congreso de Lingüística General, UAM, junio 2008. [51] González-Ledesma, A. "Pragmatext, Annotating the Spanish C-ORAL-ROM Corpus with Pragmatic Knowledge",4th Corpus Linguistics Conference, University of Birmingham, July. 2007. [52] González-Ledesma, A. y Samy, D. "Marcadores discursivos en árabe y español: un estudio computacional basado en corpus paralelos con anotación pragmática". 8ª Congreso de Lingüística General, UAM, 25-28 de junio 2008 [53] Gozalo, P. "Reflexiones sobre el futuro. Los datos del español no nativo". 8ª Congreso de Lingüística General, UAM, 25-28 de junio 2008 [54] Kimura, C. "The constancy and alteration in the respect language of Japanese" Panel titled "Re-creation of Identities in East Asia: Literature and Linguistics" 5th International Convention of Asia Scholars), Kuala Lumpur. 2007 [55] Moreno Sandoval, A., Guirao, J.M. y Torre Toledano, D. "Herramientas de anotación de corpus de habla espontánea del Laboratorio de Lingüística Informática de la UAM" Revista de la Sociedad Española para el Procesamiento del Lenguaje Natural. Nº 41, 2008. [56] Moreno Sandoval, A., T. Toledano, D., De La Torre, R., Garrote, M. y Guirao, J.M.. "Developing a Phonemic and Syllabic Frequency Inventory for Spontaneous Spoken Castilian Spanish and their Comparison to Text-Based Inventories". LREC 2008,Marrakech. [57] Moreno Sandoval, A.,. (editor). Actas del VIII Congreso de Lingüística General: El valor de la diversidad [meta]lingüística. Madrid.CD-ROM.ISBN 978-84-96487-19-9, 2008 [58] Samy, D. y González-Ledesma, A.. "Pragmatic Annotation of Discourse Markers in a Multilingual Parallel Corpus (Arabic- Spanish-English)". LREC 2008, Marrakech, may 2008 Jornada de Seguimiento de Proyectos, 2010 Programa Nacional de Tecnologías Informáticas TIN2007-67407-C03 Figure 1: Linguistic and software modules and Participants in BRAVO project Jornada de Seguimiento de Proyectos, 2010 Programa Nacional de Tecnologías Informáticas ANNEX I: PLANNING SUBPROJECT 1 Activities/Tasks 1.1.1 Project Management and Coordination 1.1.2 Subproject 1 Management 1.2 Platform for resources integration 1.3.1 Definition of the architecture 1.3.2 Adjustment of existing resources, modules and prototypes 1.3.3 Integration in the environment 1.4.1 Analysis of the state of the art in answer extraction 1.4.2 Analysis and implementation of a module for temporal exp.. 1.4.3 Flexible answer extraction 1.4.4 Validation and ranking of answers 1.5.1 Evaluation of commercial and open solutions 1.5.2 Evaluation and implementation of a SAD 1.5.3 Module for feature extraction 1.5.4 Estimation of the acoustic model 1.5.5 Estimation of a Language model 1.5.6 Validation of the system in several environments 1.6.1 Textual questions analysis module 1.6.2 Oral questions analysis module 1.6.3 Validation of prototype 1.7.1 Design and implementation of a probabilistic QA model 1.7.2 Integration and validation of the prototype 1.8.1 Participation in International Evaluation forums 1.9.1 Web demonstrator of QA prototypes 1.9.2 Publication of research results PLANNING SUBPROJECT 2 Activities/Tasks 2.1.1 Management of subproject 2 2.1.2 Coordination of subproject 2 2.2.1 Debugging and enlargement of existing linguistic … 2.2.2 Compilation of semantic resources based on …. 2.3.1 Debugging and enlargement of linguistic processing … 2.3.2 Development of entities recognition module 2.3.3 Development of semantic processing module 2.3.4 Modules for multimedia processing 2.4.1Domain modeling 2.5.1 Analysis of actual status of IR in QA systems 2.5.2 Integration/adaptation of specific improvements for QA PLANNING SUBPROJECT 3 Activities/Tasks 3.1.1. Management of subproject 3 3.2.1. Development of Arabic resources 3.2.2. Development of Spanish resources 3.2.3. Development of Japanese resources 3.3.1. Study of domain and design of recordings collec. 3.3.2. Collection of subcorpus of read speech 3.3.3. Collection of subcorpus of spontaneous speech 3.3.4. Annotation of the corpus of questions 3.3.5. Splitting the corpus: training and evaluation 3.4.1. Model to classify textual questions in Spanish 3.4.2. Model to classify textual questions in Arabic 3.4.3. Grammars for oral questions in Spanish First year (*) Second Year (*) x|x|x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|x|x| |x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|x|x| x|x|x| x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|x|x| x|x|x|x| |x|x|x|x|x|x| x|x|x|x|x|x| |x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x| |x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x| x|x| |x|x|x|x|x|x| x|x|x|x| x|x|x|x|x|x| |x|x|x|x|x|x|x|x| Third Yeard (*) x|x|x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x| |x|x|x|x|x|x| |x|x|x|x|x|x|x|x| |x x|x|x|x|x|x|x|x|x|x|x|x| |x| |x|x|x|x|x|x| First year (*) x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x |x|x|x|x |x|x|x|x|x|x|x|x |x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x| |x|x|x|x|x|x|x|x First year (*) x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x x|x|x|x x|x|x|x|x|x |x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x| x| |x| x|x|x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x| x|x|x|x|x|x| x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|x|x| x| |x|x| x|x|x|x|x|x|x|x|x|x|x|x| Second year (*) x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x |x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x| Third year (*) x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x| Second year (*) x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x Third year (*) x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x| x|x|x|x|x|x| x|x|x|x|x|x| x|x|x|x|x|x| x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|x|x x|x|x|x|x|x|x|x|x|x|x|x |x|x|x|x|x|x |x|x|x|x|x|x |x|x|x|x|x|x x|x|x|x|x|x