Memorias del 3er Congreso Argentino de Bioinformática y Biología
Transcripción
Memorias del 3er Congreso Argentino de Bioinformática y Biología
Memorias del 3er Congreso Argentino de Bioinformática y Biología Computacional 26, 27 y 28 de septiembre de 2012 Facultad de Ingeniería – Universidad Nacional de Entre Ríos Oro Verde, Entre Ríos Organizado por: Auspician: 1 Organización Comité cientı́fico Dr. Ariel Amadı́o UNER Dr. Ariel Chernomoretz UBA Dr. Diego Ferreiro UBA Dr. Morten Nielsen UnSam Dr. Sergio Pantano Institut Pasteur, ROU Dr. Alfredo Quevedo UNER Dr. Maximo Rivarola INTA Dr. Gustavo Vazquez UNS Comité organizador Mst. Bioing. Rubén Acevedo Mst. Bioing. Gerardo Gentiletti Dra. Fernanda Izaguirre Dr. Vı́ctor Casco Dr. Ariel Amadı́o Bioing. Pedro Tomiozzo Bioing. Yanina Atum Bioing. Analı́a Chernı́z Dr. Fernán Agüero Dr. Diego Ferreiro Dra. Cristina Marino Buslje Dr. César Martı́nez Bioing. Roberto Leonarduzzi Bioing. Iván Gareis 2 Conferencias Bioinformatics for data-driven biology Dr. Mario Caccamo Resumen de la conferencia. Obtaining the sequence of the three-billion bases of the human genome was hailed as one of the biggest achievement in the history of science. It involved a remarkable organisation of scientists, policy-makers and funding agencies from across the world. This feat has marked the beginning of a new era in molecular biology characterised by a revolution in data generation. As in other areas of science, technological advances combined with the availability of high-performance computers have made possible to produce, process and collect biological data at a rate that have transformed life sciences. Today, any laboratory equipped with the latest sequencing technologies can generate in only few days as much sequence as the Human Genome Project did in 10 years and at a fraction of the cost. More excitingly these new technologies have opened up the possibilities for new applications such as the promise of personalised medicine, the study of environmental samples and the ability to apply more effective crop breeding methods. As sequencing becomes cheaper and more accurate we will soon be able to explore genetic information in real-time and at a single-cell level. This unprecedented wealth of data, however, has come with new challenges. There is a growing gap between the capacity to generate genomic sequences and the ability to process and interpret the resulting data. The sheer volume of information requires new levels of software sophistication both to cope with the load and to analyse it effectively. If we are to realise the value of data-intensive biology we cannot rely on existing methodologies. One example is the novel computer hardware and software architectures that are emerging to cope with the demands of big data analyses such as cloud computing platforms. Although these solutions can help to close the “Next-Generation Gap” in molecular biology they still don’t provide the complexity needed to integrate and interpret data from multidisciplinary sources; let alone the ability to understand it. The breakthroughs, however, will be achieved by training and educating the next generation of scientists and professionals in a new paradigm of science driven by data. In this presentation I will explore the impact that this revolution has had in the use of informatics in biology and how this transformation is only the begging for a very exciting future for life sciences. In silico characterization of intermolecular interactions in biological systems Dr. Claudio Cavasotto Resumen de la conferencia. Today, computational simulation is an invaluable tool to study macromolecular association, enzymatic reactions, and to understand at a molecular level the relationship between structure, dynamics and function. Thus, it provides an efficient and insightful complement to experimental evaluation. At the core of these calculations lies the potential energy function, which describes the intermolecular interactions in the system. The latest developments of our research group will be presented, focusing on the application of in silico methods to problems in the areas of structural biology, drug discovery, binding free energy calculation, and cheminformatics, namely: I the ligandsteered homology modelling method, where the interaction of known ligands with the re3 3er Congreso Argentino de Bioinformática y Biologı́a Computacional ceptor is used to shape and optimize the binding site through a stochastic global energy minimization, with the final goal of using the modelled structures in structure based drug discovery; II III the discovery of novel modulators of GPCRs and nuclear receptors through coarse-grained high-throughput docking followed by experimental evaluation; the use of quantum mechanical (QM) methods to study biomacromolecular interaction; as a case study, the QM calculation of absolute and relative binding free energy of tetra- phosphopeptides to the SH2 domain of human LCK will be presented and compared to the failure of classical methods. Current limitations of computational methods and future trends will be also discussed. Mining the Schistosoma genome for new drug targets Dr. Guilherme Correa-Oliveira Resumen de la conferencia. There are three important Schistosoma species parasitizing humans: Schistosoma mansoni, S. japonicum and S. haematobium. Together they chronically infect at least 200 million people and more than 200,000 deaths are reported annually worldwide. Several efforts, including health education, sanitation, intermediate host control, and chemotherapy treatments, are among the strategies recommended by the WHO. However, in the recent years infection prevalence and in some regions the intensity have not reduced. The drug of choice to treat schistosomiasis is oral praziquantel that has been used for over 40 years. Although praziquantel is an efficacious drug with some limitations, in recent years problems of resistance have arisen and alternatives don’t exist so far. To contribute to a solution, the important information of the recently sequenced genomes of these parasites was used to identify potential targets for the development of an alternative drug. Advances in structural and functional genomics, proteomics, genetics and molecular biology have substantially increased the amount of available data for schistosome research. Making full use of this information requires computational resources and skills that may not be promptly available for most researchers. Integration of the large volumes of different data types in a user friendly and easily available manner is of major importance to the community and one of the objectives our our group. A database containing genomic, gene annotation and functional data, SchistoDB (www.schistodb.net), was constructed using the GUS Schema. SchistoDB offers a variety of tools including BLAST, protein motif searches, keyword searches of pre-computed BLAST results, Gene Ontology assignments, protein family information and microarray probes. We have also produced SchistoCyc, the complete metabolic pathways prediction produced using the PathwayTools software. SchistoDB includes a list of drugs predicted to act on orthologues of S. mansoni according to KEGG DRUG and links exist to TDR Targets. We aimed at developing targets against two groups of proteins: histone modifying enzymes and protein kinases. Histone modifying enzymes (HMEs) play key roles in the regulation of chromatin modifications. Furthermore, aberrant epigenetic states are often associated with human diseases, leading to great interest in HMEs as therapeutic targets. We have identified and characterized all enzymes involved in acetylation and methylation modification, for instance: histone acetyltransferases (HATs), deacetylases (HDACs), methyltranferases (HMTs) and demethylases (HDM). We analyzed the predicted proteomes of the parasites in order to identify and classify the HMEs through computational approaches, mainly using HMM profiles. We were able to identify, in average, 60 HMEs with some variation within the three Schistosoma species. From the identified enzymes, 24 were validated as therapeutic targets individually using RNA interference in cultured larval stages (schistosomula) to invalidate the corresponding genes. Although, gene knockdown of up to 90 % could be achieved, no phenotype could be observed after 7 days of dsRNA exposure. Loss of motility could be observed as a phenotype for two HDMs after 30 days of dsRNA exposure. In addition, in order to assess the role of genes in the presence of the host environment, under immunological pressure, knockdown parasites for four HMEs (HDAC8, KDM1/ KDM2 and PRMT3) were tested in vivo. A significant reduction of worm burden (50 %) could be observed in mice infected with knockdown parasites for HDAC8 when compared to unspecific control. Finally, egg count was significantly reduced in mice livers for all tested HMEs. In conclusion, our work improved the functional annotation of over 20 % of S. mansoni HAT and HDAC proteins. Parasites with reduced levels of HDAC8, 4 3er Congreso Argentino de Bioinformática y Biologı́a Computacional KDM1/KDM2 and PRMT3, seem to diminish the oviposition and ability to survive (for HDAC8) in the host milieu, indicating that these enzymes could be good target candidates for drug development. Since eukaryotic protein kinases (ePKs) are good chemical and medical targets for drug development and an increasing number of ePK inhibitors have been approved for the treatment of different human disease, they have become the focus of this study. The ePKs were identified in S. mansoni, S. japonicum and S. haematobium by HMM searches and classified by group, family and subfamily by phylogeny. Most selected ePKs were activators/effectors of MAPK signaling pathway, and key pathway proteins were chosen for experimental validation: SmRas, SmERK1, SmERK2, SmJNK, and SmCaMK2. RNAi was used to elucidate the functional role of MAPKs in signaling pathways. Although transcription was reduced no phenotype was observed in culture. Therefore, mice were infected with the silenced schistosomula and it was observed that SmJNK has an important role in transformation and survival of the parasites as low number of adult worms was recovered and the tegument of survived worms was damaged. Moreover, SmERK1/SmERK2 expression was related to egg production, as mice infected with silenced schistosomula, displayed significantly lower egg production and the recovered female worms had underdeveloped ovaries. Furthermore, it was showed that the c-fos transcription factor was overexpressed in parasites with low expression of SmERK1, SmJNK and SmCaMK2. Two Universals in Genomics: Information Content and Specie’s Abundance Diversity Dr. Hernán Dopazo Resumen de la conferencia. In this talk we analyse two hypotheses: H1- that there is a common combinatorial structure of DNA along all diversity of life, and H2- that a common rule governing species abundance and diversity (SAD) exists in genomes. H1- Our first hypothesis is that there is a random-like structure of DNA along all diversity of life. To test it, we define a complexity measure based on a classical method used in data compression and applicable to arbitrarily large sequences introducing no fragmentation. The method detects regularities due to repeats of any length, at any distance, and other structural correlations. As the main result we report that the ratio of genome complexity to size remained almost maximal and unchanged along six orders of magnitude in genome size, covering all biological diversity. We observe a uniform complexity increases with genome size for phages, bacteria, unicellular eukaryotes, fungi, plants, and animals. Major deviations from maximal genome complexity correspond to polyploid species. We formulate two general hypotheses: almost maximal combinatorial structure of DNA sequence is a common characteristic of genomes throughout biological diversity; increases in the combinatorial complexity of DNA only occur by mechanisms of genome amplification, and subsequent accumulation of DNA sequence mutations, transpositions and/or deletions of genetic material. Our hypothesis can be falsified if a single recent polyploid genome with a randomlike DNA structure is found; or if a non-polyploid genome shows a non- random DNA structure. H2- Our second hypothesis is that there is a common rule governing species abundance and diversity (SAD) in genomics. To what extent SAD reflects adaptive or stochastic outcomes? Ideal models for genomics would consider all diversity of elements populating eukaryote genomes. However, such model does not exist. In ecology, the unified neutral theory of biodiversity (UNTB) assumes interactions among tropically similar species equivalent on an individual “per capita” basis. UNTB assumes that these individuals, regardless of the species, appear to be controlled by similar birth, death, dispersal, and speciation rates. Biodiversity composition therefore emerges randomly in the community. Here, taking advantage of the UNTB and the general framework posed by ecological genomics we ask for the relative SAD of genetic elements of 500 chromosomes in 30 eukaryote genomes. After ML adjustment of UNTB parameters and hypothesis testing we found that most chromosomes follow relative SAD according to the expected by UNTB. While ecologists found natural selection an irrelevant component to explain relative SAD in forests, we found that the same simple neutral model fits SAD of genetic elements in genomes. We suggest that the random-like structure and the observed SAD are universals in genomes along all diversity of life. 5 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Predicting metabolic activity and crowd-sourcing promoter strength analysis Dr. Pablo Meyer-Rojas Resumen de la conferencia. Although much is understood about the enzymatic cascades that underlie cellular biosynthesis, comparatively little is known about their cellular organization. We here show via a detailed analysis of the localization of fluorescently tagged enzymes in bacteria that biochemical reactions inside the cytoplasm are organized spatially following a rule where the first or the last enzymes are localized and this localization is determined by the activity state of the cell’s metabolic network. In the framework of the DREAM6 (Dialogue for Reverse Engineering Assessments and Methods), a community effort to evaluate the status of the methodology for systems biology modeling, we presented a challenge where participants had to predict gene promoter expression from an experimentally generated data set previously unknown. Twenty-one teams submitted results predicting the expression levels of 53 different promoters from ribosomal protein genes of yeast S. cerevisiae. We here present the analysis of participants predictions, providing a benchmark for assessment of methods predicting promoter activity. Data mining in bioinformatics: an integrated approach based on computational intelligence Dr. Diego Milone Resumen de la conferencia. Biology is in the middle of a data explosion. The technical advances achieved by the genomics, metabolomics, transcriptomics and proteomics technologies in recent years have significantly increased the amount of data that biologists can measure and analyze about different aspects of an organism. Besides, ∗omics data sets have several additional problems: they have inherent biological complexity and may have significant amounts of noise as well as measurement artifacts. The need to extract information from such databases is again considered a challenge. This requires novel computational techniques and models to automatically perform data mining tasks such as integration of different data types, clustering and knowledge discovery, among others. This presentation is about a novel integrated computational intelligence approach for biological data mining that involves neural networks and evolutionary computation. We propose the use of self-organizing maps for the identification of coordinated patterns variations; a new training algorithm that can include a priori biological information to obtain more biological meaningful clusters; a validation measure that can assess the biological significance of the clusters found; and finally, an evolutionary algorithm for the inference of unknown metabolic pathways involving the selected clusters. We suggest that the random-like structure and the observed SAD are universals in genomes along all diversity of life. 6 7 19:00 - 19:30 18:30 - 19:00 18:00 - 18:30 17:30 - 18:00 17:00 - 17:30 16:30 - 17:00 16:00 - 16:30 15:30 - 16:00 15:00 - 15:30 14:30 - 15:00 14:00 - 14:30 13:30- 14:00 13:00 - 13:30 12:30 - 13:00 12:00 - 12:30 11:30 - 12:00 11:00 - 11:30 10:30 - 11:00 10:00 - 10:30 9:30- 10:00 9:00 - 9:30 Sesión Pósters Conferencia Dr. Claudio Cavasotto Receso Sesión oral 1 Almuerzo Conferencia Dr. Mario Cáccamo Apertura del 3CAB2C Acreditación Miércoles 26 / 09 / 2012 Jueves 27 / 09 / 2012 Sesión Pósters Conferencia Dr. Pablo Meyer-Rojas Receso Sesión oral 3 Almuerzo Conferencia Dr. Hernán Dopazo Receso Sesión oral 2 Mesa Redonda - Educación en Bioinformática Es posible que el programa sufra modificaciones en los próximos días. En ese caso serán comunicadas tan rápido como sea posible. Asamblea A2B2C Cierre del 3CAB2C Conferencia Dr. Diego Milone Almuerzo Conferencia Dr. Guillerme Correa Olivera Receso Sesión oral 4 Viernes 28 / 09 / 2012 Programa Trabajos por sesión Miércoles 26 Sesión Oral 1 Miércoles 26, 14:30 a 16:30 Aula 4 Chair: Ariel Chernomoretz 1) Design and virtual screening of new anti-HIV integrase inhibitors (ir) M. A. Quevedo, M. C. Briñón, Departamento de Farmacia - Facultad de Ciencias Quı́micas - Universidad Nacional de Córdoba 2) Identification of binding motifs in large-scale peptide data sets using a Gibbs sampling approach (ir) M. Nielsen, O. Lund, M. Andreatta, Technical University of Denmark 3) Comparing the Bonferroni and the Benjamini-Hochberg procedures (ir) D. M. Kelmansky, S. Ferro, Instituto de Cálculo FCEN UBA, Instituto de Cálculo FCEN-UBA 4) Metabolic pathfinding based on genetic algorithms (ir) M. Gerard, G. Stegmayer, D. Milone, Conicet 8 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Jueves 27 Sesión Oral 2 Jueves 27, 10:00 a 12:00 Aula 4 Chair: Elmer Fernandez 1) Assessing protein-disease association significance from candidate ranking lists (ir) A. J. Berenstein, I. Ibañez, A. Chernomoretz, Universidad de Buenos Aires - Instituto Leloir, Universidad de Buenos aires- Instituto Leloir 2) Reverse engineering HD-Zip transcriptional regulatory networks (Ft. Information Theory) (ir) A. L. Arce, M. Capella, D. Ré, R. L. Chan, A. Chernomoretz, Instituto de Agrobiotecnologı́a del Litoral, Fundación Instituto Leloir 3) Advantages of balanced classifier design on microarray data classification (ir) M. Brun, I. Pagnuco, V. Ballarin, Facultad de ingenierı́a UNMdP, Facultad de ingenieria UNMdP-CONICET 4) Conformational diversity and evolutionary rates in proteins (ir) D. Zea, M. S. Fornasari, C. Marino Buslje, G. Parisi, Fundación Instituto Leloir, Universidad Nacional de Quilmes, SBG Universidad Nacional de Quilmes Sesión Oral 3 Jueves 27, 14:30 a 16:30 Aula 4 Chair: Ignacio Sanchez 1) Glycobioinformatics: Using solvent structure to predict and characterize protein carbohydrate complexes (ir) M. Marti, University of Buenos Aires 2) Eukaryotic secretory pathway proteins avoid occluded N-glycosylation sequons (ir) M. López Medus, G. E. Gómez, P. M. Couto, L. Landolfo, J. J. Caramelo, Fundación Instituto Leloir-Conicet-IIBBA, Fundación Instituto Leloir-Conicet-IIBBA, Departamento de Quı́mica Biológica-FCEN-UBA, Fundación Instituto Leloir 3) Dissecting relationships between sequence, structure and functions in the Ankyrin Repeat Protein Family (ir) R. Gonzalo Parra, R. Espada, D. U. Ferreiro, Protein Physiology Lab, Dpto de Quı́mica Biológica, FCEyN-UBA and CONICET 4) Structure-Function Prediction of Highly Variable Sub-sequences of Protein Subfamilies (ir) M. V. Revuelta, A. ten Have, IIB-CONICET-UNMdP 9 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Viernes 28 Sesión Oral 4 Viernes 28, 10:00 a 12:00 Aula 4 Chair: Fernan Aguero 1) The Comparisons of Sequences with the Nucleotide Database (NCBI) and the BLAST tool. What information we can obtain? (ir) V. E. Firmenich, M. E. Fernández Feijóo, M. B. Espinosa, CONICET, ALS Group, UBA 2) Alternative models about the origin of Ribosome Inactivating Proteins genes (ir) W. Lapadula, M. Juri Ayub, M. V. Sanchez-Puerta, Instituto de Biologı́a Agrı́cola de Mendoza (IBAM), Lab. Biol. Mol. UNSL. IMIBIO-SL (CONICET) 3) Phylogenetic relationships of Rhinella arenarum beta-catenin. A developmental biology useful model (ir) M. A. Hasenahuer, C. D. Galetto, V. H. Casco, M. F. Izaguirre, Facultad de Ingenierı́a, UNER 4) HMMerCTTer: Tailor-made Decision Making for the Semi-automatic Clustering of large Protein Superfamilies (ir) H. G. Bondino, I. A. Pagnuco, M. V. Revuelta, M. Brun, A. ten Have, IIB-CONICET-UNMdP, Laboratorio de Procesamiento Digital de Imagenes, FI-UNMdP, Advanta Semillas SAIC Centro de Investigación en Biotecnologı́a 10 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Miércoles 26 Sesión de Posters 1 Miércoles 26 Aula 3 1) High throughput pyrosequencing and bioinformatics of a multi-extreme environment (ir) N. Rascován, S. Revale, E. Mancini, M. Vázquez, M. E. Farı́as, INDEAR, PROIMI 2) Non extensive statistics generalization of Jensen Shannon divergence for DNA sequence analysis (ir) M. Ré, P. Lamberti, FRC - UTN ; FaMAF - UNC, FaMAF - UNC 3) Distribution of bioactive peptides in NR (ir) A. E. Nardo, M. C. Añón, G. Parisi, Departamento de Ciencia y Tecnologı́a, Universidad Nacional de Quilmes, Roque Saenz Peña 182, Bernal B1876BXD, Argentina, Centro de Investigación y Desarrollo en Criotecnologı́a de Alimentos (CIDCA,Universidad Nacional de La Plata, CONICET), La Plata 47 y 116 (1900), Argentina. 4) The relation between the divergence of sequence and structure in intrinsically disordered proteins (ir) N. Palopoli, J. Glavina, I. E. Sánchez, Universidad de Buenos Aires 5) Design of a pipeline for de novo identification of cis-regulatory elements involved in transcriptional re-programming during tomato fruit development and ripening (ir) T. Duffy, F. Carrari, Instituto Nacional de Tecnologı́a Agropecuaria 6) Identification of putative subtelomeric regions in the genome of Toxoplasma gondii (ir) S. Carmona, M. C. Dalmasso, S. Angel, F. Agüero, IIB-INTECH-UNSAM-CONICET 7) Phylogeny of fungal species of genus Aspergillus using ITS sequences (ir) M. Cossio, G. Sioli, G. Perona, INBIOMIS 8) On line comparison of sequences alignment and phylogenetic analysis of native Trichoderma sp from Misiones province (ir) G. Sioli, L. Castrillo, M. Cossio, N. Amerio, M. I. Fonseca, L. Villalba, P. Zapata, INBIOMIS 9) Prediction of blood to liver coefficients for volatile organic compounds: a cheminformatics approach (ir) D. Palomba, M. J. Martinez, I. Ponzoni, M. Dı́az, G. E. Vazquez, A. Soto, Laboratory for Research and Development in Scientific Computing (LIDeCC), DCIC, UNS, Faculty of Computer Science, Dalhousie University, Halifax, Canada, Planta Piloto de Ingenierı́a Quı́mica (PLAPIQUI) CONICET-UNS 10) Predicting Protein Function from Sequence and Structural Data: a Globin’s Family Case (ir) J. P. Bustamante, M. Marti, D. Estrin, INQUIMAE 11 3er Congreso Argentino de Bioinformática y Biologı́a Computacional 11) Variations of ligand binding affinity upon protein conformational diversity (ir) E. I. Juritz, A. Monzón, Quilmes National University 12) Effect of the o-glicosilation in the binding of Extensins to Peroxidases. (ir) A. Aptekmann, J. Estevez, A. Nadra, UBA-QB, UBA-FBMC 13) Using Computer Simulations to Understand Enzyme Mechanisms: Application to Mycobacterium tuberculosis CYP121 Unusual Reaction (ir) V. G. Dumas, L. Defelipe, A. Petruk, A. Turjanski, M. Martı́, Departamento de Quı́mica Biologica e Inquimae-Conicet, Facultad de Ciencias Exactas y Naturales, UBA, Departamento de Quı́mica Inorgánica, Analı́tica y Quı́mica Fı́sica e Inquimae-Conicet, Facultad de Ciencias Exactas y Naturales, UBA 14) Theoretical studies of membranes at different thermotropic phases in salts solutions by molecular dynamics. (ir) F. E. Herrera, M. D. L. M. Sales, D. E. Rodrigues, FBCB 15) Online modeling of Endoglucanases from Aspergillus genus using PHYRE2 (ir) M. Cossio, G. Sioli, G. Perona, INBIOMIS 16) Comparison of two homology based protein structure online software (ir) M. Perona, M. Molina, M. Cossio, INBIOMIS 17) Relative mobility of epitopes residues in immunogenic proteins (ir) M. Astorga, S. Fernández Alberti, G. Parisi, Universidad Nacional de Quilmes, Universidad Nacional de La Plata, Universidad Nacional de La Plata 18) Identification of putative LxCxE motifs targeting the retinoblastoma protein in human viruses by structure- and sequence-based calculations (ir) J. Glavina, L. B. Chemes, G. de Prat-Gay, I. E. Sánchez, Universidad de Buenos Aires, Fundación Instituto Leloir 19) Design of novel DNA-binding specificity in proteins from the “zinc finger” family (ir) B. Basanta, A. Alibes, L. Serrano, A. Nadra, Centre for Genomic Regulation, Universidad de Buenos Aires 20) Diversity and evolution of retinoblastoma protein-binding LxCxE motifs in human proteins (ir) L. B. Chemes, J. Glavina, I. Sanchez, G. de Prat-Gay, Fundación Instituto Leloir, Protein Physiology Lab, Universidad de Buenos Aires, Fundacion Instituto Leloir 21) Molecular Dynamics and Circular Dichroism Study of VBT:VBA Polymers (1:1 and 1:4). Structure and Dynamics comparison. (ir) A. Fuselli, S. Garay, D. Martino, D. Rodrigues, Facultad de Bioquı́mica y Cs. Biológicas - UNL - INTEC (UNL-CONICET), Facultad de Bioquı́mica y Cs. Biológicas - UNL 22) LATERAL PRESSURE EFFECTS ON STRUCTURAL PROPERTIES OF DPPC LIPID BILAYERS IN GEL AND LC PHASES: A MOLECULAR DYNAMICS STUDY (ir) S. A. Garay, J. F. Quaranta, D. E. Rodrigues, Facultad de Bioquı́mica y Cs. Biológicas - UNL - INTEC (UNL-CONICET), Facultad de Bioquı́mica y Cs. Biológicas - UNL 23) Comparison of Classifier Design Algorithms on a Small Sample Microarray Data (ir) 12 3er Congreso Argentino de Bioinformática y Biologı́a Computacional I. Pagnuco, M. Brun, V. Ballarin, Facultad de ingenierı́a UNMdP, Facultad de ingenieria UNMdP-CONICET 24) Metagenomics and metatranscriptomics of soil microbial communities developing in bulk and rizospheric soils of argentinean pampa region. (ir) N. Rascovan, B. Carbonetto, E. Mancini, M. Reinert, S. Revale, M. Vazquez, Plataforma de Genómica y Bioinformática, INDEAR, Rosario, Santa Fe, Argentina. 25) Following the tracks of the trypanosoma cruzi prenilome (ir) E. Porta, G. Labadie, IQUIR-CONICET 26) Web-based gene-expression analysis using the plant biology analysis tools: GENEVESTIGATOR (ir) M. G. Acosta, M. A. Ahumada, S. L. Lassaga, V. H. Casco, LAMAE - FI-UNER y Cátedra de Biologı́a, FCA-UNER, LAMAE - FI-UNER, Cátedra de Biologı́a, FCA-UNER, Cátedra de Genética y Mejoramiento Vegetal, FCA-UNER 27) DIGESuite: a Cytoscape plug-in for 2D-DIGE analysis (ir) S. Taleisnik, J. Mishima, C. Fresno, M. Semrik, G. Ribero, G. Merino, L. Prato, A. Llera, E. Fernandez, BioScience Data Mining Group - Fac. Ing. - UCC, CONICET, Universidad Nacional de Villa Marı́a, Fundación Instituto Leloir, CONICET, Facultad de Ingenierı́a - UNER, Universidad Católica de Córdoba, CONICET, Universidad Católica de Córdoba 28) Strategies for gap-closure of Thermus sp. 2.9 genome (ir) L. Navas, A. Amadı́o, R. Zandomeni, Instituto de Microbiologı́a y Zoologı́a Agrı́cola (IMyZA), Instituto Nacional de Tecnologı́a Agropecuaria (INTA), Las Cabañas y de Los Reseros, Buenos Aires, Argentina, CONICET – EEA Rafaela, Instituto Nacional de Tecnologı́a Agropecuaria (INTA) 29) Analysis of variability of Mal de Rı́o Cuarto virus (MRCV) through haplotype networks (ir) M. A. Garcı́a, M. D. L. P. Giménez Pecci, J. B. Cabral, I. G. Laguna, F. Maurino, C. H. Vera, INTA IPAVE - CIAP, CONICET, INTA IPAVE - CIAP, UTN FRC 30) One vs One Artificial Neural Network strategy for gene expression multiclass classification (ir) L. Remon, L. Juárez, D. Arab Cohen, C. Fresno, L. Prato, L. Villoria, E. Fernandez, Universidad Nacional de Villa Maria, Universidad Catolica de Cordoba, CONICET, Universidad Catolica de Cordoba 31) SVM Tree with Optimal Multiclass Partition applied to Gene expression signature classification (ir) M. Pallarol, D. Arab Cohen, C. Fresno, L. Prato, E. Fernandez, Universidad Nacional de Villa Maria, Universidad Catolica de Cordoba, CONICET, Universidad Catolica de Cordoba, Biomedical Data Mining Group UCC 13 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Jueves 28 Sesión de Posters 2 Jueves 28 Aula 3 1) 25S-18S ribosomal nature of the not NOR-associated highly GC-rich heterochromatin of chili peppers (Capsicum-Solanaceae) (ir) M. Grabiele, H. Debat, M. Scaldaferro, G. Seijo, D. Ducasse, E. Moscone, D. Martı́, IBONE-UNNE-CONICET, IBS-UNaM-CONICET, IMBIV-UNC-CONICET, IFFIVE-INTA 2) FuL: A Logic processor to aid design and validate virological experiments (ir) A. Kondrasky, D. Gutson, C. Areces, FuDePAN 3) 14-3-3 isoforms subfunctionalization revealed by systems biology analysis of cross-talk between phosphorylation and lysine acetylation (ir) M. Uhart, D. Bustos, INTECH 4) Simulation of pesticide effect on thermo-dependent arthropod populations: fixed point iteration method (ir) C. Bartó, J. Edelstein, E. Trumper, INTA, UN Córdoba 5) Honeybees colony virtual simulation, step 2 (ir) M. Migueles, L. Gende, L. Defeudis, P. Macri, M. Churio, M. Eguaras, L. Braunstein, Universidad Nacional de Mar del Plata, Universidad Nacional de Mar del Plata-CONICET 6) Unraveling the molecular basis of mammalian inner ear evolution: analysis of the outer hair cell cytoskeleton protein spectrin (ir) F. Pisciottano, B. Elgoyhen, L. Franchini, INGEBI - CONICET 7) Characterization of long interspersed non- LTR elements in section Arachis (ir) S. Samoluk, D. Carisimo, G. Robledo, G. Seijo, Instituto de Botánica del Nordeste, Instituto de Botánica del Nordeste,Facultad de Ciencias Exactas y Naturales y Agrimensura (Universidad Nacional del Nordeste) 8) Estimation of Species Richness in Microbial Communities (ir) C. Santa Maria, M. Soria, UNLAM, Universidad de Buenos Aires. Faultad de Agronomı́a 9) Conformational diversity and evolutionary rates in proteins (ir) D. Zea, M. S. Fornasari, C. Marino Buslje, G. Parisi, Fundación Instituto Leloir, Universidad Nacional de Quilmes, SBG Universidad Nacional de Quilmes 10) CoDNaS database: The conformational diversity of proteins and its relationship with biological properties (ir) A. Monzon, G. Parisi, E. Juritz, UNER-UNQ, CEI-UNQ 14 3er Congreso Argentino de Bioinformática y Biologı́a Computacional 11) Attacking Mycobacterium Tuberculosis in the dormant phase: A Combination of expression data with structural druggability and nitrosative stress sensitivity (ir) L. G. Radusky, L. A. Defelipe, A. G. Turjanski, M. Martı́, Departamento de Quı́mica Biológica - Universidad de Buenos Aires 12) INTA bioinformatic platform: An approach using ontology driven database and web interface to integrate and explore genomic data (ir) S. Gonzalez, B. Clavijo, M. Rivarola, P. Fernandez, M. Farber, N. Paniego, INTA-CONICET, INTA-FIUBA, INTA 13) Computational Simulation of inclusion ways of Sulfamethoxazole and Sulfadiazine in Cyclodextrins (ir) L. Erbes, Uner 14) How much information keeps the solvation structure of a Crystal Protein? (ir) C. Modenutti, D. Gauto, L. Radusky, S. Hajos, M. Marti, University of Buenos Aires 15) Digitization Project in MACN: the importance of standard protocols to obtain high quality taxonomic information (ir) P. Cossi, C. Zimicz, M. C. Luna, N. Andón, N. Cuadra, M. B. Bukowski Loináz, M. J. Ramı́rez, Museo Argentino de Ciencias Naturales, “Bernardino Rivadavia”; Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Museo Argentino de Ciencias Naturales, “Bernardino Rivadavia” 16) Computational, biochemical, and spectroscopic studies of the copper-containing nitrite reductase from the denitrifier Sinorhizobium meliloti 2011 (ir) M. C. Gómez, F. M. Ferroni, A. C. Rizzi, S. D. Dalosto, C. D. Brondino, INTEC, UNL 17) Software integration to bioimage management, processing and analysis (ir) J. E. Diaz-Zamboni, L. Bugnon, E. V. Paravani, C. D. Galetto, J. F. Adur, V. Bessone, M. Bianchi, M. G. Acosta, S. J. Laugero, V. H. Casco, M. F. Izaguirre, Laboratorio de Microscopia Aplicada a Estudios Moleculares y Celulares - Facultad de Ingenierı́a - Universidad Nacional de Entre Rı́os 18) Image analysis to control the roast level of the peanut (ir) I. Arévalo, S. Ojeda, FAMAF 19) Comparison of the ability to predict true linear B-cell epitopes by on-line available prediction programs (ir) J. G. Costa, P. L. Faccendini, S. S. Sferco, C. M. Lagier, I. S. Marcipar, IQUIR, Depto. de Quı́mica Analı́tica, Facultad de Ciencias Bioquı́micas y Farmacéuticas, Universidad Nacional de Rosario. Suipacha 531. Rosario, Laboratorio de Tecnologı́a Inmunológica, Facultad de Bioquı́mica y Ciencias Biológicas, Universidad Nacional del Litoral. Paraje El Pozo. Santa Fe., Laboratorio de Tecnologı́a Inmunológica, Facultad de Bioquı́mica y Ciencias Biológicas, Universidad Nacional del Litoral. Paraje El Pozo. Santa Fe, Departamento de Fı́sica, Facultad de Bioquı́mica y Ciencias Biológicas, Universidad Nacional del Litoral, Paraje El Pozo. Santa Fe; and INTEC (CONICET-UNL), Güemes 3450, Santa Fe 20) Relationship between divergence of using synonymous codons in host/virus and the presence of microRNA (ir) F. Riberi, L. Tardivo, L. Fazzi, G. Biset, D. Gutson, D. Rabinovich, Instituto Biomédico en Retrovirus y SIDA-INBIRS. Fundación para el Desarrollo de la Programación en ácidos Nucleicos-FuDePAN, Universidad Nacional de Rı́o Cuarto, Instituto Biomédico en Retrovirus y SIDA-INBIRS, Fundación para el Desarrollo de la Programación en ácidos Nucleicos-FuDePAN 21) A pipeline for structural annotations in bacterial genomes (ir) 15 3er Congreso Argentino de Bioinformática y Biologı́a Computacional E. Lanzarotti, L. Defelipe, L. Radusky, M. Marti, A. Turjanski, Departamento de Quimica Biologica , FCEN - UBA, Departamento de Quimica Biologica y INQUIMAE , FCEN UBA 22) VISI: a computational program for antiviral strategies comparison (ir) D. Gutson, P. Oliva, L. Ramos, P. Pury, F. Herrero, D. Rabinovich, FuDePAN, FaMAF, Universidad Nacional de Córdoba 23) 1D model of the pulse wave along the systemic arteries (ir) C. E. Saavedra Fresia, F. E. Menzaque, National University of Tucuman, National University of Cordoba 24) Agi4x44.2c: a two-colour Agilent 4x44 Qualtiy Control R library for large microarray projects (ir) G. Gonzalez, C. Fresno, G. Merino, A. Llera, O. Podhajcer, E. Fernandez, Laboratorio de Terapia Celular y Molecular, Instituto Leloir, Grupo de Mineria de Datos en Biociencias, Facultad de Ingeniera, UNER 25) MSA2MI: A server to calculate and visualize mutual information in multiple sequence alignments (ir) F. L. Simonetti, M. Nielsen, C. Marino Buslje, Center for Biological Sequence Analysis, Fundación Instituto Leloir 26) GOboot: towards a robust SEA analysis (ir) C. Fresno, A. Llera, M. R. Girotti, M. P. Valacco, J. A. López, L. Zingaretti, L. Prato, O. L. Podhajcer, M. G. Balzarini, F. Prada, E. Fernandez, BioScience Data Mining Group - Fac. Ing. - UCC, CONICET, National Center for Cardiovascular Research, Madrid, Spain, Biometry Laboratory, National University of Córdoba, The Institute of Cancer Research, London, UK, Fundación Instituto Leloir, CONICET, of Technology, School of Engineering and Sciences, UADE, Instituto A.P. de Ciencias Básicas y Aplicadas, Universidad Nacional de Villa Maria, Universidad Catolica de Cordoba 27) Development of an algorithm to detect distant orthologous genes in baculoviridae family. (ir) J. Iserte, M. Garavaglia, S. Miele, M. Belaich, D. Ghiringhelli, Universidad Nacional de Quilmes 28) COMPUTATIONAL PREDICTION OF THE BIOLOGICAL EFFECTS OF MUTATIONS IN OTC GENE IN ARGENTINIAN PATIENTS (ir) S. M. Silvera Ruiz, J. A. Arranz Amo, L. E. Laróvere, R. Dodelson de Kremer, Unitat Metabolopaties, Hospital Universitari Materno-Infantil Vall d’Hebron, CEMECO, Hospital de Niños de Córdoba 29) In sı́lico prediction of cross-reactive epitopes of the major soybean allergen Gly m Bd 30K (P34) with bovine caseins and their analysis by immunochemical methods. (ir) A. Candreva, G. Parisi, G. Docena, S. Petruccelli, CIDCA UNLP, La Plata 47 y 116., Departamento de Ciencia y Tecnologı́a, UNQ, Roque Saenz Pena 182, Bernal., LISIN, FCE, UNLP, La Plata, 47 y 115. 30) Evolutionary and structural analysis of procirsin, a typical plant aspartic proteinase zymogen (ir) D. Lufrano, S. Vairo Cavalli, G. Parisi, LiProVe - Facultad de Ciencias Exactas, UNLP, Departamento de Ciencia y Tecnologı́a, UNQ 31) BiFe: a national EMBNet node hosting Argentine bioinformatics applications (ir) Embnet Node Argentina, Protein Physiology Laboratory, Departamento de Quı́mica Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires 32) Construction of phylogenetic trees from Trichoderma sp using the program MEGA 5.10 (ir) G. Sioli, L. Castrillo, M. Cossio, N. Amerio, L. Villalba, P. Zapata, INBIOMIS 16 3er Congreso Argentino de Bioinformática y Biologı́a Computacional 17 Resúmenes 18 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Comparing the Bonferroni and the Benjamini-Hochberg procedures Sebastián Ferro, Diana Kelmansky1 1 Instituto de Cálculo, FCEN-UBA The Bonferroni procedure for the correction of p-values in multiple tests, based on the control of the probability of at least one false positive, ie the Family Wise Error Rate (FWER), has been criticized as being too restrictive, resulting in a low power in situations of multi-scale tests as those used in the context of genomic experiments, eg microarrays. Given that Bonferroni also controls the Per Family Error Rate (PFER), ie the expected value of false positives, this disadvantage -which is not intrinsic to the procedure but due to the extreme restrictions of its application using the FWER- can be overcome. This work, following Gordon, Glazko, Qiu and Yakovlev (2007) proposal, shows that it is possible to equalize the errors, ie select a PFER for the Bonferroni approach that results in a given false discovery rate (FDR) . Under errors' equalization it is shown that similar power levels are obtained for both. However Bonferroni procedure is more stable than the Benjamini-Hochberg regarding the variance of the total number of discoveries and the number of true discoveries. In practice this means that the estimations of FDR that we are controlling for are less reliable than those for the expected value of false positives. In addition to verifying the results of Gordon, Glazko, Qiu and Yakovlev (2007) the results are extended to actual situations on the degree of correlation among genes expression levels. This extension was possible by modifying the algorithm initially proposed that reduced processing time and enabled better precision in the error equalization (Ferro 2011). Key words: multiple testing, Bonferroni, FDR References A. Gordon, G. Glazko, X. Qiu, and A.Y. Yakovlev. Control of the mean number of false discoveries, Bonferroni, and stability of multiple testing. Annals of Applied Statistics, 1:179-190, 2007. S. Ferro. "A more efficient and precise comparison of Bonferroni and BenjaminiHochberg procedures". Tesis de Licenciatura en Cs. de la Computación. FCEN-UBA. 2011 19 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Glycobioinformatics: Using solvent structure to predict and characterize protein carbohydrate complexes Marcelo A. Martí Departamento de Química Biologica e Inquimae‐Conicet, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Ciudad Universitaria, Pabellón 2, Buenos Aires, C1428EHA, Argentina [email protected] Formation of protein ligand complexes is a fundamental process in biochemistry. In-silico based methods that predict the structure of complexes, or docking methods, are widely used and are an essential part of many rational drug development programs. The potential and reliability of any docking method lies in its capability to correctly predict the complex structure, taking as starting point the structures of the protein and ligand separately. Nevertheless, given the approximations involved in the theoretical developments employed, results are not always successfully achieved. Carbohydrate binding proteins are a large and diverse group of biomolecules displaying a wide variety of biological activities including cell recognition, communication and cell growth. In this context understanding protein-carbohydrate interactions at the molecular level with atomic resolution, is of fundamental importance for basic and applied glycobiology. A common, but usually overlooked feature of carbohydrates is the fact that their polar OH groups, quite frequently bind to hydrophilic patches of the protein surface, resulting in significant solvent displacement and reorganization. Water molecules and carbohydrate OH groups can participate in similar hydrogen binding networks when establishing contacts with protein surfaces. With this in mind, we though to use this information in order to in-silico predict the protein-carbohydrate complexes, with higher accuracy than conventional docking methods. Analyzing the solvent structure at the protein surface is not an easy task. One of the most potent methods for studying solvent structure is based on the inhomogeneous fluid solvation theory (IFST) which allows the determination of several properties for the water molecules from a plain Molecular Dynamics (MD) simulation. Using, this methodology, recently, we were able to show that solvent structure and dynamics at protein surfaces involved in carbohydrate binding proteins are very different as those from the bulk solvent, allowing the identification of the so called water sites (WS) or hydration sites. The WS correspond to definite regions in the area adjacent to the protein surface where the probability of finding a water molecule is significantly higher than that observed in the bulk solvent, and can be further thermodynamically characterized using the IFST. In the present work, we used the characterization of the WS in the CBS of several carbohydrate binding proteins, to modify the scoring function of the Docking program Autodock in order to perform the in-silico determination of the corresponding protein-ligand complexes. Our results clearly show that the modified function significantly improves the quality and accuracy of the results, both in terms of how close the predicted complex structure resembles the real one (i.e the one obtained by crystallography), and in the differentiation of true from false positives and negatives. The resulting solvent structure biased docking protocol thus results in a powerful tool to the design and optimization of glycomimetic drugs development, and for the basic understanding of protein carbohydrate interactions. 1. Carbohydrate-binding proteins: Dissecting ligand structures through solvent environment occupancy. Diego F. Gauto, Santiago Di Lella, Carlos M. A. Guardia, Darío A. Estrin and Marcelo A. Martí*, J. Phys. Chem B. 2009 113(25) 8717-8724. 2. Structural basis for ligand recognition in a mushroom lectin: solvent structure as specificity predictor. Gauto DF, Di Lella S, Estrin DA, Monaco HL, Martí MA. Carbohydr Res. 2011, 15;346(7):939-48. 3. Characterization of the Galectin-1 Carbohydrate Recognition Domain in Terms of Solvent Occupancy. Di Lella S, Marti MA, Alvarez RM, Estrin DA, Ricci JC. J Phys Chem B. 2007, 111, (25 ) 7360-7366 20 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Using Computer Simulations to Understand Enzyme Mechanisms: Application to Mycobacterium tuberculosis CYP121 Unusual Reaction Victoria G Dumas, Lucas A Defelipe, Ariel A Petruk, Adrian G Turjanski and Marcelo A Martí Departamento de Química Biologica e Inquimae‐Conicet, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Ciudad Universitaria, Pabellón 2, Buenos Aires, C1428EHA, Argentina [email protected] Understanding enzyme mechanism at molecular level is an invaluable information in the field of protein inhibitor design, protein de-novo engineering and general biochemistry. Computer simulation methods, mainly Quantum Mechanics (QM) and Classical Molecular Dynamics (MD) based methods, provide an extraordinary tool to study enzyme reaction mechanisms, since they allow to actually simulate and see the reaction happen in the computer [1,2]. Cytochromes of the p450 type (Cyps) are a large and ubiquitous family of heme proteins which usually catalyze oxidation (in most cases hydroxylation) of organic compounds. In mammals they are responsible for the metabolism of majority of pharmacological compounds, and they are also studied as potential antimicrobial targets or as potential enzymes for biotechnological purposes. Among the 20 Cyps encoded by the Mycobacterium tuberculosis (Mt) genome, CYP121 was encountered as essential for the viability of the bacilli, making it a potential target for antitubercular drugs design. Interestingly, the mechanism by which CYP121 carries out its activity remains unknown. There is evidence that suggests that this protein is responsible for catalyzing the formation of a C‐C bond between the two aromatic cycles of cyclopeptide cyclo(l‐Tyr‐l‐ Tyr) (cYY) resulting in a new chemical entity [3,4], a reaction which is quite unusual for CYPs. In this work, we have used a combination of classical molecular dynamics (MD) and hybrid quantum‐classical (QM/MM) methodologies in order to elucidate the reaction mechanism carried out by this interesting and important protein. By means of classical simulations we could see the effect of the protein in restraint the movement of the ligand, allowing the two carbon atoms being activated and positioned in sufficient proximity to allow covalent linkage. We used hybrid QM-MM methods to calculate the free energy profile of the reaction, showing that the C-C bond formation involves a spin shift along the reaction resulting in a moderate barrier due to spin crossing. Taken together our results allow for a better understanding of these interesting enzyme and for the general reaction mechanism of CYPs protein. References 1. Capece L, Lewis-Ballester A, Yeh S-R , Estrin DA & Marti, MA (2012). Complete reaction mechanism of indoleamine 2,3-dioxygenase as revealed by QM/MM simulations. The journal of physical chemistry. B, 116(4), 1401-1413. 2. Lewis-Ballester A, Batabyal D, Egawa T, Lu C, Lin Y, Marti MA, Capece L, et al. (2009). Evidence for a ferryl intermediate in a heme-based dioxygenase. Proceedings of the National Academy of Sciences of the United States of America, 106(41), 17371-6. 3. Belin P, Le Du MH, Fielding A, Lequin O, Jacquet M, Charbonnier J-B, Lecoq A, et al. (2009). Identification and structural basis of the reaction catalyzed by CYP121, an essential cytochrome P450 in Mycobacterium tuberculosis. Proceedings of the National Academy of Sciences of the United States of America, 106(18), 7426-7431. 4. McLean KJ, Carroll P, Lewis DG, Dunford AJ, Seward HE, Neeli R, Cheesman MR, et al. (2008). Characterization of active site structure in CYP121. A cytochrome P450 essential for viability of Mycobacterium tuberculosis H37Rv. The Journal of biological chemistry, 283(48), 33406-16. 21 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Design and virtual screening of new anti-HIV integrase inhibitors. Mario Alfredo Quevedo, Margarita Cristina Briñón Dpto. de Farmacia, Fac. de Ciencias Químicas, Universidad Nacional de Córdoba (UNC), Argentina. ([email protected]) Background information The integrase (IN) of the human immunodeficiency virus (HIV) is a key enzyme catalyzing viral/host DNA integration. Its inhibition interrupts the viral life-cycle and thus is actively studied to design potent and selective anti-HIV drugs. Unfortunately, obtaining IN inhibitors assisted by computational methods has been hindered by the lack of a complete crystal structure F of the functional IN intasome. Considering that the Fig. 1 crystal structure of the closely related prototype foamy OH O virus (PFV) intasome was obtained recently,1 this work O deals with the design and screening of new IN C inhibitors based on scaffold A (Fig. 1), a versatile N O CH R leader for high throughput chemical synthesis. 2 CH 3 Scaffold A Material and methods In a first stage, the screening methodology was validated using a set of 16 compounds structurally related to scaffold A, whose anti-HIV activities are reported. The crystal structure of PFV (pdb: 3OYA) was used for molecular docking procedures. Ionization state and tautomer analyses were performed, after which an exhaustive rigid docking approach was applied based on generated conforme libraries. Ligand analyses, rigid docking and post processing were performed using software packages developed by OpenEye Inc. In a second stage, a set of 1000 compounds (massive library) was created and subjected to molecular docking screening. Results Assay Validation: Very good correlations between the docking rank and antiviral potency was found, with compounds in the low, mid, high namolar IC50 range ranking in the first, second and third order, respectively. Only one outlier was found (false negative), which was attributed to a different chemical substitution pattern that the rest of the compounds. Screening assays: out of the 1000 molecules screened in the massive library, 63 exhibited higher docking rankings than the most potent compound in the training set. The synthetic feasibility of these 63 compounds was assessed, selecting 12 for further synthesis and anti-HIV evaluation. Conclusions The crystal structure of PFV seems adequate for the design and virtual screening of HIV integrase inhibitors, at least in chemical series exploring substitution on R (Fig. 1). Also the high speed of the search method is compatible with the screening of high number of compounds. References 1. Hare S, Gupta SS, Valkov E, Engelman A, Cherepanov P: Retroviral intasome assembly and inhibition of DNA strand transfer. Nature 2010, 464:232-236. 22 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Alternative models about the origin of Ribosome Inactivating Proteins genes Lapadula Walter Jesús, Sanchez-Puerta M. virginia and Juri Ayub Maximiliano Lab. Biol. Mol. UNSL. IMIBIO-SL (CONICET) E-mail: [email protected] Background Ribosome inactivating proteins (RIPs) are N-glycosidases that depurinate a specific adenine residue in the conserved sarcin/ricin loop of 28S rRNA. The most widely studied examples of RIPs are ricin, a potent toxin of Ricinus communis, and Shiga toxins from enteric bacteria causing hemolytic uremic syndrome. RIPs genes have been reported to be present in many plants and a few bacteria. In addition, there are biochemical evidences of RIP activity in a few fungi species. The analysis of RIPs phylogeny is problematic because the sequences are highly divergent, and their distribution across species is patchily distributed. It is currently assumed that RIPs genes were originated in plants and prokaryotic RIPs have been acquired by a single Horizontal Gene Transfer (HGT) event from plant to bacteria [1]. We have recently reported a phylogenetic analysis of RIP sequences [2]. In the present work, we performed exhaustive searches for novel RIP genes in genomic and EST databases. We found novel RIP encoding sequences from bacteria, and more interestingly, eleven RIP genes in fungal WGS. These results suggest that the current view of RIPs phylogeny should be revisited. Therefore, we performed sequence alignments using different algorithms (CLUSTALW, T-COFEE, MAFFT), and new phylogeny inferences including the novel sequences. The resulting data were analyzed in the context of phylogenetic relationships among species, in order to propose the most plausible hypothesis. Altogether, the data can be explained by at least two alternative models (see Figure 1): I. RIPs genes originated in plants and were acquired via HGT by fungi and bacteria. This model implies at least three independent HGT events. II. RIPs genes were present in the common ancestor of eukaryotes and bacteria, and were lost in several lineages through evolution. This model implies several loss events in different lineages; archaea, metazoan, many bacteria, etc. The pros and cons of these models are discussed in this work. Figure 1 Schematic representation of the tree of life showing the most relevant taxa according to reference [3]. Divergence times in million of years ago (Ma) are shown based in references [4, 5]. Model I is showed by appearance of RIP genes in plants (open circle) and three independent HGT events to Gram+ bacteria, Grambacteria, and fungi are shown by arrows. Model II is showed by an earlier origin of RIP genes (black circle) and several independent losses in different linages (grey circles). Reference 1. Peumans WJ, Van Damme EJM: Evolution of plant ribosome inactivating Proteins. In: Lord, JM, Hartley, MR (Eds.), Toxic Plant Proteins. 2010, 1–26. 2. Lapadula WJ, Sanchez-Puerta MV, Juri Ayub M: Convergent evolution led ribosome inactivating proteins to interact with ribosomal stalk. Toxicon 2011, 57:427-432. 3. Ciccarelli F, Doerks T, Von Mering C, Creevey C, Snel B, Bork P: Toward Automatic Reconstruction of a Highly Resolved Tree of Life. Science 2012, 311:1283-1286. 4. Da-fei Feng, Cho G, Doolitle RF: Determining divergence times with a protein clock: Update and reevaluation. PNAS 1997, 94: 13028–13033. 5. Battistuzzi FU, Feijao A, Hedges B: A genomic timescale of prokaryote evolution: insights into the origin of methanogenesis, phototrophy, and the colonization of land. BMC Evolutionary Biology, 2004, 4:44. 23 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Comparison of Classifier Design Algorithms on a Small Sample Microarray Data Inti Anabela Pagnuco1,2, Marcel Brun1,3, Virginia Ballarin1 1 Lab. de Procesos y Medición de Señales, Fac.de Ingeniería, Univ. Nacional de Mar del Plata, Bs. As., Argentina CONICET 3 Departamento de Matemáticas, Fac. de Ingeniería, Univ. Nacional de Mar del Plata 2 Introduction Single Nucleotide Polymorphism (SNP) are one base mutations on the ADN sequences, usually producing phenotypical changes. Some Pattern Recognition techniques, including classification, feature selection and error estimation, are used to find the relationship between SPNs and changes in phenotype. In this work we compare three techniques used to design classifiers from samples: plug-in, neural networks (NN), and multi-resolution (MRS) [1], applied to classification of cattle based on SNPs, using 2 sets of SNPs, obtained by a previous feature selection algorithm [2]. Because an important aspect of designing classifiers for genomic studies, including data from microarray, SNPs and other platforms, is the existence of few training samples, we propose to study how the method behaves when the number of samples reaches small values. For this reason, the comparison was done by plotting the cross-validation error as a function increasing values of the number of training samples. These samples were selected, each time, randomly from the available samples. Results From the 145 samples, we first used n=10 samples for classifier design, then increased the number by 15 samples until reaching 145 samples. Random samples of size n were obtained, a classifier was designed over these n samples, and its error estimated on the samples not used for classification. This process was repeated 100 times to average the results from random sampling. The figure shows the results from the two sets of samples. The x-axis corresponds to the number of samples used to train the classifier, and the y-axis corresponds to the cross-validation error. The lines are red for NN, yellow for Plug-In, and blue for MRS. The small vertical lines indicate the variance of the estimated error over the 100 realizations. We can see that for large training sets, the three methods have similar performance, but the NN performance suffer most the reduction of training samples (values below 50 in both graphs). (a) (b) Figure 1: Cross-validation error for (a) the first set of SNPs, and (b) for the second set of SNPs Conclusion In this work we can observe how NN design may be a poor choice for SNP classification when the number of samples is small. Additional work was done on simulated data and other genotypic studies. References [1] U. Braga-Neto, y E. Dougherty, Classification,Genomic Signal Processing and Statistics, EURASIP Book Series on Signal Processing and Communication, Hindawi Publishing Corporation, 2005. [2] Gonzalez, Mariela A.; Brun, Marcel; Corva, Pablo M.; y Ballarin, Virginia. “Análisis de señales genómicas para la clasificación de razas bovinas”, CAI, 2009 24 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Advantages of balanced classifier design on microarray data classification Marcel Brun1.2, Inti Anabela Pagnuco1,3, Virginia Ballarin1 1 2 Grupo de Procesamiento Digital de Imágenes, Fac. de Ingeniería, Univ. Nacional de Mar del Plata Departamento de Matemáticas, Fac. de Ingeniería, Univ. Nacional de Mar del Plata, 3 CONICET Introducción The analysis of high throughput genomic data is an important tool for disease diagnosis and prognosis, as in the study of the gene to gene interaction networks. In this context, classifier design, feature selection and error estimation play a fundamental role on the determination of the effects of genotype on phenotype. These three methods are usually dependent between themselves since, for example, feature selection may be an intrinsic part of a classifier design system. In this context, the number of samples used to design a classifier is an important factor that affects the quality of the results. Moreover, the amount of samples of each class may affect considerably the interpretability and usability of the results. In general, and specifically in genomics signal processing, the importance of obtaining a good classifier design is of uttermost importance, since they may be used to determine medical treatment. In this work we analyzed the effect of the imbalance of samples (between positive and negative) on synthetic and real genomic data, compared to the design using artificial balancing, extending the analysis done in [1], by applying the analysis to several real and artificial datasets. Results We studied the classifier errors, false positive rate (FPR) and false negative rate (FNR) for several datasets, comparing standard design against balanced design, using binary classification with a multiresolution approach [2]. The error measures were computed using a cross-validation approach. For the analysis we used seven public datasets and three simulated datasets, the later ones showing three different balances. Table 1 shows the results (Error, FPR and FNR) for both balanced (B) and unbalanced (NB) design. The last two rows show the number of positive/negative samples, and the number of features. The large sample size for the synthetic experiments avoid issues related to small sample error estimation. In the last column (very unbalanced dataset) we can see how the error is very small, since a constant classifier does almost a perfect job, as shown by the fact that unbalanced design produces 0% FPR and 100% FNR. Balanced design may increase the overall error, but by generating more realistic values of FPR and FNR. We can see that similar consideration apply to the data obtained from genomic databases. Table 1: Error, FPR and FNR for balanced (B) and unbalanced (NB) classifiers design. Muscle Kawasaki Synthetic Synthetic Synthetic DataSet Autismo Listeria Diabetes desease Influenza desease 1 2 3 Error B FPR B FNR B 0.09 0.058 0.115 0.367 0.264 0.618 0.2 0.644 0.014 0.164 0.154 0.178 0.090 0.057 0.214 0.274 0.057 0.214 0.492 0.514 0.469 0.446 0.312 0.700 0.355 0.297 0.678 Error NB FPR NB FNR NB 0.093 0.062 0.122 0.358 0.215 0.699 0.25 0.637 0.045 0.178 0.062 0.365 0.085 0.036 0.292 0.229 0.036 0.292 0.498 0.520 0.476 0.351 0 1 0.150 0 1 Clase 1/2 # Variables 7/8 1000 82/34 123 4/9 1000 21/14 1000 73/18 1000 44/8 1000 500/500 8 650/350 8 850/150 8 Conclusion In both synthetic and real data analysis we can see a better performance, regarding FPR and FNR, for balanced classifier design. The experiments also describe the dangers of using error as quality measure. References [1] Brun, M.; Ballarin, V. Data balancing for Phenotype Classification based on SNPs. 2010. [2] Brun, M.; Dougherty E; Hirata Jr. R.; Barrera J. Design of optimal binary filters under joint multiresolution-envelope constraint. Pattern Recognition Letter, 2003. 25 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Predicting Protein Function from Sequence and Structural Data: a Globin’s Family Case Juan P Bustamante, Marcelo A Martí, Darío A Estrin Department of Inorganic, Analytical and Physical Chemistry, INQUIMAE-CONICET. Faculty of Exact and Natural Sciences, University of Buenos Aires. Buenos Aires, Argentina. ([email protected]) Predicting function from sequence and/or structure are key issues in bioinformatics. Although a broad functional assignment can be done by assigning a protein to a given family (or domain), determining a protein particular function is not straightforward. Assuming that the function is coded in the structure through the determination of its chemical properties, it is possible in principle to predict a putative function if relevant properties can be computed. The globins family of heme proteins offer a large, diverse and thoroughly studied set of proteins, whose function is tightly related to small ligand (mainly O 2 but also NO, 1 CO, and H2S) affinity and reactivity with the active site heme. Globins with high oxygen affinity usually function as O2-redox related enzymes, moderate affinity globins usually act as oxygen carriers, while low O2 affinity globins are NO or CO sensors. Ligand affinity is determined by the ratio between association (kon) and dissociation rate (koff). The first one is mainly related with the ligand migration process from the solvent to the protein active site, which is determined by the presence of internal tunnels and cavities and residues acting as “gates”, while the second is determined by interactions between protein and bound ligand. During the last decade our group has developed several in-silico methods to determine both processes based solely on structural information 2,3,4, showing excellent agreement with the experimental data for several particular cases. This fact prompted us to extend our analysis to a whole family of proteins, in this case the truncated hemoglobins (trHbs). TrHbs are a distinct widespread phylogenetic group of the globins family, which is divided in three different groups: I, II and III (also labeled N, O and P)5, for which about 1000 different sequences have been reported, and existing at least one determined structure for each subgroup. In the present work, all possible different active site and tunnel/cavity structures were built on trHbs homology based models, based on multiple sequence alignments for each group as determined using HMM. For each possible structural type, the oxygen stabilization in the active site, related to k off; and the free energy profile for small ligand migration along the tunnels, related with k on, were computed. With these data we were able to assign to each protein a putative oxygen affinity (high, moderate or low) and a ligand binding relative rate (fast or slow), that allow assigning their putative function. These results were finally combined with phylogenetic and molecular evolution analysis together with literature derived data about the organism living style. Our results show that ligand affinity characteristics are randomly distributed among the phylogenetic groups, but they are correlated with the organism living style. Molecular evolution analysis also show that small changes (even one residue changes) may have a dramatic impact on the affinity and therefore protein function, being far more important than global structural changes. In summary, our results not only show that predicting specific functional properties from sequence/structure is possible, but also reveal interesting aspects about globins family evolutionary history at molecular level. Reference 1. Milani M, Pesce A, Nardini M, Outllet H, Outllet Y, Dewilde S, Boceli A, Ascenzi P, Guertin M, Moens L, Friedman JM, Wittenberg JB, Bolognesi M. Structural bases for heme binding and diatomic ligand recognition in truncated hemoglobins. Journal of Inorganic Biochemistry. 2005. 99:97-109. 2. Marti MA, Crespo A, Capece L, Boechi L, Bikiel DE, Scherlis DA, Estrin DA. Dioxygen affinity in heme proteins investigated by computer simulation. Journal of Inorganic Biochemistry. 2006. 100(4):761-70. 3. Capece L, Marti MA, Crespo A, Doctorovich F, Estrin DA. Heme protein oxygen affinity regulation exerted by proximal effects. Journal of the American Chemical Society. 2006. 128(38):12455-61. 4. Forti F, Boechi L, Estrin DA, Marti MA. Comparing and combining implicit ligand sampling with multiple steered molecular dynamics to study ligand migration processes in heme proteins. Journal of Computational Chemistry. 2011. 10.1002/jcc.21805. 5- Vuletich DA, Lecomte JTJ. A Phylogenetic and Structural Analysis of Truncated Hemoglobins. Journal of Molecular Evolution. 2006. 62:196–210. 26 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Identification of binding motifs in large-‐scale peptide data sets using a Gibbs sampling approach Morten Nielsen, Ole Lund and Massimo Andreatta Center for Biological Sequence Analysis, Technical University of Denmark, DK-‐2800 Lyngby, Denmark Background Proteins recognizing short peptide fragments drive a large part of cellular signaling. Accurate description of the specificities of such peptide binding receptors in many cases can provide insights to their function and can be used for instance to design disease diagnostics, inhibitor compounds, and vaccines. The recent advances in high-‐throughput technologies for generation of peptide data have made it both faster and cheaper to generate large libraries of peptides for the study of peptide-‐binding protein specificities. Interpretation of such large peptide data sets however is not trivial and requires computational methods capable of identifying subtle recurring patterns shared among particular sets of peptides. This task becomes more challenging when the data contain more than one pattern, and/or the motifs are found at different registers within distinct peptides. Several methods have been developed aiming to address this problem [1-‐4], ranging from simple multiple sequence alignment methods to advanced motif identification methods, including artificial neural networks, hidden Markov models and Gibbs sampling. However, all these methods have the severe limitations of only dealing with single specificities or requiring the input data to be pre-‐aligned to a common motif. Results Here, we present an algorithm based on Gibbs sampling aiming to go beyond these limitations. The method can simultaneously align and cluster peptide data sets containing an a priori unknown number of specificities. We apply the method to de-‐convolute binding motifs in a panel of peptide data sets with different degrees of complexity spanning from the simplest case of pre-‐ aligned fixed-‐length peptides, to cases of unaligned peptide data sets of variable length. Example applications include mixtures of binders to different MHC class I and class II alleles, distinct classes of ligands for SH3 domains, and sub-‐specificities of the HLA-‐A*02:01 molecule. The results of the analysis for the SH3 domain peptide data are shown in Figure 1. Figure 1. Sequence motifs on SH3 domain binding data clustered in 1 to 3 clusters. a) Sequence motif of the data set aligned in one single cluster. b) Sequence motifs for SH3 domain data split in two clusters. The two groups are in strong agreement with the canonical class I (panel c, 1,892 peptides) and class II (panel b, 498 peptides) types of SH3 domain ligands. c) Sequence motifs when the data is split in 3 clusters. The clusters have sizes of respectively 1,606, 490 and 305 peptides. Data was taken from [3]. +"#$%&'()" *"#$%&'()&" ,-" !"#$%&'()&" .-" #-" Conclusions A Gibbs clustering algorithm was developed allowing the simultaneous identification of multiple subtle receptor motifs within peptide data sets. In benchmark calculations on data sets containing multiple binding motifs (both pre-‐aligned fixed-‐length peptides and unaligned peptides of variable length) the method consistently demonstrated high performance. The Gibbs clustering algorithm is available online as a web server at http://www.cbs.dtu.dk/services/GibbsCluster-‐1.0. References 1. Nielsen, M., C. Lundegaard, and O. Lund, Prediction of MHC class II binding affinity using SMM-‐ align, a novel stabilization matrix alignment method. BMC Bioinformatics, 2007. 8: p. 238. 2. Andreatta, M., et al., NNAlign: a web-‐based prediction method allowing non-‐expert end-‐user discovery of sequence motifs in quantitative peptide data. PLoS ONE, 2011. 6(11): p. e26781. 3. Kim, T., et al., MUSI: an integrated system for identifying multiple specificity from very large peptide or nucleic acid data sets. Nucleic Acids Res, 2012. 40(6): p. e47. 4. Noguchi, H., et al., Hidden Markov model-‐based prediction of antigenic peptides that interact with MHC class II molecules. J Biosci Bioeng, 2002. 94(3): p. 264-‐70. 27 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Metagenomics and metatranscriptomics of soil microbial communities developing in bulk and rizospheric soils of argentinean pampa region. Nicolás Rascovan1, Belén Carbonetto1, Santiago Revale1, Estefanía Mancini1, Marina Reinert1 and Martin Vazquez1. 1 Plataforma de Genómica y Bioinformática, INDEAR, Rosario, Santa Fe, Argentina. Background Soils are one of the most biologically diverse environments on earth, but more than 90% of the microorganisms have not been studied as they cannot be cultured in the laboratory. A strategy developed to overcome this issue is to study the microbial communities by sequencing their genomic content and approach called metagenomic. In the last few years a debate has been established about the deleterious effect that intensive agriculture might produce to microbial communities of agricultural soils. The goal of the present work is to contribute to this debate by analyzing microbial communities of pampean soils from a metagenomic perspective. Material and methods DNA extracted from bulk soils of agricultural and non agricultural sites and DNA and RNA from rhizospheric soils under two different agricultural managements were sequenced by high throughput pyrosequencing using 16S rRNA amplicon sequencing (AS) and whole genome shotgun (WGS). Over 17 Gbp were obtained by WGS and more than 1 Mi sequences for AS. Amplicon sequences were fully analyzed using QIIME software package and WGS using the MG RAST annotation and analysis tool. In addition we used a custom made pipeline analysis tool for metatranscriptomic analysis. Results and Discussion From metatranscriptomic (cDNA) and metagenomic (gDNA) comparison we found that the metabolically active microorganism are significantly different from the total suggesting that relevant microorganisms might be a subset of the whole community at a given time and condition. The custom made analysis pipeline demonstrated to be a useful tool for metabolic comparative analysis between cDNA and gDNA datasets. We could identify different highly and lowly expressed metabolisms (cDNA level vs. gDNA level) for each agricultural management. We found that diversity at taxonomic level is much higher than at metabolic level, probably meaning that a metabolic redundancy occurs among different species. Microbial communities showed to be different under different agricultural managements at metabolic and taxonomic level, but those differences were not dramatic. We could also identify the species and metabolisms associated to each agricultural management, to each geographic region and to rhizospheric soil. This is the first study of soil microbial communities from a genomic perspective done in Argentina and a good reference for further works. We could start unravelling the mysteries hidden in the complex soil universe but this is just the first step on a long journey that should be also followed by other scientist in our country. 28 3er Congreso Argentino de Bioinformática y Biologı́a Computacional High throughput pyrosequencing and bioinformatics of a multi-extreme environment. Nicolás Rascovan1, Santiago Revale1, Estefanía Mancini1, Martin Vazquez1 and María Eugenia Farías2 1Plataforma de Genómica y Bioinformática, INDEAR, Rosario, Santa Fe, Argentina. 2 Laboratorio de Investigaciones Microbiológicas de Lagunas Andinas (LIMLA), Planta Piloto de Procesos Industriales Microbiológicos (PROIMI), CCT, CONICET, Tucumán, Argentina Background and Experimental Procedures The advent of high-throughput DNA sequencing technologies, and particularly pyrosequencing, has opened a gate for the in depth studies of environmental microbial communities through metagenomic approaches. In this study, we have analyzed 34,549 16S rDNA sequences obtained by PCR and 454 sequencing from microbial communities developing in Laguna Diamante, a multi-extreme environment (PH=10, high arsenic content and salinity, high altitude and therefore high UV radiation and low O2 pressure). The sequences where clustered into Operation Taxonomic Units (OTU) using uclust algorithm at 0.8, 0.9 and 0.97 similarity and representative sequences were taxonomically classified using RDP classifier algorithm on GreenGenes database. Public datasets from other environments were analyzed using same procedures to compare with Diamante results. Bray Curtis distances were calculated based on taxonomic distribution at phylum level and UPGMA trees were constructed to visualize relatedness between samples. Results and Discussion We found an extremely diverse microbial community developing in environmental conditions that most of the life on earth could not resist. Taxonomic analysis showed a markedly predominance of Protobacteria phylum (62% of all sequences), but Bacteroidetes (13%), Firmicutes (6%) and Verrucomicrobia (4%) where also considerably abundant (Figure 6). Moreover, although the primers used were not the best to amplify Archaeas, we could detect a high amount of sequences from this group (4%). Most of the sequences (90%) could be classified at least to family level, suggesting that species existing in other environments have developed strategies to evade extreme conditions. Comparison to other environments have shown a closer relationship to very distant location such as Baltic Sea than to the geographically close Socompa Stromatolite located at only 200km and similar conditions. This is the first microbial characterization of Diamante Lake and together with other works, an important step toward the understanding of mechanisms to survive in adverse conditions such as those of ancient earth and other planets like Mars. 29 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Digitization Project in MACN: the importance of standard protocols to obtain high quality taxonomic information Cossi, Paula1,2; Zimicz, Carolina1,2; Luna, María Celeste1; Andón, Noelia N.1,2; Cuadra, Natalia1,2; Bukowski Loináz, María Belén1,2 and Ramírez, Martín J.1,2 1 2 Museo Argentino de Ciencias Naturales, “Bernardino Rivadavia”, CABA, Buenos Aires, Argentina. Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina Global access of biodiversity occurrence data all over the world is possible due to networks and portals, such as the GBIF and SNDB Data Portals, which integrate biodiversity data from heterogeneous sources. Therefore, the structure of databases with biological content is essential to ensure the accurate exchange of information. Most biological data from specimens housed in natural history museums are still not digitally recorded, or are registered in non-standard formats. The Digitization Project of the biological collections of the Argentine Museum of Natural Sciences “Bernardino Rivadavia” (MACN) began in 2008 and its main purpose is to turn primary data on paper into digital formats, implementing established data standards, such as Darwin Core, to provide information easily available to the scientific community. To achieve this objective primary data contained in files or catalogues is captured using the application Aurora. Data fields in Aurora are mapped to corresponding terms in DarwinCore, including taxonomic, temporal and geographic information, and also collectors and other curatorial details. About 219.000 specimens had been digitized of all the collections of the Museum, including the Invertebrate, Entomology, Herpetology, Ornithology, Arachnology, Mastoozology National Collections, and the National Herbaria of Vascular and Cellular Plants. The digitization project also includes the recording of geo-referenced species data. Localities are geo-referenced using the point-radius method taking into account aspects of precision and specificity of the locality description. About 40% of the specimen records have been geo-referenced at present. The validation of this data is done using DIVA-GIS programme, in order to identify and correct plausible errors during the geo-referenced process. The advantages of implementing standard protocols and field tools are the fast and easy access to biological information and the quality improvement of the data housed in the different collections of the museum. 30 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Analysis of variability of Mal de Río Cuarto virus (MRCV) through haplotype networks 1 2 1 Mario Alejandro García , María de la Paz Giménez Pecci , Juan Bautista Cabral , Irma Graciela Laguna Maurino2, Carlos Hugo J. Vera1 1 UTN FRC 2 INTA IPAVE – CIAP 3 CONICET 2,3 , Fernanda Analysis of variability through networks Genetic variability of individuals of the same species can be studied through networks that represent the genetic distances between them [1]. We studied the case of Mal de Río Cuarto virus (MRCV), defining distance measures between genome profiles of different individuals and creating a network of haplotypes. Topological properties of the network were analyzed. The network was explored in two dimensions, forming space-time environments with different levels of granularity and highlighting the existing profiles (Figure 1). Figure 1. Network exploration in space and time dimensions The profiles, called haplotypes (haploid genotypes), have gotten trough electrophoretic analysis of the viral dsRNA genome segments, performed on samples from 8 host species, in 13 locations, over 13 seasons [2].The electrophoretic profile of MRCV is represented by a binary string of length 18, which contains the ten known segments of the virus, some of which can be placed in different positions [3], and two extra genomic bands [4]. The distances between haplotypes were calculated as Hamming distance plus three special functions that depend on the existent knowledge about the virus. Finally, the exploration step led to the observation that, in the first crop years tested, the number of haplotypes and the distance between them was greater than in subsequent crops. A variability indicator was calculated for each environment and compared with its expected value, confirming the observation made during the examination and concluding that virus variability decreased after an epidemic occurred during the crop year 1996/97. Conclusion The use of networks in the KDD (Knowledge Discovery in Database) process was very successful and managed to highlight behavior of the object of study that had not been evident so far. Although an AMOVA analysis [5] and also a haplotype analysis by environments had been performed [3], the difference or distance between the profiles of each environment could be detected only with the implementation of the haplotype networks. The main contribution of this case to the KDD process is the proposal of interactive exploration of networks, which turned out to be intuitive and easy to apply for analysis. In a human-centered process, where the creativity and experience of the analyst play a key role [6], the proposed process was able to offer a fresh perspective, complementary to the other techniques of KDD. Acknowledgements UTN1219, FONCyT PICT 06-02486, PICT 143-02, INTA AEPV 214012, MinCyT PROTRI 2010. Reference 1. Posada D., Crandall K. A.: Intraspecific gene genealogies: trees gafting into networks. Trends in Ecology and Evolution. PubMed, CSA 2001,16:37-45 2. Giménez Pecci M.P., Carpane P., Dagoberto E., Laguna I.G.: Variabilidad del perfil electroforético de los segmentos genómicos del virus causal del Mal de Río Cuarto del maíz en Argentina. XIII Congreso Latinoamericano de Fitopatología. VEP-4 2005, Pg.: 562 3. Giménez Pecci M.P., Carpane P., Murua L., Bruno C., Balzarini M., Laguna I.G.: Variabilidad del Mal de Río Cuarto virus (MRCV) del maíz según frecuencia de haplotipos obtenidos desde perfiles electroforéticos de los segmentos genómicos. Actas de la Academia Nacional de Ciencias 14 2008, 99-107 4. Giménez Pecci M.P., Laguna I.G., García M.A., Carpane P.: Bandas extragenómicas en el perfil electroforético del dsRNA de Mal de Río Cuarto virus del maíz (Fijivirus, Reoviridae). Revista Argentina de Microbiología. Supl. 1 2007, Pg 108 5. Giménez Pecci M.P., Bruno C., Balzarini M., Laguna I.G.: Aplicación del análisis de la varianza molecular en datos de perfiles electroforéticos de segmentos genómicos del Mal de Río Cuarto virus (MRCV) del maíz (Zea mays L.) en Argentina. Actas de la Academia Nacional de Ciencias 13 2007, 141-152 6. Brachman R.J., Anand T.: The Process of Knowledge Discovery in Databases: A Human-Centered Approach. Advances in Knowledge Discovery and Data Mining, MIT Press 1996, 37-58 31 3er Congreso Argentino de Bioinformática y Biologı́a Computacional CoDNaS database: The conformational diversity of proteins and its relationship with biological properties Alexander Monzón, Ezequiel Juritz and Gustavo Parisi Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, Bs. As. Argentina Background Protein native state is better represented by an ensemble of conformers in equilibrium describing the conformational diversity or dynamism of a protein. Conformational diversity is a key feature to understand essential properties of proteins like function, enzyme and antibody promiscuity, enzyme catalytic power, signal transduction, protein-protein recognition and the origin of new functions. Crystallographic structures of the same protein obtained in different conditions can be considered as representative conformers of protein native state. This view is supported by the correlation found between the observed structural diversity determined by NMR experiments and those coming from different crystallographic structures. Description In order to study how biological properties of proteins are associated with the extension of their conformational diversity, we developed a protein conformational database called CoDNaS (from Conformational Diversity of the Native State preliminary release [http://codnas.unq.edu.ar]). For this purpose we recruited the redundant collection of crystallized structures from PDB database and obtained 9474 monomeric and homo-oligomeric proteins (accounting a total of 40565 structures) representing putative conformers for each corresponding protein. Using an all vs. all structural alignment between the corresponding conformers of each protein we defined the extension of conformational diversity as the maximum RMSD registered. We obtained that the average RMSD between conformers is 1.33Å and a maximum of 38 Å. By cross linking our proteins with several databases we recruited a broad spectrum of biological and physical-chemical information (as taxonomy, GO terms, ligands, mutations, oligomeric state, etc.). Then, using our practical definition of conformational diversity it is easy to relate its extension with different parameters. For example proteins crystallized in different conditions such as bound/unbound states, or mutant/wild-type state or with variations in pH and temperature give averages of conformational diversity as 1.09Å, 5.11Å, 2.64Å and 5.11Å respectively. Similar correlations have been obtained allowing us to study how conformational diversity varies with protein function, sequence similarity, taxonomy and cellular location. Conclusion We think that CoDNaS database is a useful tool to relate conformational diversity with different parameters and properties allowing us to increase our knowledge in such important feature of proteins. 32 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Non extensive statistics generalization of Jensen Shannon divergence for DNA sequence analysis Miguel Ré1,2, Pedro Lamberti2 1 Facultad Regional Córdoba, Universidad Tecnológica Nacional, Maestro López y Cruz Roja Argentina, Ciudad Universitaria, 5010 Córdoba 2 Facultad de Matemática, Astronomía y Física, Universidad Nacional de Córdoba, Haya de la Torre y Medina Allende, Ciudad Universitaria, 5010 Córdoba Jensen Shannon Divergence (JSD) is a symmetrized version of Kullback-Leibler divergence[1]. JSD allows quantifying the difference between probability distributions. It has been widely applied to analysis of symbolic sequences by comparing the symbol composition of different subsequences [2]. One advantage of JSD is that it does not require the symbolic sequence to be mapped to a numerical sequence, which is necessary for instance in spectral or correlation analyses. Different extensions of JSD have been proposed to improve the detection of sequences borders, in particular for DNA sequence analysis[3-4]. Since its original proposal [5], Tsallis entropy has been considered to extend Boltzmann Gibbs Shannon entropy results and applications. Different JSD Tsallis extentions has been suggested and its properties analyzed [6-7]. We present here possible extensions of JSD in Tsallis entropy framework and consider the results obtained when applied to DNA sequence analysis. 1. Kullback S, Leibler R: On information and sufficiency. Ann Math. Stat. 1961, 22: 79-86. 2. Grosse I, Bernaola-Galván P, Carpena P, Román-Roldán R, Oliver J, Stanley H: Analysis of symbolic sequences using the Jensen-Shannon divergence. Phys. Rev. E 2002, 65: 041905 1-16. And references therein. 3. Arvey A, Azad R, Raval A, Lawrence J: Detection of genomic islands via segmental genome heterogeneity. Nucleic Acids Research 2009, 1-12. 4. Thakur V, Azad R, Ramaswamy R: Markov models of genome segmentation. Phys. Rev. E 2007, 75: 011915 1-10. 5. Tsallis C: Possible Generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52: 479-487. 6. Martins A, Aguiar P, Figueiredo M: Nonextensive generalizations of the Jensen-Shannon Divergence. arXiv:0804-1653 2008, 1-7. 7. Lamberti P, Majtey A: Non-logarithmic Jensen-Shannon divergence. Phys. A 2003, 329: 81-90 Email: [email protected] 33 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Improving the correlation between experimental and theoretical studies of the interaction between Zidovudine (and novel derivatives) and human serum albumin 1 1 Ileana del Rosario Tossolini , María Cecilia Gómez 1 Facultad de Ingeniería, Universidad Nacional de Entre Ríos (U.N.E.R), Oro verde, Argentina Despite Zidovudine (AZT) effectiveness in the treatment of the acquired immunodeficiency syndrome (AIDS), this drug has significant adverse effects, many of them associated to low plasma protein binding, including the human serum albumin (HSA). Hence, obtaining AZT prodrugs, with higher affinity for HSA, is a key strategy to increase the effectiveness of the drug. In this case, the strategy used was the chemical modification at the 5’-OH position of the molecule, binding several amino acids, thus producing the AZT derivatives. HSA is present in the body in its pure form (ASHP) and complexed with fatty acids (ASHFA). Owing to the fact that both species exhibit different biodistribution, these studies were done so as to determine the molecular aspects that lead to the different affinities of the AZT derivatives, for both species of HSA. In a previous work, in order to design AZT derivatives with increased affinity for both species of HSA, molecular modeling methodologies, docking and molecular dynamics were applied based on the crystallographic structures of HSAP and HSAFA (PDB 1BM0 and PDB 3B9L, respectively). Molecular modeling techniques were used to find the ligands optimized geometries, in their minimum energy conformations. Then, the docking studies were performed on the HSA primary binding site. The energy calculations were applied to each complex trajectory obtained through molecular dynamics, using the MM_PBSA module of AMBER10 package. Although the found values show evidence of the molecular bases that could lead to the different affinities of the derivatives for both proteins, these values do not correlate in the desired manner with the experimental affinity [1]. For this reason, the docking studies were performed again, changing the values of certain parameters. For example, the genetic algorithm was carried out 40 times in order to obtain statistically significant results than the previous ones. In this manner, it was possible to validate some of the complexes obtained before, but new configurations were found, indicating that it would be interesting to perform molecular dynamics on those complexes. Another aspect to take into account, with the aim of improving the energy calculations, is to vary the dielectric constant. The predictions are quite sensitive to the solute dielectric constant, and this parameter should be carefully determined according to the charge of the protein/ligand binding interface [2]. On the basis of that analysis, further studies will be focused on changing the mention parameter so as to obtain a more realistic correlation between the energetic values and the affinity constants. The energy calculations were performed using the dielectric constants: 1 and 2, showing a better correlation with the latter value, but it is still necessary to improve the calculations with that parameter. References 1. Quevedo MA, Ribone SR, Moroni GN, Briñón MC: Binding to human serum albumin of zidovudine (AZT) and novel AZT derivatives. Experimental and theoretical analyses. Bioorg Med Chem 2008, 16:2779-2790. 2. Hou T, Wang J, Li Y, Wang W: Assesing the Performance of the MM/PBSA and MM/GBSA Methods. The Accuracy of Binding Free Energy Calculations Based on Molecular Dynamics Simulations. J Chem Inf Model 2011, 51:69-82. 34 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Following the tracks of the trypanosoma cruzi prenilome Exequiel Porta1, Guillermo Labadie1 1 IQUIR-CONICET (Rosario Chemical Institute), U.N. de Rosario, Suipacha 531, S2002LRK, Rosario, Argentina. Te: 54-341–4370477 E-mail: [email protected]. Background The Chagas-Mazza is one of the world's major parasitic diseases affecting the Americas and is caused by the protozoan parasite Trypanosoma cruzi. If the condition is not treated in time, it attacks the body's vital organs infecting and causing disabling injuries and a slow deterioration leading to death. The necessity of developing new chemotherapeutic agents and find new targets for action because of current drugs are inefficient and only work in the acute stage. Furthermore prenylation refers to the posttranslational modification of proteins with isoprenyl anchors. These motifs are often involved in lipid mediating membrane protein, as well as protein-protein interactions of important cellular proteins. It is known that eukaryotic three enzymes catalyze the transfer of these lipids. The farnesyl transferase (FT) and geranyl geranyl transferase type 1 (GGT1) recognize the CAAX motif of the C-terminus of the protein substrate and place a farnesyl (polyisoprene of 15 carbons) or geranylgeranyl (20 carbon polyisoprene), respectively in the thiol a cysteine of this motif. The third enzyme, Geranylgeranyltransferase transferase type 2 (GGT2 or RabGGT) recognizes the complex of proteins Rab GTPases with specific Rab accessory protein (REP, for its acronym in English) to connect one or two cysteine geranilgeraniles a more flexible. Due to the extensive study conducted in the search for inhibitors of farnesyltransferase (FTase-i) as anticancer agents in the pharmaceutical industry, in particular, the group of Prof. Gelb (Univ. of Washington) conducted a study of these compounds as antiparasitic agents. They found that the i-FTase enzyme, which inhibit both human and the parasite, have greater cytotoxicity toward the parasite. This finding has validated this enzyme as a target for new chemotherapeutic agents, leading different groups to look for specific inhibitors of the parasitic enzyme. In recent years, there have been reported in the literature specific enzyme inhibitors wich show antimalarial activity and tripanomicide. This discovery opened different expectations in the development of new drugs from that target. In contrast to what occurs in humans, very little is known about the role of protein prenylation in parasites. Thus, it is only known less than 10 of these substrate proteins, most of them belonging to the trypanosomatids. This opens a fertile field of research where the tools of biological and bioorganic chemistry can provide new points of view for the study of the parasitic "prenilome". To study and elucidate the parasitic prenilome is necessary to have a bioinformatic study of the possible proteins that might be targets of isoprenylation in T. cruzi. In this way we will give a theoretical framework to the next step: developing the new chemical tools (bioorthogonals probes and fluorescent probes) necessary to be applicable in proteomics.The cores Software for bioinformatic analysis approach are the Preps, the PrenBase, the BLAST and T-Coffee. All are available free online virtual platforms. Results and Conclusions Analyzing the entire proteome of T. cruzi (19,906 proteins), it was found a total of 135 proteins with the capacity of being prenylated. A huge percentage of these proteins may perform vital functions to the parasite. This work will provide the theoretical framework set to continue studying the T. cruzi prenilome using chemical and biological tools available for proteomics. 35 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Attacking Mycobacterium Tuberculosis in the dormant phase: A Combination of expression data with structural druggability and nitrosative stress sensitivity Leandro G. Radusky1,2*, Lucas A. Defelipe1,2, Marcelo A, Marti1,2 , Adrian G. Turjanski1,2 * to whom correspondence should be sent: [email protected] 1. Departamento de Química Inorgánica, Analítica y Química Física/INQUIMAE-CONICET, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Ciudad Universitaria, Pabellón 2, Buenos Aires, C1428EHA, Argentina. 2. Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Ciudad Universitaria, Pabellón 2, Buenos Aires, C1428EHA, Argentina It is estimated that one-third of the world population is infected with Mycobacterium tuberculosis (Mt), resulted in 1.8 million deaths worldwide. (World Health Organization, 2011) The host immune response to tuberculosis (TB) infection relies in phagocytosis of the bacilli by the macrophages resulting in the formation of a granuloma which stops bacterial replication. Inside the granuloma the bacteria faces a particular stressing condition characterized by hypoxia, inducible Nitric Oxide (NO) synthase derived NO and nutrient deprivation, and in response switches to a non replicative state, usually called the dormancy phase, where it can remain hidden and alive for decades. Reactivation of latent Mt is a high risk factor for disease development particularly in immunocompromised individuals. Common treatment of TB involves a long treatment with the front line drugs, isoniazid, rifampicin, pyrazinamide and ethambutol. However, the emergence of multi and extensively-drug-resistants (MDR and XDR) Mt strains, and the negative drugdrug interactions with certain HIV (or other disease) treatments, show the urgent need for new anti-TB drugs. In the present work we have performed a proteome scale analysis of Mt potential drug targets specific for the dormant phase. For this sake, for all Mt protein domains with available structure, we have first the determined their i) sensitivity to RNOS based upon aminoacidic composition of the active site, ii) pocket druggability using fpocket[1] and different pocket properties. This information was then combined with essentiality[2-4], off-target and microarray derived data [5] in a target prioritization pipeline. Using all the information cited above we performed a weighted search using Sensitivity of RNOS, Druggability, Essenciality, Offtargeting against Human targets and Upregulation in RNOS conditions as criteria for selection. Three new putative targets have been chosen to follow a virtual screening protocol. (Table 1). Table 1 Name N-acetyl-glutamate dehydrogenase Putative phosphotransferase Possible umaA Mycolic Uniprot Sensible to RNOS Druggable Essential Offtarget Upregulation in RNOS semialdehyde P63562 Yes Yes Yes Yes Yes aminoglycoside Q7D606 Yes Yes Yes Yes No Q6MX39 Yes Yes Yes Yes Yes Acid Synthase Acknowledgements This work was partially funded by ANPCyT PICT-2010-2805 awarded to AGT and Bunge y Born FBBEI9/10 (2011-2012) to MAM. LR is a ANPCyT Fellow. LAD is a CONICET Fellow. References 1. Schmidtke P et al (2010); J Med Chem. 53(15):5858-67 2. Sassetti C.M., et al (2003); PNAS 100 (22) 12989-12994 3. Rengarajan J, et al (2005); PNAS 102(23):8327-32 4. Sassetti C.M, et al (2003); Mol. Microbiol. 48(1), 77–84 5. Voskuil, M.I., et al. (2003); JEM 198 (5) 705-713 36 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Computational, biochemical, and spectroscopic studies of the copper-containing nitrite reductase from the denitrifier Sinorhizobium meliloti 2011 María Cecilia Gómez1, Felix Martín Ferroni1, Alberto Claudio Rizzi1, Sergio Daniel Dalosto2 and Carlos Dante Brondino1 1 Departamento de Física, Facultad de Bioquímica y Ciencias Biológicas, Universidad Nacional del Litoral, Santa Fe, Argentina, S3000ZAA. 2 INTEC, Santa Fe, Argentina, S3000ZAA. Nitrite reductases are enzymes that catalyze the reduction of nitrite to NO in the denitrification pathway of the biogeochemical nitrogen cycle [1]. In denitrifying bacteria, this reaction can be catalyzed by two nitrite reductases, one containing a cd1 heme and the other containing copper. Copper-containing nitrite reductases (hereafter Nir) present homotrimeric structure (~ 40 kDa/monomer) with two copper atoms per monomer, one of type 1 (T1Cu, also blue copper) and other of type 2 (T2Cu, also normal copper) (Fig. 1). Nirs have been classified into two groups according to the UV-vis properties of their T1 centers. Blue Nirs exhibit a very intense absorption band at ~ 590 nm, whereas green Nirs present two intense absorption bands at ~ 460 and 600 nm. The coordination around both copper centers is shown in Fig.1b. T1Cu is an electron transfer center, whereas T2Cu is the catalytic center. The proposed reaction mechanism, which involves a pseudoazurin as external electron donor (Paz), is schematized in Fig.1.a. Fig 1- a) Schematic 3D structure of Nir b) Coordination around T1Cu and T2Cu We recently overexpressed and purified the copper containing nitrite reductase from the denitrifier Sinorhizobium meliloti 2011 (SmNir) [2]. Sinorhizobium meliloti 2011 is a rhizobia organism which lives symbiotically in root nodules of legumes widely used in agriculture because of their ability to take dinitrogen from the atmosphere. We present and discuss the biochemical and spectroscopic properties of SmNir together with the computational structural model predicted from its amino acid sequence. We also report computational studies that describe the interaction of both types of copper atoms with their ligands using a classical force field and classical molecular dynamics. The structure of Nir from Alcaligenes faecalis (pdb accession number, 1SNRB), which shows a high percentage of identity to SmNir, was used as model. The force field was addressed using the combination of quantum mechanics (QM) and classical mechanics (MM) methods known as QM/MM methods [3]. This approach allowed us to model adequately the active site at the QM level of theory, and the rest of the system with MM. A total of seven residues, two copper atoms and one water molecules were treated with QM and the rest, including some water molecules from the solvent, with Amber force field. We discuss the theoretical model in terms of the experimental results. Acknowledgment We thank FONCYT, CONICET, and CAID-UNL for financial support. References 1. Zumft W: Cell biology and molecular basis of denitrification. Microbiol Mol Biol Rev 1997, 61:533616. 2. Ferroni FM, Guerrero SA, Rizzi AC, Brondino CD: Overexpression, purification, and biochemical and spectroscopic characterization of copper-containing nitrite reductase from Sinorhizobium meliloti 2011. Study of the interaction of the catalytic copper center with nitrite and NO. J. Inorg. Biochem 2012, 114:8-14. 3. Vreven T, Byun KS, Komaromi I, Dapprich S, Montgomery JA, Morokuma K, Michael J. Frisch MJ: Combining quantum mechanics methods with molecular mechanics methods in ONIOM. J Chem Theory Comput 2006, 2:815–826. 37 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Truncated normal regression models for soil-water characteristic curves Carolina C. M. Paraı́ba1 , Carlos A. R. Diniz2 , Aline H. N. Maia3 1 PhD Student at Departamento de Estatı́stica, UFSCar, São Carlos, São Paulo, CP 676, Brazil 2 Departamento de Estatı́stica, UFSCar, São Carlos, São Paulo, CP 676, Brazil 3 EMBRAPA - Meio Ambiente, Jaguariúna, São Paulo, CP 69, Brazil Background A soil-water characteristic curve (SWCC) is a useful graphical tool which describes the amount of water remaining in the soil (water volume content) as a function of the soil water tension (matric potential). SWCCs are important to study the relationship between soil and water, a physical phenomenon that affects soil use in many different purposes. One common use of these curves is to indirectly determine the unsaturated hydraulic conductivity, using statistical pore-size distribution models [1]. A SWCC is usually estimated by nonlinear regression models fitted to data sets obtained from laboratory experiments or from pedotransfer functions. Methods When constructed from laboratory experiments data, the curve is fitted considering pairs, (θ, ψ), obtained by applying different tensions, ψ, to the a given soil sample, and observing the water content, θ, remaining in the sample after application of each tension level considered. Thus, retention curves relate a variable response, θ, with a regressor variable, ψ. However, given the nature of the SWCC data, it is known that the observed water content at a matric potential will be such that it is not less than the residual soil-water content, θr , and no more than the saturated soil-water content, θs , a phenomenon known in statistics as truncation. The most widely used method for estimating the parameters of a SWCC is the nonlinear least squares method. Although well established, usual least squares procedures can be highly biased in the presence of truncation, which can seriously affect the estimated curve and prediction based on it. As argued in [2], it is important to account for truncation in regression analysis since usual LS estimators can be biased, inefficient, and inconsistent. In the present paper, we propose an alternative approach for estimating SWCC based on nonlinear normal truncated regression models, assuming normal experimental errors and taking into account the truncated nature of the observed data. The parameters of the curve are estimated by maximum likelihood method. Results Simulation studies are provided to access the quality of estimates for the proposed regression model. A real data set is analyzed using the proposed methodology. We also provide a comparison study between the proposed methodology and the usual nonlinear least squares procedure. References 1. Cornelis WM, Khlosi M, Hartmann R, van Meirvenne M, de Vos B: Comparison of unimodal analytical expressions for the soil-water retention curve. Soil Science Society of America Journal 2005, 69:1902-1911. 2. Maddala GS: Limited dependent and qualitative variables in econometrics. Cambridge: New York, 1983. 38 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Reverse engineering HD-Zip transcriptional regulatory networks (Ft. Information Theory) Agustín L. Arce1, Matías Capella1, Delfina A. Ré1, Raquel L Chan1, Ariel Chernomoretz2,3 1 Instituto de Agrobiotecnología del Litoral, Universidad Nacional del Litoral, CONICET, Ciudad Universitaria, 3000, Santa Fe, Argentina 2 Grupo Biología de Sistemas Integrativa, Fundación Instituto Leloir, C.A.B.A., Argentina, C1405BWE 3 Departamento de Física. Facultad de Ciencias Exactas y Naturales (o IFIBA), Universidad de Buenos Aires, C.A.B.A., Argentina, C1428EHA Background HD-Zip proteins constitute a family of plant transcription factors (TFs). It has been reported that proteins belonging to subfamilies I and II are mainly involved in responses to environmental stimuli, particularly abiotic stresses. However, the regulatory networks in which these TFs participate are largely unknown. In this work the transcriptional regulatory networks of Arabidopsis were reverse engineered employing algorithms based in information theory using public large scale transcriptomic assays. The results were analyzed from a functional and evolutionary point of view, focusing on HD-Zip I and II TFs. Materials and methods The program ARACNE was used for the network reconstruction. Filtered data consisted of 9618 genes and 269 microarrays obtained under different abiotic stress treatments (AtGenExpress project, http://www.weigelworld.org/resources/microarray/AtGenExpress/). As a result, sets of potential direct targets (named modules of transcriptional activity, MTAs) for each of the 831 TFs were obtained. Results The study of the MTAs of the 25 HD-Zip I and II TFs revealed many novel functional characteristics. A distinctive pattern of expression was found for subsets of genes in roots and shoots for most of the MTAs (e.g., MTA of AtHB1; Figure 1A). The overlap of the MTAs was more significant for some phylogenetically closely related genes (Figure 1B), suggesting a degree of functional redundancy in these cases. The expression correlation of the genes in each MTA under different stresses uncovered a potential unknown role for most TFs in heat response (e.g., AtHB12 MTA; Figure 1C). A de novo motif discovery approach on promoters of MTA genes with the support of conditional mutual information allowed the recognition of a potential interplay with GBF TFs. Functional studies with GO terms on MTA genes resulted in the association of the TFs with known and new pathways and functions. Arabidopsis HD-Zip mutants and overexpressors preliminary confirmed the predicted regulatory role for some TFs on selected genes. Other preliminary experimental data also supports their role in heat response. Figure 1 A B C D A. Gene expression heatmap of AtHB1 MTA. B. Heatmap of pairwise MTAs overlap comparisons considering all TFs. C. Genes with correlated expression in AtHB12 MTA evaluating different stresses. D. Representation of statistically enriched GO terms associated with genes of the AtHB12 MTA 39 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Software integration to bioimage management, processing and analysis J. E. Diaz Zamboni1, L. Bugnon1, E. Paravani1, C. Galetto1, J. Adur1, V. Bessone1, M. Bianchi1, M. G. Acosta1, S. Laugero1, V. H. Casco1 and M. F. Izaguirre1 1 Laboratorio de Microscopia Aplicada a Estudios Moleculares y Celulares (LAMAE) – Facultad de Ingeniería – Universidad Nacional de Entre Ríos Abstract The last two decades, technological advances regarding to basic research equipment for cell and molecular biology is astonishing. Between all instruments developed as apparatus to support research, bioimaging systems are essential tools. Bioimages can be obtained from a wide range of equipment and associated techniques, from which we can highlight microscopy images (including all types and modes), electrophoresis gels, hybridization membranes and microarrays. Almost all available instrumentation comes from a very competitive industry, which produces equipment highly protected by patents and proprietary software that allows the storage of bioimages in several file formats. However, the vast majority of the systems do not allow storing the entire experiment metadata and they neither use standardized and open format to enable users to import from other software. This lack of standardization makes the use of the applications more complex and incompatible, and therefore, users tend to move their data across multiple applications and file formats. Consequently, valuable information is lost in the conversion and/or migration. LAMAE´s members, as users and developers, have addressed the problem of bioimaging administration, processing and analysis, working in the implementation of systematic software integration. The solution is focused on the use of standard and image file formats, and free software. We have searched through the options: a standard format for easy sharing, a system to efficiently manage bioimages data and metadata. OME-XML and OME-TIFF were selected to standardize information; both formats have been created for optical microscopy techniques. The first format is a text file encoding the image to text, maximizing portability and easy metadata reading [1,2,3]. OME-TIFF format is an open standard based on the TIFF format, where the metadata information defined in the schema OME-XML is stored in the header of the TIFF file. Its capabilities are higher in the rapid access to image data [4]. Both OME-XML and OME-TIFF headers are extensible and with a structure that make them appropriated to other bioimages. We have selected the free server OMERO, for the management of bioimages, which was installed in a desktop computer. Client’s applications can run on any computer and can be access to the server through the local network. ImageJ was selected for image processing and analysis, because of its powerful management features of: multiple image formats, development tools capabilities, server access and functionality OMERO cross-platform [5]. Another software integration activity was the implementation of a tool to export images from an optical sectioning microscope to OME-TIFF file format [6,7]. Our future activities on software integration will be related to the development of two imaging systems: a photodocumentation system for electrophoresis gels and a digitalization system for scanning electron microscopy. In both cases the images will be stored on the OMERO server and we will be studying the OME-TIFF format images obtained in such equipment for portability of image files. References 1. J. R. Swedlow, I. Goldberg, E. Brauner, and P. K Sorger. “Informatics and quantitative analysis in biological imaging”. Science 300 (2003) 100-102. 2. I. Goldberg, C. Allan, J.-M. Burel, D. Creager, A. Falconi, H. Hochheiser, J. Johnston, J. Mellen, P.K. Sorger, and J.R. Swedlow. “The Open Microscopy Environment (OME) Data Model and XML file: open tools for informatics and quantitative analysis in biological imaging”. Genome Biol. (2005) 6 R47. 3. Open Microscopy Environment. http://www.ome-xml.org/ 4. Bioformats. http://www.loci.wisc.edu/ome/ome-tiff.html 5. ImageJ. http://rsbweb.nih.gov/ij/ 6. J. E. Diaz-Zamboni. “Software para usuarios de microscopios de desconvolución digital”. Tesis de grado. Facultad de Ingeniería, Universidad Nacional de Entre Ríos, (2004). 7. J. E. Diaz-Zamboni, J. F. Adur, D. Osella and V. H. Casco. “Software para usuarios de microscopia de desconvolución digital”. XV Congreso Argentino de Bioingeniería, (2005). 40 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Image analysis to control the roast level of the peanut 1 1 Ignacio Arévalo , Silvia Ojeda 1 FAMAF, Córdoba, Argentina Motivation This work deals with automatic control applied to an agro-industrial sector. It responses to the increasing interest of the peanut industry to improve yield and product quality. Precisely, an online automatic methodology to distinguish the different roast levels in the peanut roasting process is introduced. This helps to detect failures in the peanut roasting process so that corrections can be applied if needed. The proposed method improves the methodology recently presented by Palma, Ojeda and Modesti [3] and it is based on a novel algorithm that use information provided by optical sensors installed near to the oven where the roasting process is carried out. Materials We have a database of 3900 color images of skinless peanut kernels and bulk. These were taken in a simulated environment and they have a variety of roast levels. Conclusion We present a novel algorithm to automatic control of roasting of peanut. The method is based on image processing techniques and it applies computer technologies. The online automatic new methodology allows to properly distinguish the desired roast level. Keyword Bulk peanut, image processing, computer networks, roast level of peanut. References 1. BATAL, A.; DALE, N.; CAFÉ, M. Nutrient composition of peanut meal. Journal of Applied Poultry Research, v. 14, p. 254-257, 2005. 2. SANDERS, T. H. Effects of variety and maturity on lipid class composition of peanut oil. Journal of the American Oil Chemists' Society. v. 57, n. 1, p. 8-11, 2007. 3. PALMA, J. J., OJEDA, S. M., MODESTI, M. Procesamiento de imágenes industriales: una aplicación al control del tostado del maní. IJIE Vol 3, Nº2 2011. 41 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Eukaryotic secretory pathway proteins avoid occluded Nglycosylation sequons. Máximo López Medus, Gabriela Elena Gómez, Lucas Landolfo, Julio Javier Caramelo* Fundación Instituto Leloir. IIBBA. Conicet. *Departamente de Química Biológica de Buenos Aires. Abstract N-glycosylation is one of the most abundant and drastic posttranslational modifications. About 25 % of eukaryotic proteins are N-glycosylated when they enter the secretory pathway. N-glycans are important for the conformational maturation of glycoproteins and fulfill vital roles in several molecular recognition processes. This modification takes place on the sidechain of Asn residues within the context Asn-X-Ser/Thr (N-glycosylation sequon), where X can not be Pro. Even though all known N-glycans are located on the protein surface, N-glycosylation takes place before any major protein folding event, when proteins display an extended conformation. For this reason, it is possible the occupation of sequons normally buried on the protein structure, which in turn would seriously impair their folding process. There are two scenarios to avoid this situation: (1) secretory pathway proteins avoid occluded N-glycosylation sequons or (2) occluded sequons are not occupied. To answer this, we classified the protein data bank based on whether proteins belong or not to the eukaryotic secretory pathway. Next, we analyzed the surface exposition of Asn residues within the sequon context using the MSMS program. We found that secretory pathway proteins avoid occluded Nglycosylation sequons. Compared with non-secretory pathway proteins, Asn-X-Thr and Asn-X-Ser sequons are 6 and 3 times less frequent in secretory pathway proteins, respectively. This strong bias is highly specific, since it is absent in any of the remaining Ans-X-Y combinations. To generalize this result, we analyze the solvent exposition of the first residue present in the 400 Y1-X-Y2 combinations. Interestingly, we found that only N-glycosylation sequons display such a strong disparity between secretory and non-secretory pathway proteins. 42 3er Congreso Argentino de Bioinformática y Biologı́a Computacional 25S-18S ribosomal nature of the not NOR-associated highly GC-rich heterochromatin of chili peppers (Capsicum-Solanaceae) Mauro Grabiele1, Humberto Debat2, Marisel Scaldaferro3, Guillermo Seijo4, Daniel Ducasse2, Eduardo Moscone3, Dardo Martí1 1 Instituto de Biología Subtropical (IBS-UNaM), Posadas, Misiones, 3300, Argentina 2 Instituto de Fitopatología y Fisiología Vegetal (IFFIVE-INTA), Córdoba, 5119, Argentina 3 Instituto Multidisciplinario de Biología Vegetal (IMBIV-UNC), Córdoba, 5000, Argentina 4 Instituto de Botánica del Nordeste (IBONE-UNNE), Corrientes, 3400, Argentina Background Highly GC-rich heterochromatin (Het) is a universal component of the genome of chili peppers [1]. Fluorescent in situ hybridization (FISH) patterns using Arabidopsis and wheat derived ribosomal (rDNA) probes, embracing rRNA genes and intergenic spacer (IGS) with repetitive blocks segments, revealed a priori co-localization of rDNA and Het regions in Capsicum which promoted a further characterization of the 25S-18S rDNA unit in the genus [2] and the present work. Material and methods To reveal the unambiguous nature of the Het of chili peppers, combined bioinformatics (Database searching, sequence alignments, primary and secondary structures analysis) molecular cytogenetics (FISH) and molecular biology (PCR amplification, restriction enzymes assays, cloning and sequencing) approaches were carried out in 8 taxa representatives of the major lineages of Capsicum. Results A definite FISH co-localization pattern of Capsicum derived rDNA genes (18S, 25S, 5.8S) and spacers (ITS, IGS) probes and Het is exclusive for chili peppers based on x=12. FISH pattern of pCp200/33 probe, a mutated IGS element likely affecting rDNA transcripton, imitate that of rDNA/Het excluding the active NORs. Figure 1 Alignment of homologous regions of 25S-18S rDNA derived probes used in FISH of Capsicum (left) and double FISH in Capsicum pubescens (right); blue: DAPI stained chromatin; green signals: 18S rDNA probe (Cf18S-17); red signals: pCp200/33 probe. Arrowheads point out the absence of red signals in active NORs. Scale bar is 10 µm. Conclusion Highly GC-rich Het of chili peppers based on x=12 is formed by tandemly repeated mega satellite DNA sequences derived from the 25S-18S rDNA entire unit (7.8 kbp); its origin, expression and evolution are strongly related to its inherent heterochromatic and ribosomal double nature. Its absence in the Het constitution of taxa with x=13 has evolutionary relevance. References 1. Moscone EA, Scaldaferro MA, Grabiele M, Cecchini NM, Sánchez García Y, Jarret R, Daviña JR, Ducasse DA, Barboza GE, Ehrendofer F: The evolution of chili peppers (Capsicum – Solanaceae): a cytogenetic perspective. Acta Hort 2007, 745:137-169. 2. Grabiele M, Debat HJ, Moscone EA, Ducasse DA: 25S-18S rDNA IGS of Capsicum: molecular structure and comparison. Pl Syst Evol 2012, 298: 313-321. 43 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Computational Simulation of inclusion ways of Sulfamethoxazole and Sulfadiazine in Cyclodextrins Erbes, Luciana A.1 1 Universidad Nacional de Entre Ríos , Facultad de Ingeniería, Oro Verde, Entre Ríos, Argentina Introduction Sulfonamide-Cyclodextrine complexes motivated this research, in first instance because of their widespread use in the pharmaceutical field, and secondly, for the lack of works of this issue. Sulfamethoxazole (SMX) and Sulfadiazine (SDZ) are the sulfonamides selected, and related to Cyclodextrines (CD), it was chosen the β-cyclodextrine (β-CD). Theorical studies from each complex were done, using a set of methodologies (molecular modelling, docking and dynamic) through software systems. Results Molecular Modelling The ligands: SMX and SDZ, were designed using a software called Gabedit. The receptors: β-CD, Hydroxypropyl- β-CD (HP- β-CD with 3 and 4 hydroxypropyl groups), and Methyl- β-CD (M- β-CD), were developed using a crystallographic structure similar than β-CD that was modified. Molecular Docking The location of each ligand in the complexes were predicted using Amber software. In each case, some criteria were defined to be able of select the appropriate conformations to be analyzed in molecular dynamics. In every complex (SMX-β-CD, SDZ-β-CD, SMX-HP-β-CD, SDZ-HP-β-CD, SMX-MβCD and SDZ-M-βCD), the result of a conformational search was, in general, one cluster with a minimum energy. Molecular Dynamic Molecular dynamic stages (minimization, heating, equilibration, and production) were applied to the receptors and the complexes, in an an explicit solvent enviroment. In each stage, energy and temperature analyses were done and compared. Finally, 10 ns were obtained from every receptor and 10 ns from each complex. Analyses VMD software, LigPlot software and Amber scripts (H bonds, distances, nearest waters) were used to verify the location of the ligands in their receptors and the movements and orientation from the hydroxypropyl groups. Conclusions Most of the complexes have their ligand orientated in the same way, where the benzene nucleus with an amino group is located toward the wide side of the CD. Just in SMX-HP-β-CD (4 hydroxypropyl groups) and SDZ-HP-β-CD (3 hydroxypropyl groups), is backwards. Also, comparisons were obtained getting as a result some solubility information. 44 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Phylogenetic relationships of Rhinella arenarum β-catenin. A developmental biology useful model Hasenahuer M.A.; Galetto C. D.; Casco, V. H. & Izaguirre, M. F. Laboratorio de Microscopia Aplicada a Estudios Moleculares y Celulares, Facultad de Ingeniería (Bioingeniería-Bioinformática), Universidad Nacional de Entre Ríos. Ruta 11, Km 10, Oro Verde, Entre Ríos, Argentina. Corresponding author: M. F. Izaguirre: [email protected] Background Rhinella arenarum is a South American toad, widely distributed by Argentina, Uruguay, Bolivia and southern Brazil. This species has been extensively used in developmental biology studies in Argentina for more than 50´ years. Recently, we were able to isolate and sequence a 539 bp fragment of R. arenarum β-catenin cDNA (1). β-catenin is a vertebrate cytoplasmic protein that, like the Drosophila armadillo product has two main functions: linking the cadherin cell-adhesion molecules to the cytoskeleton, and mediating in the wnt/wingless signalling pathway. Thus β-catenin regulates gene expression by direct interaction with transcription factors belong Tcf/LEF family. It provides molecular mechanisms for signal transduction from celladhesion components or wnt protein to the nucleus, and thus controlling numerous cell events, such as growth and development (2, 3). Study of β-catenin function in non traditional animal models increases the proofs to understand your evolutionarily role. Therefore, a complete gene sequence Logos and phylogenetic analysis were tackled. Numerous metazoan and non-metazoan gene and protein sequences of β-catenin and β-catenin-like were analyzed. Materials and methods cDNA sequence Logos were obtained from metazoan (Homo sapiens, Macaca mulatta, Bos taurus, Gallus gallus, Mus musculus, Xenopus laevis, Xenopus tropicalis, Rhinella, arenarum, Anolis carolinensis, Danio rerio, Drosophila melanogaster, Pediculus humanus, Caenorhabditis elegans, Trichoplax adhaerens Amphimedon queenslandica), and non-metazoan (Arabidopsis thaliana, Volvox carteri, Dictyostelium discoideum. Using the same species, phylogenetic analysis of protein sequences of β-catenin and βcatenin-like was tackled by PhyML 3.0, with aLRT-SH and bootstrap branch supports. Results and Conclusions Logos and protein phylogenetic trees obtained revealed a near evolutionary relationship of Rhinella arenarum β-catenin homologous with other vertebrate’s homologous genes, especially the amphibian ones, and interesting relationship with those of non-metazoan species, supporting the hypothesis that β-catenin pre-exist the metazoan life, integrating information from genomic, cytoskeletal, plasma membrane and environmental sources. Acknowledgements Present work was supported by PID-UNER 6088-1. 45 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Comparison of the ability to predict true linear B-cell epitopes by on-line available prediction programs 1 2 3 2 1 J. Gabriel Costa , Pablo L. Faccendini , Silvano J. Sferco , Claudia M. Lagier , Iván S. Marcipar a Laboratorio de Tecnología Inmunológica, Facultad de Bioquímica y Ciencias Biológicas, Universidad Nacional del Litoral. Paraje El Pozo. Santa Fe, Argentina. b IQUIR, Depto. de Química Analítica, Facultad de Ciencias Bioquímicas y Farmacéuticas, Universidad Nacional de Rosario. Suipacha 531. Rosario, Argentina. c Departamento de Física, Facultad de Bioquímica y Ciencias Biológicas, Universidad Nacional del Litoral, Paraje El Pozo. Santa Fe, Argentina; and INTEC (CONICET-UNL), Güemes 3450, Santa Fe, Argentina. Background: Several experimental methods have been developed to identify B epitopes from infectious microorganism proteins. However, these methodologies are long term demanding and quite expensive. Our work deals with the use of prediction programs to identify useful B cell linear epitopes to develop immunoassays. Therefore, we have tested 5 free, on-line prediction methods (AAPPred, ABCpred, Bcepred, BepiPred and Antigenic), widely used for predicting linear epitopes, using the primary structure of protein as the only input. Each program uses a very different algorithm. Methods and Results: To compare the quality of the predictor methods we have used their positive predictive value (PPV), i.e. the proportion of the predicted epitopes which are true, experimentally confirmed epitopes, in relation to all of the epitopes predicted. Eleven proteins which had been whole mapped experimentally by highly reliable techniques to detect epitopes, were studied. Each program was run and predicted epitopes were compared with the 65 true epitopes dispayed in the proteins. In order to identify useful predicted linear epitopes, none supposed true negative set was used. The confidence intervals of PPV were calculated with at 90% level of significance for each different prediction procedures. The best PPV were obtained with AAPpred and ABCpred, 69.1% and 62.8% respectively. We also statistically evaluate the differences between theses PPV values when counting with paired data. This allowed us studying which program produced a PPV value different from that calculated for another program, stated with 90% certainty. Then, to monitor the programs prediction efficiency, we compared the epitope identifying positive prediction value with that obtained when randomly selecting regions of the molecule under study. Our results indicate that only 2 of the programs studied predicted epitopes with a statistically significant higher positive prediction value than a random procedure, these being AAPPred and ABCpred. Although, we analyzed if the epitopes predicted by the consensus of several programs were more efficient than those which had been predicted with each program alone or with partial consensus. But we observed that considering as true epitopes only the consensus regions to several programs, does not improve PPV value with respect to the results produced by each program individually. Conclusion: We conclude that AAPPred and ABCpred yield the best results, as compared with the other programs and with a random prediction procedure. We also ascertained that considering the consensual epitopes predicted by several programs does not improve the prediction positive predictive value. 46 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Relative mobility of epitopes residues in immunogenic proteins Marcos Astorga1 , Sebastián Fernández Alberti1,2 , Gustavo Parisi1,2 Universidad Nacional de La Plata, Argentina 2 Universidad Nacional de Quilmes, Argentina 1 Background The antigen-antibody interaction is based on the recognition by the antibody of particular antigen residues called epitopes. Understanding the general features that characterize the epitopes may contribute to the design of specific drugs that hinder the antigen-antibody complex formation. From this point of view, the knowledge of the main physicochemical, structural, dynamical and evolutionary properties of the epitopes will allow us to obtain common features that can be used to develope new methods of epitope recognition. Methods We analyzed flexibility and dynamics properties of epitopes residues in 15 complexes antigen-antibody groups. In order to do that, we have performed vibrational normal modes analysis using coarse grained elastic network models. Results and conclution Our preliminary results reveal a significant decrease in the epitope flexibility once the antigen-antibody complexes are formed. Similar behaviours are observed for the relative movilities of the epitopes in the low frequency normal modes. These results represent a complementary information that can be use in combination with the physicochemical and structural characterization of the epitopes sites in order to identify potentially epitopes. 1/1 47 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Relationship between divergence of using synonymous codons in host-virus and the presence of microRNA Franco Riberi1 , Laura Tardivo1 , Lucia Fazzi2 , Guillermo Biset3 , Daniel Gutson3 , Daniel Rabinovich2,3 1 Departament of Computing Sciences, Universidad Nacional de Rı́o Cuarto, Rı́o Cuarto, Córdoba, Argentina 2 Instituto Biomédico en Retrovirus y SIDA-INBIRS, Buenos Aires, Argentina 3 Fundación para el Desarrollo de la Programación en Ácidos Nucleicos-FuDePAN, Córdoba, Argentina Background MicroRNAs (mi RNA) are small RNA that regulates the expression of m RNA in the cells. They can interfere with viruses replication. In order to do this, it is necessary that the mi RNA recognize genome target sites and that a pairing between the mi RNA and a fragment of viral mi RNA occurs. This recognition is more likely if the fragment is not paired (masked) in the secondary structure of viral m RNA[1]. It is known that the genome of some human viruses has a bias in the use of synonymous codons (different codons that encode for the same aminoacid) compared with the host even though its replication would be less efficient[2]. Goal The aim of the study is to determine if this bias could be the result of evolutionary pressure exerted by the mi RNA. To achieve this goal massive comparisons should be made (in the order of 10e7 ) between the recognition of the virus natural genome and the “humanized” genome. The latter may be obtained by replacing codons in the viral genome, achieving a codon usage ratio similar to the host. Materials and Methods For each mi RNAs, the software to be developed will do parallel “sweep” with the natural and humanized virus sequence. For each possible genome site, this program should determine the number of recognized nucleotides and whether these sites are available or masked by the secondary structure. When comparing results in homologous sites it can be determined whether mi RNAs have a differential effect among different target m RNAs (normal and humanized). The program will be coded using the C++ programming language and licensed under the GPLv3 software licence. Results For each mi RNA and genome, a table should be produced that for each position records the matching m RNA score, both in the original and the humanized sequence. Table 1 shows the shape structure to be generated. Original sequence Humanized sequence Score original sequence Score humanized sequence Position Matching∗ Masked+ XYZ Matching† Masked‡ XYZ %const=1∗ cFold∗ %const=1+ cFold+ %const=1† cFold† %const=1‡ cFold‡ 1 aaTTg CacA ... ... aaTTg Maca ... ... AaTTg Xaca ... ... ttAAC Gtct ... ... ttMAC MtcM ... ... ttYAC YtcX ... ... 0.44 0.45 0.22 0.24 0.55 0.54 0.11 0.21 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... N Table 1: Table structure to generate. Where (cFold constAT = 1.25) && (cFold constGC = 0.95). Also, It will be analyzed whether the results favor the hypothesis of of bias in codon usage. mi RNA selective pressure as a cause Conclusions and Perspectives The above presented shows that software can be developed as a tool for massive comparisons for the interations between mi RNAs and alternative target m RNA, this will be part of a software product called RNAemo. This will contribute to the development of tools to compare the possible effects of host mi RNA in intentionally introduced viruses for gene therapy of cancers or genetic diseases. Further studies will include an estimate of the binding reaction between mi RNA and m RNA[3] free energy. References [1] Gareth M. Jenkins and Edward C. Holmes. “The extent of codon usage bias in human RNA viruses and its evolutionary origin”, 2003. [2] Ulrike Muckstein, Hakim Tafer. “Thermodynamics of RNA-RNA Interaction”. Institute for Theoretical Chemistry, University of Vienna. [3] Zuker, Michael. “Computational Methods for RNA Secondary Structure”.. June 8, 2006. 48 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Distribution of bioactive peptides in NR Agustina Nardo1, Cristina Añón2 and Gustavo Parisi1 1Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, Roque Saenz Pena 182, Bernal B1876BXD, Argentina 2Centro de Investigación y Desarrollo en Criotecnología de Alimentos (CIDCA), Universidad Nacional de La Plata, La Plata 47 y 116 (1900), Argentina. Background Bioactive peptides (BP) are short sequences (3-20 residues) that can be encrypted in food proteins sequences. BP modulates the biological activity of several human enzymes playing key roles in different metabolisms such us the regulation of blood pressure, stimulating or suppressing the action of the immune system, modulating the activity of the nervous system, inhibit the development of bacteria and fungi among others. The detection of BP in proteins is an important issue in food biotechnology for the development of functional foods. To study the relationships between sequence and structure with BP biological activity, in this work we study the structural and evolutionary occurrence of well characterized BP in the universe of known proteins. Material and Methods Using BioPep database, we retrieved 1.662 BP above 5 residues long. Sequence similarity searches using non-redundat database were performed to search for exact occurrence of each BP. The dataset obtained contains 80.523 sequences. In order to characterize the sequences with at least one occurrence of a known BP and to characterize the distribution of peptides we made a structural assignment using BLAST searches over CATH database. From these searches we obtained 55.407. For each of these proteins, a template was selected from the CATH database searches and also, a reference structure characterizing the homologous superfamily for each protein was retrieved. This reference structure represents the conserved fold for all the retrieved proteins in the same homologous superfamily. Using the sequence alignment between the template and the reference structure we mapped the occurrence of each BP in order to explore the structural and functional distribution in the different structural families found. Phylogenetic trees were also obtained using Phyml maximum likelihood approach. The statistical significance of the BP occurrence was evaluated using Slim packet. Conclusion The distribution of BP was not homogeneous showing a great variety of organism and phylogenetic distribution. The distribution of BP in the structural space showed, however, a relative few number of structural families. It is very interesting to mention that for certain activities, hot spots were found in the different folds. Also, we found that several BP are associated with regions of structural and functional importance due to the high sequence relative conservation as is derived from the phylogenetic analysis. We think that the occurrence of BP hot spots associated with different activities and folds could contribute to the development of new tools to find BP. 49 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Unraveling the molecular basis of mammalian inner ear evolution: analysis of the outer hair cell cytoskeleton protein spectrin Francisco Pisciottano , Belén Elgoyhen , Lucía Franchini Instituto de Investigaciones en Ingeniería Genética y Biología Molecular (INGEBI), Buenos Aires, Argentina In our laboratory we are studying the genetic basis underlying the evolution of the particular functional capacities of the mammalian inner ear. During the evolution of mammals, the inner ear went through many important changes that made it different from the hearing organ of other vertebrates and endowed mammals with unique hearing capacities in the animal kingdom. Among many changes we can remark the origin of a unique cellular type, the outer hair cell (OHC), which shows a novel mechanism known as somatic electromotility. This mechanism of mechanic amplification is an active cochlear amplifier that can increase hearing sensitivity and frequency selectivity and depends on OHCs length changes mediated by the motor protein prestin. These length changes are possible due to the particular characteristics of the OHC's lateral wall, which has a submembranous lateral system, known as cortical lattice. This protein based skeleton consists of circumferential filaments of actin that are cross-linked with filaments of spectrin. In the cortical lattice of the mammalian OHCs alphaII-spectrin in found in association with betaV-spectrin, which indirectly interacts with prestin [1]. We found in previous work that prestin shows strong signatures of positive selection in the mammalian lineage [2]. Using maximum likelihood methods to test models of positive selection we aim to reveal which other proteins were involved in shaping the morphological and functional particularities of the mammalian inner ear. Our present results suggest that betaV-spectrin has accompanied prestin’s evolutionary trend in the lineage leading to mammals. Moreover, betaV-spectrin selected sites group in clusters which show to distribute non-randomly along the protein spanning over specific spectrin domains. Among the domains that accumulate positive selected amino-acids we find those mediating interaction with alphaII-spectrin for dimerization and with the adaptor proteins ankyrin, which mediate the attachment on integral membrane proteins to the spectrin-actin based membrane skeleton. Our work continues to delineate the genetic bases underlying the evolution of the inner ear in mammals. References 1. Legendre K, Safieddine S, Küssel-Andermann P, Petit C, El-Amraoui A: αII-βV spectrin bridgesthe plasma membrane and cortical lattice in the lateral wall of the auditory outer hair cells. J CellSci. 2008, 121:3347-3356. 2. Franchini LF, Elgoyhen AB: Adaptive evolution in mammalian proteins involved in cochlear outer hair cell electromotility. Mol Phylogenet Evol 2006, 41:622-635. 50 3er Congreso Argentino de Bioinformática y Biologı́a Computacional A pipeline for structural annotations in bacterial genomes Lanzarotti Esteban1,2, Defelipe Lucas1,2, Radusky Leandro1,2, Marti Marcelo1,2, Turjanski Adrián1,2 1 Departamento de Química Biológica, FCEN - UBA, Buenos Aires, Argentina INQUIMAE, CONICET-UBA, Buenos Aires, Argentina 2 In the last 10 years, a lot of work was done in developing software tools for predicting of structural features from a protein amino acid sequence, like: secondary structure[1], intrinsic disorder[2] and tertiary structure[3]. Also, a lot of effort was spent in the improvement DNA sequencing technologies, making possible to obtain many portions of bacterial DNA without waiting any longer[4]. In this work we present a pipeline to produce annotations of structural properties for sequenced bacterial proteins and to produce structural models using homology modeling techniques and assesing models using two quality measures. 1. Rost B. Review: protein secondary structure prediction continues to rise. J Struct Biol. 2001 May- Jun;134(2-3):204-18. 2. He B, Wang K, Liu Y, Xue B, Uversky VN, Dunker AK. Predicting intrinsic disorder in proteins: an overview. Cell Res. 2009 Aug;19(8):929-49. 3. Martí-Renom MA, Stuart AC, Fiser A, Sánchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 2000;29:291-325. 4. Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387-402. 51 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Identification of putative LxCxE motifs targeting the retinoblastoma protein in human viruses by structure- and sequence-based calculations Juliana Glavina1, Lucía B. Chemes2, Gonzalo de Prat-Gay2, Ignacio E. Sánchez1 1 Protein Physiology Laboratory, Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires. 2 Protein Structure-Function and Engineering Laboratory. Fundación Instituto Leloir and IIBBACONICET. Introduction Many protein functions can be described in terms of “linear sequence motifs” of less than five function-determining residues. The LxCxE motif interacts with the retinoblastoma tumor suppressor (Rb), which plays a key role in cell cycle progression. The LxCxE motif was identified in several proteins from RNA and DNA viruses, suggesting the LxCxE motif may be present in other viral proteins. We have developed a method to predict the affinity of a sequence stretch to the retinoblastoma protein using a combination of structure- and sequence-based calculations. Methods Structure-based calculations used FoldX, which is an empirical force field for the prediction of the stability of proteins and protein complexes [1]. We used the LxCxE-Rb complex structure to compute a first position specific scoring matrix. Sequence-based calculations used molecular information theory, which makes use of residue statistics at an alignment of known binding motifs [2]. We used over 200 sequences of LxCxE motifs from the papillomavirus E7 protein to compute a second position specific scoring matrix. Finally, we used the new algorithm to scan all known sequences from human viruses. Conclusions The combination of structure-based calculations and sequence-based calculations is able to reproduce quantitative and semi-quantitative binding experiments from the literature, the identification of known instances of the LxCxE motif and of novel putative LxCxE motifs. We discuss the list in the light of the structural and functional properties of the protein containing each motif. References 1. Guerois R, Nielsen JE, Serrano L. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol 2002 , 320: 369-387. 2. Schneider TD, Stormo GD, Gold L, Ehrenfeucht A. Information content of binding sites on nucleotide sequences. J Mol Biol 1986. 188: 415-431. Acknowledgements We acknowledge funding from Agencia Nacional de Promoción Científica y Tecnológica (PICT 20101052 to I.E.S), Consejo Nacional de Investigaciones Científicas y Técnicas (postdoctoral fellowship to L.B.C; G.d.P.G., and I.E.S. are CONICET career investigators) and Instituto Nacional del Cáncer (graduate fellowship to J.G.). 52 3er Congreso Argentino de Bioinformática y Biologı́a Computacional VISI: a computational program for antiviral strategies comparison Leandro E. Ramos2 , Pablo M. Oliva2 , Francisco Herrero2 , Daniel Gutson1 , Daniel Rabinovich1,3 , Pedro A. Pury2 1 FuDePAN: Fundación para el Desarrollo de la Programación en Ácidos Nucléicos, Córdoba, Argentina, X5002AOO, Duarte Quirós 1752 7A 2 FaMAF, Universidad Nacional de Córdoba, Ciudad Universitaria, Córdoba, Argentina, X5000HUA 3 CNRS: Centro Nacional de Referencia para el SIDA, Facultad de Medicina, UBA, Buenos Aires, Argentina, C1121ABG Background The existence of an increasing number of antiretrovirals and the phenomenon of resistance makes it suitable to develop programs that help with the election of different therapy options, before proceeding to in vitro or in vivo trials. For this task, a software to show the evolution over time of the infection under different therapies would be particularly useful. To address this issue, we present the Virus Simulator (ViSi) project [http://visi.googlecode.com]. It models in-silico the temporal evolution of cellular and viral populations involved in HIV infection. The parameters of interaction among cells and virions are completely configurable to represent both the action of antiretroviral drugs and the development of drug-resistant strains. Particularly, the action of reverse transcriptase inhibitors (RTI) and protease inhibitors (PI) are explicitly considered. The system is composed by an extensible kernel of simulation developed from synthesis and upgrading of known mathematical models [1,2], and a plugin-based extension model. These plugins are specifically designed to consider combinations and sequences of antiretrovirals applications. Thus, the system allows the simulation and testing of several drug therapies used in AIDS treatments. Conclusion The main features of the system described above are: • Modular kernel to test different mathematical models of HIV infection. • Completely configurable to set simulation parameters. • Extensible through plug-ins to simulate drug effects on infection. • Programmable interface to test therapies of combinations and sequences of antiretrovirals. References 1. Denise E. Kirschner and F. G. Webb: Understanding drug resistance for montherapy treatment of HIV infection, Bull. Math. Biol. 1997 59:763–786. 2. Alan S. Perelson and Patrick W. Nelson: Mathematical analysis of HIV-1 dynamics in vivo. SIAM 1999 41:3-44. 53 3er Congreso Argentino de Bioinformática y Biologı́a Computacional FuL Alejandro Kondrasky1 , Daniel Gutson1 , Carlos Areces1,2 1 FuDePAN: Fundación para el Desarrollo de la Programación en Ácidos Nucleicos, X5002AOO, Córdoba, Argentina. 2 FaMAF: Universidad Nacional de Córdoba, Ciudad Universitaria, X5000HUA, Córdoba, Argentina. Background The body of knowledge in biology, particularly in virology and immunology, is increasing in volume and complexity. This is why it would be useful to have these knowledge represented in a formal language inside a knowledge base. Subsequently, dierent methodologies for analysis and manipulation could be developed, allowing validity checks to be performed on conclusions obtained in experiments. FuDePAN's Logic processor (FuL) (http://ful.googlecode.com) is being developed to organize, interpret, verify and explore knowledge in molecular biology, applied to virology and immunology in particular. This will help nd inconsistencies and automatically derive new information. Its main function will be the verication of conclusions obtained by results from experiments using queries. Our initial test case will be the following conclusion obtained from experiments done by FuDePAN: • Validate the conclusions obtained in the Junin experiment about the temperature-change eects over the virus secondary structure: Corroborate that the line of thought that includes the predictions of the eects of febrile state over the Junin RNA secondary structure, in which it is hypothesized that the temperature increment reduces the production of nucleoproteins because the hairpin loop in the intergenic region presents dissimilar characteristics when it is compared on the two ambisense genome strings when the temperature is increased. FuL has a plug-in architecture, simplifying the inclusion of new kinds of reasoning services. An API will be provided, which denes the way in which knowledge ows between the plug-ins and FuL's core reasoning engine. An SDK composed of libraries and tools required for building plug-ins will also be made available. The kernel of the tool will be composed of a planner that can handle PDDL (Planning Domain Denition Language) input, and a knowledge manager that will be the interface between the plug-ins registered in that session and the planner. Via a XML le, it will be possible to register the plug-ins that FuL will utilize in that session and congure dierent session parameters. FuL will include a semantic reasoner for Description Logics (DL) as one of the plug-ins, and we will also provide a knowledge representation language for the virology domain based on DL. This language will allow the development of an ontology of virology knowledge that will be available for querying during a FuL session. References 1. Franz Baader, Deborah L. McGuinness, Daniele Nardi, Peter F. Patel-Schneider: Handbook: Theory, implementation, and applications. 2. The Description Logic The Seventh International Planning Competition Description of Participant Planners of the Deterministic Track , 2011. www.plg.inf.uc3m.es/ipc2011-deterministic/ParticipatingPlanners 3. Daniel Gutson, Agustín March, Maximiliano Combina, Daniel Rabinovich: Prediction of consequences of the , 2006. www.fudepan.org.ar/node/71 febrile status on the RNA secondary structure of the Junín Virus 1 54 3er Congreso Argentino de Bioinformática y Biologı́a Computacional The relation between the divergence of sequence and structure in intrinsically disordered proteins Nicolás Palopoli1, Juliana Glavina1, Ignacio Enrique Sánchez1 1 Protein Physiology Laboratory, Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Argentina Introduction It has long been accepted as a general rule that the structural dissimilarity between two globular proteins increases as their sequences depart from one another1. In recent years we have become aware that most proteomes have a significant percentage of proteins with intrinsically disordered regions2. Very little information is available for evolutionary sequence-structure relationships in these regions, although they are known to display specific sequence patterns and show a non-random structure in spite of being very flexible. Here we present a computational assessment of the interplay between sequence and structure of intrinsically disordered regions of proteins. Methods We have focused our studies on the Papillomavirus E7 protein family. These proteins usually display an Nterminal disordered domain (E7N) and a C-terminal globular domain (E7C), allowing for a fair comparison of the relationship between sequence patterns and different structural descriptors. We represent the degree of sequence conservation through the information content of each site. The intrinsically disordered regions, possible binding segments, secondary structure and tendency to aggregate were predicted for all E7 sequences using the one-dimensional models of the disordered domain IUPred3, ANCHOR4 and Tango5. The average solvent accesible surface area of a residue, backbone dihedral angle propensities and radius of gyration were predicted using ensembles of structural models based on local structural propensities as implemented in ProtSA6 and Flexible-meccano7. Results We have calculated the relationship between the degree of sequence divergence and the variability in different structural parameters for every pair of E7 proteins in our dataset and for each position in their alignment. We have assessed the similarities of E7N and E7C by determining how well the observed sequence patterns and structural features in each domain correlate with different descriptors of their degree of disorder. It seems that sequence conservation in E7 is not highly dependent on the degree of disorder. In contrast, any two E7N domains show much higher differences in disorder than the corresponding globular E7C domains at the same level of sequence identity, in agreement with previous results obtained through in silico simulated mutations8. Conclusion We have been able to describe evolutionary sequence-structure relationships for intrinsically disordered regions of a model protein. These relationships seem to differ from those in globular domains and between regions of a disordered domain with different functional properties. Acknowledgements N.P. is a postdoctoral fellow in the PhasIbeAm project. J.G. is the recipient of an Instituto Nacional del Cáncer graduate fellowship. I.E.S. is a CONICET researcher. References 1 Chothia C, Lesk AM. EMBO J. 1986, 5(4):823-6. 2 Oldfield CJ, Cheng Y, Cortese MS, Brown CJ, Uversky VN, Dunker AK. Biochemistry. 2005, 44(6):1989-2000. 3 Dosztányi Z, Csizmok V, Tompa P, Simon I. Bioinformatics. 2005, 21(16):3433-4. 4 Dosztányi Z, Mészáros B, Simon I. Bioinformatics. 2009, 25(20):2745-6. 5 Fernandez-Escamilla AM, Rousseau F, Schymkowitz J, Serrano L. Nat Biotechnol. 2004, 22(10):1302-6 6 Estrada J, Bernadó P, Blackledge M, Sancho J. BMC Bioinformatics. 2009, 10:104. 7 Ozenne V, Bauer F, Salmon L, Huang JR, Jensen MR, Segard S, Bernadó P, Charavay C, Blackledge M. Bioinformatics. 2012, 28(11):1463-70. 8 Schaefer C, Schlessinger A, Rost B. Bioinformatics. 2010, 26(5):625-31. 55 3er Congreso Argentino de Bioinformática y Biologı́a Computacional 1D model of the pulse wave along the systemic arteries Saavedra Fresia Cecilia E.1 , Menzaque Fernando E.2 1 Faculty of Exact Sciences and Technology, National University of Tucuman, Tucuman, Argentina 1 Faculty of Biochemistry, Chemistry and Pharmacy, National University of Tucuman, Tucuman, Argentina 2 Faculty of Astronomy, Mathematics and Physics, National University of Cordoba, Cordoba, Argentina Materials and methods Blood is the fluid that circulates throughout the body via the circulatory system that consists of the heart and blood vessels. The blood path describes two complementary circuits, the pulmonary circulation and the systemic one. Arteries are responsible in the systemic circulation for carrying oxygenated blood and nutrients to other parts of the body, organs, tissues and muscles. The heart is the one that pumps blood throughout the body in consecutive stages. First fills the atria, then contracts, the valves open and blood enters the ventricles. When full, the ventricles contract and push blood into the arteries that are thick and elastic vessels. On each ventricular contraction a loosening of the initial portion of the aorta is caused which propagates downwards waveform along the systemic arteries. The aim of this job is to propose a simplified three-dimensional model describing the behavior of the pulse wave. The one dimensional model takes into account Navier-Stokes equations for a Newtonian fluid to an elastic artery that describe the motion of the fluid in a given artery, the movement of its walls and the interaction between the fluid and the walls (momentum and continuity equations). It is further assumed that arteries are of circular section, the flow is axisymmetric, the velocity profile is flat and larger arteries form a binary tree of veins containing an incompressible and frictionless Newtonian fluid. To solve the proposed nonlinear model the finite difference method for 2-step Lax-Wendroff was used. Conclusion The computational results show that the 1D model is the feasible one to determine the flow in large arteries. Reference 1. John, LK; Li, J., Dynamics Of The Vascular System, World Scientific Publishing Co. Re. Ltd., 2004. 2. Keener, J.; Sneyd, J., Mathematical Physiology, Springer-Verlag, New York 1998. 3. Ottensen, J.; Olufsen, M.; Larsen, J., Applied Mathematical Models in Human Physiology, Society for Industrial and Applied Mathematics 2004. 56 3er Congreso Argentino de Bioinformática y Biologı́a Computacional One vs One Artificial Neural Network strategy for gene expression multiclass classification Remón L1, Juárez L1, Arab Cohen D1, Fresno C12, Prato LB3 , Villoria LN3,Fernandez EA12 1 BioScience Data Mining Group, Facultad de Ingeniería, Universidad Católica de Córdoba 2 CONICET. 3 Instituto A.P. de Ciencias Básicas y Aplicadas, Universidad Nacional de Villa Maria Molecular signatures are sets of genes that could be used to diagnose or classify disease status on subjects. Due to the need of a great amount of samples and/or the different overlapping characteristics of the classes in the feature space building a successful diagnostic tool is still a wish [1]. Artificial Neural Networks (ANN) were not extensively used in gene expression signatures classification, basically because of “the curse of dimensionality problem”, where the amount of variables (genes) is greater than the number of samples (subjects). ANNs usually solve multiclass problems by means of setting a large structure with at most as many output neurons as classes exist in the domain. This implies adjusting a great number of weights, which in essence requires a lot of samples for the algorithm to converge [2]. By means of a “divide and conquer” strategy one can split a complex problem into several “easier” problems. One of these strategies is the One vs One classification through binary classifiers. This implies, for K>2 classes, solve K(K-1)/2 binary classification problems. Here we present some preliminary results on solving multiclass gene expression signature classification through K(K-1)/2 binary ANNs with a voting schema for class prediction, called OVONN. The proposed methodology was tested on 3 gene expression data bases preprocessed as in [6]. For each data base those genes showing a standard deviation greater than 95 percent were selected as predictor variables. In table 1 it is possible to see the performance of the OVONN compared to the traditional ANN approach with as many output nodes as classes. The models were cross validated by a Leave One Out by Class strategy and the number of Hidden Units optimized in each case. Table 1 Percentage Prediction Error Statistics NCI60 [4] 9 Tumors [5] 11 Tumors [5] ANN OVONN ANN OVONN ANN OVONN Min 12.5 0.0 25.0 0.0 0.0 0.0 1Q Median Mean 3Q Max 25.0 25.0 42.5 75.0 75.0 0.0 12.5 10.0 12.5 25.0 25.0 25.0 27.1 25.0 37.5 15.6 25 22.9 34.38 37.5 13.6 27.3 22.4 34.1 36.4 2.3 9.1 7.6 9.1 18.2 1Q and 3Q: first and third quartile. From table 1, it is possible to observe that the traditional approach strongly suffers from the curse of dimensionality. Meanwhile, our approach out-performed the previous one, requiring fewer samples to reach a stable solution with good performance. The solutions reached by OVONN were very stable across training data sets. The proposed approach could bring the ANN to the classification arena again, providing a new competitive classification tool. An R library will be made available soon. Keywords: multi-class classification, ANN, OVO. References 1- Parker, Joel S. et al. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes, Journal of Clinical Oncology, 27(8):1160–1167 2- Ou G, Murphey L. Multi-class pattern classification using neural networks, Pattern Recognition, doi:10.1016/j.patcog.2006.04.041 3- Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu C, Peterson C, Meltzer P: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 2001, 7(6):673–679. 4- Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN, Mesirov JP, Lander ES, Golub TR: Chemosensitivity prediction by transcriptional profiling. Proc. Natl. Acad. Sci. U.S.A. 2001, 98:10787–10792. 5- Dudoit S, Fridlyand J, Speed TP: Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association 2002,97(457):77– 87. 6- Tapia E, Ornella L, Bulacio P, Angelone L, Multiclass classification of microarray data samples with a reduced number of genes. BMC Bioinformatics, http://dx.doi.org/10.1186/1471-2105-12-59 57 3er Congreso Argentino de Bioinformática y Biologı́a Computacional SVM Tree with Optimal Multiclass Partition applied to Gene expression signature classification Pallarol M1, Arab Cohen D1, Fresno C12, Prato, LB3Fernandez EA12 1 BioScience Data Mining Group, Facultad de Ingeniería, Universidad Católica de Córdoba 2 CONICET. 3 Instituto A.P. de Ciencias Básicas y Aplicadas, Universidad Nacional de Villa Maria. Abstract. Gene expression signatures are currently used to lead cancer therapy [1]. In many situations, they are expected to successfully diagnose several disease types. However this is not usually possible, because of the need of a great amount of samples or by the overlapping characteristics of the classes in the feature space. One of the main tools used for multiclass classification problems is Support Vector Machines (SVM) under the well known OVO and OVA strategies and, more recently, the tree based approach. Most of the tree based SVM classifiers try to split the multi-class space, mostly, by some clustering like algorithms into several binary partitions. One of the main drawbacks of this approach is that the natural class structure is not taken into account. Furthermore, the same SVM parameterization is used for all partitions in the above mentioned strategies. Here, we applied the SVMTOCP (SVM tree optimal classification partition) [2], a new splitting methodology for K>2 multi-class problems. It builds a twoclass problem for each node in the tree, by looking for the input class combinations that produce the best SVM performance in a specific tree node. This implies to solve for node “i” Li = η ⋅ Ki! r!(K i − r )! (1) binary problems, where η=1(0.5) for K odd (even) and r=[K/2]. Once the best solution, if found, at node “i” r classes are passed to the child nodes and the process repeated until reaching a leaf. Despite the training phase being time and computationally expensive, the proposed approach always produces a balanced tree and the original class structure is preserved. The last property is very important from a Data Mining point of view, because the reached solution allows to identify which of the class combinations provides soft or hard margin solutions (tree nodes could have different kernel parameters) and automatically identifies what are the most difficult input classes to split. These are very important properties for data analysts who need to extract hidden knowledge from a multivariate data base. The SVMTOCP and the SVM OVO strategies were compared over three gene expression databases to classify tumor samples. In all cases the SVMTOCP achieves much more “Hard Marging” (HM) solutions and lesser amount of support vectors (SV) with no statistical difference in performance than the usual OVO approach. Reaching solutions with less number of SVs and HMs suggests, a more robust classification strategy and fewer samples to achieve efficient solutions. These findings are very nice properties for genomic applications where the number of samples is scarce. Table 1: used data sets characteristics and classification performances for both strategies SVMTOCP DB NCI60 [4] 9 Tumors [5] SCBR [6] Instances 61 58 63 #Classes 8 (5,9) 8 (6,9) 4 (8,23) %PE ** 25 ** 0 0 %HMs ** 43 ** 36 ** 33 SVM OVO %SV %PE ** ** ** 78 87 58 %HMs %SV 17 0 95 15 0 96 0 0 67 ** p<0.01. #Classes (minimum, maximum samples on classes), HMs: %of Hard Marging solutions, %SV: %of support vectors used to build the solution, %PE: %Prediction Error. The data sets were preprocessed as in [6], and those genes showing higher variance were selected. Keywords: multi-class classification, SVM, Binary Tree. References 1- Parker, Joel S. et al. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes, Journal of Clinical Oncology, 27(8):1160–1167 2- Arab Cohen, D, Fernandez EA. SVMTOCP: A binary tree base SVM approach through optimal multi-class binarization In 17th Iberoamerican Congress on Pattern Recognition CIARP 2012 Eds: León LA, Déniz, Mejail ME, Jacobo J. 3- Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu C, Peterson C, Meltzer P: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 2001, 7(6):673–679. 4- Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN, Mesirov JP, Lander ES, Golub TR: Chemosensitivity prediction by transcriptional profiling. Proc. Natl. Acad. Sci. U.S.A. 2001, 98:10787–10792. 5- Dudoit S, Fridlyand J, Speed TP: Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association 2002, 97(457):77–87. 6- Tapia E, Ornella L, Bulacio P, Angelone L, Multiclass classification of microarray data samples with a reduced number of genes. BMC Bioinformatics, http://dx.doi.org/10.1186/1471-2105-12-59 58 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Web-based gene-expression analysis using the plant biology analysis tools: GENEVESTIGATOR María Gabriela Acosta2; Miguel Ángel Ahumada1; Sergio Luis Lassaga3; Víctor Hugo Casco1,2 1 Cátedra de Biología, FCA-UNER, Ruta 11 km 10, Oro Verde, Argentina. LAMAE - FI-UNER, Ruta11-km10½. Oro Verde, Argentina. 3 Cátedra de Genética y Mejoramiento Vegetal, FCA-UNER, Ruta 11 km 10, O. Verde, Argentina. 2 Background The GENEVESTIGATOR microarray database and expression meta-analysis engine was developed to perform gene-expression analysis. Their results are processed from thousands of systematically annotated and normalized microarray experiments [1]. With the aim of seeking genes that code receptors kinases capable of activating AT5G01830, armadillo repeats (ARM-repeat) protein; we have used this bioinformatics tool to find protein kinases and AT5G01830 co-expression. Additionally, we have analyzed the gene expression on floral tissue of AT5G01830, under different abiotic stress using GENEVESTIGATOR perturbations tool, to display the response of genes to a wide multiplicity of conditions. Results In the present report, we have used the array ATH1 (22k array, ss7176) to visualize gene expression across arrays from a pre-selected set of experiments. The main goal was to show expression intensity of a gene list across the selected arrays. The anatomy tool under the search toolset condition, was able to quickly find out, how strongly are expressed AT5G01830 in different tissues and under different stress condition (hormonal and saline). In the present report we have used the co-expression tool, to find out genes exhibiting expression profile closer to our target gene: AT5G01830 (black spot in Figure 1). By using this tool, we were able to detect a protein kinase (white spot 1, in Figure 1) as possible candidate to activate putative E3-ubiquitin ligases, like AT5G01830, of five candidates present in the Arabidopsis genome. Using GENEVESTIGATOR, we have not detected gene expression in floral stage of development in normal growth conditions concordantly to our results by sqRT-PCR approach. Figure 1 Conclusions GENEVESTIGATOR is a high performance engine of the search for gene expression. The tools are highly efficient to detect the expression of genes helping to confirm or simplify molecular biology experiments. In this study, a decreased the number of gene targets encoding protein kinases (five to one), lowering reagents costs and reducing working time. Reference 1. Zimmermann P, Hirsch-Hoffmann M, Hennig L and Gruissem W: GENEVESTIGATOR. Arabidopsis microarray database and analysis toolbox. Plant Physiol 2004, 136: 2621–2632. 59 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Agi4x44.2c: a two-colour Agilent 4x44 Qualtiy Control R library for large microarray projects 1 1,2 4 2,3 2,3 González GA , Fresno C , Merino G , Llera A , Podhajcer O , Fernández EA 1 1,2 Grupo de Minería de Datos en Biociencias, Facultad de Ingeniería, Univ. Cat. de Córdoba 2 CONICET 3 Laboratorio de Terapia Celular y Molecular, Instituto Leloir 4 Facultad de Ingeniería, UNER Microarrays remain the most accepted tool in transcriptome studies that require to analyze a large number of samples [1,2] and in several prospective international efforts, in particular in cancer, that are currently running. One of the biggest challenges of these large-scale studies lies on the simultaneous evaluation of hundreds of arrays for quality control (QC) to discard, semi automatically, those that do not meet the minimum quality requirements. Currently, the most used tools for quality control of two-colour Agilent 4x44 microarrays are Feature Extraction (FE) [3] and QC Chart Tool [5], both developed by Agilent. FE creates a PDF report for each array containing several measures of quality, and then users must manually review each report to find problematic arrays resulting in a very time consuming task. On the other hand QC Chart Tool only shows line graphs for FE quality metrics, not allowing to explore intensities distribution, spatial patterns, etc. Here we present Agi4x44.2c, a new QC R library [4] that fulfills all mentioned limitations. It facilitates global quality control allowing users to quickly compare all arrays at a glance. Furthermore, unlike other Bioconductor packages (such as Agi4x44PreProcess [6] and arrayQualityMetrics [7]), Agi4x44.2c includes QC tools specific to the two-color Agilent 4x44 platform, for more complete and comprehensive analysis. Figure 1. Some of the plots created by Agi4x44.2c. (a) False color image of raw intensities in the green channel. Abnormal patterns can be seen in the first 3 chips. Boxplot (b) and M vs A plot (d) show that the fourth array is different from the other three. Metrics plot (c) shows a summary of Agilent quality metrics from PDF reports. In this case we can see that the first three chips (F1-F3) are problematic ones. References [1]Yu J, Yu J, Cordero KE, et al. A transcriptional fingerprint of estrogen in human breast cancer predicts patient survival. Neoplasia. 2008;10(1):79–88. [2]Curtis C, Shah OP, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 2012. DOI: 10.1038. [3] Agilent. Agilent Feature Extraction Reference Guide, 2007. [4]R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria, 2005. ISBN 3-900051-07-0. [5] Agilent QC Chart Tool. http://www.genomics.agilent.com/files/Manual/G4460-90022_QC_Chart_User.pdf [6]Lopez-Romero P. Agi4x44PreProcess: PreProcessing of Agilent 4x44 Array Data. 2011. http://www.bioconductor.org/packages/release/bioc/html/Agi4x44PreProcess.html [7] Kaumann A, Huber W. Quality assessment with arrayQualityMetrics. 2009. http://bioc.ism.ac.jp/2.4/bioc/vignettes/arrayQualityMetrics/inst/doc/arrayQualityMetrics.pdf 60 3er Congreso Argentino de Bioinformática y Biologı́a Computacional DIGESuite: a Cytoscape plug-in for 2D-DIGE analysis 1 1 1,2 1 3 5 3 2,4 1,2 Talesnik T , Mishima JM , Fresno C , Semrik M , Ribero G , Merino G , Laura B. Prato , Llera AS , Fernández EA 1 Grupo de Minería de Datos en Biociencias, Facultad de Ingeniería, Univ. Católica de Córdoba 2 CONICET, Argentina 3 Instituto Académico Pedagógico de Ciencias Básicas y Aplicadas, Universidad Nacional de Villa María 4 Laboratorio de Terapia Molecular y Celular, Fundación Instituto Leloir, Buenos Aires. 5 Facultad de Ingeniería, UNER Background: biomedical companies usually offer proprietary black-box software associated with their machinery. In this context, the user cannot check whether the data cope with model assumptions, in order to apply alternative approaches. Furthermore, user-interfaces are very restricted not allowing the user to extend the analysis. This is not the exception in GE Healthcare software Decyder® for two-dimensional difference gel electrophoresis (2D-DIGE) [1]; where a global view of the state of a proteome can be obtained, by the examination of up to three labeled samples on a two-dimensional gel. The aim of this technology is the detection of spots (with a priori unknown proteins), showing a statistical expression difference under different experimental conditions. In this context, is crucial to include proper visualization during pre-processing steps such as spot filtering and normalization, prior to differential expression analysis. To overcome these limitations we propose DIGESuite, a plug-in for 2D-DIGE protein expression analysis, to extend the well-known bioinformatics’ flexible visualization tool Cytoscape [2]. B D A E C F Figure 1: DIGESuite screenshots. a) plug-in control panel, b) gels, c) spot boxplots, d) linear-mixed model specification, e) differentially expressed spot selection, f) R console Methods: the plug-in uses a client-server topology, where Cytoscape offers the graphical front end and the statistical engine R [3] works as back-end. Decyder® raw/normalized volume images data files are displayed for each gel image using Cytoscape capabilities (see Figure 1). The user can easily filter problematic spots (saturated or dusty ones), check for protein spots distribution using boxplots and use normalization alternatives such as two-stage linear mixed models [4]. Even if necessary, the user can open an R terminal to tune the data at his will. Once normalized, a user friendly interface lets the user specified the linear or mixed model for differential expression analysis instead of Decyder® one/two way ANOVA. Automatically, differentially expressed spots are highlighted in the gel, according to user significance threshold (raw/adjusted p-values and/or fold-change). It also provides a complete report of the processed steps applied, as well as the location of the spots to pick from MS protein identification. Furthermore, additional Cytoscape plug-ins can be included in the analysis according to the user’s needs. Conclusion: as far as we know, there is no available free tool that allows the analysis of protein data in a consistent and flexible manner. DIGESuite can be used after Decyder® image analysis has been carried, allowing flexible filtering, normalization, differential expression analysis and spot information for MS protein identification. This tool can also be used jointly with other Cytoscape plug-ins to further extend protein expression analysis. References [1] Viswanathan S, Unlü M, Minden JS: Two-dimensional difference gel electrophoresis. Nat Protoc. 2006, 3:1351-8. [2] Shannon P, et. al: Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res. 2003, 13: 2498-2504 [3] R Development Core Team: R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2009, ISBN 3-900051-07-0 [4] Fernández EA, et al.: Improving 2D-DIGE protein expression analysis by two-stage linear mixed models: assessing experimental effects in a melanoma cell study. Bioinformatics 2008, (23):2706-2712 61 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Characterization of long interspersed non- LTR elements in section Arachis 1 1 1,2 1,2 Sebastián Samoluk , Diego Carisimo , Germán Robledo , Guillermo Seijo 1 Instituto de Botánica del Nordeste, Corrientes, Corrientes, CP 3400, Argentina 2 Facultad de Ciencias Exactas y Naturales y Agrimensura (Universidad Nacional del Nordeste), Corrientes, Corrientes, CP 3400, Argentina Abstract Section Arachis (genus Arachis, Leguminosae) is composed of 29 wild diploid species belonging to five different genomes (A, B, D, F y K) and two allotetraploid species (AABB). Experiments based on molecular mapping and genome in situ hybridization suggested that changes in the repetitive fractions may have been a main force leading to genomic differentiation in Arachis. To test this hypothesis, degenerate primers were designed to isolate and characterize a conserved region of the reverse transcriptase gene from long interspersed non-LTR elements (LINEs) from eight species representing five different genomes of Arachis. The 37 isolated clones showed the conserved amino acid motifs characteristic of the reverse transcriptase of LINEs. These sequences were compared by the pairwaise method and a Neighbour- Joining tree was constructed using the program MEGA, version 5 [1]. Even though the alignment of nucleotides showed a high interspecific nucleotide divergence, the deduced amino acid sequences evidenced high percentages of similarity. Nineteen sequences had stop codons and the introduction of frameshifts in the reading frame of some sequences was necessary to optimize the alignment. Amino acid sequences from other angiosperms and gymnosperms with homology to the reverse transcriptase of the LINEs isolated from Arachis were recovered from public databases using BLASTx tool [2] and incorporated to the tree. The topology of the tree showed that the sequences isolated from Arachis were grouped into a unique cluster without species-specific subclusters. On the other hand, all the recovered sequences from public databases (which included some from legume species) grouped into another well separated cluster. The sequences grouped in the latter cluster had much deeper branches than those observed in the cluster of Arachis sequences. Conclusion From these results, we concluded that the diversification of LINEs is relatively recent in Arachis and that it may have occurred before the differentiation of the genomic groups present in section Arachis. References 1.Tamura K, Peterson D, Peterson N, Stecher G, Nei M, and Kumar S (2011) MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Mol. Biol. Evol. 28: 2731-2739 2.Altschul, S.F., W Gish, W Miller, E Myers & D J Lipman (1990) "Basic local alignment search tool" J. Mol. Biol. 215:403-410 62 3er Congreso Argentino de Bioinformática y Biologı́a Computacional MSA2MI: A server to calculate and visualize mutual information in multiple sequence alignments Franco Simonetti1, Morten Nielsen2 and Cristina Marino Buslje1. 1 Fundación Instituto Leloir. CABA. Argentina. 2Center for Biological Sequence Analysis. DTU. Denmark. Background Multiple Sequence Alignments (MSAs) of homologous proteins carry at least two levels of information. One is given by the amino acid frequencies observed at each position of the MSA, and the other is given by the relationship between two or more positions. The first is known as conservation and the second can be studied in terms of co-variation between positions. The extent of the mutual co-evolutionary relationship between two positions in a protein family can be estimated using Mutual Information (MI) [1]. An algorithm was developed for this task by Marino Buslje et al [2] and is here made publicly available as a web tool. Results We present a web toolkit that allows users to calculate and visualize the MI between residues in an MSA. The web service was developed using PHP on the server side with Javascript and Flash on the client-side. The pipeline was implemented as modules, making addition of new features easy. The main task is to calculate the MI between all pairs of columns in the MSA. The output is displayed as a MI network using Cytoscape Web [3], where each node corresponds to a column in the MSA and edges between nodes represent significant MI values [2] (Figure 1). Several parameters can be set in order to calculate and present the data. For example, if the structure of the protein is known, structural data can be displayed by adding the PDB numbering schema to the nodes and distance information for edges (Figure 1). Also, node coloring can be set to match different attributes, such as conservation value (Figure 1). Additionally, by clicking each node the relative frequency of different amino acids for this position is shown. Results can be downloaded for further user manipulation, which include MI and conservation data in raw format and network files to load on Cytoscape's desktop version. Conclusions This web toolkit allows the study of protein families through a simple and interactive interface, utilizing sequence based data such as conservation, coevolution and amino acid composition and capable of mapping structural data when available. Available at: www.leloir.org.ar/MSA2MI Figure 1. Mutual Information network rendered using Cytoscape Web. Node color represents the conservation value (red to blue, higher to lower score). Mutual Information edges are shown as solid lines while distance edges are shown as dashed lines. The right panel displays information about the nodes and edges selected. Filters can be applied to desired value. The network layout can be modified changing distance and MI threshold values and its output exported in different file formats. [1] L. C. Martin, G. B. Gloor, S. D. Dunn, and L. M. Wahl, “Using information theory to search for co-evolving residues in proteins.,” Bioinformatics (Oxford, England), vol. 21, no. 22, pp. 4116-24, Nov. 2005. [2] C. M. Buslje, J. Santos, J. M. Delfino, and M. Nielsen, “Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information.,” Bioinformatics (Oxford, England), vol. 25, no. 9, pp. 1125-31, May 2009. [3] C. T. Lopes, M. Franz, F. Kazi, S. L. Donaldson, Q. Morris, and G. D. Bader, “Cytoscape Web: an interactive web-based network browser.,” Bioinformatics (Oxford, England), vol. 26, no. 18, pp. 2347-8, Sep. 2010. 63 3er Congreso Argentino de Bioinformática y Biologı́a Computacional 14-3-3 isoforms subfunctionalization revealed by systems biology analysis of cross-talk between phosphorylation and lysine acetylation Marina Uhart, Diego M Bustos Laboratorio de Biología Estructural y Celular de Modicaciones post-traducción. INTECH, Int. Marino Km 8.2 - Chascomus - Argentina Advances in quantitative mass spectrometry-based proteomics now enables the system-wide characterization of signaling events at the level of post-translational modications, protein-protein interactions and changes in protein expression. The 14-3-3 proteins interact with more than 800 dierent proteins, in part as the result of their specic phospho-serine/phospho-threonine binding activity (RSXpS/TXP, RXXXpS/TXP and pS/T-X(1-2)-COOH). The family is composed by 2 paralogs in yeast, 7 in mammals, and up to 15 in plants. Upon binding to 14-3-3, the stability, subcellular localization and/or catalytic activity of the ligands are modied. 14-3-3 can hide intrinsic localization motifs, prevent molecular interactions and/or modulate the accessibility of a target protein to modifying enzymes such as kinases, phosphatases or proteases. The extraordinarily high sequence conservation between 14-3-3 protein isoforms poses a signicant technological challenge to researchers working with this family. A systems-level approach is necessary to map 14-3-3 network's components and to understand their functions. We used dierent databases to create a PPI (proteinprotein interaction) network for 14-3-3 signaling in human cells. We also added kinases and their substrates published in the HPRD database for human cells, including the information about the phosphorylation- and Lys acetylation sites. Finally we transformed this unidirectional network of ~5000 nodes in a directed one, obtaining a complete representation at high resolution of the 14-3-3 binding partners and their modications. Using a computational system approach we found that networks of each isoform are statistically dierent (Jaccard index < 0.25) and built by dierent set of 3-nodes motifs (p < 0.005), with dierent structural stability. A feed-forward loop motif (# 7, SSS=1) is present in gamma, zeta and eta networks. This motif has been detected within the transcription-regulation networks of E. coli and S. cerevisiae. At the level of signal transduction networks, this motif could represent the scaold function, where a protein (in this case 14-3-3) facilitates the interaction between two other proteins (one of them regulates the other one). Another feature that shows dierences between each isoform specic network is the intrinsic disorder content (p = -09 2.044e Krustal-Walis test), promoting distinct levels of wired interactome. This dierence in the percent- age of disorder is reected in the size, number and co-appearance of domains and domains clubs in each partner of 14-3-3 network isoforms, suggesting their participation in dierent signaling pathways. It was remarkable to found that Tyr was the most phosphorylable amino acid in domains of 14-3-3 epsilon partners. This, together with the over-representation of SH3 and Tyr_Kinase domains suggest that epsilon could be involved in growth factors receptors signaling pathways. Finally, we found that within zeta's network, the number of acetylated partners is signicantly higher (Fisher exact test) compared with each of the other isoforms, with p values from 1.65e -10 for sigma, the less similar, to 0.0024 for gamma, the most similar to zeta isoform. The number of acetylated Lys is not proportional to the domain number (or number of amino acids in domains). In the case of zeta isoform, the domains of its partners contain more modied Lys than all 14-3-3 paralogs. Also, an analysis of the subcellular localization of those zeta partners that are acetylated (48%) shows that 42% are mainly nuclear, containing the 60% of all N uclear Localization S ignals present in -06 partners of this isoform (p = 1.288e , Fisher exact test). The Lys acetylation correlates with pTyr but not with pSer or pThr, suggesting a crosstalk between these two kinds of PTM. Our analysis also shows a clear subfunctionalization in members of the 14-3-3 family by dierential PTMs. 1 64 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Theoretical studies of membranes at different thermotropic phases in salts solutions by molecular dynamics F ernando E. Herrera 1 , M. de los Milagros Sales1 , Daniel E. Rodrigues1,2 1 Área de Modelado Molecular, Laboratorio de Biomembranas, Departamento de Física, Facultad de Bioquímica y Ciencias Biológicas, Universidad Nacional del Litoral, Santa Fe, Argentina. 2 INTEC, (UNL+CONICET), Argentina. Biological membranes are very complex systems since their structure and dynamic characteristics are affected by different conditions such as temperature or the ionic composition of aqueous buffers around them. The temperature determines the thermotropic phases and ordering of the lipids, like the ordered Gel (G) or the Liquid crystalline (LC). The ionic concentration on the other hand affects the membrane fluidity. Therefore, theoretical studies of the interplay between these two factors are necessary to understand the molecular mechanisms of their interactions. Molecular dynamics have proven to be a reliable tool to study biomembranes in detail. In this context, we have performed Molecular Dynamics simulations of DPPC (Dipalmitoylphosphatidylcholine) hydrated bilayers at Gel (T=22°C) and Liquid Crystalline (T=50°C) phases, at different ionic concentration of NaCl in order to rationalize the effect of the ionic forces on different thermotropic phases of the same system. In this work, we have developed several tools to analyze in detail the structural and dynamical properties affected by the ionic concentration (area per lipid, atomic density profiles, thickness fluctuations 2Dmaps, ion depth profiles, ion solvation depth profiles, diffusion coefficients, etc). The results, that are in agreement with previous reports, show that the ionic absorption and the effects of the ions on many membrane properties depend primarily on the phase of the membranes. The area per lipid and the diffusion coefficients in the LC phase is reduced when the ionic concentration increase while they remain unchanged in G phase. Additionally, it was found that the bilayer thickness in LC phase increase with the salt concentration. Furthermore, the absorbed Na ions interact principally with the carbonyl oxigens in both phases (see Figure 1). This work has finally contributed to emphasize that salt concentration and temperature are important factors to take into account in the design of any kind of experiments. Figure 1: Lipid interactions in both thermotropic phases. 65 3er Congreso Argentino de Bioinformática y Biologı́a Computacional The Comparisons of Sequences with the Nucleotide Database (NCBI) and the BLAST tool. What information we can obtain? Victoria E. Firmenich, M. Eugenia Fernández Feijóo and María B. Espinosa. Fundación PROSAMA, CONICET. Paysandú 752, C1405ANH. Ciudad Autónoma de Buenos Aires, Argentina. The National Center for Biotechnology Information (NCBI from USA) provides a database "on line" that comprises nucleotide sequences from DNA of genes from microbes, plants, humans and several species used as reference. We conducted sequences analysis using the Basic Local Alignment Search Tool (BLAST). The BLAST tool allows the analysis of a nucleotide sequence obtained from PCR products. The data obtained by sequencing from specific amplifications are analysed by BLAST. We performed studies of AMELX and Vkorc1 using this tool. The genomic DNA used as template was prepared from tissue samples from small mammals species (Akodon azarae, Lagostomus maximus and Mus musculus). In this work we describe this methodology useful to assess nucleotide sequences. After the sequencing of a PCR amplification product, the sequence should be in an archive .doc: …TTAGGTTAGGGCTAAG….a file of letters that represent the four DNA bases is the nucleotide "query" to find coincidences in the official database. Then we choose an organism to pursue the sequence alignment. So far are human, rodents, flowering plants (Arabidopsis thaliana), rice (Oryza sativa) and some others (Pan troglodytes; Danio rerio; Gallus gallus; Drosophila melanogaster, Apis mellifera and Bos taurus) which genome it is known and has been assembled on the NCBI databases for BLAST. The sequence of PCR products for amelogenin gene was obtained from DNA templates of A. azarae and L. maximus. The Amel sequence from Mus musculus and human was used for sequences analysis because those are the closer organisms in which the Amel gene it is described. The BLAST algorithm allowed us to determine that the females from both species shared an intronic sequence with the human Amelogenin gene (M55418); the identities minimal was of 73% in sequences of 200 base pair length. The Vkorc1 sequences obtained for the three exons from wild type mice were compared with the NCBI Reference Sequence: NT_039433.8 corresponding to Mus musculus strain C57BL/6J chromosome 7. The sequences length was from 168 to 311 base pair. A maximum of 6 gaps was found in a 3% of the sequences. The identities between the sequence for the vitamin K epoxide reductase complex from Mus musculus strain C57BL/6J and the wild type was of 87 to 100%. There was no mutation for the Vkorc1 sequence in none of 20 wild rodents analysed. These analyses show that the Vkorc1 exons sequence is well conserved in wild mice from population of Buenos Aires in study. The BLAST allowed us to study the amelogenin gene in species as Akodon azarae and Lagostomus maximus in which Amel was unknown and also the studies of the Vkorc1 from Mus musculus of local wild populations. 66 3er Congreso Argentino de Bioinformática y Biologı́a Computacional GOboot: towards a robust SEA analysis 1 2,3 2,3,a 2, 3 4 5 5 Cristóbal Fresno , Andrea S Llera , María R Girotti , María P Valacco , Juan A López , Laura Zingaretti , Laura Prato , Osvaldo 2,3 2,6 7 1,2 Podhajcer , Mónica G Balzarini , Federico Prada and Elmer A Fernández 1 2 3 BioScience Data Mining Group, Catholic University of Córdoba, CONICET, Laboratory of Molecular and Cellular Therapy, Leloir 4 5 Institute, Buenos Aires, National Center for Cardiovascular Research, Madrid, Spain, Instituto A.P. de Ciencias Básicas y 6 7 Aplicadas, Universidad Nacional de Villa Maria, Biometry Laboratory, National University of Córdoba, Institute of Technology, a School of Engineering and Sciences, UADE, Buenos Aires, Present address: The Institute of Cancer Research, London, UK Keywords: Gene Ontology, background selection, Genomics, Proteomics. Background Set enrichment analysis (SEA) is the traditionally used approach for Gene Ontology (GO) analysis, due to its trajectory and availability over commercial and public tools/websites [1-2]. In the GO structure, each term is statistically evaluated at a time resulting enriched if the observed proportion of differentially expressed proteins/genes differ from the expected when compared against a background reference (BR). The appropriate BR is difficult to devise and GO results tend to depend on it. In this sense, terms would result enriched or not according to the BR used. Here, a new method is presented to evaluate the enrichment robustness of nodes by means of bootstrap perturbations of the used BR. Thus, each node will have a “power score”, where high stability nodes are candidates to by explored and leaving spurious enriched terms out of the analysis. Methods A resampling technique was implemented to provide a stability (power) measure of SEA to evaluate the effectiveness of a given BR to identify true enriched terms. Simulated BRs were generated by bootstrapping a BR, trying to keep each simulated BR as close as possible to the length of the original BR (in order to introduce small perturbations in length of both GO members and BR). The power value was calculated as the percentage of times a term gets enriched, over a high number of simulated BRs. In this sense, higher power implies greater stability of the term. DAVID [3] was the chosen tool to test SEA in a proteomic (Girotti et al., unpublished) and three microarray experiments freely available at Gene Expression Omnibus [4-6] under different BRs: the genome of the specie (BR-I), the chip-gene list (BR-II, if possible) and a user defined reference (BR-III [7]). The BR-III (but is not restricted to) was the reference used for power calculation, as it is considered the one which fulfills the statistical assumption. Boxplot of the enriched terms of main GO category (Biological Process) was plotted, using a Venn-diagram color pattern to contrast enrichment with typical BR selections (BR-I or BR-II). Results In Figure 1 it is possible to see that the power boxplots of all enriched nodes (in white) are above 40% for most of datasets. Almost all nodes found in BR-III reached power values above 50%. Meanwhile, those nodes that appeared enriched by bootstrapping BR-III and previously found by BR-I or shared by BR-I & II, showed power values less than 40% in all cases. This suggests that enriched nodes found by BR-III were highly consistent and potentially meaningful. These enriched terms were validated by literature. BR-I Figure 1: Biological process power boxplots of bootstrapped enriched nodes, coded with the overlapping source of the full BR length (BR-I to BR-III). Notice that “Joint” boxplot (in white) corresponds to the boxplot of all bootstrapped enriched nodes. BR-II BR-III Discussion By means of stability analysis it was shown that non-consensus nodes identified only with BR-I and/or BR-II are unstable, suggesting spurious enrichment. On the contrary, enriched terms found by BR-III showed high power suggesting more “confidence” (robustness) making these terms good candidates for further exploration. We found that “robust” terms where biologically relevant to the experimental setting [7]. In this context, the proposed tool provided additional information (power values) addressing ontology exploration and new unseen terms blurred by the traditional approaches, to assist researchers in ontology analysis. References *1+ P. Khatri, S. Drăghici, Ontological analysis of gene expression data: current tools, limitations, and open problems, Bioinformatics, 21, 3587-3595 (2005) [2] D. Wei Huang et al. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists, Nucleic Acids Res., 37:1-13 (2009) [3] I. Rivals, L. Personnaz, L. Taing, M-C. Potier, Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics, 23, 401-407 (2007) [4] L. M. Packer et al. Gene expression profiling in melanoma identifies novel downstream effectors of p14ARF, Int. J. Cancer, 121, 784-790 (2007) [5] A. Spira et al. Effects of cigarette smoke on the human airway epithelial cell transcriptome. Proc. Natl. Acad. Sci. U. S. A., 101, 10143-10148 (2004) [6] S. McGrath-Morrow et al. Impaired lung homeostasis in neonatal mice exposed to cigarette smoke. Am. J. Respir. Cell. Mol. Biol., 38, 393-400 (2008) [7] C. Fresno, A. S. Llera, M. R. Girotti, M. P. Valacco, J. A. López, O. L. Podhajcer, M. G. Balzarini, F. Prada, E. A. Fernández, The Multi-Reference Contrast Method: facilitating set enrichment analysis, Comput. Biol. Med. 42, 188-194 (2012) 67 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Honeybees colony virtual simulation, step 2 Mario Migueles 1, Liesel Gende 2,3,4, Leonardo Defeudis 2, Pablo Macri 1,4, María Churio 3,4, Martín Eguaras 2,4, Lidia Braunstein 1 1 Instituto de Investigaciones Físicas de Mar del Plata (IFIMAR). Departamento de Física. FCEyN. Universidad Nacional de Mar del Plata, Mar del Plata, Buenos Aires, Argentina, 7600. 2 Laboratorio de Artrópodos. Departamento de Biología. FCEyN. Universidad Nacional de Mar del Plata, Mar del Plata, Buenos Aires, Argentina, 7600. 3 Departamento de Química. FCEyN. Universidad Nacional de Mar del Plata, Mar del Plata, Buenos Aires, Argentina , 7600 4 CONICET, Buenos Aires, Argentina, C1033AAJ In eusocial insect colonies, as honeybees, many tens of thousands of workers can live together as a regulated superorganism. These colonies are characterized by division of labour: specialization of individual workers for particular tasks. Honeybees have developed collective food acquisition methods to provide themselves with nutrients. They split the food gathering task into a variety of subtasks performed by different individuals. Foragers search for food sources, collect food, and transport it to the nest, where it is processed and stored by other groups of workers. The aim of this work was to develop a software, called BeEp, which asserts a causal relationship between honeybee’s age, task performance, population and food balance. We describe a novel multi-agent model (MAMS) that focuses on the dynamic task selection of honeybees. The behavior of the complete system was directly reproduced by simulating the actions of the individuals. Our simulation was intended to model all the important aspects of a bee’s life inside the hive. We assume differentiation among castes (workers, drones, queen). Queen is the only one who is in charge to deposit their eggs in empty cells. Workers bees are physiologically and morphologically identical; we emphasize the differentiation among these according to age and therefore their activity in the colony. This includes individual development from egg to adult and adult performing tasks such as brood tending (nursering), storage nectar-pollen (storing), as well as, collection of nectarpolen (foraging). The software also considers transformation of the nectar in honey. Adult bees of all ages satisfy their energy demands by consuming stored nectar (honey) or by being fed by other adults. The larvae (brood) must be fed by nurse’s bees. We have simulated honeybee’s colony of 2000 individuals in 4 frames for the term of 365 days, maintaining nutrition and population balance. The software consists of multiple parallel programs that run synchronized: GoReporter, details the statistics of the simulation, BeTV allows following the simulation in one computer that runs on another computer with better operational characteristics. BeEp also generates an event monitor, which shows step by step the progress of the simulation. The software works on a colony simulation performed by multiple computers in parallel (clusters), this reduces dramatically simulation times. In parallel we made experiments with real honeybees in mini colonies to validate the simulation. We plan to utilize our model for additional studies such us: beekeeping epidemics, pollination and honey production. Reference 1. Schmickl T, Crailsheim K: TaskSelSim: a model of the self-organization of the division of labour in honeybees. Mathematical and Computer Modelling of Dynamical Systems 2008, 14 (2):101–125. 68 3er Congreso Argentino de Bioinformática y Biologı́a Computacional HMMerCTTer: Tailor-made Decision Making for the Semi-automatic Clustering of large Protein Superfamilies Hernán Gabriel Bondino1,3 Inti Anabela Pagnuco2 María Victoria Revuelta1 Marcel Brun2 and Arjen ten Have1 1: Laboratorio de Biología Comparativa en Solanáceas, IIB-CONICET-UNMdP, Mar del Plata (7600); 2: Laboratorio de Procesamiento Digital de Imagenes, FI-UNMdP, Mar del Plata (7600); 3: Advanta Semillas SAIC Centro de Investigación en Biotecnología, Balcarce (7620) Keywords Expert System, Structure-Function Prediction, Function Assignation, Protein Family, Protein Superfamily Background The sheer amount of protein sequences derived from public genome sequences provide many opportunities but also challenges to biologists. Many protein superfamilies appear to consists of various, sometimes unknown, subfamilies that are often difficult to be distinguished. Computational analyses play an important role in what is referred to as function assignation but typically require specific biological knowledge, insight in the available biocomputational tools and heavy computation of large phylogenies. We set out to develop a tool for the bioinformatics layman that, based on a training set of high quality expert annotation, automatically clusters superfamily protein sequences into subfamilies. We developed an automatic but user-supervised procedure that results in a high quality clustering, cluster-specific HMMer profiles and corresponding cut-off threshold values for reliable sequence identification and clustering. Hence, we refer to this new tool as HMMer Cut-off Threshold Tool or HMMerCTTer. Results HMMerCTTer depends on an expert-provided training set that consists of a phylogeny and the underlying Multiple Sequence Alignment (MSA). First, HMMerCTTer assigns monophyletic clusters using a ranking algorithm based on the Silhouette Index with weight correction. The Silhouette Index measures the compactness and separation of clusters based on the distances provided by the tree. Then, a HMMer profile is build for each Silhouette-qualified cluster using the user provided MSA. Each cluster-specific HMMer profile will, theoretically, identify sequences belonging to the same cluster with a high alignment-score, whereas sequences from other clusters will have significantly lower scores. Sudden drops in alignment-scores are thus indicative for cut-off thresholds. This does, however, depend on the quality of the tree and corresponding MSA but also on the variation and conservation observed within and among the different subfamilies of the superfamily. Hence, the procedure is supervised by the biological expert in order to optimize both sensitivity and specificity. In a second step, the sensitivity and the specificity of the HMMer profiles is tested using either the ungapped sequences from the training set or the corresponding complete proteomes. Based on graphically represented data, the user either accepts clusters or asks for an iterative refinement. For instance, large clusters with a high Silhouette Index but nevertheless an in-discriminative HMMer profile, can be re-analysed by means of an iteration of only the clusters' subtree through the ranking algorithm. This results in smaller and more specific profiles. Another refinement included deals with clusters that are considered too small, using an iteration through the HMMer profiling loop. A third refinement is a manual override of the clustering provided by the ranking algorithm, in order to enable paraphyletic clustering. The idea of HMMerCTTer was first applied to our recently published plant-ACD superfamily study. In this study 29 custom-defined HMMer profiles were constructed and manually selected based on a phylogeny of 406 sequences derived from seven complete plant proteomes. The generated profiles were used to screen 17 complete plant proteomes. and yielded a single false positive (829 sequences, 17 complete proteomes) whereas all real positives were detected (training set, 7 compete proteomes). The automated HMMerCTTer identified a slightly higher amount of clusters than the manual procedure but the HMMer profiles generated reliable cutoffs. Hence, the same 829 sequence collection and clustering would have been achieved if HMMerCTTer rather than a time costly expert-analysis would have been applied. Conclusions HMMerCTTer provides biologists with an easy and powerful tool for the reliable classification of subfamilies of superfamilies. Since Nature provides us with infinite scenarios of superfamilies, more benchmarking will be required in order to further improve HMMerCTTer and to test its general applicability and limits. Currently we are analysing HMMerCTTer using the highly complex superfamilies of aspartic proteasas, polygalacturonases and phospolipases C. For the future we foresee the development of a HMMerCTTer based tool for the supervised annotation of complete proteomes. 69 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Structure-Function Prediction of Highly Variable Sub-sequences of Protein Subfamilies María Victoria Revuelta, Arjen ten Have Laboratorio de Biología Comparativa en Solanáceas, IIB-CONICET-UNMdP, Mar del Plata (7600) Keywords: Structure-Function Prediction, Aspartic Proteinase, Protein Family, Protein Superfamily, Subfamily or Specificity Determining Sub-sequence Background Protein families consist of homologous, often functionally related, proteins that have a similar 3D structure. Key aspect of protein families is that they contain paralogues, which allows for functional diversification and the evolution of subfamilies. One of the aims of Structure-Function Prediction studies is the identification of Subfamily or Specificity Determining Positions (SDPs), sites or residues specific for certain functional aspects or subfamily classification. The identification of SDPs is a hot topic in Bioinformatics and can be achieved by various methods based on either evolutionary tracing (ET) or mutual information (MI), both of which depend on multiple sequence alignments (MSAs) and homology. Interestingly, MSAs also identify sub-sequences that are not conserved throughout the complete superfamily and, hence, are not truly homologous. Current ET or MI SDP identification methods do not identify these Subfamily or Specificity Determining Sub-sequences (SDSs), some of which could be very important for protein function. We set out to develop methodology for the identification and subsequent analysis of SDSs using A1 Aspartic Proteinases (APs) as a case study. APs form a well studied protein family with a number of well described, functionally important loops such as the Nepenthesin-specfic loop and the Plant Specific Insert. The analysis will be used for functional prediction but also for the foundation of a more general SDS-identification and analysis procedure. Results A multiple sequence alignment of 710 AP sequences from 107 completely sequenced eukaryotic genomes was constructed based on known hallmarks and available structural information. Non-homologous or otherwise poorly aligned sub-sequences were removed and a phylogenetic tree was constructed. The tree shows the existence of eleven different AP subfamilies whereas the MSA trimming identified 12 stretches with high variability. Six of these were described by Metcalf & Fusek (1993) as variable loops that are covering the binding cleft, are rather mobile or distorted in structures and are supposedly involved in substrate specificity. The other six SDSs are more remote form the binding cleft but also appear solvent exposed. Once identified, the SDSs require bio-computational analysis. The sub-sequences were analyzed for length, subfamily conservation and sequence characteristics. The length of each of the 12 highly variable sub-sequences was determined using a PERL script and analyzed in R in order to find significant differences between subfamilies. Subfamily conservation was analysed by realignment of the 12 SDS regions for the 11 identified subfamilies. Reliable alignments were obtained for some but not all 131 datasets. Comparison of reliable cluster-specific SDS-alignments was hampered by a low information content. All sequences were analysed using a number of bio-computational methods in order to detect putative physicochemical and or biological fingerprints. Conclusion MSA trimming software can be used for the identification of SDSs. Ten out of 12 SDSs identified in the AP superamiliy show statistically significant differences throughout the superfamily classification. A number of SDS-cluster alignments are reliable which suggest these SDSs are functionally constrained within certain subfamilies. Other SDS-cluster alignments are not reliable and require a tree-guided iterative alignment optimization which is currently being developed. Comparison of SDSs is hampered by lack of clear homology and alternative strategies are being developed for comparative analysis. Most SDSs are relatively hydrophilic confirming that SDSs are solvent exposed. A number of Prosite patterns with a high probability of occurrence was identified and will be statistically analysed. Reference Metcalf P, Fusek M: T EMBO 1993, 12(4):1293-1302 70 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Molecular Dynamics and Circular Dichroism Study of VBT:VBA Polymers (1:1 and 1:4). Structure and Dynamics comparison. Sergio A. Garay1, Antonela Fuselli1, Debora Martino1,2, Daniel E. Rodrigues1,2 Facultad de Bioquímica y Ciencias Biológicas - Universidad Nacional del Litoral 2 INTEC(UNL-CONICET) Santa Fe, Argentina [email protected] 1 A novel class of environmentally benign, non-toxic and recyclable materials based on vinylbenzyl thymine (VBT) and an ionically-charged vinylbenzyl triethylammonium chloride (VBA) monomers was studied. This compounds were bioinspired in the nitrogen bases interactions that happened in DNA degenerative processes and their reversion possibility using specific enzymes from life organisms. The hydrophilic nature of VBA let us work without using organic solvents in the polymerization process. The technological applications of these polymers has became earlier than the necessary basic studies which could help to understand their behavior and improve their services. We present a study of the influence of the co-polymerization molar relationship VBT:VBA on the short distance structure adopted by the polymers chains. We carried out several Molecular Dynamics simulations of these polymers and also Circular Dichroism experiments. We run simulations of 32 monomers of VBT:VBA (1:1) and 35 monomers of VBT:VBA (1:4) in explicit water (SPC model). The polymer 1:1 showed 75 % of its monomers in helix conformation, while the 1:4 only showed 54 %. We detected thymine stacking between residues (i, i+4) and (i,i+5) in the former and latter polymer respectively. The number of residues in a helix turn was 3.6 and 4.0 for the stoichiometry 1:1 and 1:4 respectively. The helix structure of the polymer 1:4 was interrupted by longer unstructured segments than in the 1:1, showing also more undulations of its backbone. The former also showed a higher number of water molecules, solvent accessible surface and H bond numbers closed to it than the polymer 1:1, indicating that the unstructured segments (in 1:4 polymer) let enter more water close to it. In all cases we found at least a pair of thymine stacked inside the helix segments, which could explain in part the helix stability. In summary, we can conclude that thymine piling up would be responsible (at least in part) of the helix structure of the polymer, helping to lower the SAS of the hydrophobic moiety. The lack of CD signal was in agreement with the simulation results. -1- 71 3er Congreso Argentino de Bioinformática y Biologı́a Computacional LATERAL PRESSURE EFFECTS ON STRUCTURAL PROPERTIES OF DPPC LIPID BILAYERS IN GEL AND LC PHASES: A MOLECULAR DYNAMICS STUDY A. Sergio Garay1, Juan F. Quaranta1, Daniel E. Rodrigues1,2 Área de Modelado Molecular, Lab. de Biomembranas, Dpto. de Física, Fac. de Bioquímica y Cs. Biológicas. 2INTEC (UNL+CONICET). ARGENTINA. [email protected] 1 Cell membranes contain hundreds of lipid species and proteins arranged in heterogeneous domains. Nowadays it is known that this compositional and morphological heterogeneity is central to their functions of substance trafficking and protein interactions. It is therefore necessary to rationalize how the lateral pressure boundary conditions affect the structure, ordering and dynamics of the lipid domains. We performed Molecular Dynamics simulations (MD) on hydrated lipid bilayers of DPPC in Gel(G, T=22°C) and Liquid-crystalline(LC, T=50°C) phases, at several lateral pressure values (ensemble of constant surface tension, ST) to evaluate its influence in the structural and ordering properties. For both phases the MD were performed over a bilayer of 480 lipids, at ST values of 14 and 28 dyn/cm, being the former that which reproduces the experimental NMR Deuterium order parameter profiles of the LC-phase. One of the relevant structural properties is the area per lipid: Area[G,ST=14dyn/cm]=(44.7+/0.4)Å^2; Area[G,ST=28dyn/cm]=(48.3+/-0.2)Å^2; Area[LC,ST=14dyn/cm]=(61.5+/-0.7)Å^2; Area[LC,ST=28dyn/cm]=(75.0+/-0.3)Å^2. The results show that the LC-phase is much more sensible to the change in lateral pressure than the ordered G-phase. The hydration of the lipid polar groups is known as a relevant contribution to the interface energetic. The number of H-bonds of water to the carbonyl O per lipid are: HB[G,ST=14dyn/cm]=5.14; HB[G,ST=28dyn/cm]=5.19; HB[LC,ST=14dyn/cm]=5.69; HB[LC,ST=28dyn/cm]=6.09. It is shown that the larger change is for the LC-phase. The change in the LC-phase area explains this behavior and also that the number of water whose orientational potential is perturbed by the interface is more sensitive for this case. We have also analyzed the changes in the order parameter profiles, the number of water that bridged among the lipids, and the lateral pressure profiles across the bilayer to untangle the contributions from different lipid regions. 72 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Assessing protein-disease association significance from candidate ranking lists Ariel Berenstein1,2, Irene Ibañez1,2 , Ariel Chernomoretz1,2 1 2 Departamento de Física, Universidad de Buenos Aires, Bs.As. Argentina. Laboratorio de Bioinformática, Instituto Leloir, Bs. As. Argentina Background There has been a lot recent interest in the application of complex network theory in human health related research in order to predict new disease/gene-product associations. Most of this type of research programs assumes that protein associated to the same disease have an increased tendency to interact with each other. Accordingly, most of gene prioritization methods involve the use of: already known genedisease associations, a complex network of interacting proteins that encodes physical or functional relationships between them, and a kind of information propagation technique used to rank candidate proteins in terms of their degree of association with disease-related seeds. Usually top ranked proteins are considered as new candidates, but this procedure does not take into account either, the statistical significance of the proposed gene-disease association or topological structure effect of P2P implemented network. Materials and methods We considered genes and protein associated to the Alzheimer disease as reported by the DisGenet database [1].Protein-protein interactions inferred from the Human Interaction Network (HIN)[2], Three different protein candidate prioritization methods was analyzed (Functional Flow, Random Walk with Restart and Net Rank Algorithms) [3-6]. We implement a bootstrapping technique taking into account the topological network structure to assign statistical significance to observed scores, and correct the corresponding p-values, whit a multiple hypothesis testing technique (FDR). Results We show that predictions based on ranking candidate lists obtained by this type of algorithms can be highly biased by the underlying network topology. We show how and when a bootstrapping technique,that takes into account the local connectivity pattern of each node, should be used to alleviate this issue. Conclusions In this work we highlight the importance of adequately take into consideration the network connectivity pattern in gene prioritization procedures. Looking at several topological quantities we have analysed the induced topological bias and quantified the performance of bootstrapping techniques that aim to alleviate it. We found that different algorithms are differently affected by this bias. This observation can be explained in terms of the respective information propagation scheme implemented in each algorithm. 1. Bauer-Mehren et al. DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene– 2. 3. 4. 5. 6. disease networks. Bioinformatics 26: p2924, 2010.. Ceriani et al. Automated Network Analysis Identifies Core Pathways in Glioblastoma PloS One ,5 (2): e8918, 2010. Guney et.al. Toward PWAS: discovering pathways associated with human disorders. BMC Bioinformatics 12(Suppl 11):A12, 2011 Nabieva et al. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21, p302, 2005. Kohler et al. Walking the interactome for prioritization of candidate disease genes.Am J Hum Genet, 82, p949 2008. Chen et al. Disease candidate gene identification and prioritization using protein interaction networks BMC Bioinformatics, 10, p73, 2009. 73 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Estimation of Species Richness in Microbial Communities 2 Introduction Data mining concepts combined with statistical estimation can be applied in metagenomics to infer species richness in microbial communities. There are several statistical estimators that infer species richness in the community from a sample [1]. In spite of the usually large volume of data, a paradox occurs because richness estimators have a poor performance and underestimate species richness. The reason has its roots in the low frequency of statistically rare species, that is, species with one or just a few members in the population. These rare species sometimes can constitute the major part of the community. So that the sample of sequences drawn from the population can contain thousands of reads and millions of bases, but still be insufficient for an adequate estimate. To improve richness estimation we introduce here an algorithm for species counting, called ARE, based on an intelligent-data-analysis approach combining simulation and machine learning. We test ARE on a real-world sample of 16S rRNA sequences. Material and Methods The analyses shown in this work were performed with the gene coding for the 16S rRNA, which enables a phylogenetic evaluation of the similitudes and differences between microorganisms [2]. The sequences were aligned against a reference and filtered by size and relative position in the overall alignment. The remaining sequences were clustered in OTUs by similarity using the Jukes-Cantor distance. The similarity threshold was chosen so that every cluster corresponds approximately to a different species. The ARE algorithm starts with a population model based on the richness and distribution of an initial sample, then it improves estimation by successively adding individuals selected by a simulation process. This simulation takes into account the species abundance in the initial sample to estimate what is the probability of the next individual to be a member of a species already recorded in the sample or a member of a new one. This probability is calculated as the quotient between the number of species with only one member and the total number of individuals in the sample, as suggested by Alan Turing and demonstrated later by Good [3]. The quotient at any given iteration is: Tˆi distribution along the simulation process. In this context, the probability of finding a new species tends to zero as the number of simulated individuals increases, and as a consequence, the number of recorded species also increases. To evaluate ARE we analyzed eight samples from a metagenomic survey of the coastal line of a hypersaline lake (NCBI Short Read Archive accession number SRX008158) Resuls The simulations were halted after reaching a threshold of singleton frequency or a given number of simulated individuals. The number of resulting species is the richness estimator of the population. The ARE estimates were higher than those of other current estimators. Figure 1 compare ARE predictions to those of Chao and ACE, two common non-parametric estimators. Figure 1 Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Comparación de Estimaciones Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil 11000 Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil 10000 Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil 9000 Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil 8000 Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Riqueza Estimada 1 Cristóbal Santa María , Marcelo Soria 1 UNLAM.San Justo. Argentina 2 FAUBA. Buenos Aires. Argentina 7000 Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil 6000 Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil 5000 Versión Estudiantil 4000 Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil 3000 Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil 2000 Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil 1000 Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil 0 Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil S85 S86 S87 S88 S89 S90 S91 S92 Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Muestras Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión CHAO ACE Estudiantil ARE Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil Versión Estudiantil To test the goodness of fit of the estimator we created a simulated population that follows a Fisher´s log-series distribution [4]. A sample was drawn from this population and used to estimate population richness using ARE, which confirmed the improvement in performance obtained with this new estimator. Conclusions The results obtained with the metagenomic data indicate that ARE yields better estimates of richness, while the results from the simulated population confirm, at least from a statistical point of view, these improvements. References 1. Chao, A and Lee, S. Estimating the Number of Classes via Sample Coverage. Journal of American Statistical Association, 1992, 87(417):210-217. 2. Schloss, P. and Handelsman,J. Toward a census of bacteria in soil. PLoS Computational Biology, 2006, 2(7): e92. 3. Good, I. The Population Frequencies of Species and Estimation of Population Parameters. Biometrika, 1953, 40( 3/4):237-264. 4. Fischer, R. Corbet, S y Williams, C. The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population. The Journal of Animal Ecology,1943, 12(1):42-58. n sgletones i 1 At every step i the algorithm updates the number of species present in the sample depending whether the simulated individual belongs to a new species or not. In this way, ARE “learns” the 74 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Dissecting relationships between sequence, structure and functions in the Ankyrin Repeat Protein Family R. Gonzalo Parra*, Rocío Espada and Diego U. Ferreiro Protein Physiology Lab, Dpto de Química Biológica, FCEyN-UBA and CONICET. Buenos Aires, Argentina. *[email protected] Repeat proteins are made up of tandem arrays of similar 20~40 amino acid stretches that usually fold up in elongated structures mainly stabilized by local interactions. Due to their apparently simple architecture, these proteins constitute useful models to dissect relationships between sequences, structures and functions. The Ankyrin Repeat Protein family (ARPs) is widely distributed in nature. A canonical ankyrin repeat consist in a 33 amino-acids length motif that usually folds into a beta-hairpin-helix-loop-helix upon interaction with its nearest neighbours. Their biological function is attributed mediating specific protein-protein interactions with versatility of recognition paralleled to that of antibodies. Thus, their function (or lack of) plays crucial roles in the developing of various pathological processes and in bacterial or viral infections. We have built a relational database to statistically characterize ARPs architecture at various levels of description. We have collected, depurated and catalogued all available ARPs sequences, structures and functional data, that delineates the general properties of this protein family. Usually Hidden Markov Models (HMMs) derived from Multiple Sequence Alignments (MSAs) are used to detect repeats in protein sequences. This methodology has many disadvantages as it fails to detect those repeats (or parts of them) with a high degree of divergence.. We developed a robust scheme to perform structural alignments and detect symmetries to define the repeating units within a repeat array and between natural protein pairs. The derived metrics were compared in terms of sequence and structural similarity. We detected subgroups within the ARPs family that appear to correspond to known functional classes. We found that the most common methods used to characterize globular protein domains are insufficient to capture essential characteristics of the ARP family. We hypothesize that this is due to strong evolutionary divergence in sequences that tolerate insertions and rearrangements within a repeating array. We show that the divergent regions can usually be mapped to binding sites. We postulate that the functional constraints imposed by specific binding conflicts with robust folding of these proteins, and that these signals could be used to inform energetic terms in folding dynamics models. 75 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Design of novel DNA-binding specificity in proteins from the “zinc finger” family Benjamin Basanta1, Andreu Alibes2, Luis Serrano2, Alejandro Nadra1 1 Structural Biochemistry Group, Biologic Chemistry Department, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, C1428EGA Buenos Aires, Argentina 2 Biologic Systems Design Group, Systems Biology Program, Centre for Genomic Regulation, 08003 Barcelona, España Protein-DNA interaction has a central role in cellular development, modulating essential processes such as gene expression, cell cycle, chromatin structure, etc. The “zinc fingers” structural motif is the most abundant DNA-binding protein domain in mammalian genomes. Each of these domains binds a zinc ion that provides high structural stability, making it highly tolerant to mutations and easily evolvable [1]. Development of novel DNA-binding proteins with new specific sequences is a great technological challenge and has a potential application in many fields, from basic research to synthetic biology and gene therapy. Currently, there is only one example of a protein successfully redesigned to bind a DNA sequence different from the wild-type [2] [3]. On the other hand, the naturallyoccurring repertoire of zinc fingers that bind a specific sequence is limited [4] [5] [6]. With the aim of developing new protein-DNA-binding-site pairs, we propose the use of the FoldX software [7], which allows modeling and prediction of protein-DNA interactions, based in energy landscape calculations [8]. In this work we present a strategy for computational re-design of binding interfaces and experimental validation in a one-hybrid system in yeast. 1. Tokuriki N, Tawfik DS: Stability effects of mutations and protein evolvability. Curr Opin Struct Biol. 2009 5:596-604. 2. Ashworth J, Havranek JJ, Duarte CM, Sussman D, Monnat RJ Jr, Stoddard BL, Baker D: Computational redesign of endonuclease DNA binding and cleavage specificity. Nature 2006, 7093:656-9. 3. Ulge UY; Baker DA; Monnat Jr. RJ: Comprehensive computational design of mCrel homing endonuclease cleavage specificity for genome engineering. Nucleic Acids Res. 2011, 1:1-10. 4. Maeder ML, Thibodeau-Beganny S, Osiak A, Wright DA, Anthony RM, Eichtinger M, Jiang T, Foley J01E, Winfrey RJ, Townsend JA, Unger-Wallace E, Sander JD, MüllerLerch F, Fu F, Pearlberg J, Göbel C, Dassie JP, Pruett-Miller SM, Porteus MH, Sgroi DC, Iafrate AJ, Dobbs D, McCray PB Jr, Cathomen T, Voytas DF, Joung JK: Rapid opensource engineering of customized zinc-finger nucleases for highly efficient gene modification. Mol. Cell. 2008 2:294-301. 5. Bhakta MS, Segal DJ: The generation of zinc finger proteins by modular assembly. Methods Mol. Biol. 2010 649:3-30. 6. Sander JD, Maeder ML, Reyon D, Voytas DF, Joung JK, Dobbs D: ZiFiT (Zinc Finger Targeter): an updated zinc finger engineering tool. Nucleic Acids Res. 2010 (Web Server issue):W462-8. 7. Schymkowitz, J: The FoldX web server: an online force field. Nucleic Acids Res, 2005. 33(Web Server issue): p. W382-8. 8. Alibés A, Nadra AD, De Masi F, Bulyk ML, Serrano L, Stricher F: Using protein design algorithms to understand the molecular basis of disease caused by protein-DNA interactions: the Pax6 example. Nucleic Acids Res. 2010 21:7422-31. 76 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Simulation of pesticide effect on thermo-dependent arthropod populations: fixed point iteration method Carlos A. Bartó 1, Julio D. Edelstein 1,2, Eduardo V. Trumper 1,2 1 Informatics Dept., Exact, Physics and Natural Sciences, National University of Cordoba, Argentina 2 Entomology, Agricultural Experimental Station (INTA) Manfredi, Cordoba, Argentina Agricultural crops are often damaged by arthropods and triggering the application of pesticides for vegetal protection, with the aim of reducing the pest populations. Pest resurgence can occur, inducing frequent pesticide applications with the consequent environmental risks of pest resistance, killing of non-target population, air, water and soil pollution, epidemiological consequences and increasing costs of crop production, among other effects. Few pesticide models oriented to the pest population effect can be found in specialized papers. In the present work the effect of pesticides on the number of individuals (for example larvae, state variable) for a population is defined. The simulated organisms develop as a function of environmental temperature, assuming normal distribution of developmental rates and instantaneous response of the metabolism to environment. The extended von Foerster (eVF) equation is used to solve numerically, partial differential equations of change in population abundance on time and physiological age. The software ARTROPOB ® (2012), designed to implement simulation models of stage structured population dynamics, was used and the pesticide module was included in its system. The evolution of multiple stage larvae with the application of a pesticide is made by the iteration for the fixed point iteration process. The state variable is calculated by the integration of eVF fluxes. The state variable is reduced proportionally by a survival coefficient. The convergence procedure was controlled by tolerance parameters, limiting the number of iterations and the distance from the estimation (Fig. 1). A norm distance per generation and larval instar was calculated as the absolute maximum value of the mortality in a determined time minus the mortality a discrete time before. More than a simple chemical pesticide effect, other kinds of control tactics like biological ones, based on a denso-dependent and frequencydependent pathogens, predators with a satiation function or parasitoids with a learning behaviour, were also able to be simulated. 400 4000 n-previous n-present 350 3500 Norm 300 3000 250 2500 200 2000 150 1500 100 1000 50 500 0 Total mortality Instantaneous distance Integral Output 0 0 1 2 3 4 5 6 7 8 9 Iteration Figure 1: Iterative calculus of the integrated mortality and the norms in the fixed point method applied for pesticide effect estimation on population models. Although a real pesticide effect cannot be predicted on the bases of real circumstances, this theoretical tool allows tactics application analysis. Modeling the pest management process allows researchers to estimate the effects on virtual pest populations and evaluate optimal timing of pesticide application. 77 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Strategies for gap-closure of Thermus sp. 2.9 genome. Laura Navas1 , Ariel Amadío2, Rubén Zandomeni3 1,3 Instituto de Microbiología y Zoología Agrícola (IMyZA), Instituto Nacional de Tecnología Agropecuaria (INTA), Las Cabañas y de Los Reseros, Buenos Aires, Argentina 2 CONICET – EEA Rafaela, Instituto Nacional de Tecnología Agropecuaria (INTA) Extremophile organisms are of great interest due to their potencial as sources of proteins for biotechnological application. A thermophilic bacterium was isolated from a hot water spring in Salta, Argentina. Phylogenetic analysis indicated that it belongs to the Thermus genus. DNA sequencing was performed using Roche 454 technology to obtain the complete genome sequence. Two hundred and fifteen thousand non-paired readings were obtained totaling 81.238.046 pb and providing approximately 35-40 fold coverage of the genome size (estimated in 2Mpb). Reads were assembled de novo using Newbler (v2.3), which generated 137 contigs larger than 500 nucleotides and a N50 of 39.906 pb. The G+C genome content resulted in 66.7%. Different bioinformatics strategies were used to predict the collinearity between contigs to finish the genome. First, synteny with two species of the Thermus genus were analyzed and compared to contigs from the isolate 2.9. A second strategy consisted in the generation of an optical map from Thermus sp 2.9 genome (OpGen, Sanger Institute) using the restriction enzyme NheI. It allowed comparing the restriction patterns of the whole genome with those of each contig generated in silico. Finally, a fosmid library (Epicentre) was generated with an insert size of 30-40Kb, and the ends of 150 clones were sequenced. All this approaches allowed the generation of scaffolds to order the contigs. As the result of these strategies 95 joins were predicted. Thirty two of them were confirmed by PCR and sequencing of amplified products. The average size of the sequenced gaps was ~1200 bp. Currently, we have 10 scaffolds which cover 98% of the genome. Following this strategy we were able to join several contigs, and order many of them. However, it is clear that obtaining one scaffold (or ideally one contig) is particularly complex for genomes with high GC content. To increase the information and get a finished genome, we are currently planning a mate-paired run with an insert size of ~8kb, aiming not only to join the 10 scaffolds, but also solve repetitive sequences of remaining contigs. Key words: Genome finishing, Gap closure, Scaffolding, Thermophilic bacteria Acknowledgments We thank Matthew Dunn and all Team 63 from Wellcome Trust Sanger Institute for the generation of the optical map. 78 3er Congreso Argentino de Bioinformática y Biologı́a Computacional COMPUTATIONAL PREDICTION OF THE BIOLOGICAL EFFECTS OF MUTATIONS IN OTC GENE IN ARGENTINIAN PATIENTS 1 2 1 1 Silene Silvera Ruiz , Antonio Arranz Amo , Laura Laróvere , Raquel Dodelson de Kremer . 1 Centro de Estudio de las Metabolopatías Congénitas, Hospital de Niños de Córdoba, Fac. De Cs. Médicas, UNC, Córdoba, Argentina 2 Unitat de Metabolopaties, Hospital Universitari Materno-Infantil Vall d´Hebron, 08035 Barcelona, España Summary Ornithine transcarbamylase deficiency (OTCD) is the most common inherited disorder of the urea cycle and is transmitted as an X-linked trait. Defects in the OTC gene cause a block in ureagenesis. Males with mutations leading to complete OTCD develop hyperammonemic coma in the first week of life, which carries an approximately 50% mortality and universal morbidity among survivors [1–3]. In males with mutations resulting in partial OTC deficiency and in approximately 15% of female heterozygotes, hyperammonemic crisis occurs later in childhood and carries a 10% mortality and significant morbidity [4]. OTCD results from mutations in the OTC gene, encoding a 354-residue polypeptide. The complete repertoire of OTCD-causing mutations is estimated as 560 mutations, including 290 mSNCs. Since disease-causing mSNCs represent <20% of the 2064 possible OTC mSNCs, simple approaches are essential for discrimination between causative and trivial mSNCs [5]. Observation of the OTC structure appears a simple approach for such discrimination, comparing favourably in our simple with four formalized structure-based and/or sequence-based in silico assessment methods, and supporting the causation of deficiency by the given mutations. The aim of this work was to validate five mSNCs c.386G>A, c.452T>G, c.533C>T, c.622G>A, c.829C>T found in Argentinian patients of our centre, and correlate the pathogenic degree of each one with their fenotype/clinical data. The five patients were diagnosed biochemically and molecularly, this serie includes affected males and simptomatic female carriers with mild and severe forms. Thus, multiple sequence alignment was made by CLUSTALW2 (http://www.ebi.ac.uk/Tools/msa/clustalw2/). The OTC mSNCs were evaluated using bioinformatics tools of public databases and web-based software programs. The conservation score of the affected residue was calculated by two tools: PolyPhen, (http://genetics.bwh.harvard.edu/pph) on which scores are evaluated as 0.000 (most probably benign) to 0.999 (most probably damaging); and SIFT (http://blocks.fhcrc.org/sift/SIFT.html) on which the scores less than 0.05 indicate substitutions are predicted as intolerant. Another tool that we used was PoPMusic (http://babylone.ulb.ac.be/popmusic/) wich evaluates the changes in stability of a given protein under single-site mutations, on the basis of the protein's structure (Table 1). Table 1 79 3er Congreso Argentino de Bioinformática y Biologı́a Computacional In sílico prediction of cross-reactive epitopes of the major soybean allergen Gly m Bd 30K with bovine caseins and their analysis by immunochemical methods. Candreva, Ángela1,2, Parisi, Gustavo3, Docena Guillermo2 and Petruccelli Silvana1. 1 CIDCA UNLP, La Plata 47 y 116. 2 LISIN, FCE, UNLP, La Plata, 47 y 115. 3 Departamento de Ciencia y Tecnología, UNQ, Roque Saenz Pena 182, Bernal. Background Cow’s milk allergy (CMA) constitutes the main food allergy in Argentina. The nutritional substitutes mostly used are soy-based formulas; however, 40% of the patients do not tolerate soybean milk. The molecular bases of these reactions are not fully understood. Our group has shown that the major proteins of soybean 11S and 7S storage proteins, shared cross-reactive epitopes with bovine caseins. The aim of this work was to predict potential cross reactive epitopes in the major soy allergen P34. Although P34 is a minor seed component is considered a major soybean allergen. Servers currently available on the internet are not able to predict cross-reactive proteins between soybean proteins and cow’s milk proteins. Since the relevance of cross reactive epitopes between soybean and CMP has been confirmed by our group by in vitro immunochemical studies and in vivo using a mouse model for CMA, the performance of the in sílico prediction method needs to be improve it. Our objective was to develop a computational strategy to predict P34 and bovine caseins common epitopes and then compare the in sílico results with immunochemical analysis. Material and Methods For analysis of P34: bovine casein allergenic epitopes were obtained from the database IEDB , and then were aligned with P34 protein, the obtained results were plot in graphic built based on the consensus amino acid accumulation. P34 homology modeling, solvent accessibility and discontinuous epitopes: To build 3D models we used homology modeling with the sequence of the protein P34 (gi 195957142) as target. With this sequence we searched Protein Data Bank (PDB) to obtain putative templates. Using this model we obtained the positions exposed to solvent using the program DSSP. In addition, P34 modeled structure was used to analyzed if the cross reactive epitopes identified by the sequential analysis were predictive as B-cell epitopes by the Discotope server. Immunochemical analysis using Overlapping Synthetic Peptides: to test our prediction the entire protein sequence of P34 was synthesized as linear 15-mer overlapping peptides with five-residue shifts immobilized on paper. The recognition of the synthetic peptides by different primary antibodies: pool of IgE allergic patient sera reactive and mouse monoclonal antibodies (mAbs): specific of α, ß and κ-casein were assayed. Results The sequential analyzes detect two main regions with potential cross reactive epitopes: region A and B. Six 15 mer peptides in region A with the highest score were recognized by the two different pools of IgE patient sera, and the 3 casein specific mAbs. Only one region B peptides was recognized by the 3 mAbs. The predicted peptides were on the surface of the molecule exposed to the solvent. Conclusion In conclusion, the in sílico methods used in this work allow as predicting cross-reactive epitopes between P34 and bovine caseins that were confirmed by experimental analysis. 80 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Coevolution and Contact Networks within Superfolds Martin Banchero1, Elin Teppa1 and Cristina Marino Buslje1 1Fundación Instituto Leloir Changes due to mutations of amino acids do not occur randomly but functionality and structure impose constrains to different positions. There are compensatory mutations such as a mutation in a certain position induces a coordinated mutation in another(s) position(s) elsewhere in the protein. These coevolving mutations are of key interest as they identify residues that interact within the protein, engaged to a particular function as examples: catalytic reaction, structure stabilization, protein-protein and substrate interaction and allosteric regulation. It has long been suggested that correlated mutations can be exploited to infer spatial contacts within the tertiary protein structure [1]. On the other hand, folds are not equally adopted by proteins but instead 40 % of the proteins in the PDB adopt 0.1% of the possible folds. Those highly populated group of folds are called Superfolds and adopt very regular architectures (e.g., TIM barrel fold, αβ-barrel, Rossmann fold; three-layer αβ-sandwich; αβ-plait, two-layer αβsandwich) [2]. In this work we first analyzed the relationship between Mutual Information (MI), interpreted as a measure of coevolution [3], and contact distance at different families within a superfamily (defined as Pfam clan [4]), belonging to a superfold. Secondly we try to uncover which MI relationships are common between families of the same clan due to the common fold and superfamily function, an also we try to identify which ones are specific to each family (see figure 1). With this approach we aim at identifying superfamily MI (i.e those MI relationships due to common superfamily function and or fold) and family specific MI. Understanding the similarities and differences between families of the same kind could be the foundation of more precise annotation methods. Figure1: MI and distance map of three families of the same clan (Tim barrel fold CL0160: PF01280, PF01717 and PF8267) and a family of a different clan and fold (globin-like fold CL0090: PF00042) as comparison. Blue dots: top 20 % MI; green dots: distance <8Å; red dots: top 20% MI and distance <8Å. References: 1. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, et al. (2011) Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences 108: E1293-E1301. 2. Orengo Ca Fau - Thornton JM, JM T (2005) - Protein families and their evolution-a structural perspective. Annu Rev Biochem 74: 867-900. 3. Buslje CM, Santos J, Delfino JM, Nielsen M (2009) Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics 25: 1125-1131. 4. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, et al. (2012) The Pfam protein families database. Nucleic Acids Research 40: D290-D301. 81 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Evolutionary and structural analysis of procirsin, a typical plant aspartic proteinase zymogen Daniela Lufrano1, Sandra E. Vairo Cavalli1, Gustavo Parisi2 1 Laboratorio de Investigación de Proteínas Vegetales (LIPROVE), Departamento de Ciencias Biológicas, Facultad de Ciencias Exactas, Universidad Nacional de La Plata, C.C. 711, 1900 La Plata, Argentina. 2 Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, Roque Sáenz Peña 352, Bernal, Buenos Aires, B1876BXD, Argentina E-mail: [email protected] In plants, aspartic proteases (APs, EC. 3.4.23) appear to be the second-largest class of proteases being A1 family the best studied and the largest group, classified in these organisms into typical, nucellin-like and atypical proteases [1]. Typical plant APs are synthesized as singlechain preproenzymes characterized by the presence of the plant specific insert (PSI) domain of approximately 100 amino acids, absent in APs from other sources (viruses, bacteria, yeast, fungi and animals). The preproenzymes are subsequently processed into single- or two-chain mature forms where PSI domain is removed. The prosegment and the first residues of the Nterminal portion of the AP precursors have been described to play a critical role in blocking catalytic aspartates and thus preventing autoactivation. Particularly, residue Arg 7 of the propeptide in barley´s typical AP precursor (prophytepsin) is reported to form an ionic interaction with Glu 171 and Asp 178 in mature protein, and together with other hydrophobic and hydrogen bonds, links the propeptide in a way that the Lys 11 and Tyr 13 of the N-terminal region interact with the active site inhibiting the activity of propythepsin. However, the precursor of a typical AP from flowers of Cirsium vulgare (Savi) Ten. (Asteraceae), called procirsin and obtained by heterologus expression, was shown to be active at acidic pH [2]. In order to find possible differences that explain recombinant procirsin activity, we performed a phylogenetic analysis of procirsin and a structural model using Modeller program, further evaluated with the DOPE potential and Prosa II server (score -8.5). We also estimated the variation of the net charge of the propeptides of procirsin and prophythepsin as a pH function. Our analysis shows that procirsin shares a cluster with APs from diverse organisms in which the closest homologous is cyprosin from C. cardunculus (98% of sequence similarity). According to the structural model and the evolutionary analysis, all the residues described as important for biological function, as well as Arg 7p, Lys 11 and Tyr 13 are conserved in procirsin in comparison with all the sequences of the cluster. The large positive charge at acidic pH predicted for the prosegment of procirsin when compared with prophytepsin, could alter the correct localization of the propeptide avoiding the interaction of Lys/Tyr with the catalytic residues and turning the procirsin active at low pH. We propose that as pH increases, the charge in the prosegment decreases allowing the correct conformation to inhibit the proenzyme. Acknowledgements We would like to acknowledge the financial support to ANPCyT, Argentina (PICT 02224), University of La Plata (Project X-576). G. Parisi and S.E. Vairo Cavalli are members of CONICET Research Career Program; D. Lufrano is awarded fellowship of CONICET. References 1. Faro C, Gal S: Aspartic proteinase content of the Arabidopsis genome. Current protein & peptide science 2005, 6:493-500. 2. Lufrano D, Faro R, Castanheira P, et al: Molecular cloning and characterization of procirsin, an active aspartic protease precursor from Cirsium vulgare (Asteraceae). Phytochemistry 2012, in press. 82 3er Congreso Argentino de Bioinformática y Biologı́a Computacional BiFe: a national EMBNet node hosting Argentine bioinformatics applications EMBNet node Argentina1 1 Protein Physiology Laboratory, Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires (Argentina). Background EMBNet [http://www.embnet.org] is a science-based group of collaborating nodes that provides bioinformatics services to the molecular biology community [1]. BiFe (Bioinformática Federal [http://www.embnet.qb.fcen.uba.ar]) is the Argentine node of EMBNet. Our goal is to help the Argentine bioinformatics community [2] in bringing their newly developed applications online. BiFe is located at the Protein Physiology Lab, Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Argentina. The EMBNet node Manager is Dr. Ignacio E. Sánchez, and the EMBNet staff includes Dr. Adrián G. Turjanski and Msc. Leandro G. Radusky (Departamento de Química Inorgánica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Argentina). Materials and methods We have built our application servers using an open source general-purpose toolkit, which is publicly available for downloading, together with a tutorial here. Its features include a fully customizable input form including file uploading and automatic ftp file retrieval; dynamic application loading and a fully customizable output form including text, tables, graphics and protein structure representation using Jmol [3]. Argentine academic researchers interested in having their bioinformatics applications hosted by BiFe may contact us for further details. We will put our resources at your service, free of charge. We do not expect any retribution in terms of authorship of any other form of scientific credit. Results and conclusions The site groups bioinformatics applications developed in Argentina. Some of them are hosted by BiFe, such as the Frustatometer [4], an application to localize frustration in proteins and BeEP, a tool to validate protein models through evolutionary information. The site also displays links to applications and databases that have also been developed in Argentina and are hosted elsewhere. Acknowledgements We acknowledge funding from the Ministerio Argentino de Ciencia, Tecnología e Innovación Productiva. Ignacio E. Sánchez and Adrián G. Turjanski are researchers from Consejo Nacional de Investigaciones Científicas y Técnicas. References 1. D'Elia D, Gisel A, Eriksson NE, Kossida S, Mattila K, Klucar L, Bongcam-Rudloff E: The 20th anniversary of EMBnet: 20 years of bioinformatics for the Life Sciences community. BMC Bioinformatics 2009, 10 Suppl 6:S1. 2. Bassi S, Gonzalez V, Parisi G: Computational biology in Argentina. PLoS Comput Biol 2007, 3(12):e257. 3. Herraez A: Biomolecules in the computer: Jmol to the rescue. Biochem Mol Biol Educ 2006, 34(4):255-261. 4. Jenik M, Parra RG, Radusky LG, Turjanski A, Wolynes PG, Ferreiro DU: Protein frustratometer: a tool to localize energetic frustration in protein molecules. Nucleic acids research 2012, 40(W1):W348-W351. 83 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Design of a pipeline for de novo identification of cis-regulatory elements involved in transcriptional re-programming during tomato fruit development and ripening Tomas Duffy1, Fernando Carrari1 1 Instituto Nacional de Tecnología Agropecuaria, Hurlingham, Argentina. Background. During the development and ripening of the tomato (Solanum lycopersicum) fruits extensive reprogramming of the gene transcriptional network occurs via the interaction of transcription factors with cis-regulatory elements (CREs). One of the mechanisms explaining the coordinated expression of genes involved in a common biological process is the CRE composition of promoters of the involved genes. In this direction, computational methods that identify over-represented DNA motifs in the promoters of co-expressed genes are time- and cost-effective complements for large-scale putative cis-regulatory elements discovery (pCREs). In this work we applied a combination of softwares and in-house scripts to analyze microarray experiments available on public databases to produce clusters of co-expressed genes in which to search for over-represented pCREs. Similarity with previously described CREs, positional preference and cooccurrence were analyzed. Materials and Methods. Fifty-four two-color TOM1 tomato microarray experiments, of 9 time points during tomato fruit development and ripening, where downloaded from the Tomato Functional Genomics Database, normalized with the Limma R package, and probe summerization was carried out with the WGCNA R package. Clusters of co-expressed genes where generated by using *omeSOM software, based on self organizing maps. Promoters (1500bp up-stream of translation start sites) of co-expressed genes where fetched using in-house Perl/BioPerl scripts. On these sequences pCREs were searched using three different tools namely: MEME, Weeder and MotifSampler. All statistically over-represented pCREs where clustered to eliminate redundant motifs using Gimmemotifs software. Non-redundant pCREs (nr-pCREs) where compared to previously described CREs present in the PLACE database using STAMP. In order to evaluate positional preference, all nr-pCREs were mapped using FIMO on all analyzed promoters. The statistical significance of motif co-occurrence was calculated by the cumulative hypergeometric distribution function using a combination of R and Perl scripts. A network of cooccurring pCREs was built using Cytoscape. Results and Discusion. We identified de novo 410 nr-pCRES, which showed strong positional preference for the first 400bp up-stream of the translation start site. Two hundred and fifty three of them showed high similarity to previously described plant CREs. We generated a network of statistically co- 84 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Diversity and evolution of retinoblastoma protein-binding LxCxE motifs in human proteins Lucía B. Chemes1, Juliana Glavina2, Gonzalo de Prat-Gay1 and Ignacio E. Sánchez2 1 Protein Structure-Function and Engineering Laboratory. Fundación Instituto Leloir and IIBBA-CONICET. 2 Protein Physiology Laboratory, Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires. Introduction The retinoblastoma tumor suppressor protein (Rb) plays a central role in eukaryotic cell cycle control, differentiation and chromatin structure regulation. Rb is the hub of a large protein interaction network. The retinoblastoma-binding LxCxE linear motif mediates a high affinity interaction between a conserved surface patch in Rb and one third (approximately 30) of human cellular Rb targets [1] and also between Rb and several oncoproteins from human viruses. In the present work we study the occurrence and evolution of LxCxE motifs present in human proteins using bioinformatics tools to identify this linear motif and to analyze its variability in homologous non-human proteins. Methods We use available linear motif databases and bibliographic search to compile a database of human Rb target proteins harboring the LxCxE motif. We annotate the structural context of the motif using the Protein Data Bank structure database [2] and the IUPRED predictor for intrinsic disorder [3]. We also characterize the sequence context of the motif using sequence logos [4] and searching for known associated motis [5]. For a subset of targets, we search for the LxCxE motif in homologous proteins from eukaryotic and prokaryotic organisms and analyze evolution of the motif. Results and conclusions We report that the LxCxE motif from human Rb protein targets can be found both within disordered and within globular domains. When present in a globular domain, the motif can occur in various secondary structure elements, suggesting that conformational transitions must take place to allow for Rb binding. We find variability in the linear motifs associated to the LxCxE motif and different conservation patterns when compared to the known instances of viral proteins. We discuss the results found for LxCxE motifs in the human proteome in the light of the information available on the Rb-LxCxE interaction and on the known features of viral LxCxE motifs. Based on these data, we suggest that host and viral LxCxE motifs may differ in their evolution and functional properties. References [1] Dick FA. Cell Div. 2007 Sep 13;2:26., [2] http://www.rcsb.org/pdb/home/home.do, [3] Dosztányi Z, Csizmók V, Tompa P and Simon I. Bioinformatics (2005) 21, 3433-3434., [4] Schneider TD, Stephens RM. 1990. Nucleic Acids Res. 18:6097-6100, [5] Chemes LB, Glavina J, Faivovich J, de Prat-Gay G, Sánchez IE. J Mol Biol. 2012. In press. DOI: http://dx.doi.org/10.1016/j.jmb.2012.05.036. Acknowledgements We acknowledge funding from Agencia Nacional de Promoción Científica y Tecnológica (PICT 2010-1052 to I.E.S), Consejo Nacional de Investigaciones Científicas y Técnicas (postdoctoral fellowship to L.B.C; G.d.P.G., and I.E.S. are CONICET career investigators) and Instituto Nacional del Cáncer (graduate fellowship to J.G.). 85 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Identification of putative subtelomeric regions in the genome of Toxoplasma gondii Santiago J. Carmona1, Maria C. Dalmasso2, Sergio O. Angel2,, Fernán Agüero1, 1 Laboratorio de Genómica y Bioinformática, 2Laboratorio de Parasitología Molecular IIB-INTECH, Universidad de San Martín-CONICET, Buenos Aires, Argentina Background Most eukaryotic chromosome ends are formed by telomeric repeats and subtelomeric regions, also called Telomeric Associated Sequences (TAS). TAS are patchworks of genes interspersed with repeated elements, and while these domains present similar arrangements in different species, their sequences are highly divergent. In addition, these regions present a particular nucleosomal composition and bind specific factors therefore producing a special kind of heterochromatin. In the currently available draft of the T. gondii genome, telomeres are not completely assembled, and chromosome ends have not been analyzed yet. Here we discuss some findings regarding T. gondii chromosome ends. Results All-vs-all pairwise sequence comparison of T. gondii chromosomes revealed the presence of a conserved region of approximately 25 to 30 Kb at the ends of 9 of the 14 chromosomes in the parasite strain ME49, defined here as TgTAS-like. Sequence similarity among these regions is on average ~70%, they are highly conserved in other strains, but are unique to Toxoplasma, with no detectable similarity in other Apicomplexan parasites. The internal structure of these TgTAS-like sequences consist of 3 repetitive regions separated by high-complexity sequences that are depleted of genes, with the exception of one gene at their 3' end. To analyze potential compositional bias along the chromosome we performed a correspondence analysis (CA) of the trinucleotide composition observed in sliding windows of lengths 1 to 100 Kb. The analysis showed a strong bias, with only 2 dimensions (the first and second principal coordinates of the CA) largely explaining the trinucleotide bias (>60%). TgTAS-like regions showed the highest trinucleotide compositional bias on the first principal coordinate (1PC) when using a window size of 30Kb . This compositional bias is similar to that observed in other genomic fragments such as those containing centromeric sequences (Figure 1). We also found that 1PC is negatively correlated to gene density in the genome (Pearson's correlation coef. -0.445, p-value < 10^-16), ie genomic fragments with low 1PC values are generich while high 1PC is associated to gene-depleted regions, such as TgTAS-like and centromeres. Finally, ChIP-qPCR experiments showed that nucleosomes associated to TgTAS-like sequences are enriched in silencing epigenetic markers such as histone H4 monomethylated at K20 and the histone variant H2AX. Conclusions We identified a region encompassing ~ 30 Kb present in most of the Toxoplasma chromosomes, denominated TgTASlike. They form a specialized heterochromatin, characterized by: i) a particular trinucleotide composition, ii) a special arrangement containing three satellite families, iii) depletion of coding sequences, and iv) enrichment in nucleosomes containing heterochromatin-like histone markers. Interestingly, these features allowed us to identify similar regions, not necessarily sub-telomeric, that might be functionally similar. 86 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Prediction of blood to liver coefficients for volatile organic compounds: a cheminformatics approach Damián Palomba1,2, María Jimena Martinez1, Ignacio Ponzoni1,2, Mónica Díaz1,2, Gustavo Vazquez1, Axel J. Soto3 1 Laboratory for Research and Development in Scientific Computing (LIDeCC), DCIC, UNS, Av. Alem 1250, Bahía Blanca, Argentina 2 Planta Piloto de Ingeniería Química (PLAPIQUI) CONICET-UNS, La Carrindanga km.7, Bahía Blanca, Argentina 3 Faculty of Computer Science, Dalhousie University, Halifax, Canada Background Volatile organic compounds (VOCs) are organic chemical compounds whose composition makes it possible for them to evaporate under normal indoor atmospheric conditions of temperature and pressure. VOCs are of concern as both indoor and outdoor air pollutants because many of them are known or suspected to cause chronic adverse health effects in exposed population. In this sense, partition coefficients from blood to tissue are of importance in environmental, toxicological and pharmacokinetic modeling. Although some prediction models were developed in the past [1], their prediction capacity still remains to be improved. In this work we propose a new prediction model based on a QSPR (Quantitative Structure-Activity/Property Relationship) modeling technique. Materials and methods The data set is composed of 122 volatile organic compounds; 438 descriptors were calculated using Dragon software. The compounds and their respective blood-liver partition coefficients (logPLiver) were extracted from reference [1]. We employed the interface and routines provided by the machine learning tool Weka. To generate the model, we divided the data set into a training set and an external validation test set with 83 % and 17 % of compounds respectively. In order to select the most relevant descriptors we employed a 5-cross-fold validation with in-fold feature selection (M5P implementation) over the training set. Table 1 shows the performance metrics for each fold. Since a different set of attributes may be selected in each instance of the cross-fold, a consensus scheme was employed. As a result, the following relevant descriptors were selected: SIC2, S1K, X5A, SIC1, AAC, X4Av, H-046, nN, MSD, X2sol. Table 1: R2 and RMSE values for each fold of the feature selection process Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 R2 0.821 0.74 0.71 0.70 0.74 RMSE 0.182 0.212 0.241 0.268 0.214 Results and conclusions The final model was developed using a decision tree (WREP implementation) and validated using the external test set. We obtained a model that uses 5 descriptors: H-046, S1K, X2sol, MSD y SIC1, with performance metrics of R2 = 0.83 and RMSE = 0.18. Compared to the results reported in [1] we observe a significant improvement of the prediction performance. Acknowledgements This work is kindly supported by grants PGI 24/ZN15 and PGI 24/ZN16 (Universidad Nacional del Sur) and PIP112-2009-0100322 (CONICET - National Research Council of Argentina). References 1. Abraham M H, Ibrahim A, Acree W E Jr: Air to liver partition coefficients for volatile organic compounds and blood to liver partition coefficients for volatile organic compounds and drugs. Eur J Med Chem 2007, 42: 743-751. 87 3er Congreso Argentino de Bioinformática y Biologı́a Computacional How much information keeps the solvation structure of a Crystal Protein? Carlos Modenutti 1,2, Diego F. Gauto1, Leandro Radusky1 Silvia Hajos2 y Marcelo A. Marti1 1 Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Ciudad Universitaria, Pabellón II, C1428EHA Ciudad Autónoma de Buenos Aires, Argentina. 2 Departamento de Microbiología, Inmunología y Biotecnología Facultad de Farmacia y Bioquímica. Universidad de Buenos Aires, Junin 954, C1113AAD Ciudad Autónoma de Buenos Aires, Argentina. Background Interactions between carbohydrates and proteins mediate numerous important biological functions, such as signal transduction, cell adhesion, host−pathogen recognition, and the immune response (1). In our previus works (1,2) combining MD simulations with statistical analysis, we showed that the properties of the water molecules close to the surface of the carbohydrate recognition domains (CRD) of various lectins, resemble the structure of the lectin-carbohydrate complex. Specifically, we defined the so called water sites (WS) as space regions close to the protein surface with higher than bulk solvent water finding probability, and computed several thermodynamic and structural properties. Saraboji K. et al found a correlation between the position of the WS and crystallographic waters suggesting that the CRD in Galectin-3 is preorganized to recognize a sugarlike framework of oxygens (3). In order to check whether this is an exclusive property of lectins or a pattern is common to proteins capable of binding carbohydrate we create a database (DB) from a set of proteins obtained from the Protein Data Bank. Results We analyze crystallographic structures of apo-protein (AP) vs protein-carbohydrate complex (PCC) in order to check the ability of crystallographic water to predict the position of OH groups. We found a direct correlation between the position of crystallographic water molecule in AP and OH group of the ligand in the complex structure. Conclusion This study shows that the water molecule position obtained from crystallographic structure have a strong correlation with the OH group position in the complex and appear as a powerful tool for glicomimetic drug design. Reference 1.Gauto, D. F., Di Lella, S., Guardia, C. M., Estrin, D. A. & Marti, M. A. J Phys Chem B. 2009 Jun 25;113(25):871724. 2.Di Lella, S., Marti, M. A., Alvarez, R. M. S., Estrin, D. A. & Díaz Ricci, J. C. J Phys Chem B. 2007 Jun 28;111(25):73606. Epub 2007 May 25. 3.Saraboji K.,Håkansson M.,Genheden, Diehl C.,Qvist J.,Weininger U., Nilsson U, Leffler H,Ryde U,Akke M, and Logan D.Biochemistry.2012 Jan 10:51(1):296-306.Epub 2011 Dec 7. 88 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Online modeling of Endoglucanases from Aspergillus genus using PHYRE2 Manuel Cossio1, Gastón Sioli1, Griselda Perona1, Lorena Castrillo1 y Pedro Zapata1 INBIOMIS, Posadas, Misiones, Argentina, 3300 Background Cellulolytic enzymes are generally induced as multienzyme systems and they have been divided, according to cellulose fiber cleaving region, into three classes endoglucanases, cellobiohydrolases and β-glucosidases. Endoglucanases are extracellular enzymes that degrade cellulose to lower molecular weight sugars and can be found on wood decomposing fungi [1]. Researchers are now focusing their view in this field due to the potential application of these enzymes in bioethanol production. Protein models provide us useful information about structure, global interactions with membrane complexes and other proteins, number and location of hydrophobic and hydrophilic residues and molecular weight, among other things. All previously mentioned would allow us to make inferences about kinetic and enzymatic properties of endoglucanases having just their aminoacidic sequence as real data. Materials and Methods To construct 3D model of endoglucanases, aminoacidic sequences of three species belonging to the genus Aspergillus were obtained from NCBI database. They were processed with “protein homology/analogy recognition engine v 2.0” (PHYRE 2) online software obtaining specific models for each protein sequence [2]. These models were analyzed and compared in order to identify structural differences between each enzyme. Results Specie Total residues Residues modeled* % α helix % β strand % disordered * >90% confidence A. fumigatus 373 61 % residues 20.21 32.95 46.84 A. niger 356 71 % residues 7.12 39.15 53.73 A. oryzae 333 92 % residues 38.83 16.96 44.21 Conclusions 3D modeling with PHYRE 2 represents a good strategy that allows us to infer about structure and protein composition from FASTA aminoacidic sequences. References 1. Knowles: Cellulase families and their genes. Tibtech 1987, 5: 255-261. 2. Kelley, Sternberg: Protein structure prediction on the web: a case study using the Phyre server. Nature Protocols 2009, 4: 363-371. 89 3er Congreso Argentino de Bioinformática y Biologı́a Computacional INTA bioinformatic platform: An approach using ontology driven database and web interface to integrate and explore genomic data. Sergio Gonzalez1,3*, Bernardo Clavijo1*, Máximo Rivarola1,2, Paula Fernandez1,2, Marisa Farber1,2 and Norma B Paniego1,2 1 Instituto Nacional de Tecnología Agropecuaria/Instituto de Biotecnología, Hurlingham, Argentina, 2 CONICET, Argentina. 3 Facultad de Ingeniería/UBA, Buenos Aires, Argentina. *Contributed equally Background During the last few years, as the availability, affordability and magnitude of genomics and genetics research increases so does the need to provide accurate and reliable access to the resulting data and combined analyses of genomes. One approach is to combine the outputs from different software tools and merge the results so as to check the reliability of the merged-output after visual analysis. Today, more than 1,000 genomes have been completely sequenced, moreover, high-throughput sequencing (Next-Generation-Sequencing: NGS) technologies underscore the importance of computational methods in annotating and mining the vast amount of genomic data. In summary, no off-the-shelf solution exists for the assembly, gene prediction, genome annotation and merged-data presentation necessary to interpret and/or fully take advantage of all genomic features. The huge effort to invest large resources into custom bioinformatics support for any genome sequencing project remains a major challenge to fully understand an organism's genome. Results In this work, we present an approach using a ontology database to store, visualize, analyze and share this information, also including information associated to each feature represented in the database. For example SNPs associated to a gene feature or information from transcriptomic assays. To a accomplish our approach, on one hand we first developed ATGC, that uses Chado (Generic Model Organism Database, http://gmod.org), an ontology driven relational database schema implemented in PostgreSQL. One of the main goals for ATGC is to facilitate the exploration and visualization of the data. The main development effort was done to exploit GO annotation and analyzing the annotated genes, allowing users to move through the GO-DAG structure. This approach navigates between different classes of available genes on different projects. On the other hand we combined functional annotation with other genomic analysis (such as transcriptomics) utilizing the genome browser Gbrowse (gmod.org/wiki/GBrowse ) to facilitate the visualization of all data in its genomic context. Conclusion We developed a user friendly flexible platform to organize genomic data. The integration of functional annotation, information associated to each feature and genomic coordinate based system enables easy exploration to generate users hypothesis. Finally, we plan to optimize the collection management features, allowing users to create and manipulate lists of features by different criteria, even connecting the database to complementary platforms for data processing and analysis like Galaxy, providing a mean to perform data manipulation and storage. 90 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Phylogeny of fungal species of genus Aspergillus using ITS sequences Manuel Cossio1, Gastón Sioli1, Griselda Perona1, Pedro Zapata1 INBIOMIS, Miguel Lanús, Posadas, Misiones, Argentina, 3304 Background A method to identify Aspergillus at the species level and differentiate it from other true pathogenic and opportunistic molds was developed using the 18S and 28S rRNA genes for primer binding sites[1]. The “Internal Transcribed Spacer” (ITS) region has been used widely in molecular characterization because of their relatively high variability and facility of PCR amplification and its sequence constitute a significant value for classification in fungi because of its appropriate evolutionary rate. Materials and methods The ITS sequences of six species of the genus Aspergillus were obtained from NCBI nucleotide database. They were aligned with CLUSTAL X v 2.0 and the edited with Bioedit v 7.1. The secuences edited were processed with MEGA 5.1 to build the phylogenetic tree using maximum likelihood method [2]. Results Phylogenetic tree ITS fungal species 1. A. versicolor 2. A.unguis 3. A. flavipes 4. A. fumigatus 5. A. oryzae 6. A. flavus Conclusions ITS sequences could be used in maximum likelihood phylogenetic trees construction. References 1. Henry, Iwen : Identification of Aspergillus species using Internal Transcribed Spacer regions 1 and 2. J Clin Microbiol. 2000, 38:1510–1515. 2. Kishino H, Hasegawa M: Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution 1989, 29:170- 179. 91 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Comparison of two homology based protein structure online software. María Perona1, María Molina1, Manuel Cossio1, Pedro Zapata1 INBIOMIS, Posadas, Misiones, Argentina, 3300 Background Laccases are oxidative enzymes which have received special attention from researchers in last decades due to their ability to oxidase both phenolic and nonphenolic lignin related compounds as well as highly recalcitrant environmental pollutants. Those properties make them very useful for their application in biotechnological processes such as detoxification of industrial effluents, mostly from the paper and pulp, textile and petrochemical industries. Structure analysis constitute a very important step in the comprehension of enzymatic kinetics and for that reason online softwares that are capable of building 3D structures are considered very useful tools in the understanding of enzymatic behavioral. Materials and Methods The aminoacidic sequence of Laccase of Trametes sanguinea was obtained from NCBI online database. To generate the protein 3D structure, the sequence was run with protein homology/analogy recognition engine v 2.0” (PHYRE 2) and SWISS-MODEL online software [1] [2]. The approaches for protein structure generated by both programs were analyzed and compared each other in order to determine similarities and differences between them. Results Software Interactive 3D model Residues modelled Confidence % α hélix/β strand Quaternary structure Ligand information Phyre 2 YES 100% >90% YES Not informed Not informed Swiss- Modeler YES 100% Not informed Not informed YES YES Conclusions Both softwares were efficient producing 3D models of Laccase aminoacidic sequence. References 1. Arnold, Bordoli: The SWISS-MODEL Workspace: A web-based environment for protein structure homology modelling. Bioinformatics 2006, 22:195-201. 2. Kelley, Sternberg: Protein structure prediction on the web: a case study using the Phyre server. Nature Protocols 2009, 4: 363-371. 92 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Construction of phylogenetic trees from Trichoderma sp using the program MEGA 5.10 Gastón Sioli, Lorena Castrillo, Manuel Cossio, Natalia Amerio, y Pedro Zapata INBIOMIS, Posadas, Misiones, 3300, Argentina Background MEGA1 5.10 is used to get in creating phylogenetic trees from protein or nucleic acid sequence data, in order to analyze similarities and the degree of approximation between the sequences. This work described, step by step the aligning of sequences, estimating the tree by test of maximum likelihood and drawing the tree. The analysis of these sequences allows a comparison between different genetic strains and is a tool to supplement the molecular identification of isolates to species level. Materials and Methods Six sequences were obtained from different strains belonging to Trichoderma genus through NCBI database, the alignment was performed with Clustal X 2.1 program. The aligned sequences were edited using BioEdit 7.1.3.0, and using MEGA 5.10 program were obtained dendrograms based on Maximum Likelihood test. From the five sequences, 1 corresponds to T. koningiopsis, 1 to T. pleuroticola, 1 to T. hamatum, 1 to T. harzianum y 1 to T. brevicompactum. Results ITS fungal species 1. T. koningiopsis 2. T. pleuroticola 3. T. harzianum 4. T. hamatum 5. T. brevicompactum Conclusions Analysis by bioinformatics programs, such as MEGA 5.10, is useful for inferring phylogenetic relationships among different strains of Trichoderma sp, demonstrating both the presence of homologies as well as the evolutionary distances. References 1 Tamura K, Peterson D, Peterson N, Stecher G, Nei M, and Kumar S: MEGA 5 Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Molecular Biology and Evolution 2011, 28: 2731-2739. 93 3er Congreso Argentino de Bioinformática y Biologı́a Computacional On line comparison of sequences alignment and phylogenetic analysis of native Trichoderma sp from Misiones province Gastón Sioli, Lorena Castrillo, Manuel Cossio, Natalia Amerio, María Isabel Fonseca y Pedro Zapata INBIOMIS, Posadas, Misiones, 3300, Argentina Background Species of Trichoderma1 genus are of great biotechnological interest because they offer good qualities as biological control agents, soil bioremedial, growth promoters and enzymes producers. The diverse biotechnological applications of Trichoderma make an accurate strains identification essential, as well as their phylogenetic relationships. The molecular species classification it is done by the development of online tools that perform an analysis based on barcode or nucleotide sequences, using bioinformatics software available on line. Which can be used to construct evolutionary trees or dendrogramas, that reflect homologies and genetic relationships on the principle of minimum evolution or maximum parsimony. Materials and Methods It were taken fifteen native Trichoderma sp isolates of Misiones province, and were characterized molecularly by the internal transcribed spacer regions ITS1 and ITS2 amplification of ribosomal DNA. For its determination to species level, bioinformatics analyzed by using three databases highly recognized: Fungal barcoding, TrichOKEY, y NCBI. To construct phylogenetic trees, the alignment of the sequences was performed using Clustal X 2.1 program. The aligned sequences were edited using BioEdit 7.1.3.0, and with MEGA2 5.10 program, dendrograms were obtained based on Maximum Likelihood test, Neighbor-Joining test, Minimum Evolution test, and Maximum Parsimony test. Moreover, with T.N.T.3 program, dendrograms were produced by Bootstrap and Jacknife methods. Results According to bioinformatic analysis found that of 15 isolates of native Trichoderma sp studied, 6 correspond to T. harzianum, 6 to T. koningiopsis, 1 to T. hamatum, 1 to T. brevicompactum y 1 to T. pleuroticola. Also, it could establish similarity relations between strains by dendrograms constructed with MEGA 5.10 program and TNT program. Conclusions Native Trichoderma sp strains show similar topology and minimal differences between parsimony analyses performed. References 1 Druzhinina IS, Kopchinskiy AG, Kubicek CP: The first 100 Trichoderma species characterized by molecular data. Mycoscience 2006, 47:55–64. 2 Tamura K, Peterson D, Peterson N, Stecher G, Nei M, and Kumar S: MEGA 5 Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Molecular Biology and Evolution 2011, 28: 2731-2739. 3 Golobolt P, Farris JS, Nixon K: Review of TNT – Tree Analysis Using new tehnology Version 1.0. Cladistics 2004, 20: 378-383. 94 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Variations of ligand binding affinity upon protein conformational diversity Ezequiel Juritz and Alexander Monzón Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, Bs. As. Argentina Background As our previous studies shows (Juritz, Maria Silvina Fornasari, et al. 2012; Juritz, Palopoli, et al. 2012), protein conformational diversity should be taken into account when performing computational biology methods that require as input a protein structure. We support the idea that computational biology methods should consider protein native state as a population of structural conformers in equilibrium and not as a rigid arrangement of atoms in order to accomplish more accurate results. A protein crystallographic structure should be considered as an instance of the protein structural dynamism, and different crystallographic structures of the same protein can be considered as different structural conformers. Description In the present work we evaluate how different conformers of a same protein present diverse behavior when docking methods are performed upon a given ligand. The docking method used was AutoDock Vina (Trott and Olson 2010), and the protein structures were obtained from CoDNaS database (from Conformational Diversity of the Native State)(Monzón, Juritz and Parisi, in preparation). CoDNaS database contains the redundant collection of crystallographic structures of 9,474 proteins, accounting a total of 40,565 structures, representing putative conformers for each corresponding protein. We performed docking methods evaluating both natural and non-natural ligands for each protein. Conclusion Our preliminary results indicate that the difference on binding affinity between conformers is significant, reaching values greater than 5.0 kcal/mol. Interestingly, binding affinity variation does not correlate with the RMSD value between the structural conformers. Juritz, Ezequiel Iván, Maria Silvina Fornasari, et al. 2012. “On the effect of protein conformation diversity in discriminating among neutral and disease related single amino acid substitutions.” BMC Genomics 13(Suppl 4): S5. http://www.biomedcentral.com/1471-2164/13/S4/S5 (Accessed June 28, 2012). Juritz, Ezequiel Iván, Nicolás Palopoli, et al. 2012. “Protein conformational diversity modulates protein divergence.” Mol Biol Evol. Trott, O., and A. J. Olson. 2010. “AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.” J Comput Chem. 31(2): 455-61. 95 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Metabolic pathfinding based on genetic algorithms Matias Gerard1,2 , Georgina Stegmayer1 and Diego Milone2 1 2 CIDISI-UTN-FRSF, CONICET, Lavaise 610 - Santa Fe (Argentina) SINC(I)-FICH-UNL, CONICET, Ciudad Universitaria - Santa Fe (Argentina) Background Metabolic pathway searching consists of finding a set of reactions allowing to transform a compound into another one. There are several search methods based on classical algorithms like breath-first search (BFS) [1] and depth-first search (DFS) [2] to perform this task. However, there are problems in which a very high number of solutions must be explored, making classical methods practically inapplicable. Genetic algorithms use stochastic search to explore multiple points of the search space at the same time. Material and methods We propose a genetic algorithm (EAMP) that perform a two-end metabolic pathway search and compare its performance with two classical search algorithms. To achieve this, the chromosomes were built by attaching a reaction to each gene, and the left-to-right sequence of genes encoded a metabolic pathway. A initialization strategy to build variable size chromosomes with a partially conserved sequentiality of reactions was proposed. The crossover and mutation operators were designed to promote the building of a reaction chain. Fitness function was built to consider the validity of the reactions sequence, the presence of the compounds to relate and the occurrence of repeated reactions. Results EAMP was studied for several mutation rates and different initialization strategies. Results indicate that minimum searching time was reached for a mutation rate of 0.04 and the initialization strategy with initial variable size for the chromosomes. Comparison of EAMP with BFS and DFS is shown on Figure 1. Boxplots correspond to searching time and number of reactions of 120 pathways founds with each algorithm. DFS perform the search with minimum search time but produce solutions with maximum number of reactions allowed. BFS found shortest pathways but employ greater time than EAMP. The genetic algorithm perform the search using an intermediate time to BFS and DFS, and not only found the shortest pathways but also solutions with greater number of reactions linking the two compounds. A B Fructose and Manose Metabolism C01019 R03161 C02985 R01951 C00325 Pyruvate Metabolism R00212 C00058 Glycine, Serine and Threonine Metabolism C00022 C00022 R00221 C00740 R00214 R03163 C01721 R00220 C00149 R03241 C01099 C00033 R02262 R00589 R00703 R00319 C00186 R01450 C00256 C00078 R02722 C00065 R00582 C01005 R00588 R0 14 46 R00364 C03979 R02261 C00424 C00424 R02260 C00546 R01016 C00111 C00188 R00751 C00048 C00037 R00372 Figure 1. (A) Boxplots for searching time and number of reactions for EAMP, BFS and DFS algorithms. Searching times and the number of reactions are shown in in white and gray, respectively. Time is plotted in logarithmic scale. (B) Metabolic pathway linking C01019 and C00037. Pathway found is shown in bold line. Conclusions The proposed genetic algorithm found metabolic pathways using intermediate times to those required by BFS and DFS. Moreover, it builds metabolic pathways with variable size, including either shortest pathways and larger solutions. It is interesting from a biological viewpoint because pathways larger than shortest path could be provide relevant information about alternative metabolic pathways. References 1. Ogata, H., Goto, S., Fujibuchi, W., Kanehisa, M.: Computation with the KEGG pathway database. BioSystems 47 (1998) 119–128 2. Faust, K., Croes, D., van Helden, J.: Metabolic Pathfinding Using RPAIR Annotation. Journal of Molecular Biology 388(2) (2009) 390–414 96 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Conformational diversity and evolutionary rates in proteins Diego Javier Zea1, Maria Silvina Fornasari1, Cristina Marino Buslje2 & Gustavo Parisi1 1 Structural Bioinformatics Group, Universidad Nacional de Quilmes, Bernal, Buenos Aires, Argentina 2 Structural Bioinformatics Unit, Fundación Instituto Leloir, Capital Federal, Buenos Aires, Argentina Introduction Several factors have been associated to the modulation of the evolutionary rate. Gene expression level is one of the strongest and consistent correlation between genomic data and evolutionary rate[1]. Recent findings indicate that structure-functional features and translation rates could have comparable contributions to explain evolutionary rates[2]. Most of these studies have been done describing the native state of a protein with a single structure. However, it is well established that native state of proteins are better represented by an ensemble of different conformers in dynamic equilibrium[3]. In this work we study how the presence of conformational diversity could influence the rate of evolution. Methods To study this relationship we used a major update of the PCDB database (Protein Conformational Data Base)[4]. This database contains almost 8000 proteins with different degrees of structural diversity measured as the maximum root-mean-square deviation (RMSD) found between the different conformers for each protein. The RMSD was normalized for structural alignment length.[5] Each of these proteins was linked to OMA [6] for estimate dN (number of non-synonymous substitutions per non-synonymous site) as a measure of evolutionary rate using PAML 4 [7]. We used CATH database [8] for analysis of domains in a given protein. Domains were selected, for further dN estimation, using clustal omega [9] protein alignments. Results and Conclusions We found a negative correlation between dN (for orthologs between mouse and human) and the maximum RMSD for alpha carbons between protein conformers (Spearman rank correlation with a rho of -0.34 and a p-value less than 5 percent) for mono domain proteins in humans. This rho was tested with a bootstrap, the interval of confidence at the 95% level goes from -0.5 to -0.1. Our results indicate that conformational diversity have an important role modulating protein evolutionary rates. We think that our findings could have important implications in the understanding of protein evolution process. References 1. Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH: Why highly expressed proteins evolve slowly. Proceedings of the National Academy of Sciences of the United States of America 2005, 102:14338-43. 2. Wolf MY, Wolf YI, Koonin EV: Comparable contributions of structural-functional constraints and expression level to the rate of protein sequence evolution. Biology direct 2008, 3:40. 3. Tsai C-jung, Ma B, Nussinov R: Commentary Folding and binding cascades : Shifts in energy landscapes. 1999, 96:99709972. 4. Juritz EI, Alberti SF, Parisi GD: PCDB: a database of protein conformational diversity. Nucleic acids research 2011, 39:D4759. 5. Carugo O: A normalized root-mean-square distance for comparing protein three-dimensional structures. 2001:1470-1473. 6. Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C: OMA 2011: orthology inference among 1000 complete genomes. Nucleic acids research 2011, 39:D289-94. 7. Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Molecular biology and evolution 2007, 24:1586-91. 8. Orengo C a, Pearl FM, Bray JE, Todd a E, Martin a C, Lo Conte L, Thornton JM: The CATH Database provides insights into protein structure/function relationships. Nucleic acids research 1999, 27:275-9. 9. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology 2011, 7:539. 97 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Effect of the o-glicosilation in the binding of Extensins to Peroxidases. A.Aptekmann|(a,b), JS Salter(b), J Estevez(b), A Nadra(a) (a) Departamento de Química Biológica FCEN, UBA (b) IFIBYNE CONICET-UBA The classical vegetal peroxidases (PERs)that contain an heme group are related to a wide number of roles as in lignification, hormonal signaling, development and ROS protection. In Arabidopsis thaliana 73 apoplastic PERs of type III have been described (1). Recently we have discoverede that some mutants for PERs as PER73 have phenotype similar to that of the mutants for some cell wall glicoprotein: extensins (EXTs), sugesting some degree of substrate specificity(2). We hipothesize that some PERs, including PER73, catalize the crosslinking of O-glicoproteins (in particular EXTs) and that such process is influences by the o-glicosilation status of those same EXTs. In order to test this hypothesis, we have proposed adressing the modelling of PERs and their possible ligand EXTs.In the present work we describe the obtention of the structure of PER73 bindind diferent EXTs with distinct Oglicosilations. The structures have been modelled by homology using as a template the PER2 and the Horseradish Peroxidase (PDB ID : 1PA2 and 1H57) and a colagen structure(the poliprolin kind), followed by an energy minimization after wich we used this structures to do docking. As ligands for this docking we used EXTs with O-glicosilations as those found in wild type arabidopsis and sub O-glicosilated variants similar to those found in the experiments with mutants for the O-glicosilation(3) path. Those EXTs to be found as the more likely putative substrate by this analisis will be evaluated in vivo and in vitro. (1). Computational analyses and annotations of the Arabidopsis peroxidase gene family (2) “Modelado de la peroxidasa 73 y su especificidad por extensinas” 2011 SAB. A.A.Aptekmann J.S. Salter J.M. Velazquez. J.M. Estevez. A.D. Nadra (3) “Essential role of O-glycosylated plant cell wall extensins for polarized root hair growth”. 2011. Science 332, 1401-1403. S.M. Velasquez et al. Nadra & J. M. Estevez. 98 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Plant small heat shock proteins during different stress conditions other than heat. Comparative analysis between Arabidopsis thaliana and Solanum lycopersicum 1 2 2 2 Débora Pamela Arce , Martin Damián Ré , Silvana Beatriz Boggio and Estela Marta Valle 1 Facultad Regional San Nicolás, Universidad Tecnológica Nacional, San Nicolás, 2900, Buenos Aires, Argentina 2 Instituto de Biología Molecular y Celular de Rosario (IBR-CONICET), Universidad Nacional de Rosario, S2002LRK, Rosario Argentina Background Small heat shock proteins (sHSPs) are chaperones that play an important role in abiotic and biotic stress tolerance. The special importance of sHSPs in plants is suggested by their unusual abundance and they 1 are found in different cellular compartments . Furthermore, some sHSPs are also expressed during certain 2 stages of development . In this study Arabidopsis thaliana and Solanum lycopersicum were used as model plants. Previous findings allowed us to identify heat shock proteins (HSPs), heat shock factors (Hsfs) and 3 sHSPs genes up-regulated under oxidative stress mediated by methyl viologen (MV) in Arabidopsis . In 4 addition, tomato sHSPs were induced in red fruit compared to green fruit . All these results allowed us to perform an in silico strategy for analyzing the regulation of sHSPs gene expression. Three sHSPs promoter sequences from mitochondrial LeHsp23.8-M, cytosolic LeHsp17.7-CI and cytosolic LeHsp17.4CII were analyzed. Materials and Methods In the present work we made a screening of the following databases: Sol genomics network http://www.sgn.cornell.edu/], Tomato EST Database http://ted.bti.cornell.edu/], NCBI Database http://www.ncbi.nlm.nih.gov/]. The Arabidopsis Information Resource (TAIR) [http://arabidopsis.org/]. 5 Sequence analyses were performed using bioinformatics tools BLASTn and T-Coffee . For promoter analysis, 1.9 kb upstream regions of sHsp genes were identified by BLASTN in NCBI database and in Sol genomics network (SGN Combined WGS-BAC-unigenes set). Heat shock elements (HSEs) were searched in the PLACE database [http://www.dna.affrc.go.jp/PLACE/index.html]. Further analysis to identify conserved motifs was performed using expectation maximization method MEME [http://meme.sdsc.edu/meme4/intro.html], PlantCARE database 6 [http://bioinformatics.psb.ugent.be/webtools/plantcare/html/] and MatInspector program . Results The analysis of 1.9 kb promoter region of sHsp genes using bioinformatics tools (see Materials and Methods) showed that heat shock elements or HSEs (CCAAT Box) were present in the three sequences analyzed. Other motifs also detected were a related sequence of abscisic acid response element (ABRE) and ethylene responsive elements (ERE). Conclusions These results indicate that LeHsp23.8-M, LeHsp17.7-CI and LeHsp17.4-CII could be involved in different processes mediated by some plant hormones (abscisic acid or ethylene) other than heat stress. These promoter sequences could be used in the generation of tomato transgenic plants to evaluate the LeHsp23.8-M, LeHsp17.7-CI and LeHsp17.4-CII gene expression pattern under environmental stress conditions or different developmental stages. References 1. Wang W, Vinocur B, Shoseyov O, Altman A: Role of plant heat-shock proteins and molecular chaperones in the abiotic stress response. Trends Plant Sci 2004, 9:244-252. 2. Prasinos C, Kampis K, Samakovli D, Hatzopoulos P: Tight regulation of expression of two Arabidopsis cytosolic Hsp90 genes during embryo development. J Exp Bot 2005, 56:633-644. 3. Scarpeci TE, Zanor MI, Carrillo N, Mueller-Roeber B, Valle EM: Generation of superoxide anion in chloroplasts of Arabidopsis thaliana during active photosynthesis: a focus on rapidly induced genes. Plant Mol Biol 2008, 66(4):361-378 4. Re MD, Arce DP, Boggio SB: Expresión de sHsps luego de la conservación en frío de tomates (cv. MICRO-TOM). V Jornadas argentinas de Biología y Tecnología Postcosecha, 2009. 5. Notredame C, Higgiens D, Heringa J: T-Coffee: A novel method for multiple sequence alignments. J Mol Biol 2000, 302: 205-217. 6. Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, Klingenhoff A, Frisch M, Bayerlein M, Werner T: MatInspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics 2005, 21(13):2933-42 99 3er Congreso Argentino de Bioinformática y Biologı́a Computacional GO Function predictions by True Path Rule Pilar Bulacio∗1,2 , Flavio Spetale1 , Laura Angelone1,2 and Elizabeth Tapia1,2 1 Cifasis-Conicet 2 Facultad Institute, Bv. 27 de Febrero 210 Bis, Rosario, Argentina de Cs. Exactas e Ingenierı́a, Universidad Nacional de Rosario, Riobamba 245 Bis, Rosario, Argentina Email: Pilar Bulacio∗ - [email protected]; ∗ Corresponding author Hierarchical classification backgarund Protein function prediction is an important problem in bioinformatics research. Useful tools have been developed to identify similar sequences regarding their corresponding annotation database. But when no similar sequences can be found, data mining techniques carefully designed may provide an important clue to protein function prediction. In particular, hierarchical classification methods like True Path Rule (TPR) [1] can take into account the relationship among protein functions defined on Gene Ontology (GO). This GO structure influences in two points: i ) In the training set designs for machine-learned classifiers; and ii ) In the global function prediction due to a sequence may belong to multiple classes. Fig. 1 and Fig. 2 shows a simplified TPR analysis on GO with Arabidopsis data. Consensus probability p′ represent the membership of x sample to GO nodes. Positive p′ are in blue. The starting point to apply TPR is the set of local probabilities p, positive p are in bold. p'=0.587 GO:08150 p=0.6 p'=0.575 GO:08152 p=0.6 p'=0.65 GO:43170 p=0.4 GO:44238 p=0.5 GO:09058 p=0.4 p'=0.587 GO:08150 p=0.6 p'=0.575 GO:08152 p=0.6 p'=0.456 GO:09987 p=0.4 p'=0.456 p'=0.537 GO:44237 p=0.4 GO:06807 p=0.45 GO:44249 p=0.4 GO:34641 p=0.45 GO:09987 p=0.4 p'=0.5 GO:43170 p=0.4 GO:44238 p=0.5 p'=0.45 GO:09058 p=0.4 GO:44237 p=0.4 GO:06807 p=0.45 GO:44249 p=0.4 GO:34641 p=0.45 p'=0.625 GO:19538 p=0.4 GO:10467 p=0.4 GO:09059 p=0.4 GO:44267 p=0.4 GO:06412 p=0.4 GO:44260 p=0.4 GO:34645 p=0.4 p'=0.8 GO:06139 p=0.8 p'=0.4 GO:19538 p=0.4 GO:10467 p=0.4 GO:09059 p=0.4 GO:44267 p=0.4 GO:06412 p=0.4 GO:90304 p=0.4 Figure 1: Node GO:06139 with p = 0.8 entails x belongs to six structured nodes GO:44260 p=0.4 GO:34645 p=0.4 p'=0.5 GO:06139 p=0.55 GO:90304 p=0.4 Figure 2: Node GO:06139 with p = 0.55 entails x belongs to two structured Results and Conclusions Focusing on global function predictions with TPR, on the GO taxonomy with Arabidopsis data, each node ni estimates the probability that a sample x belongs to the class ci . Then, positive local predictions for a GO node propagate from bottom to top (influence its ancestors) while negative ones are propagated to the descendant (influence its offspring) to achieve the global consensus probability p′ . Note that the strength of local evidences (p in GO:06139) may change the consensus probability p′ therefore, the positive paths (see Fig. 1, Fig. 2). References 1. Valentini G: True Path Rule Hierarchical Ensembles for Genome-Wide Gene Function Prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2011, 8:832–847. 100 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Fitting a mathematical expression the effect of two herbicides (paraquat and glyphosate) on the population dynamics of Beijerinckia mobilis in soils planted with soybean Tucumán, Argentina Alberto Manlla, Melisa Apud Reinhold y Gladys Contino Facultad de Agronomía y Zootecnia, Universidad Nacional de Tucumán, 4000 San Miguel de Tucumán, Argentina Background The activity of the rhizosphere microorganisms convert the atmospheric nitrogen in nitrogen used by plants (NH2, NH4 or NO3). Among the free fixatives, the least known of the genere is Beijerinckia, with B.mobilis counting as a native species. Agricultural practices in soybean crops in the region are characterized by an intense demand for pesticides. The herbicides most used are: paraquat and glyphosate. Materials and methods The variables `time since the application of herbicides´ and `the number of most frequent microorganisms´ found in soil samples from plots of 200 cm², upon which herbicides were distributed at random over a soybean crop in 2010. Regression analysis was applied to the variables. Results The summary of the experimental data, transformed logarithmic scale are presented in a scatter diagram (Figure 1) which allows to infer the mathematical models that best fit the test and determine analytically the elements that characterize these relationships (Table 1). Tabla 1: Determinación analítica de los principales elementos Figura 1: Representación de las variables transformadas Paraquat a) Ordenada al Origen: valor de Y cuando `X = 0´ o cuando Glifosato la parábola corta el `eje Y´ o punto (0, c) 9.00 8.00 Log 10 (NMP) c= y = 2.335x2 - 7.123x + 8.535 R² = 0.860 8.535 6.731 7.00 b) Raices o Cero de la Función: no hay raices por que el 6.00 el discriminante (b² - 4ac) es negativo ( < 0 ) 5.00 b² - 4 a c = -28.98 -14.28 4.00 b² = 50.74 32.51 4ac= 79.72 46.79 3.00 2.00 c) Extremos o coordenadas del vértice de la parábola: y = 1.738x2 - 5.702x + 6.731 R² = 0.718 1.00 0.00 0.00 0.50 1.00 1.50 2.00 2.50 Log 10 (Tiempo en Horas) 3.00 3.50 x = -b / 2a = 1.525 1.640 y(x) = 3.103 2.054 5.43 4.68 término cuadrático término lineal 10.86 9.35 término independiente 8.535 6.731 Conclusions The herbicides affected B.mobilis similarly but with different intensity. Paraquat caused the least harmful effect, allowing the phace recovery of the original levels of the population. Among other factors, the microbial growth depends upon the starting point of herbicide degradated (it is possible that subproducts of such process serve as nutrients or stimulating recovery of the population), upon weather (after 720 hours of herbicides applications rain and low temperatures might have mitigated the harmful effects on microorganisms) and upon soil (trial plots had difficulty infiltrating rain, so the superficial runoff might attenuated the herbicide effect). Reference 1. Mayz Figueroa J. Fijación Biológica de Nitrógeno. Revista científica UDO Agrícola. Vol 4. 1 - 20 Pág. Universidad de Oriente. Maturín, Estado Monagas 2004. 2. Olivares JP. Fijación Biológica de Nitrógeno. Estación Experimental del Zaidin, CSIC, Granada, España 2008. 3. Lourival Larini. Toxicología Dos Praguicidas. Editora Manole Ltda. San Pablo. Brasil. 1999. 101 Índice alfabético Acosta, M. G., 13, 15 Adur, J. F., 15 Agüero, F., 11 Ahumada, M. A., 13 Alibes, A., 12 Amadı́o, A., 13 Amerio, N., 11, 16 Andón, N., 15 Andreatta, M., 8 Angel, S., 11 Añón, M. C., 11 Aptekmann, A., 12 Arab Cohen, D., 13 Arce, A. L., 9 Areces, C., 14 Arévalo, I., 15 Arranz Amo, J. A., 16 Astorga, M., 12 Ballarin, V., 9, 13 Balzarini, M. G., 16 Bartó, C., 14 Basanta, B., 12 Belaich, M., 16 Berenstein, A. J., 9 Bessone, V., 15 Bianchi, M., 15 Biset, G., 15 Bondino, H. G., 10 Braunstein, L., 14 Briñón, M. C., 8 Brondino, C. D., 15 Brun, M., 9, 10, 13 Bugnon, L., 15 Bukowski Loináz, M. B., 15 Bustamante, J. P., 11 Bustos, D., 14 Cabral, J. B., 13 Candreva, A., 16 Capella, M., 9 Caramelo, J. J., 9 Carbonetto, B., 13 Carisimo, D., 14 Carmona, S., 11 Carrari, F., 11 Casco, V. H., 10, 13, 15 Castrillo, L., 11, 16 Chan, R. L., 9 Chemes, L. B., 12 Chernomoretz, A., 9 Churio, M., 14 Clavijo, B., 15 Cossi, P., 15 Cossio, M., 11, 12, 16 Costa, J. G., 15 Couto, P. M., 9 Cuadra, N., 15 Dalmasso, M. C., 11 Dalosto, S. D., 15 de Prat-Gay, G., 12 Debat, H., 14 Defelipe, L., 12, 16 Defelipe, L. A., 15 Defeudis, L., 14 Dı́az, M., 11 Diaz-Zamboni, J. E., 15 Docena, G., 16 Dodelson de Kremer, R., 16 Ducasse, D., 14 Duffy, T., 11 Dumas, V. G., 12 Edelstein, J., 14 Eguaras, M., 14 Elgoyhen, B., 14 Embnet Node Argentina, , 16 Erbes, L., 15 Espada, R., 9 Espinosa, M. B., 10 Estevez, J., 12 Estrin, D., 11 Faccendini, P. L., 15 Farber, M., 15 Farı́as, M. E., 11 Fazzi, L., 15 Fernández Alberti, S., 12 Fernández Feijóo, M. E., 10 Fernandez, E., 13, 16 Fernandez, P., 15 Ferreiro, D. U., 9 Ferro, S., 8 Ferroni, F. M., 15 Firmenich, V. E., 10 Fonseca, M. I., 11 Fornasari, M. S., 9, 14 Franchini, L., 14 102 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Fresno, C., 13, 16 Fuselli, A., 12 Galetto, C. D., 10, 15 Garavaglia, M., 16 Garay, S., 12 Garay, S. A., 12 Garcı́a, M. A., 13 Gauto, D., 15 Gende, L., 14 Gerard, M., 8 Ghiringhelli, D., 16 Giménez Pecci, M. D. L. P., 13 Girotti, M. R., 16 Glavina, J., 11, 12 Gómez, G. E., 9 Gómez, M. C., 15 Gonzalez, G., 16 Gonzalez, S., 15 Gonzalo Parra, R., 9 Grabiele, M., 14 Gutson, D., 14–16 Hajos, S., 15 Hasenahuer, M. A., 10 Herrera, F. E., 12 Herrero, F., 16 Ibañez, I., 9 Iserte, J., 16 Izaguirre, M. F., 10, 15 Juárez, L., 13 Juri Ayub, M., 10 Juritz, E., 14 Juritz, E. I., 12 Kelmansky, D. M., 8 Kondrasky, A., 14 Labadie, G., 13 Lagier, C. M., 15 Laguna, I. G., 13 Lamberti, P., 11 Landolfo, L., 9 Lanzarotti, E., 16 Lapadula, W., 10 Laróvere, L. E., 16 Lassaga, S. L., 13 Laugero, S. J., 15 Llera, A., 13, 16 López Medus, M., 9 López, J. A., 16 Lufrano, D., 16 Luna, M. C., 15 Lund, O., 8 Macri, P., 14 Mancini, E., 11, 13 Marcipar, I. S., 15 Marino Buslje, C., 9, 14, 16 Martı́, D., 14 Martı́, M., 12, 15 Marti, M., 9, 11, 15, 16 Martinez, M. J., 11 Martino, D., 12 Maurino, F., 13 Menzaque, F. E., 16 Merino, G., 13, 16 Miele, S., 16 Migueles, M., 14 Milone, D., 8 Mishima, J., 13 Modenutti, C., 15 Molina, M., 12 Monzón, A., 12 Monzon, A., 14 Moscone, E., 14 Nadra, A., 12 Nardo, A. E., 11 Navas, L., 13 Nielsen, M., 8, 16 Ojeda, S., 15 Oliva, P., 16 Pagnuco, I., 9, 13 Pagnuco, I. A., 10 Pallarol, M., 13 Palomba, D., 11 Palopoli, N., 11 Paniego, N., 15 Paravani, E. V., 15 Parisi, G., 9, 11, 12, 14, 16 Perona, G., 11, 12 Perona, M., 12 Petruccelli, S., 16 Petruk, A., 12 Pisciottano, F., 14 Podhajcer, O., 16 Podhajcer, O. L., 16 Ponzoni, I., 11 Porta, E., 13 Prada, F., 16 Prato, L., 13, 16 Pury, P., 16 Quaranta, J. F., 12 Quevedo, M. A., 8 Rabinovich, D., 15, 16 Radusky, L., 15, 16 Radusky, L. G., 15 Ramı́rez, M. J., 15 Ramos, L., 16 Rascován, N., 11 Rascovan, N., 13 103 3er Congreso Argentino de Bioinformática y Biologı́a Computacional Ré, D., 9 Ré, M., 11 Reinert, M., 13 Remon, L., 13 Revale, S., 11, 13 Revuelta, M. V., 9, 10 Riberi, F., 15 Ribero, G., 13 Rivarola, M., 15 Rizzi, A. C., 15 Robledo, G., 14 Rodrigues, D., 12 Rodrigues, D. E., 12 Saavedra Fresia, C. E., 16 Sales, M. D. L. M., 12 Samoluk, S., 14 Sanchez, I., 12 Sánchez, I. E., 11, 12 Sanchez-Puerta, M. V., 10 Santa Maria, C., 14 Scaldaferro, M., 14 Seijo, G., 14 Semrik, M., 13 Serrano, L., 12 Sferco, S. S., 15 Silvera Ruiz, S. M., 16 Simonetti, F. L., 16 Sioli, G., 11, 12, 16 Soria, M., 14 Soto, A., 11 Stegmayer, G., 8 Taleisnik, S., 13 Tardivo, L., 15 ten Have, A., 9, 10 Trumper, E., 14 Turjanski, A., 12, 16 Turjanski, A. G., 15 Uhart, M., 14 Vairo Cavalli, S., 16 Valacco, M. P., 16 Vazquez, G. E., 11 Vázquez, M., 11 Vazquez, M., 13 Vera, C. H., 13 Villalba, L., 11, 16 Villoria, L., 13 Zandomeni, R., 13 Zapata, P., 11, 16 Zea, D., 9, 14 Zimicz, C., 15 Zingaretti, L., 16 104