- What is EVEX?
- How are gene/protein names identified in text?
- How are gene/protein names normalized?
- Which kind of interactions are extracted by EVEX?
- How are interactions extracted from text?
- What is the canonical generalization?
- What are the HomoloGene and Ensembl (Genomes) generalizations?
- Is the data in EVEX up-to-date?
- How should EVEX be cited?
- Can we download or export the EVEX data?
What is EVEX?
EVEX (short for EVent EXtraction) is a text mining resource built on top of PubMed abstracts and PubMed Central full text articles. It contains more than 76 million automatically extracted gene/protein names and over 40 million bio-molecular events pertaining them. We present both direct as well as indirect associations between genes and proteins, enabling explorative browsing of information contained in the literature. The dataset further has been annotated with unique gene identifiers and enriched with gene families from Ensembl and HomoloGene, largely solving gene name ambiguity and providing homology-based generalizations of the data.
How are gene/protein names identified in text?
The BANNER system of Leaman and Gonzales (2008) is used for named entity recognition (NER). It is a machine-learning system based on conditional random fields, and has been released under the Common Public License 1.0.
How are gene/protein names normalized?
The GenNorm system of Wei and Kao (2011) is used for named entity normalization (NEN). It is an integrative method for cross-species gene normalization and achieved top performance on the BioCreative III challenge.
Which kind of interactions are extracted by EVEX?
The EVEX resource contains data on various bio-molecular events such as phosphorylation, transcription, localization, binding, protein catabolism, gene expression, (DNA) methylation, ubiquitination and (positive/negative) regulation. Furthermore, indirect associations include co-regulators and common binding partners.
How are interactions extracted from text?
The interactions in EVEX are extracted with the Turku Event Extraction System. This machine learning framework is a generalized biomedical relation extraction tool consisting of mainly SVM components. It was the best performing system of the BioNLP'09 Shared Task and delivered state-of-the-art predictions in almost all sub challenges of the BioNLP'11 Shared Task.
What is the canonical generalization?
To account for lexical variations of gene/protein symbols, we have defined a canonical form for each symbol, removing non-alphanumerical characters and ignoring capitals. As a result, all variants such as "Esr-1", "Esr 1" and "ESR1" are mapped to the same canonical string "esr1". The canonical generalization subsequently fetches and aggregates all occurrences of the same symbol and its events in text, creating meaningful summaries across articles.
What are the HomoloGene and Ensembl (Genomes) generalizations?
Building on top of the gene normalization output and the canonical generalizations, the family generalizations aggregate multiple members of gene families and frequent synonyms describing them. These generalizations provide a high-level overview of common interactions between gene families and enable homology-based hypotheses for newly characterized genes/proteins. Currently, EVEX provides three different ways to define gene families, either through Homologene (eukaryots), Ensembl (vertebrates) or Ensembl Genomes (metazoa, plants, protists, fungi, bacteria). For example, the HomoloGene family of "esr1" contains genes from human, mouse, rat, chicken and zebrafish, and common synonyms include "Esr-1", "ERalpha", "NR3A1" and "estrogen receptor 1".
Is the data in EVEX up-to-date?
Yes, we invest a lot of time in keeping the resource up-to-date. Since its original release in 2011, EVEX has more than doubled in size by processing newly published PubMed abstracts and additionally covering PubMed Central Open Access articles. We will keep updating the resource as new biomedical articles are being published.
How should EVEX be cited?Please refer to our latest publication: Van Landeghem S, Björne J, Wei C-H, Hakala K, Pyysalo S, Ananiadou S, Kao H-Y, Lu Z, Salakoski T, Van de Peer Y, Ginter F (2013). Large-scale event extraction from literature with multi-level gene normalization. PLoS One 8:e55814.
- The original dataset, distributed as plain text files: Björne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T (2010). Scaling up Biomedical Event Extraction to the Entire PubMed. Proceedings of BioNLP 2010, pp 28-36.
- The MySQL database and canonical or gene family generalizations: Van Landeghem S, Ginter F, Van de Peer Y, Salakoski T (2011). EVEX: A PubMed-scale resource for homology-based generalization of text mining predictions. Proceedings of BioNLP 2011, pp 28-37.
- The website: Van Landeghem S, Hakala K, Rönnqvist S, Salakoski T, Van de Peer Y, Ginter F (2012). Exploring Biomolecular Literature with EVEX: Connecting Genes through Events, Homology and Indirect Associations. Advances in Bioinformatics, special issue Literature-Mining Solutions for Life Science Research, ID 582765.
- The plugin for Cytoscape: Hakala K, Van Landeghem S, Kaewphan S, Salakoski T, Van de Peer Y, Ginter F (2012). CyEVEX: Literature-scale network integration and visualisation through Cytoscape. SMBM Systems demo (in press).
Can we download the EVEX data?Yes! Please visit the download section for the latest EVEX data.
A Cytoscape plugin, called CyEVEX is also available. CyEVEX enables straightforward integration of textual data with any kind of network analysis.
Older versions of the EVEX resource:
- The original dataset, in Shared Task format & XML: 2010 release
- The corresponding MySQL database, enriched with canonical and gene family generalizations: 2011 release