About the ExTRI project

The ExTRI resource is a set of TF-TG relationships generated by a text mining approach on MedLine Abstracts (see the publication of Miguel Vazquez, Martin Krallinger, Florian Leitner, Martin Kuiper, Alfonso Valencia and Astrid Laegreid (to appear soon)).

Motivation

The regulation of gene transcription by transcription factors is a fundamentally important biological process, yet the relations between transcription factors (TF) and their target genes (TG) are still only sparsely covered in databases. Understanding the human transcriptional gene regulatory network (TF-TG) is essential not only to characterize differences between cell types, tissues, and developmental processes but also to identify regulatory alterations associated with diseases.

We designed a text mining approach dedicated to detect mammalian TF-TGs, based on a high-quality curated training set specifically designed to detect mentions to direct regulation of a TG by a TF. The obtained results can be used as a ‘triage’ set by curators, or for network analysis and modelling, causal reasoning or knowledge graph mining approaches.

Objectives

ExTRI aims to provide an exhaustive collection of putative TRIs extracted from PubMed abstracts. 

  • To provide transcription regulation interaction (TRI) information with sufficient quality to be used in computational modelling or knowledge graph mining approaches to support causal reasoning.
  • To serve as a base for building productive TRI curation stacks.
  • To serve as a reference for critical examination of the distribution of TRI information across literature and current knowledge bases.

Description of the work

The ExTRI resource is a knowledge graph collection of TF-TG relationships created by text-mining assisted large scale extraction of information regarding regulatory interactions between specific DNA binding transcription factors and their target genes, from MedLine abstracts as recent as 2015:

  • The text mining approach was designed to detect mentions of human DNA binding Transcription Factors (named TF) together with a target gene, in a context that suggests direct regulation of gene expression at the transcription level.
  • All mentions from human, mouse and rat were normalized to human identifiers.
  • This resulted in approximately 40,000 different Transcription Regulation Interactions (TRIs) across almost 100,000 sentences obtained from around 34,000 abstracts (53,000 unique sentences that were scored as ‘high-confidence’.
  • The Transcription Regulation Interactions (TRIs) in our results feature almost a thousand TFs (15% less in high confidence TRIs only) and about 5600 TGs (20% less in high confidence TRIs only).