Supplementary ExTRI project resources
Description Gold Standard
The manual curation followed four sequential steps. Sentences complying with steps 1 through 4 are positive TRI-sentences.
- Identification of sentences describing gene regulation interactions (GRI); sentence describes regulatory interaction between regulator and target; the regulatory interaction is not indirect; and semantic connection between regulator and target places the interaction within the context of transcription regulation.
- Identification of regulator and target (Regulator is protein, Target can refer to gene and not gene family or collection of genes); sentences that fulfil selection criteria in 1) and 2) are gene regulation sentences (GRI) with among them the subclass of transcription regulation sentences (TRI).
- Normalization of regulator to TF (TFClass).
- Normalization of target to gene.
Each sentence in the GS was labelled independently by at least two domain experts. The annotations were compared and, in case of disagreements, decisions were discussed against the curation guidelines until consensus was reached. The GS is available as Supplementary Table 4.
The curation process was designed to capture only binary regulator-target pairs, while regulatory complex dependencies (e.g., a complex that requires the presence of several [co-]factors) were not annotated as such. Furthermore, TRI descriptions which were not expressed in single sentences, i.e. pairs that can only be established from the context of more than a single sentence of a publication, were not included in the corpus.
The final GS comprises 658 unique positive sentences from 189 different abstracts and specifies 916 TF relations between 208 different TFs and 208 different TGs (Full corpus is given in Supplementary Table 4). These positive and negative TRI sentences, selected and annotated with a list of binary interaction partners, are the central resource upon which the literature information extraction in the current work was built.