About the T-box riboswitch annotation database

About the T-box riboswitch annotation database

The T-box riboswitch annotation database (TBDB) is a database that attempts to annotate structural and genetic features of T-box leader sequences. The goal of this database is to enrich information available about T-box riboswitch sequences in order to facilitate research in this area. While >15,000 T-box riboswitch sequences have been discovered by genome mining, only a handful have been experimentally characterized. We hope that the information contained in this database will decrease the barrier to entry into this field.

Feature prediction

The predictions contained in the TBDB were performed using the methods found in the bioRxiv. All code used to generate the database is present in our Github. In summary, our pipeline performed feature prediction in two steps. First INFERNAL was used to predict secondary structure, then the secondary structure was searched for conserved features including Stem I, the specifier loop, and the antiterminator. The codon and discriminator were extracted from the position of these motifs. Thermodynamic calculations (MFE) on antiterminator and terminator folds were performed using ViennaRNA. The NCBI accession numbers of the input sequences were used to gather various annotations, including taxonomy and donwstream gene ontology. tRNAscan-SE was used to generate a list of tRNAs for each organism. Most likely codons within specifier loops were chosen based on their position within the specifier loop, with additional refinement using tRNA discriminator base and downstream gene ontology (where present). Alternative codon-frames, where found, are also presented.

Benchmarking feature prediction

The structurally-annotated dataset found in Vitreschak et al., 2008 was used to validate the accuracy of our feature prediction pipeline. From the 698 initial sequences:

T-box riboswitches (n = 698)

694 sequences had a T-box detected by INFERNAL (99.5%)
621 scored high enough to make a feature prediction (89.5%)

Specifiers (n = 621)

589 specifiers were predicted correctly (94.8%)
1 specifier was off by -1 (0.2%)
9 specifiers were off by +1 (1.4%)
22 specifiers were otherwise incorrect (3.5%)

Discriminator Base (n = 619)

619 discriminator bases were predicted correctly (100%)

T-box riboswitch classification and regulatory type

We used two different covariance models to build this database: the RFAM class I T-box riboswitch model (RF00230), and our own translational class II model derived from ileS leader sequences. Qualitatively, class I T-box riboswitches tend to have a larger Stem I structure, while class II T-box riboswitches have a shorter one. In our database, T-box riboswitches detected by the Rfam RF00230 covariance model are mostly Class I transcriptional T-box riboswitches, and T-box riboswitches detected by the our translational model will be class II translational T-box riboswitches. However, there may be instances where the RFAM00230 model predicts the folding of what are class II T-box riboswitches due to similar structural features (in particular, the antiterminator/antisequestrator motif). Additionally, there are other classes of T-box riboswitches (such as S. aureus ileS T-box riboswitches, e.g. RMB1LX9O) for which we do not currently have robust covariance models, but which are sometimes detected by the RF00230 covariance model.

We have attempted to classify the T-box riboswitches in the database by type of regulation. T-box riboswitches predicted using RF00230 are classified as transcriptional if we have been able to identify a downstream terminator hairpin, or unknown if they do not. T-box riboswitches predicted using our ileS translational model are classified as translational. In total, we have 20396 putative class I transcriptional T-box riboswitches, 1012 putative class II translational T-box riboswitches, and 2128 T-box riboswitches of unknown regulatory type.

Class I model (RFAM00230)

Will fold mostly canonical class I transcriptional T-box riboswitches

Sometimes will also fold canonical class I translational T-box riboswitches (terminator usually not found here)

Sometimes will fold T-box riboswitches that are actually class II translational (will have a poor INFERNAL output score)

Class II model (TBDB Ile Translational)

Will fold mostly class II Ile translational T-box riboswitches

Sometimes will also fold canonical class I T-box riboswitches (will have a poor INFERNAL output score)

Handling complex cases

The RFAM00230 model does not produce secondary structures with more than one antiterminator at a time. This means that complex T-box leader sequences (such as any partially-double T-box riboswitches) are not currently included in the TBDB. As model outputs, these cases would either be truncated after the first antiterminator/terminator (i.e. missing their second 'half'), or they would have the first antiterminator/terminator pair not labeled (i.e. first half not shown as a structural loop). The same problem could occur with double T-box riboswitches (tandemly arranged complete T-box riboswitches) where either one of the two T-box riboswitches would be absent or the first T-box's antiterminator/terminator and second T-box's Stem I mischaracterized. As we continue to build new covariance models for finding new T-box riboswitches, we will also be improving existing models to handle complex cases.

We have attempted to classify the T-box riboswitches in the database by type of regulation. T-box riboswitches predicted using RF00230 are classified as transcriptional if we have been able to identify a downstream terminator hairpin, or unknown if they do not. T-box riboswitches predicted using our ileS translational model are classified as translational. In total, we have 20396 putative class I transcriptional T-box riboswitches, 1012 putative class II translational T-box riboswitches, and 2128 T-box riboswitches of unknown regulatory type.

RNA folding thermodynamics

T-box riboswitch switching depends on the relative stability of antiterminator and terminator folds. This was evaluated using thermodynamic methods provided in the ViennaRNA package. The antiterminator structure output by INFERNAL was optimized with appropriate constraints using RNAfold. Similarly, the terminator structure was found by RNALfold, looking for hairpins between the T-box bulge 5'-UGGN-3' and poly-U regions. Both structures and their folding ∆Gs are included in the database.

tRNA pairing prediction

In nature, T-box riboswitch logic is controlled by cognate tRNAs that Watson-Crick base pair with the T-box riboswitch specifier loop and anti-acceptor arm sequences (T-box region). However, other tertiary interactions are thought to play an important role in deciding if a specific tRNA can control T-box riboswitch logic. In particular, structural features in certain T-box riboswitch Stem I and Stem II regions are thought to interact with tRNA in a sequence-specific manner. In order to facilitate discovery of functional T-box riboswitch leaders, we used tRNAscan-SE to identify all tRNAs from T-box riboswitch hosts that could pair with each T-box riboswitch. Host tRNA identification was performed for all complete sequence records. For partial sequence records, tRNA identification was attempted and we report matching tRNAs if any were identified.

Data sources

Input sequences for building the TBDB were obtained from previously published datasets. Structurally annotated datasets from Vitreschak et al. were used for validation. T-box riboswitches were assigned a unique ID based on structure to de-duplicate entries shared between datasources.

The Rfam 14.0 database (14106 T-box riboswitches, predicted using INFERNAL)
GeCont3 (4491 T-box riboswitches)
Vitreschak et al., 2008 (698 T-box riboswitches, structurally annotated)
Weinreb et al., 2016
Abreu-Goodger and Merino, 2005 (RiBex web server)