CompaRNA - on-line benchmarks of RNA structure prediction methods

Laboratory of Bioinformatics and Protein Engeneering

Home

Methods

Datasets

Rankings

RNA 2D Atlas

Help

FAQ

Contact us

RSS feeds

Twitter

RNA secondary structures used for benchmarking RNA 2D prediction methods + predictions generated by methods tested by CompaRNA

PDB 2.8 MB

The dataset of RNA structures extracted by CompaRNA from the PDB database consists of previously unknown RNAs. In order to remove redundant RNA sequences cd-hit-est was used. The filtering was performed by comparing all aligned sequence pairs using a 90% sequence identity cutoff and assuming that minimal alignment coverage for the longer sequence cannot exceed 70%.

For every new RNA released from the PDB, the RNAView program is used to extract information about the secondary structure. If there is more than one model in the PDB file, 100% consensus of secondary structure over all models is used as a reference. Secondary structures are extracted from the PDB f iles according to two definitions - "standard" and "extended". The standard RNA base pair definition follows the Leontis and Westhof classification i.e. the canonical A-U, G-C and Wobble G-U pairs which belong to the cis Watson-Crick/Watson-Crick geometry are considered as the secondary structure. On the contrary, the "extended" secondary structure definition includes all interacting bases in both cis and trans orientations.

Moreover, the following criteria are subsequently used to choose RNAs valid for benchmarking RNA secondary structure prediction methods:

Length of at least 20 nt
Continuity of an RNA structure (no backbone breaks in RNA 3D structure)
Defined secondary structure

RNAstrand 163.3 MB

RNAstrand stores experimentally solved RNA secondary structures, but not necessarily extracted from 3D structures. The entire RNAstrand dataset containing 4666 RNA sequences and secondary structures was downloaded. The procedure for filtering this dataset was exactly the same as in the case of the PDB dataset. The only difference is that no reference 3D structures were used, therefore only one base pair definition was used. The final RNAstrand dataset consists of 1987 RNAs.