ATtRACT is A daTabase of experimentally validated RNA binding proteins and AssoCiated moTifs. Even if it is possible to perform analyzes typical of a web application we believe that the core of ATtRACT is its own database.
ATtRACT database can be consulted through three types of search.

Search Tools

  • Search specific entries of the database
  • Search motifs
  • Search sequences

In addition is possible to discover patterns that occur repeatedly in a set of sequences and compare them with motifs present in ATtRACT database.  


Search tools
    Database Search
Users can search information about specific entry of the database simply typing or choosing one or a combination of the following options:

  • Official Gene name e.g: "SRSF1" or Synonyms e.g: "SFRS1"
  • Gene id e.g: "ENSG00000136450"
  • Minimum length of the motif or Maximum length of the motif
  • Type of experiment (multiple choice allowed)
  • Organism (multiple choice allowed)
  • Domain (multiple choice allowed)

Searches can be restricted by using combinations of queries. I.e, all Human motifs belonging to "PCBP2" gene ranging from 6 to 8 nucleotides can be retrieved by entering "PCBP2" in the gene name, selecting "Homo Sapiens" under the "organism" checkbox, and typing "6" in "MINimum length of motif" field and "8" in "MAXimum length of motif" field.

Gene ID Search

User can search entry in ATtRACT using gene id. Refer to the following table for knowing the database from where gene ID is extracted:

organism Database ID
Homo sapiens Ensembl ENSG....
Mus Musculus Ensembl and PDB Ensembl: ENSMUSG...
PDB: PDBID_CHAIN I.e 3IVK_H
Drosophila melanogaster Ensembl FBGN...
Saccharomyces cerevisiae Ensembl YLR...
Caenorhabditis elegans Ensembl WBGENE...
Bos taurus Ensembl ENSBTAG...
Bombyx mori Ensembl Metazoa BGIBMGA...
Aspergillus nidulans Ensembl Fungi CADANIAG...
Danio rerio Ensembl ENSDARG...
Naegleria gruberi Ensembl protist lmjf...
Plasmodium falciparum Ensembl protist MAL...
Pongo abelii Ensembl ENSPPYG...
Schizosaccharomyces pombe Ensembl fungi SPAC...
Tetraodon nigroviridis Ensembl ENSTNIG...
Thalassiosira pseudonana Ensembl protist THAPS...
Gallus Gallus Ensembl ENSGALG...
Xenopus tropicalis Ensembl ENSXETG...
Xenopus laevis Xenbase XB-GENE-...
Chaetomium thermophilum Eurepean nucleotide Database GL...
Mesocricetus auratus Eurepean nucleotide Database ML...
Oryzias latipes Eurepean nucleotide Database DQ...
Vanderwaltozyma polyspora Eurepean nucleotide Database DS...
Zea mays Eurepean nucleotide Database FJ...
Arabidopsis thaliana, Cricetulus griseus, Leishmania major, Nematostella vectensis, Neurospora crassa, Ostreococcus tauri, Physcomitrella patens, Phytophthora ramorum, Rhizopus oryzae, Schistosoma mansoni, Trichomonas vaginalis, Trypanosoma brucei Database not available Same as gene name
Description of the Database ID present in ATtRACT

 

Motif Search
Symbol Description Bases represented
A Adenine [A]
C Cytosine [C]
G Guanine [G]
T Thymine [T]
U Uracil [U]
W Weak [A,T]
S Strong [C,G]
M aMino [A,C]
K Keto [G,T]
R puRine [A,G]
Y pYrimidine [C,T]
B Not A [C,G,T]
D Not C [A,G,T]
H Not G [A,C,T]
V Not T [A,C,G]
N or X aNy base [A,C,G,T]
Tab 1:IUPAC code table
User can submit a sequence ranging from 4 to 12 nucleotides and search if it represents a specific motif. No mismatch is allowed, but the user can take advantages of Iupac ambiguous notation. A table with all the accepted symbols is provided(Tab 1). I.e: if the input sequence is "cgacgra" user are performing a perfect match of "cgacgaa" and "cgacgga" with all the entries in the database.

User can restrict the search selecting a specific organism. Results are provided in table format. For a detailed description of the fields refer to the Search Results section
 

  Results

All queries are displayed as tables. A file, containing the search results, can be retrieved by clicking on drop-down Download menu at the top of the page and choosing the preferred format (csv or tsv text format)

User can choose how many items wants to display by selecting the corresponding number from the dropdown menu.


User can further filter entries of the table through a full-text search using the search input box.



User can also copy results in the clipboard or print them.







 One can sort the table according to their requirement simply by clicking on the header.
Table headers and explanation of the field follows:

Header Description
Gene name The official gene name is reported
Gene ID The official gene ID is reported
Organism Organism where the motif has been assess.
Motif Sequence of the motif
Len Length of the motif
Pubmed A link to Pubmed ID is provided. User can have a look at the reference experimentally supporting the binding, by clicking on it.
Experiment Type of experiment used to asses the motif
Domain Domain present in RBP
Offset Distance measured in nucleotides from the beginning of the sequence
Go Terms All the Go terms associated with RPB are provided
LOGO A graphical representation of the sequence profile
Quality score A numerical representation of affinity between RBP and binding sites
Description of the tables header

It is possible to investigate the associated Go term of the RNA binding protein by clicking the corresponding button in the Go terms column. A popup window will appear with all the associated go term provided in table format.

 


Scan sequence

Scan a sequence or a set of sequences
User can upload a TXT file containing RNA\DNA sequence(s) in fasta or multi-fasta format and scan the sequence(s) searching for the presence of motifs.

I.e: Fasta format

">Header
AGTGAATTATTTGAACCAGATCGCATTACAGTGATGTTCCTTAATTGTGATGTGTATCGAAGTGTGAGTAGATGTTAGAATG..."

I.e: Multi-fasta

">Header_1
TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGGACTGTTTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGT...
>Header_2
CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGATGTACTGCATGTATGCAAAGGACGTCACATTACCGTGCAGTATGATT..."

User can restrict searches by selecting a specific organism and/or restrict the search space to motifs of a certain length.
The Burrows wheleer transform (BWT) algorithm is implemented in order to speed up the searching process.
BWT permit to:
  • count the number of patterns in one or more strings
  • to locate the offset of a motif in one or more strings
in a very efficient manner.

Results are provided in table format and graphical format.







Table description

As in the Search Results section user can choose how many items wants to display by selecting the corresponding number from the drop down menu. User can filter entries of the table through a full-text search using the search input box. User can also copy results in the clipboard or print them.  One can sort the table simply by clicking on the header.

Descriptions of the headers of the table follow:
Header Description
Gene name The official gene name is reported
Gene ID The official gene ID is reported
Organism Organism where the motif has been assessed.
Motif Consensus sequence of the motif
Len Length of the motif
Pubmed A link to Pubmed ID is provided. User can have a look at the reference experimentally supporting the binding, by clicking on it.
Experiment Type of experiment used to assess the motif
Domain Domain present in RNA binding Protein
Offset Distance in nucleotide from the beginning of the sequence
Go Terms All the Go terms associated with Rna binding proteins are provided
Exon250 The log odd ratio of this specific motif belonging to an exon plus 250 nucleotides upstream and 250 downstream (for further details)
CDS The log odd ratio of this specific motif belonging to a coding sequence (for further details)
Intron The log odd ratio of this specific motif belonging to an intron sequence (for further details)

It is possible to investigate the associated Go term of the RNA binding protein by clicking the corresponding button in the Go terms column. A popup window will appear with all the associated go terms. It is possible to perform a full-text search on all the field of the table, simply filling the corresponding form.

User can download:

  1. A file containing all the analyzed sequences. They can be retrieved by clicking the drop-down Download menu in the green stripe at the top of the page and choose the preferred format between csv or tsv text format. Each sequence analyzed starts with:
    • The fasta header
    • The nucleotide sequence
    • The results in tabular format. For more information about table headers descriptions refer to this link

  2. A file containing the analysis of a specific sequence can be retrieved by clicking the dropdown Download menu in the blue stripe and choose preferred format between csv or tsv text format. As before the first two rows represent:
    • The fasta header
    • The nucleotide sequence
    • The results in a tabular format. For more information about table header description refer to this link
Graph Description
A graphical format is provided in order to visualize the results. The purpose of the graph is to identify those peaks where a concentration of motifs occurs. The x axis represents the sequence length, each bin represent a nucleotide. The y axis represents the amount of motifs starting at this position. The user can click a point on the figure and a popup window appear visualizing a table. The following fields are represented:
Header Description
Gene name The official gene name is reported
Organism The organism where the motif has been assessed.
Motif The sequence of the motif starting in this point
  Clicking outside the point the popup window disappear.
Moving the mouse wheel is possible to zoom in and out the figure.
Scoring Function
Let M=[m1,m2 m3 ,...,mn] the set of all the motifs in the database.
Let S1 = [s11,s12 s13 ,...,s1n] where s11,s12 s13 ,...,s1n are the sequences in the considered genome representing an exon plus 250 nucleotides upstream and downstream.
Let S2 = [s21,s22 s23 ,...,s2n] where s21,s22 s23 ,...,s2n are the sequences in the considered geneome representing a coding sequence.
Let S3 = [s31,s32 s33 ,...,s3n] where s31,s32 s33 ,...,s3n are the sequences in the considered geneome representing an intron.
Let CS1 = [cm1,cm2,cm3,...,cmn] where cm1,cm2,cm3,...,cmn are the occurences of motifs m1,m2,m3,...,mn in S1
Let CS2 = [cm1,cm2,cm3,...,cmn] where cm1,cm2,cm3,...,cmn are the occurences of motifs m1,m2,m3,...,mn in S2
Let CS3 = [cm1,cm2,cm3,...,cmn] where cm1,cm2,cm3,...,cmn are the occurences of motifs m1,m2,m3,...,mn in S3

Let s an input sequence of length ls and mxM a motif of length lm of multiplicity t found in the input sequence . The Log Off ratio is computed as:

logodd for computing the score for sequences belonging to set S1

logodd for computing the score for sequences belonging to set S2

logodd for computing the score for sequences belonging to set S3

where Obs is defined as:

obs

and Exp is defined as:

logodd where logodd is the length of ith sequence in S1

logodd where logodd is the length of ith sequence in S2

logodd where logodd is the length of ith sequence in S3

 


De Novo motif discovery
With De Novo motif discovery, users can discover patterns that occur repeatedly in a set of sequences.
To ensure that the new motifs discovered are similar to motifs experimentally validated, they are compared with ATtRACT database or a subset of it. For achieving the task a newly pipeline was developed and is shown in figure:



Two different tools are integrated with ATtRACT :
Tools Description
MEME Meme analyzes the input sequences for similarities among them and produces as output as many motifs as requested. MEME takes advantages of an extension of expectation maximization (EM) algorithm to produce a statistical model to automatically find a relationship between possibly related unaligned sequences.
Tomtom Tomtom analyzes MEME output to assess whether a newly discovered motif resembles any motif in ATtRACT database.
Description of the input field
The description of the input necessary for the De Novo motif analysis follow:
Input Field Description Mandatory
Upload a multi-FASTA file The multi-fasta file of putatively related fasta sequences Yes if Upload MEME txt Output field is empty
Upload MEME txt Output Upload the output or your own MEME analysis and compare with Tomtom Yes if Upload a multi-FASTA file field is empty
Model Three different type of models are available:
one motif per sequence:
each sequence in the dataset contains exactly one occurrence of the motif.
zero or one per sequence:
each sequence may contain at most one occurrence of each motif.
any number of repetition:
each sequence may contain any number of non-overlapping occurrences of the motif.
Yes (not taken in consideration if you upload Upload MEME txt Output)
Maximum number of motifs Meme will stop when the selected number of distinct motifs in the training set is reached or when none can be found with E-value < 10 (default) No (not taken into consideration if you upload Upload MEME txt Output)
MINimum length of motif Lower bound for a motif length [default = 4] No (not taken into consideration if you upload Upload MEME txt Output)
MAXimum length of motif Upper bound for a motif length [default = 14] No (not taken into consideration if you upload Upload MEME txt Output)
Evalue MEME and Tomtom stop if motif E-value greater than >10 [default = 10] No
Upload your own MEME db Permit to upload a subset of motifs extracted from ATtRACT database. For further information refer to this link No
Generation of database of known motifs
The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. The Evalue is strongly dependent on the size of the database. Would be better to compare the De novo motifs with a database containing only those motifs belonging to the same or to related species. For this reason, ATtRACT gives to the user the possibility to build an own database of known motifs.
Input Field Description
MINimum length of motif Lower bound for a motif length
MAXimum length of motif Upper bound for a motif length
Experiment Select type of experiments
Organism Select type of organisms
Domain Select specific domains
Results
User can visualize two different types of output. On the top of the page, the De Novo motifs discovered by MEME. The motifs are ordered on the base of their Evalue. MEME uses an objective function on motifs to select the "best" motif. The objective function is based on the statistical significance of the log-likelihood ratio (LLR) of the occurrences of the motif. Evalue is assigned by MEME and indicate an estimate of the number of motifs (with the same width and number of occurrences) that would have equal or higher log likelihood ratio if the training set sequences had been generated randomly User can download the results by pressing the drop-down menu button. Three possibilities file format are available:
  • TSV format
  • CSV format
Then the significant matches discovered by TOMTOM appear. The output is organized as follow:
Input Field Description
Motif [num] Motif identified by MEME
Summary Provides information about the best alignment between the De novo motif and one of the entries of the ATtRACT database. Description of the summary fields follow:
Input Field Description
Gene name Lower bound for a motif length
Gene id Upper bound for a motif length
Organism Type of organism
Reported sequence Reported sequence as annotated in the experiments
Experiments Type of experiments
Family Family of Rna Binding Protein
Sequence length Length of the motif
Offset Measure the offset between the alignment
P-value The minimal p-value over all possible offsets
Tomtom E-value The expected number of times that the given query would be expected to match a target as well or better than the observed match in a randomized target database of the given size
Q-value The minimal false discovery rate at which the observed similarity would be deemed significant.
Overlap Number of nucleotides that overlap
Alignment Provides an alignment figure between the motif present in ATtRACT database (up) and the De novo motif (down)
 


Download
User can download specific field or fields from the database. Fields available follow:
  • Gene name
  • Organism
  • Sequence
  • Length of the sequence
  • Experimets
  • Pubmed ID
  • Domain
  • Go Terms
The Gene id field is always present. For a fields description refer to this link
 


Statistics
The following statistics are produced:
  1. Motif length distribution
  2. Organism related length distribution
  3. Organism distribution
  4. Experiment distribution
  5. Domain Distribution

More insight here