<a href="/index"><img alt="My Logo" src="/attract/static\images/logo

ATtRACT is A daTabase of experimentally validated RNA binding proteins and AssoCiated moTifs. Even if it is possible to perform analyzes typical of a web application we believe that the core of ATtRACT is its own database.
ATtRACT database can be consulted through three types of search.

Search Tools

Search specific entries of the database
Search motifs
Search sequences

In addition is possible to discover patterns that occur repeatedly in a set of sequences and compare them with motifs present in ATtRACT database.

Search tools

Database Search

Users can search information about specific entry of the database simply typing or choosing one or a combination of the following options:

Official Gene name e.g: "SRSF1" or Synonyms e.g: "SFRS1"
Gene id e.g: "ENSG00000136450"
Minimum length of the motif or Maximum length of the motif
Type of experiment (multiple choice allowed)
Organism (multiple choice allowed)
Domain (multiple choice allowed)

Searches can be restricted by using combinations of queries. I.e, all Human motifs belonging to "PCBP2" gene ranging from 6 to 8 nucleotides can be retrieved by entering "PCBP2" in the gene name, selecting "Homo Sapiens" under the "organism" checkbox, and typing "6" in "MINimum length of motif" field and "8" in "MAXimum length of motif" field.

Gene ID Search

User can search entry in ATtRACT using gene id. Refer to the following table for knowing the database from where gene ID is extracted:

Description of the Database ID present in ATtRACT
organism	Database	ID
Homo sapiens	Ensembl	ENSG....
Mus Musculus	Ensembl and PDB	Ensembl: ENSMUSG... PDB: PDBID_CHAIN I.e 3IVK_H
Drosophila melanogaster	Ensembl	FBGN...
Saccharomyces cerevisiae	Ensembl	YLR...
Caenorhabditis elegans	Ensembl	WBGENE...
Bos taurus	Ensembl	ENSBTAG...
Bombyx mori	Ensembl Metazoa	BGIBMGA...
Aspergillus nidulans	Ensembl Fungi	CADANIAG...
Danio rerio	Ensembl	ENSDARG...
Naegleria gruberi	Ensembl protist	lmjf...
Plasmodium falciparum	Ensembl protist	MAL...
Pongo abelii	Ensembl	ENSPPYG...
Schizosaccharomyces pombe	Ensembl fungi	SPAC...
Tetraodon nigroviridis	Ensembl	ENSTNIG...
Thalassiosira pseudonana	Ensembl protist	THAPS...
Gallus Gallus	Ensembl	ENSGALG...
Xenopus tropicalis	Ensembl	ENSXETG...
Xenopus laevis	Xenbase	XB-GENE-...
Chaetomium thermophilum	Eurepean nucleotide Database	GL...
Mesocricetus auratus	Eurepean nucleotide Database	ML...
Oryzias latipes	Eurepean nucleotide Database	DQ...
Vanderwaltozyma polyspora	Eurepean nucleotide Database	DS...
Zea mays	Eurepean nucleotide Database	FJ...
Arabidopsis thaliana, Cricetulus griseus, Leishmania major, Nematostella vectensis, Neurospora crassa, Ostreococcus tauri, Physcomitrella patens, Phytophthora ramorum, Rhizopus oryzae, Schistosoma mansoni, Trichomonas vaginalis, Trypanosoma brucei	Database not available	Same as gene name

Motif Search

Tab 1:IUPAC code table
Symbol	Description	Bases represented
A	Adenine	[A]
C	Cytosine	[C]
G	Guanine	[G]
T	Thymine	[T]
U	Uracil	[U]
W	Weak	[A,T]
S	Strong	[C,G]
M	aMino	[A,C]
K	Keto	[G,T]
R	puRine	[A,G]
Y	pYrimidine	[C,T]
B	Not A	[C,G,T]
D	Not C	[A,G,T]
H	Not G	[A,C,T]
V	Not T	[A,C,G]
N or X	aNy base	[A,C,G,T]

User can submit a sequence ranging from 4 to 12 nucleotides and search if it represents a specific motif. No mismatch is allowed, but the user can take advantages of Iupac ambiguous notation. A table with all the accepted symbols is provided(Tab 1). I.e: if the input sequence is "cgacgra" user are performing a perfect match of "cgacgaa" and "cgacgga" with all the entries in the database.

Search is limited to 81 possible combinations. I.e: "NNNNAG" adopting the IUPAC code is not allowed (4×4×4×4= 128 possible motifs)

User can restrict the search selecting a specific organism. Results are provided in table format. For a detailed description of the fields refer to the Search Results section

Results

All queries are displayed as tables. A file, containing the search results, can be retrieved by clicking on drop-down Download menu at the top of the page and choosing the preferred format (csv or tsv text format)

User can choose how many items wants to display by selecting the corresponding number from the dropdown menu.

User can further filter entries of the table through a full-text search using the search input box.

User can also copy results in the clipboard or print them.

One can sort the table according to their requirement simply by clicking on the header.
Table headers and explanation of the field follows:

Description of the tables header
Header	Description
Gene name	The official gene name is reported
Gene ID	The official gene ID is reported
Organism	Organism where the motif has been assess.
Motif	Sequence of the motif
Len	Length of the motif
Pubmed	A link to Pubmed ID is provided. User can have a look at the reference experimentally supporting the binding, by clicking on it.
Experiment	Type of experiment used to asses the motif
Domain	Domain present in RBP
Offset	Distance measured in nucleotides from the beginning of the sequence
Go Terms	All the Go terms associated with RPB are provided
LOGO	A graphical representation of the sequence profile
Quality score	A numerical representation of affinity between RBP and binding sites

It is possible to investigate the associated Go term of the RNA binding protein by clicking the corresponding button in the Go terms column. A popup window will appear with all the associated go term provided in table format.

Scan sequence

Scan a sequence or a set of sequences

User can upload a TXT file containing RNA\DNA sequence(s) in fasta or multi-fasta format and scan the sequence(s) searching for the presence of motifs.

I.e: Fasta format

">Header
AGTGAATTATTTGAACCAGATCGCATTACAGTGATGTTCCTTAATTGTGATGTGTATCGAAGTGTGAGTAGATGTTAGAATG..."

I.e: Multi-fasta

">Header_1
TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGGACTGTTTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGT...
>Header_2
CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGATGTACTGCATGTATGCAAAGGACGTCACATTACCGTGCAGTATGATT..."

User can restrict searches by selecting a specific organism and/or restrict the search space to motifs of a certain length.
The Burrows wheleer transform (BWT) algorithm is implemented in order to speed up the searching process.
BWT permit to:

count the number of patterns in one or more strings
to locate the offset of a motif in one or more strings

in a very efficient manner.

The total number of nucleotides in input is limited to 20000

Results are provided in table format and graphical format.

Table description

As in the Search Results section user can choose how many items wants to display by selecting the corresponding number from the drop down menu. User can filter entries of the table through a full-text search using the search input box. User can also copy results in the clipboard or print them. One can sort the table simply by clicking on the header.

Descriptions of the headers of the table follow:

Header	Description
Gene name	The official gene name is reported
Gene ID	The official gene ID is reported
Organism	Organism where the motif has been assessed.
Motif	Consensus sequence of the motif
Len	Length of the motif
Pubmed	A link to Pubmed ID is provided. User can have a look at the reference experimentally supporting the binding, by clicking on it.
Experiment	Type of experiment used to assess the motif
Domain	Domain present in RNA binding Protein
Offset	Distance in nucleotide from the beginning of the sequence
Go Terms	All the Go terms associated with Rna binding proteins are provided
Exon250	The log odd ratio of this specific motif belonging to an exon plus 250 nucleotides upstream and 250 downstream (for further details)
CDS	The log odd ratio of this specific motif belonging to a coding sequence (for further details)
Intron	The log odd ratio of this specific motif belonging to an intron sequence (for further details)

User can download:

A file containing all the analyzed sequences. They can be retrieved by clicking the drop-down Download menu in the green stripe at the top of the page and choose the preferred format between csv or tsv text format. Each sequence analyzed starts with:
- The fasta header
- The nucleotide sequence
- The results in tabular format. For more information about table headers descriptions refer to this link

A file containing the analysis of a specific sequence can be retrieved by clicking the dropdown Download menu in the blue stripe and choose preferred format between csv or tsv text format. As before the first two rows represent:
- The fasta header
- The nucleotide sequence
- The results in a tabular format. For more information about table header description refer to this link

Graph Description

A graphical format is provided in order to visualize the results. The purpose of the graph is to identify those peaks where a concentration of motifs occurs.

The x axis represents the sequence length, each bin represent a nucleotide. The y axis represents the amount of motifs starting at this position. The user can click a point on the figure and a popup window appear visualizing a table. The following fields are represented:

Header	Description
Gene name	The official gene name is reported
Organism	The organism where the motif has been assessed.
Motif	The sequence of the motif starting in this point

Clicking outside the point the popup window disappear.
Moving the mouse wheel is possible to zoom in and out the figure.

Scoring Function

Let M=[m₁,m₂ m₃ ,...,m_n] the set of all the motifs in the database.
Let S¹ = [s¹₁,s¹₂ s¹₃ ,...,s¹_n] where s¹₁,s¹₂ s¹₃ ,...,s¹_n are the sequences in the considered genome representing an exon plus 250 nucleotides upstream and downstream.
Let S² = [s²₁,s²₂ s²₃ ,...,s²_n] where s²₁,s²₂ s²₃ ,...,s²_n are the sequences in the considered geneome representing a coding sequence.
Let S³ = [s³₁,s³₂ s³₃ ,...,s³_n] where s³₁,s³₂ s³₃ ,...,s³_n are the sequences in the considered geneome representing an intron.
Let C^S1 = [c_m₁,c_m₂,c_m₃,...,c_{m_n}] where c_m₁,c_m₂,c_m₃,...,c_{m_n} are the occurences of motifs m₁,m₂,m₃,...,m_n in S¹
Let C^S2 = [c_m₁,c_m₂,c_m₃,...,c_{m_n}] where c_m₁,c_m₂,c_m₃,...,c_{m_n} are the occurences of motifs m₁,m₂,m₃,...,m_n in S²
Let C^S3 = [c_m₁,c_m₂,c_m₃,...,c_{m_n}] where c_m₁,c_m₂,c_m₃,...,c_{m_n} are the occurences of motifs m₁,m₂,m₃,...,m_n in S³

Let s an input sequence of length l_s and m_x ∈ M a motif of length l_m of multiplicity t found in the input sequence . The Log Off ratio is computed as:

logodd for computing the score for sequences belonging to set S¹

logodd for computing the score for sequences belonging to set S²

logodd for computing the score for sequences belonging to set S³

where Obs is defined as:

obs

and Exp is defined as:

logodd

where

is the length of i^th sequence in S¹

logodd

where

is the length of i^th sequence in S²

logodd

where

is the length of i^th sequence in S³

The score function is available for caenorhabditis elegans, drosophila melanogaster, homo_sapiens, mus musculus, saccharomyces cerevisiae and xenopus tropicalis

De Novo motif discovery

With De Novo motif discovery, users can discover patterns that occur repeatedly in a set of sequences.
To ensure that the new motifs discovered are similar to motifs experimentally validated, they are compared with ATtRACT database or a subset of it. For achieving the task a newly pipeline was developed and is shown in figure:

Two different tools are integrated with ATtRACT :

Tools	Description
MEME	Meme analyzes the input sequences for similarities among them and produces as output as many motifs as requested. MEME takes advantages of an extension of expectation maximization (EM) algorithm to produce a statistical model to automatically find a relationship between possibly related unaligned sequences.
Tomtom	Tomtom analyzes MEME output to assess whether a newly discovered motif resembles any motif in ATtRACT database.

Description of the input field

The description of the input necessary for the De Novo motif analysis follow:

Input Field	Description	Mandatory
Upload a multi-FASTA file	The multi-fasta file of putatively related fasta sequences	Yes if Upload MEME txt Output field is empty
Upload MEME txt Output	Upload the output or your own MEME analysis and compare with Tomtom	Yes if Upload a multi-FASTA file field is empty
Model	Three different type of models are available: one motif per sequence: each sequence in the dataset contains exactly one occurrence of the motif. zero or one per sequence: each sequence may contain at most one occurrence of each motif. any number of repetition: each sequence may contain any number of non-overlapping occurrences of the motif.	Yes (not taken in consideration if you upload Upload MEME txt Output)
Maximum number of motifs	Meme will stop when the selected number of distinct motifs in the training set is reached or when none can be found with E-value < 10 (default)	No (not taken into consideration if you upload Upload MEME txt Output)
MINimum length of motif	Lower bound for a motif length [default = 4]	No (not taken into consideration if you upload Upload MEME txt Output)
MAXimum length of motif	Upper bound for a motif length [default = 14]	No (not taken into consideration if you upload Upload MEME txt Output)
Evalue	MEME and Tomtom stop if motif E-value greater than >10 [default = 10]	No
Upload your own MEME db	Permit to upload a subset of motifs extracted from ATtRACT database. For further information refer to this link	No

Generation of database of known motifs

The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. The Evalue is strongly dependent on the size of the database. Would be better to compare the De novo motifs with a database containing only those motifs belonging to the same or to related species. For this reason, ATtRACT gives to the user the possibility to build an own database of known motifs.

Input Field	Description
MINimum length of motif	Lower bound for a motif length
MAXimum length of motif	Upper bound for a motif length
Experiment	Select type of experiments
Organism	Select type of organisms
Domain	Select specific domains

For being effective, the minimum number of position weight matrixes to be generated is 50.
To overcame this limitation the user can be less restrictive selecting the various fields I.e: select other species or increase the MINimum length of the motif and/or MAXimum length of the motif

Results

User can visualize two different types of output. On the top of the page, the De Novo motifs discovered by MEME. The motifs are ordered on the base of their Evalue. MEME uses an objective function on motifs to select the "best" motif. The objective function is based on the statistical significance of the log-likelihood ratio (LLR) of the occurrences of the motif. Evalue is assigned by MEME and indicate an estimate of the number of motifs (with the same width and number of occurrences) that would have equal or higher log likelihood ratio if the training set sequences had been generated randomly User can download the results by pressing the drop-down menu button. Three possibilities file format are available:

TSV format
CSV format

Then the significant matches discovered by TOMTOM appear. The output is organized as follow:

Input Field

Description

Motif [num]

Motif identified by MEME

Summary

Provides information about the best alignment between the De novo motif and one of the entries of the ATtRACT database. Description of the summary fields follow:

Input Field	Description
Gene name	Lower bound for a motif length
Gene id	Upper bound for a motif length
Organism	Type of organism
Reported sequence	Reported sequence as annotated in the experiments
Experiments	Type of experiments
Family	Family of Rna Binding Protein
Sequence length	Length of the motif
Offset	Measure the offset between the alignment
P-value	The minimal p-value over all possible offsets
Tomtom E-value	The expected number of times that the given query would be expected to match a target as well or better than the observed match in a randomized target database of the given size
Q-value	The minimal false discovery rate at which the observed similarity would be deemed significant.
Overlap	Number of nucleotides that overlap

Alignment

Provides an alignment figure between the motif present in ATtRACT database (up) and the De novo motif (down)

The total number of nucleotides in input for MEME analyses is limited to 3000

Download

User can download specific field or fields from the database. Fields available follow:

Gene name
Organism
Sequence
Length of the sequence
Experimets
Pubmed ID
Domain
Go Terms

The Gene id field is always present. For a fields description refer to this link

Statistics

The following statistics are produced:

Motif length distribution
Organism related length distribution
Organism distribution
Experiment distribution
Domain Distribution

More insight here