ASTRAL
The ASTRAL
Compendium for
Sequence and Structure Analysis
Authors: The ASTRAL database was
created by John-Marc Chandonia, Degui Zhi, Gary Hon, Loredana Lo Conte, Nigel Walker, Patrice Koehl, Michael Levitt, and Steven E. Brenner.
References:
- Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE. The ASTRAL compendium in 2004. Nucleic Acids Research
32:D189-D192 (2004). [PDF]
- Chandonia JM, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE. ASTRAL compendium enhancements. Nucleic Acids Research
30:260-263 (2002). [PDF]
- Brenner SE, Koehl P, Levitt M. The ASTRAL
compendium for sequence and structure analysis.
Nucleic Acids Research
28:254-256 (2000). [PDF]
The ASTRAL compendium provides databases and
tools useful for analyzing protein structures and their sequences.
It is partially derived from, and augments the
SCOP: Structural Classification of Proteins
database.
Most of the resources provided
here depend upon the coordinate files maintained
and distributed by the Protein Data Bank.
Most commonly requested files (from current version):
- ASTRAL
SCOP 1.73 genetic domain sequence subsets, based on PDB SEQRES records,
with less than 40% identity to each other:
download sequences (2.6 MB)
- ASTRAL
SCOP 1.73 genetic domain sequence subsets, based on PDB SEQRES records,
with less than 95% identity to each other:
download sequences (4.1 MB)
Current Version: 1.73, released December 10, 2007. ASTEROIDS 1.73 fixed version released Mar 17, 2007, with weekly updates.
Bug fix on March 11, 2008: See the 1.73 release notes for details.
Free Software
-
PAST
(PDB Archival Snapshot Toolkit):
Perl tools to efficiently store and archive multiple datestamped snapshots of
the PDB. Version 1.3, released 8 October 2007, is
needed to interoperate with new PDB standards
that took effect in July with the release of their remediated files.
- MakeRAF: Java software to create RAF maps from PDB files is included in the StrBio Java class libraries. As of 1.73, this software has been deprecated and replaced by xml2raf, which we will release under an open source license.
Older Versions:
Documentation:
- The primary sources of ASTRAL documention
are the three references listed above. Figure 1 from the 2004 NAR paper gives a brief
overview of how ASTRAL is currently created:
Data flow in ASTRAL:
Primary data sources are shown in green.
Primary ASTRAL databases are shown in light yellow.
Less commonly used resources are shown in darker yellow.
Resources added more recently are outlined in light blue/grey.
Using the RAF maps, four complete sequence sets are created for every domain in the first seven classes of the SCOP database.
Two sets (the genetic domain sets) include the genetic domain sequences described in the 2002 NAR paper, and the other two (the original-style sequence sets) use the prior method of splitting each multi-chain domain into multiple sequences.
For each of these methodologies, one complete sequence set is derived from sequences in the PDB ATOM records, and another from sequences in the SEQRES records.
The SEQRES sets (for both genetic domain and original-style methods) are used to derive representative subsets.
Each set is fully compared against itself using BLAST, and subsets are created using three similarity criteria and various thresholds.
Representatives are chosen according to AEROSPACI scores.
PDB chain sequence sets are derived from the SEQRES records of every PDB chain in SCOP; selected subsets are created at 90-100% ID thresholds.
PDB-style files are derived from the RAF maps and SCOP domain definitions.
At each new release of ASTRAL, all non-redundant sequences from each SCOP family and superfamily are aligned using MAFFT.
A hidden Markov model (HMM) is created from the multiple sequence alignment for each family and superfamily using HMMER.
These HMMs, and BLAST, are used to predict domains in the sequences of newly released PDB entries on a weekly basis.
HMMs from the Pfam-A database are also used to predict domains in regions of the sequences not identified by HMMs derived from ASTRAL.
Unassigned regions of at least 20 consecutive residues are also predicted to be potential domains.
The predicted domains (ASTEROIDS) are available in a single file, as well as optionally available integrated into representative subsets selected according to two similarity criteria (BLAST E-value and % identity) at various thresholds.
- Documentation describing more recent updates (since the 2004 paper) can be found in the 1.65 release notes, 1.67 release notes, 1.69 release notes, 1.71 release notes and 1.73 release notes.
- For help, email the Brenner Computational Genomics Research Group at astral@compbio.berkeley.edu.