RAF Sequence Maps
The Protein Data Bank now provides XML files
that include a mapping between the PDB-format
records SEQRES (representing the sequence of the molecule used in
an experiment) and ATOM (representing the atoms experimentally
observed). These XML files (along with the chemical dictionary) also
provide information on the original identity of most residues prior
to any post-translational modifications.
Starting with ASTRAL1.73, ASTRAL RAF sequence maps are generated
from the XML files.
The RAF maps summarize
the SEQRES—ATOM relationship in a form which
can be rapidly parsed in most computer languages. Errors
in the mappings are corrected manually, with human interpretation of
the original PDB file serving as the final arbiter in case of
difficulties or discrepancies in machine translation.
summary of these edits to RAF lists affected PDB chains, along
with residue identifiers, 3-letter codes, and the translated
To download the
RAF sequence maps used to generate ASTRAL 1.75,
(Warning: 147 MB file)
Description of Format
The RAF file contains one line per PDB chain. Each line contains
two parts: the header and the body.
Here is an example of the current header format:
101m_ 0.02 38 010301 111011 0 153
^1 ^2 ^3 ^4 ^5 ^6 ^7 ^8
- PDB+chain ID. A '_' for the chain ID indicates a blank chain ID.
The chain ID is case sensitive. Most chains currently in the PDB have an
upper case chain ID.
- version number the RAF format, currently 0.02. See below for a
description of changes in the format from 0.01 to 0.02.
- header length (i.e. the body starts in position 39, counting
from 1). The header length will always be constant for every
entry from a given version of RAF; however, this length
may change in future versions.
- PDB datestamp (last modification time of PDB file)
- set of 1-bit flags (if set: 1->mapped, 2->active, 3->checked,
4->manually edited, 5->ok, 6->one-to-one mapping).
NOTE: one-to-one-mapping only means that there is a one-to-one
mapping between SEQRES and ATOM sequences. It does not mean the
sequences are the same, only that all the residues are seen.
- first non-blank residue identifier (PDB format, 4ch+1 for the insertion code)
- last non-blank residue identifier (PDB format, 4ch+1 for the insertion code)
- body starts here
The body contains one field per residue in the protein. Each field is of
fixed length, 7 characters. Here is an example containing 6 residues:
B .a 1 rr M .i 3Acc 5 de 6A t.
---- residue identifier (B|M|E if missing, 4 ch)
_ insertion code (' ' if missing)
- aa one-letter code from ATOM ('.' if missing)
- aa one letter code from SEQRES ('.' if missing)
The meaning of the characters in each field is as follows:
- First 4 characters - residue identifier, or B|M|E if missing.
These are normally derived from ATOM records, except in the
case of bibliographic entries.
Warning: these identifiers do not always monotonically increase.
B|M|E: if a residue has no corresponding ATOM record, instead of
simply having a blank field for the corresponding resid,
we use B(egin), M(iddle), E(nd) for a missing residue at
the beginning, in the middle, and at the end of a chain
respectively. This should be useful to limit searches when
scanning for a particular residue in either direction.
For bibliographic entries, no ATOM records are present,
so consecutive residue identifiers beginning with '1' are assigned.
- 5th character - insertion code, or ' ' if missing.
- 6th character - amino acid one-letter code from ATOM records, or '.' if missing.
- 7th character - amino acid one-letter code from SEQRES records, or '.' if missing.
In the above example, the protein sequence from the SEQRES records is
ALA ARG ILE CYS GLU, and the protein sequence listed in the ATOM records
is ARG 1, CYS 3, ASP 5, THR 6. The CYS has insertion code 'A'.
There are no ATOM records corresponding
to the SEQRES records for the ALA or the ILE, so the residue identifier
is replaced by a B (in the case of ALA at the beginning of the chain) or
a M (in the case of ILE, which comes after the first identified residue).
ASP 5 is mysteriously mutated to a GLU in the SEQRES records, and THR 6
is missing from the SEQRES records.
Changes to the format from 0.01 to 0.02
- The PDB datestamp field now reflects the modification time of the PDB file, rather than the date the file was obtained from the PDB.
- Some bibliographic PDB entries (entries with sequences but not coordinates)
have been added to the RAF. The PDB code for these domains begins
with '0'. Because residue identifiers of
residues in the RAF file are normally based on the ATOM records,
and the bibliographic PDB entries have no ATOM records, each residue
in the RAF entries for bibliographic chains has been numbered starting