Guide to Understanding PDB Data
Introduction
Beginner’s Guide to PDB Structures and the PDBx/mmCIF Format
Dealing with Coordinates
Biological Assemblies
Missing Coordinates and Biological Assemblies
Primary Sequences and the PDB Format
Introduction to RCSB PDB APIs
Hierarchical Structure of Proteins
Exploring Carbohydrates in the PDB Archive
Small Molecule Ligands
Molecular Graphics Programs
Methods for Determining Structure
Computed Structure Models
Resolution
R-value and R-free
Structure Factors and Electron Density

Primary Sequences and the PDB Format

Each PDB formatted file includes "SEQRES records" which list the primary sequence of the polymeric molecules present in the entry. This sequence information is also available as a FASTA download. This listing includes the sequence of each chain of linear, covalently-linked standard or modified amino acids or nucleotides. It may also include other residues that are linked to the standard backbone in the polymer. Chemical components or groups covalently linked to side-chains (in peptides) or sugars and/or bases (in nucleic acid polymers) will not be listed here.

Here is an example from PDB entry 2dgc, which includes a protein chain and a DNA chain:

SEQRES   1 B   19   DT  DG  DG  DA  DG  DA  DT  DG  DA  DC  DG  DT  DC         
SEQRES   2 B   19   DA  DT  DC  DT  DC  DC                                     
SEQRES   1 A   63  MET ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS         
SEQRES   2 A   63  ARG ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA         
SEQRES   3 A   63  ARG LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL         
SEQRES   4 A   63  GLU GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU         
SEQRES   5 A   63  VAL ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG

In many cases, you may find that the coordinates presented in ATOM records in a PDB file may not exactly match the sequence in the SEQRES records. The ends of chains and mobile loops are often not observed in crystallographic experiments, and coordinates are not included as ATOM records in the file. However, these amino acids will often be included in the SEQRES records, since the portion of the chain was present during the experiment. In these cases, a "REMARK 465" entry will be included in the header of the PDB file to identify each missing residue. For all PDB entries, the file https://cdn.rcsb.org/etl/kabschSander/ss_dis.txt.gz notes regions of the molecule that have not been observed (e.g. residues which exist in the originally studied molecule as shown in the SEQRES records, but not in the observed structure/coordinates).

You may also notice some differences with sequences in other databases. For example, a researcher may change or mutate particular residues to see the effect this will have on the overall structure, or a particular portion of it. The DBREF record provides cross-reference links between PDB sequences (what appears in SEQRES record) and a corresponding database sequence. The SEQADV record identifies differences between sequence information in the SEQRES records of the PDB entry and the sequence database entry given in DBREF.

Also, structural biologists often work with fragments of macromolecules, because they are more amenable to study than the full macromolecule. Thus, the SEQRES and ATOM records may include only a portion of the molecule, not the whole protein. The numbering of residues can also provide an additional complication. In some cases, the researchers number the ATOM records based on the numbering of the whole protein, while in other cases, they number the chain based on the fragment. Any number (negative, 0, positive) can be used.

Amino Acid and Nucleotide Nomenclature

In the SEQRES records, the standard 3-character code is used for standard amino acids, and standard nucleotides are specified by 1 or 2 characters:

Standard (L-) Amino Acids

ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN, PRO, GLN, ARG, SER, THR, VAL, TRP, TYR, PYL (pyrrolysine)*, SEC (selenocysteine) *

D-Amino Acids (present in the PDB Archive)

DAL ('ALA'), DSN ('SER'), DCY ('CYS'), DPR ('PRO'), DVA ('VAL'), DTH ('THR'), DLE ('LEU'), DIL ('ILE'), DSG ('ASN'), DAS ('ASP'), MED ('MET'), DGN ('GLN'); DGL ('GLU'), DLY ('LYS'), DHI ('HIS'), DPN ('PHE'), DAR ('ARG'), DTY ('TYR'), DTR ('TRP')

Deoxyribonucleotides

DA, DC, DG, DT, DI

Ribonucleotides

A, C, G, U, I

* SEC and PYL are considered as standard amino acids as announced by the wwPDB.

Other codes are used for modified amino acids (such as MSE for selenomethionine) and for modified nucleotides (such as CBR for bromocytosine).

Several additional records are included in the PDB format to define modifications as they appear in the ATOM records:

Record Name Describes
MODRES Modifications to standard residues
HET Nonstandard residues (as well as ligands, ions and water)
HETNAM Full chemical name of the residue
HETSYM Synonyms for the residue
FORMUL Chemical formula of the residue

As an example, here are the records that describe HYP (hydroxyproline, a modified version of PRO, or proline) in the ATOM records for collagen entry 1cag:

MODRES 1CAG HYP A    2  PRO  4-HYDROXYPROLINE
HET    HYP  A   2       8
HETNAM     HYP 4-HYDROXYPROLINE
HETSYN     HYP HYDROXYPROLINE
FORMUL   1  HYP    30(C5 H9 N O3)

Complete file format documentation is available at http://www.wwpdb.org/documentation/file-format.

beta