The Protein Data Bank (PDB) is a repository of experimentally determined
three-dimensional structures of biological macromolecules (mostly proteins and
nucleic acids). The structures it contains are themselves very useful for
answering biological questions, or for asking even more questions. In this
series of blog posts, I will show with examples how the associated metadata
(mostly structure annotations) can also answer interesting questions.
The PDB already provides statistics on some of its metadata, but
these are very general in scope. The PDB in Europe (PDBe) provides
programmatic access to the database through the PDBe API. By collecting
appropriate metadata from the database, one can get much finer insigth, for
example specific to a particular field of structural biology.
In this first post, I will introduce the PDBe API. I will show how to formulate
queries with the PDB metadata and how to retrieve the corresponding data using
PDBe API documentation
Directly relevant documentation
The documentation of the PDBe search feature is very brief: it is
simply a list of all searchable keywords and how they relate to structure
annotations. Luckily, it is accompanied by an interactive tool that greatly
helps building and debugging queries (scroll all the way to the bottom of the
list to find the tool). Constructing a query with this tool, and clicking on
the “RunCall” button, will display the URL that will replay this same query at
will (which we will use from R scripts) and can also display the results in
The PDBe also provides a tutorial on how to use the search API. But
this is demonstrated using Python in Jupyter notebooks, and I am not familiar
with either, so this didn’t help me much. It might help you.
When accessing a properly constructed URL, the API will respond by sending
results matching the query in JSON format.
To provide programmatic access to the database, the PDBe uses a software called
Solr. The Solr documentation is also helpful to figure out
how to formulate queries. The PDBe API documentation also points to this page
of the Solr wiki.
There is an R package to use Solr from R, but I feel that building
queries directly in the URL of the API call is easier. I have not tried this R
An example query and its result
Let’s try the simplest way to retrieve information from the PDB: using a unique
PDB accession code, for example 6D6V, a recent structure of
telomerase. Here is a possible URL for that query:
In the above URL, q= indicates that we request information on the database
entry with the unique PDB accession code (pdb_id) 6d6v and fl= indicates
which fields of the results we want. Requesting the pdb_id field will confirm
that the request worked, and requesting title will illustrate the example.
Last, wt=json will make sure the API sends results in JSON format, which is
very easy to import in R using the jsonlite package.
Now, let’s download the result and look at it:
## List of 2
## $ responseHeader:List of 3
## ..$ status: int 0
## ..$ QTime : int 1
## ..$ params:List of 3
## .. ..$ q : chr "pdb_id:6d6v"
## .. ..$ fl: chr "pdb_id,title"
## .. ..$ wt: chr "json"
## $ response :List of 3
## ..$ numFound: int 8
## ..$ start : int 0
## ..$ docs :'data.frame': 8 obs. of 2 variables:
## .. ..$ pdb_id: chr [1:8] "6d6v" "6d6v" "6d6v" "6d6v" ...
## .. ..$ title : chr [1:8] "CryoEM structure of Tetrahymena telomerase with telomeric DNA at 4.8 Angstrom resolution" "CryoEM structure of Tetrahymena telomerase with telomeric DNA at 4.8 Angstrom resolution" "CryoEM structure of Tetrahymena telomerase with telomeric DNA at 4.8 Angstrom resolution" "CryoEM structure of Tetrahymena telomerase with telomeric DNA at 4.8 Angstrom resolution" ...
The resulting R object is a list with two elements, which are themselves lists:
responseHeader recapitulates the query parameters,
response contains the results.
The response list contains 3 items:
numFound: the number of results,
start: no idea what this is…
docs: a data frame in which each column is a field requested in the query
(fl=) and each row is a macromolecule matching the query.
Let’s print the data frame (I convert it as a tibble because it prints
## # A tibble: 8 x 2
## pdb_id title
## <chr> <chr>
## 1 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at …
## 2 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at …
## 3 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at …
## 4 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at …
## 5 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at …
## 6 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at …
## 7 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at …
## 8 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at …
We got the correct PDB accession code, and the correct title, but repeated
8 times. This is surprising, because PDB accession
codes are unique. Why didn’t we get only one row in this result table?
Understanding the data format
One difficulty I had was to understand the format of the results returned by the
API. I could not find any formal documentation anywhere, but there are hints in
the tutorial’s Jupyter notebooks:
PDBe Solr instance serves documents based on polymeric entities in PDB entries, i.e. each document indexed by Solr represents polymeric molecules of type protein, sugar, DNA, RNA or DNA/RNA hybrid. This is why for entry 2qk9 we get 3 documents in the response, each representing the protein, RNA and DNA molecule in that entry.
This means that the number of results returned is not the number of PDB
entries matching the query (as the search bar on the website works), which is
a little bit confusing. Instead, each result is a biological macromolecule. This
makes answering certain questions easier, when we’re interested in a particular
property of a biological macromolecule. But this also makes answering other
(simpler) questions less straghtforward: to determine the number of PDB entries
matching a query, one has to explicitely count the number of unique PDB
accession codes found in the returned results (and because each result is a
macromolecule, all PDB accession codes of structures containing more than one
macromolecule will be duplicated across results).
This other paragraph of the tutorial is also relevant:
Fields in PDBe’s entity-based Solr document cover a wide range of properties, such as entry’s experimental details, details of deposition and primary publication, entity’s taxonomy, entry’s quality, entity’s cross references to UniProt and popular domain databases, biological assembly, etc. They are documented here: http://wwwdev.ebi.ac.uk/pdbe/api/doc/search.html
It is also useful now to understand a little more about Solr querying. Solr has a rich and complex query syntax, described at http://wiki.apache.org/solr/CommonQueryParameters and elsewhere.
The fields of immediate relevance to us in this tutorial are:
q - the query itself. There is a lot of flexibility in describing a query, e.g. fields, wildcards, case-insensitivity, logical operators, ranges, etc.
rows - number of results returned by Solr. Needs to be explicitly set in mysolr because it defaults to 10. Useful if only part of results are desired.
fl - fields returned in each document. This is useful to reduce the size of response.
We will only receive the fields (fl=) we explicitly requested, which
significantly reduces the size of the returned JSON data. We should also note
that by default only 10 results will be returned: if we want all of them, we
need to request more rows (simply choose a very large number to get all
possible results; at the time of writing this blog post, there is a total of
146093 PDB entries).
A note on crystal structures
There is one important thing to consider in the case of crystal structures. They
are defined by their asymmetric unit: the smallest part of the crystal that can
reconstruct the entire crystal by first applying space group symmetry operators
to reconstruct the unit cell, then applying translations of the unit cell in all
three directions to reconstruct the entire crystal. But the relevant biological
assembly may or may not coincide with the asymmetric unit, as explained
here. It is therefore important to check that the API returns only the
content of the biological assembly, and not of the asymmetric unit, because
structures with more than one copy of a molecule in the asymmetric unit should
not end up as duplicates in the results. Using the interactive query
builder and the haemoglobin examples (PDB accession
codes 2HHB, 1OUT and 1HV4), one can verify that these three queries return
the same number of macromolecules (the two haemoglobin chains, alpha and
beta). We can also do that from R:
In all cases, we got two results, corresponding to the two haemoglobin chains.
This means the API returns results based on the biological assembly, so we won’t
need to take into account multiple copies of the same molecule in the asymmetric
unit of crystal structures (of course, assuming annotations in the PDB are
In future blog posts in this series, I will show how to use PDB metadata to
address questions more subtle than those answered by global PDB