The Protein Data Bank (PDB) is a repository of experimentally determined three-dimensional structures of biological macromolecules (mostly proteins and nucleic acids). The structures it contains are themselves very useful for answering biological questions, or for asking even more questions. In this series of blog posts, I will show with examples how the associated metadata (mostly structure annotations) can also answer interesting questions.
The PDB already provides statistics on some of its metadata, but these are very general in scope. The PDB in Europe (PDBe) provides programmatic access to the database through the PDBe API. By collecting appropriate metadata from the database, one can get much finer insigth, for example specific to a particular field of structural biology.
In this first post, I will introduce the PDBe API. I will show how to formulate queries with the PDB metadata and how to retrieve the corresponding data using R.
PDBe API documentation
Directly relevant documentation
The documentation of the PDBe search feature is very brief: it is simply a list of all searchable keywords and how they relate to structure annotations. Luckily, it is accompanied by an interactive tool that greatly helps building and debugging queries (scroll all the way to the bottom of the list to find the tool). Constructing a query with this tool, and clicking on the “RunCall” button, will display the URL that will replay this same query at will (which we will use from R scripts) and can also display the results in JSON format.
The PDBe also provides a tutorial on how to use the search API. But this is demonstrated using Python in Jupyter notebooks, and I am not familiar with either, so this didn’t help me much. It might help you.
When accessing a properly constructed URL, the API will respond by sending results matching the query in JSON format.
To provide programmatic access to the database, the PDBe uses a software called Solr. The Solr documentation is also helpful to figure out how to formulate queries. The PDBe API documentation also points to this page of the Solr wiki.
There is an R package to use Solr from R, but I feel that building queries directly in the URL of the API call is easier. I have not tried this R package yet.
An example query and its result
query <- 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=pdb_id:6d6v&fl=pdb_id,title&wt=json'
In the above URL,
q= indicates that we request information on the database
entry with the unique PDB accession code (
which fields of the results we want. Requesting the
pdb_id field will confirm
that the request worked, and requesting
title will illustrate the example.
wt=json will make sure the API sends results in JSON format, which is
very easy to import in R using the
Now, let’s download the result and look at it:
result <- jsonlite::fromJSON(query) str(result)
## List of 2 ## $ responseHeader:List of 3 ## ..$ status: int 0 ## ..$ QTime : int 1 ## ..$ params:List of 3 ## .. ..$ q : chr "pdb_id:6d6v" ## .. ..$ fl: chr "pdb_id,title" ## .. ..$ wt: chr "json" ## $ response :List of 3 ## ..$ numFound: int 8 ## ..$ start : int 0 ## ..$ docs :'data.frame': 8 obs. of 2 variables: ## .. ..$ pdb_id: chr [1:8] "6d6v" "6d6v" "6d6v" "6d6v" ... ## .. ..$ title : chr [1:8] "CryoEM structure of Tetrahymena telomerase with telomeric DNA at 4.8 Angstrom resolution" "CryoEM structure of Tetrahymena telomerase with telomeric DNA at 4.8 Angstrom resolution" "CryoEM structure of Tetrahymena telomerase with telomeric DNA at 4.8 Angstrom resolution" "CryoEM structure of Tetrahymena telomerase with telomeric DNA at 4.8 Angstrom resolution" ...
The resulting R object is a list with two elements, which are themselves lists:
responseHeaderrecapitulates the query parameters,
responsecontains the results.
response list contains 3 items:
numFound: the number of results,
start: no idea what this is…
docs: a data frame in which each column is a field requested in the query (
fl=) and each row is a macromolecule matching the query.
Let’s print the data frame (I convert it as a
tibble because it prints
## # A tibble: 8 x 2 ## pdb_id title ## * <chr> <chr> ## 1 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at… ## 2 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at… ## 3 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at… ## 4 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at… ## 5 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at… ## 6 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at… ## 7 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at… ## 8 6d6v CryoEM structure of Tetrahymena telomerase with telomeric DNA at…
We got the correct PDB accession code, and the correct title, but repeated 8 times. This is surprising, because PDB accession codes are unique. Why didn’t we get only one row in this result table?
Understanding the data format
One difficulty I had was to understand the format of the results returned by the API. I could not find any formal documentation anywhere, but there are hints in the tutorial’s Jupyter notebooks:
PDBe Solr instance serves documents based on polymeric entities in PDB entries, i.e. each document indexed by Solr represents polymeric molecules of type protein, sugar, DNA, RNA or DNA/RNA hybrid. This is why for entry 2qk9 we get 3 documents in the response, each representing the protein, RNA and DNA molecule in that entry.
This means that the number of results returned is not the number of PDB entries matching the query (as the search bar on the website works), which is a little bit confusing. Instead, each result is a biological macromolecule. This makes answering certain questions easier, when we’re interested in a particular property of a biological macromolecule. But this also makes answering other (simpler) questions less straghtforward: to determine the number of PDB entries matching a query, one has to explicitely count the number of unique PDB accession codes found in the returned results (and because each result is a macromolecule, all PDB accession codes of structures containing more than one macromolecule will be duplicated across results).
This explains the above result: there are 8 different macromolecules in the structure of the telomerase complex.
Selecting desired information
This other paragraph of the tutorial is also relevant:
Fields in PDBe’s entity-based Solr document cover a wide range of properties, such as entry’s experimental details, details of deposition and primary publication, entity’s taxonomy, entry’s quality, entity’s cross references to UniProt and popular domain databases, biological assembly, etc. They are documented here: http://wwwdev.ebi.ac.uk/pdbe/api/doc/search.html It is also useful now to understand a little more about Solr querying. Solr has a rich and complex query syntax, described at http://wiki.apache.org/solr/CommonQueryParameters and elsewhere.
The fields of immediate relevance to us in this tutorial are:
q - the query itself. There is a lot of flexibility in describing a query, e.g. fields, wildcards, case-insensitivity, logical operators, ranges, etc.
rows - number of results returned by Solr. Needs to be explicitly set in mysolr because it defaults to 10. Useful if only part of results are desired.
fl - fields returned in each document. This is useful to reduce the size of response.
We will only receive the fields (
fl=) we explicitly requested, which
significantly reduces the size of the returned JSON data. We should also note
that by default only 10 results will be returned: if we want all of them, we
need to request more
rows (simply choose a very large number to get all
possible results; at the time of writing this blog post, there is a total of
146093 PDB entries).
A note on crystal structures
There is one important thing to consider in the case of crystal structures. They
are defined by their asymmetric unit: the smallest part of the crystal that can
reconstruct the entire crystal by first applying space group symmetry operators
to reconstruct the unit cell, then applying translations of the unit cell in all
three directions to reconstruct the entire crystal. But the relevant biological
assembly may or may not coincide with the asymmetric unit, as explained
here. It is therefore important to check that the API returns only the
content of the biological assembly, and not of the asymmetric unit, because
structures with more than one copy of a molecule in the asymmetric unit should
not end up as duplicates in the results. Using the interactive query
builder and the haemoglobin examples (PDB accession
1HV4), one can verify that these three queries return
the same number of macromolecules (the two haemoglobin chains, alpha and
beta). We can also do that from R:
# Checking PDB 2HHB (4 chains in asymmetric unit) hb1 <- 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=pdb_id:2hhb&fl=pdb_id,title&wt=json' hb1_result <- jsonlite::fromJSON(hb1) hb1_result$response$numFound
##  2
# Checking PDB 1OUT (2 chains in asymmetric unit) hb2 <- 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=pdb_id:1out&fl=pdb_id,title&wt=json' hb2_result <- jsonlite::fromJSON(hb1) hb2_result$response$numFound
##  2
# Checking PDB 1HV4 (8 chains in asymmetric unit) hb3 <- 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=pdb_id:1out&fl=pdb_id,title&wt=json' hb3_result <- jsonlite::fromJSON(hb1) hb3_result$response$numFound
##  2
In all cases, we got two results, corresponding to the two haemoglobin chains. This means the API returns results based on the biological assembly, so we won’t need to take into account multiple copies of the same molecule in the asymmetric unit of crystal structures (of course, assuming annotations in the PDB are correct).
In future blog posts in this series, I will show how to use PDB metadata to address questions more subtle than those answered by global PDB statistics.