Using metadata from the PDB

Introduction

The Protein Data Bank (PDB) is a repository of experimentally determined three-dimensional structures of biological macromolecules (mostly proteins and nucleic acids). The structures it contains are themselves very useful for answering biological questions, or for asking even more questions. In this series of blog posts, I will show with examples how the associated metadata (mostly structure annotations) can also answer interesting questions.

The PDB already provides statistics on some of its metadata, but these are very general in scope. The PDB in Europe (PDBe) provides programmatic access to the database through the PDBe API. By collecting appropriate metadata from the database, one can get much finer insigth, for example specific to a particular field of structural biology.

In this first post, I will introduce the PDBe API. I will show how to formulate queries with the PDB metadata and how to retrieve the corresponding data using R.

PDBe API documentation

Directly relevant documentation

The documentation of the PDBe search feature is very brief: it is simply a list of all searchable keywords and how they relate to structure annotations. Luckily, it is accompanied by an interactive tool that greatly helps building and debugging queries (scroll all the way to the bottom of the list to find the tool). Constructing a query with this tool, and clicking on the “RunCall” button, will display the URL that will replay this same query at will (which we will use from R scripts) and can also display the results in JSON format.

The PDBe also provides a tutorial on how to use the search API. But this is demonstrated using Python in Jupyter notebooks, and I am not familiar with either, so this didn’t help me much. It might help you.

When accessing a properly constructed URL, the API will respond by sending results matching the query in JSON format.

Broader documentation

To provide programmatic access to the database, the PDBe uses a software called Solr. The Solr documentation is also helpful to figure out how to formulate queries. The PDBe API documentation also points to this page of the Solr wiki.

There is an R package to use Solr from R, but I feel that building queries directly in the URL of the API call is easier. I have not tried this R package yet.

An example query and its result

Let’s try the simplest way to retrieve information from the PDB: using a unique PDB accession code, for example 6D6V, a recent structure of telomerase. Here is a possible URL for that query:

query <- 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=pdb_id:6d6v&fl=pdb_id,title&wt=json'

In the above URL, q= indicates that we request information on the database entry with the unique PDB accession code (pdb_id) 6d6v and fl= indicates which fields of the results we want. Requesting the pdb_id field will confirm that the request worked, and requesting title will illustrate the example. Last, wt=json will make sure the API sends results in JSON format, which is very easy to import in R using the jsonlite package.

Now, let’s download the result and look at it:

result <- jsonlite::fromJSON(query)
str(result)
## List of 2
##  $ responseHeader:List of 3
##   ..$ status: int 0
##   ..$ QTime : int 1
##   ..$ params:List of 3
##   .. ..$ q : chr "pdb_id:6d6v"
##   .. ..$ fl: chr "pdb_id,title"
##   .. ..$ wt: chr "json"
##  $ response      :List of 3
##   ..$ numFound: int 8
##   ..$ start   : int 0
##   ..$ docs    :'data.frame': 8 obs. of  2 variables:
##   .. ..$ pdb_id: chr [1:8] "6d6v" "6d6v" "6d6v" "6d6v" ...
##   .. ..$ title : chr [1:8] "CryoEM structure of Tetrahymena telomerase with telomeric DNA at 4.8 Angstrom resolution" "CryoEM structure of Tetrahymena telomerase with telomeric DNA at 4.8 Angstrom resolution" "CryoEM structure of Tetrahymena telomerase with telomeric DNA at 4.8 Angstrom resolution" "CryoEM structure of Tetrahymena telomerase with telomeric DNA at 4.8 Angstrom resolution" ...

The resulting R object is a list with two elements, which are themselves lists:

  1. responseHeader recapitulates the query parameters,
  2. response contains the results.

The response list contains 3 items:

  1. numFound: the number of results,
  2. start: no idea what this is…
  3. docs: a data frame in which each column is a field requested in the query (fl=) and each row is a macromolecule matching the query.

Let’s print the data frame (I convert it as a tibble because it prints more nicely):

tibble::as_tibble(result$response$docs)
## # A tibble: 8 x 2
##   pdb_id title                                                            
## * <chr>  <chr>                                                            
## 1 6d6v   CryoEM structure of Tetrahymena telomerase with telomeric DNA at…
## 2 6d6v   CryoEM structure of Tetrahymena telomerase with telomeric DNA at…
## 3 6d6v   CryoEM structure of Tetrahymena telomerase with telomeric DNA at…
## 4 6d6v   CryoEM structure of Tetrahymena telomerase with telomeric DNA at…
## 5 6d6v   CryoEM structure of Tetrahymena telomerase with telomeric DNA at…
## 6 6d6v   CryoEM structure of Tetrahymena telomerase with telomeric DNA at…
## 7 6d6v   CryoEM structure of Tetrahymena telomerase with telomeric DNA at…
## 8 6d6v   CryoEM structure of Tetrahymena telomerase with telomeric DNA at…

We got the correct PDB accession code, and the correct title, but repeated 8 times. This is surprising, because PDB accession codes are unique. Why didn’t we get only one row in this result table?

Understanding the data format

One difficulty I had was to understand the format of the results returned by the API. I could not find any formal documentation anywhere, but there are hints in the tutorial’s Jupyter notebooks:

PDBe Solr instance serves documents based on polymeric entities in PDB entries, i.e. each document indexed by Solr represents polymeric molecules of type protein, sugar, DNA, RNA or DNA/RNA hybrid. This is why for entry 2qk9 we get 3 documents in the response, each representing the protein, RNA and DNA molecule in that entry.

This means that the number of results returned is not the number of PDB entries matching the query (as the search bar on the website works), which is a little bit confusing. Instead, each result is a biological macromolecule. This makes answering certain questions easier, when we’re interested in a particular property of a biological macromolecule. But this also makes answering other (simpler) questions less straghtforward: to determine the number of PDB entries matching a query, one has to explicitely count the number of unique PDB accession codes found in the returned results (and because each result is a macromolecule, all PDB accession codes of structures containing more than one macromolecule will be duplicated across results).

This explains the above result: there are 8 different macromolecules in the structure of the telomerase complex.

Selecting desired information

This other paragraph of the tutorial is also relevant:

Fields in PDBe’s entity-based Solr document cover a wide range of properties, such as entry’s experimental details, details of deposition and primary publication, entity’s taxonomy, entry’s quality, entity’s cross references to UniProt and popular domain databases, biological assembly, etc. They are documented here: http://wwwdev.ebi.ac.uk/pdbe/api/doc/search.html It is also useful now to understand a little more about Solr querying. Solr has a rich and complex query syntax, described at http://wiki.apache.org/solr/CommonQueryParameters and elsewhere.

The fields of immediate relevance to us in this tutorial are:

q - the query itself. There is a lot of flexibility in describing a query, e.g. fields, wildcards, case-insensitivity, logical operators, ranges, etc.

rows - number of results returned by Solr. Needs to be explicitly set in mysolr because it defaults to 10. Useful if only part of results are desired.

fl - fields returned in each document. This is useful to reduce the size of response.

We will only receive the fields (fl=) we explicitly requested, which significantly reduces the size of the returned JSON data. We should also note that by default only 10 results will be returned: if we want all of them, we need to request more rows (simply choose a very large number to get all possible results; at the time of writing this blog post, there is a total of 146093 PDB entries).

A note on crystal structures

There is one important thing to consider in the case of crystal structures. They are defined by their asymmetric unit: the smallest part of the crystal that can reconstruct the entire crystal by first applying space group symmetry operators to reconstruct the unit cell, then applying translations of the unit cell in all three directions to reconstruct the entire crystal. But the relevant biological assembly may or may not coincide with the asymmetric unit, as explained here. It is therefore important to check that the API returns only the content of the biological assembly, and not of the asymmetric unit, because structures with more than one copy of a molecule in the asymmetric unit should not end up as duplicates in the results. Using the interactive query builder and the haemoglobin examples (PDB accession codes 2HHB, 1OUT and 1HV4), one can verify that these three queries return the same number of macromolecules (the two haemoglobin chains, alpha and beta). We can also do that from R:

# Checking PDB 2HHB (4 chains in asymmetric unit)
hb1 <- 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=pdb_id:2hhb&fl=pdb_id,title&wt=json'
hb1_result <- jsonlite::fromJSON(hb1)
hb1_result$response$numFound
## [1] 2
# Checking PDB 1OUT (2 chains in asymmetric unit)
hb2 <- 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=pdb_id:1out&fl=pdb_id,title&wt=json'
hb2_result <- jsonlite::fromJSON(hb1)
hb2_result$response$numFound
## [1] 2
# Checking PDB 1HV4 (8 chains in asymmetric unit)
hb3 <- 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=pdb_id:1out&fl=pdb_id,title&wt=json'
hb3_result <- jsonlite::fromJSON(hb1)
hb3_result$response$numFound
## [1] 2

In all cases, we got two results, corresponding to the two haemoglobin chains. This means the API returns results based on the biological assembly, so we won’t need to take into account multiple copies of the same molecule in the asymmetric unit of crystal structures (of course, assuming annotations in the PDB are correct).

Next plans

In future blog posts in this series, I will show how to use PDB metadata to address questions more subtle than those answered by global PDB statistics.