DNA length in crystal structures

Introduction

A few months ago, a colleague trying to crystallize a protein/DNA complex asked for my input about the length of the DNA molecule to use to make this complex. It is known that the length and type of ends (blunt or cohesive with different numbers of overhanging nucleotides) of a DNA molecule influences the crystallization propensity of a protein/DNA complex (see for instance Hollis, 2007). His approach was to look up a few crystal structures of protein/DNA complexes in the PDB to get a sense of the typical length of DNA in such structures. It is a valid approach, but is nonetheless only anecdotal evidence if it’s based on a small number of randomly chosen protein/DNA complexes. Using programmatic access to the PDB metadata, as I described in a previous post, we can answer such a question on the basis of all deposited crystal structures of protein/DNA complexes. This will give a much finer answer in the form of a distribution of DNA lengths, instead of a guess from a few structures.

Required packages

We will need the following packages:

library(magrittr)
library(here)
library(curl)
library(jsonlite)
library(tibble)
library(stringr)
library(dplyr)
library(ggplot2)
library(plotly)

Getting data

Building a query

To answer our question, we need to retrieve all molecules matching the following criteria:

  • the molecule is DNA,
  • the the molecule is found in a structure of a protein/DNA complex,
  • the molecule is found in a crystal structure (cryoEM and NMR structures won’t help us make choices for a crystallization project).

This translates as follows in terms of a search query for the PDBe API:

q=molecule_type:"DNA" AND assembly_composition:"DNA/protein complex" AND experimental_method:"X-ray diffraction"

Additionally, we need:

  • the sequence of the macromolecule (this is required to answer our question, as we will compute the length from the sequence),
  • the PDB accession codes (simply to check which complex contains any given DNA molecule we will find in the results, out of curiosity),
  • all results (not only the first 10),
  • and we need results in JSON format.

This translates as follows:

fl=pdb_id,molecule_sequence&rows=1000000&wt=json

Putting it together with the base URL of the search API, this gives the following complete URL:

pdb_query <- 'https://www.ebi.ac.uk/pdbe/search/pdb/select?q=molecule_type:%22DNA%22%20AND%20assembly_composition:%22DNA/protein%20complex%22%20AND%20experimental_method:%22X-ray%20diffraction%22&fl=pdb_id,molecule_sequence&rows=1000000&wt=json'

Quoting the spaces in “DNA/protein complex” and “X-ray diffraction” is not sufficient to make the query work properly. Doing so returns an HTTP error 505, while using the URL as it is written above works fine (copying the URL provided by the PDBe API query builder should give the correct escape characters).

Retrieving data

To avoid downloading a new dataset everytime I rebuild this blog, I will store it and retrieve it only if the file doesn’t exist:

dna_length_dataset <- here("static", "datasets", "dna-length.json")
if (!file.exists(dna_length_dataset)) {
    curl_download(url = pdb_query, destfile = dna_length_dataset)
}
pdb_data <- fromJSON(dna_length_dataset)

If you try to run the code presented in this post, you will likely get slightly different results as new strutures are deposited in the PDB. To get the same results as in this post, use the dataset saved at the time I last ran this code (11 Dec 2018).

Cleaning data

As explained in the first post in this series, each result is a macromolecule from the biological assembly (i.e. without crystallographic duplicates). This is convenient in this case: we received the field molecule_type, and the relevant data is already stored in a table, so we can very easily compute the length of each DNA sequence and store it in a new column in the same table:

cleaned_data <- pdb_data$response$docs %>%
    as.tibble() %>%
    rowwise() %>%
    mutate(dna_length = str_length(molecule_sequence)) %>%
    select(pdb_id, dna_length, dna_sequence = molecule_sequence) %>%
    ungroup()

Answering our question

Summary and initial observations

We now have a table of DNA sequences found in crystal structures of protein/DNA complexes, along with their length:

cleaned_data
## # A tibble: 8,101 x 3
##    pdb_id dna_length dna_sequence                                         
##    <chr>       <int> <chr>                                                
##  1 2dnj            8 GCGATCGC                                             
##  2 3rn2           20 CCATCAAAGATCTTTGATGG                                 
##  3 3mfi           15 TAATGAGGGGAGGAC                                      
##  4 4e68           18 TGCATTTCCCGTAAATCT                                   
##  5 1ej9           23 CAAAAATTTTTCTGAGTCTTTTT                              
##  6 4j9n            9 TAGCGTCAG                                            
##  7 4j9n           12 CATTCTGACGCT                                         
##  8 5av8          147 ATCAATATCCACCTGCAGATACTACCAAAAGTGTATTTGGAAACTGCTCCAT…
##  9 3u6d           16 TGCGTCGGGAXCTACC                                     
## 10 2o19            8 CGCAACTT                                             
## # ... with 8,091 more rows

From this, we can first determine the minimal, maximal, median and average DNA length found in deposited crystal structures of protein/DNA complexes:

summary(cleaned_data$dna_length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   11.00   14.00   19.56   19.00 1122.00

The minimal length comes as a surprise: a single base pair? It might be an annotation mistake, calling a DNA molecule 1 bp long what might actually be a nucleotide cofactor, and calling a protein/DNA complex what might actually be an enzyme with such a cofactor. There is 1 DNA molecule of this length in our current dataset:

cleaned_data %>% filter(dna_length == min(.$dna_length))
## # A tibble: 1 x 3
##   pdb_id dna_length dna_sequence
##   <chr>       <int> <chr>       
## 1 1mvm            1 A

Turns out it is not an annotation error. PDB entry 1MVM is a protein/DNA complex.

The maximal length is also a little bit surprising, considering that longer DNA molecules are more flexible, and that flexibility tends to hinder crystallization. There are 2 DNA molecules of this length in our current dataset:

cleaned_data %>% filter(dna_length == max(.$dna_length))
## # A tibble: 2 x 3
##   pdb_id dna_length dna_sequence                                          
##   <chr>       <int> <chr>                                                 
## 1 6hkt         1122 ATCGCTGTTCAATACATGCACAGGATGTATATATCTGACACGTGCCTGGAGAC…
## 2 6hkt         1122 ATCACCCTATACGCGGCCGCCCTGGAGAATCCCGGTGCCGAGGCCGCTCAATT…

If I had to guess which structure this is, I would say the one of a nucleosome array. 6HKT is indeed a nucleosome array: 6 nucleosomes bound by linker histone H1.

DNA length distribution

Back to our question: we want a distribution, therefore our answer is best expressed by a histogram showing the number of deposited crystal structures of protein/DNA complexes as a function of the DNA length found in these structures.

The histogram below is interactive, you can zoom in on a region of interest:

final_plot <- ggplot(data = cleaned_data) +
    geom_histogram(mapping = aes(x = dna_length), binwidth = 1) +
    theme_bw() +
    xlab("DNA length (bp)") +
    ylab("Crystal structures of protein/DNA complexes")
ggplotly(final_plot)

Most of the distribution resides in a shorter length range, between 0 and 150 bp. The spike at around 147 bp comes from nucleosome structures (mono-nucleosomes, not arrays). It is impressive to see that there are enough of them to stand out significantly in the entire distribution.

Distribution in the 0-50 bp range

Nucleosomes are fascinating, but are admittedly peculiar structures: when designing a piece of DNA for crystallization of a protein/DNA complex other than a nucleosome complex, one should not be biased by the length of nucleosomal DNA. Which means we can further zoom in between 0 and 50 bp and get a clearer picture answering our initial question (median DNA length depicted by a vertical red line):

ggplot(data = cleaned_data, aes(x = dna_length)) +
    geom_histogram(binwidth = 1, color = "black", fill = "white") +
    geom_vline(xintercept = median(cleaned_data$dna_length), color = "red") +
    xlim(c(0, 50)) +
    theme_bw() +
    xlab("DNA length (bp)") +
    ylab("Crystal structures of protein/DNA complexes")

The most common DNA length seems to be 16 bp, or 12 bp if we consider the 16 bp spike an outlier in the distribution.

We can also filter out everything longer than 50 bp and recalculate a less skewed average length:

cleaned_data %>%
    filter(dna_length <= 50) %>%
    select(dna_length) %>%
    summary()
##    dna_length   
##  Min.   : 1.00  
##  1st Qu.:10.00  
##  Median :14.00  
##  Mean   :15.14  
##  3rd Qu.:18.00  
##  Max.   :50.00

Follow-up questions

This quick analysis suggests at least the following questions:

  1. What is the diversity of structures with one given DNA length? Around 147 bp in length, no doubt all structures contain a nucleosome. What about spikes in the distribution like those at 5 bp and 35 bp? Are they many related structures? (same DNA sequence, different variants of the binding protein?).
  2. How does the distribution compare between structures solved by crystallography and cryoEM? My guess here is that the distribution of cryoEM structures across DNA length might be centered on a much longer length, possibly on nucleosomal DNA length (i.e. around 147 bp).
  3. How do the crystallography and cryoEM distributions compare to the entire PDB distribution? Do they recapitulate the trend over the entire PDB, or are there enough NMR structures to significantly shape the global distribution as well?
  4. DNA has also been studied in isolation (without any protein bound): what do the 4 distributions (crystallography, NMR, cryoEM and global) look like? My guess here is that there probably isn’t any cryoEM structure of naked DNA, and there is also probably a large number of NMR structures of short sequences.
  5. What do the equivalent distributions look like for protein/RNA complexes? For isolated RNA structures? How much do ribosome structures skew these distributions?

I might cover some of them in future blog posts.

References

Hollis T (2007) Crystallization of protein-DNA complexes. Methods Mol. Biol. 363: 225–237 https://doi.org/10.1007/978-1-59745-209-0_11