A few months ago, a colleague trying to crystallize a protein/DNA complex asked
for my input about the length of the DNA molecule to use to make this complex.
It is known that the length and type of ends (blunt or cohesive with different
numbers of overhanging nucleotides) of a DNA molecule influences the
crystallization propensity of a protein/DNA complex (see for instance Hollis, 2007). His approach was to look up a
few crystal structures of protein/DNA complexes in the PDB to get a sense
of the typical length of DNA in such structures. It is a valid approach, but is
nonetheless only anecdotal evidence if it’s based on a small number of randomly
chosen protein/DNA complexes. Using programmatic access to the PDB metadata, as
I described in a previous post, we can answer such a
question on the basis of all deposited crystal structures of protein/DNA
complexes. This will give a much finer answer in the form of a distribution of
DNA lengths, instead of a guess from a few structures.
Quoting the spaces in “DNA/protein complex” and “X-ray diffraction” is not
sufficient to make the query work properly. Doing so returns an HTTP error 505,
while using the URL as it is written above works fine (copying the URL provided
by the PDBe API query builder should give the correct escape characters).
To avoid downloading a new dataset everytime I rebuild this blog, I will store
it and retrieve it only if the file doesn’t exist:
As explained in the first post in this series, each result
is a macromolecule from the biological assembly (i.e. without crystallographic
duplicates). This is convenient in this case: we received the field
molecule_type, and the relevant data is already stored in a table, so we can
very easily compute the length of each DNA sequence and store it in a new column
in the same table:
From this, we can first determine the minimal, maximal, median and average DNA
length found in deposited crystal structures of protein/DNA complexes:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 11.00 14.00 19.54 19.00 1122.00
The minimal length comes as a surprise: a single base pair? It might be an
annotation mistake, calling a DNA molecule 1 bp long what might actually be a
nucleotide cofactor, and calling a protein/DNA complex what might actually be an
enzyme with such a cofactor. There is
DNA molecule of this length in our current dataset:
## # A tibble: 1 x 3
## pdb_id dna_length dna_sequence
## <chr> <int> <chr>
## 1 1mvm 1 A
Turns out it is not an annotation error. PDB entry 1MVM is a
The maximal length is also a little bit surprising, considering that longer DNA
molecules are more flexible, and that flexibility tends to hinder crystallization.
DNA molecules of this length in our current dataset:
If I had to guess which structure this is, I would say the one of a nucleosome
array. 6HKT is indeed a nucleosome array: 6 nucleosomes bound by
linker histone H1.
DNA length distribution
Back to our question: we want a distribution, therefore our answer is best
expressed by a histogram showing the number of deposited crystal structures of
protein/DNA complexes as a function of the DNA length found in these structures.
The histogram below is interactive, you can zoom in on a region of interest:
Most of the distribution resides in a shorter length range, between 0 and 150
bp. The spike at around 147 bp comes from nucleosome structures
(mono-nucleosomes, not arrays). It is impressive to see that there are enough of
them to stand out significantly in the entire distribution.
Distribution in the 0-50 bp range
Nucleosomes are fascinating, but are admittedly peculiar structures: when
designing a piece of DNA for crystallization of a protein/DNA complex other than
a nucleosome complex, one should not be biased by the length of nucleosomal DNA.
Which means we can further zoom in between 0 and 50 bp and get a clearer picture
answering our initial question (median DNA length depicted by a vertical red
ggplot(data = cleaned_data, aes(x = dna_length)) +geom_histogram(binwidth =1, color ="black", fill ="white") +geom_vline(xintercept =median(cleaned_data$dna_length), color ="red") +xlim(c(0, 50)) +theme_bw() +xlab("DNA length (bp)") +ylab("Crystal structures of protein/DNA complexes")
The most common DNA length seems to be 16 bp, or 12 bp if we consider the 16 bp
spike an outlier in the distribution.
We can also filter out everything longer than 50 bp and recalculate a less skewed
## Min. : 1.00
## 1st Qu.:10.00
## Median :14.00
## Mean :15.15
## 3rd Qu.:18.00
## Max. :50.00
This quick analysis suggests at least the following questions:
What is the diversity of structures with one given DNA length? Around 147 bp
in length, no doubt all structures contain a nucleosome. What about spikes in
the distribution like those at 5 bp and 35 bp? Are they many related
structures? (same DNA sequence, different variants of the binding protein?).
How does the distribution compare between structures solved by
crystallography and cryoEM? My guess here is that the distribution of cryoEM
structures across DNA length might be centered on a much longer length,
possibly on nucleosomal DNA length (i.e. around 147 bp).
How do the crystallography and cryoEM distributions compare to the entire PDB
distribution? Do they recapitulate the trend over the entire PDB, or are
there enough NMR structures to significantly shape the global distribution as
DNA has also been studied in isolation (without any protein bound): what do
the 4 distributions (crystallography, NMR, cryoEM and global) look like? My
guess here is that there probably isn’t any cryoEM structure of naked DNA,
and there is also probably a large number of NMR structures of short
What do the equivalent distributions look like for protein/RNA complexes? For
isolated RNA structures? How much do ribosome structures skew these