A simple DNA scrambling function in R

Many DNA-binding proteins recognize a specific sequence, although they often bind DNA somewhat weakly regardless of its sequence. For the in vitro study of such sequence-specific protein/DNA interactions, an important control to prove that binding is specific is to show that there is no binding (or weaker binding by orders of magnitude) to a random sequence.

The ideal random sequence for this control is a scrambled version of the recognition sequence (i.e. the same base composition, but occurring in a completely different order). When faced with this simple problem of scrambling a DNA sequence, I tried to find an online tool to do that. A simple search leads to a tool offered by GenScript (a gene synthesis company), but upon closer inspection it turns out that the scrambled sequences it generates are not completely random; they are in fact biased in order to make good siRNA controls, to minimize possible matches to naturally occurring sequences.

In the case of an in vitro binding experiment using purified proteins and DNA, it doesn’t matter that the scrambled control might resemble another naturally occurring sequence. So, instead of spending more time searching for an online tool, I figured it shouldn’t be too difficult to write my own. In the rest of this post, I will explain how I did that with R.

Utility functions

From other packages

We will need a few functions from the stringr package, as well as the pipe operator from magrittr (for convenience).

import::here(.from = stringr, str_c, str_count, str_length)
import::here(.from = magrittr, `%>%`)

Base composition

Before we can scramble a DNA sequence, we need a function that calculates its base composition, because we will need to check that the original binding motif and the scrambled sequence have the same base composition.

base_composition <- function(dna_seq) {
    dna_bases <- c(A = "A", T = "T", G = "G", C = "C")
    composition <- str_count(dna_seq, dna_bases)
    names(composition) <- names(dna_bases)

This function is very easy to write because R has built-in named vectors. This means that after we count the occurrences of each base in an input sequence, we can transfer the names of the dna_bases vector to the composition vector we just built, and printing that vector will report a count for each base.

GC content

With this base_composition function in hand, we can very easily write a function to calculate the GC content of a sequence.

gc_content <- function(dna_seq) {
    composition <- base_composition(dna_seq)
    gc_dna <- 100 * (composition["G"] + composition["C"]) / str_length(dna_seq)
    gc_dna <- round(gc_dna, digits = 2)
    names(gc_dna) <- NULL

“Exploding” a DNA sequence

We will need one last utility function to turn a DNA sequence into a vector of single bases.

explode <- function(dna_seq) {
    substring(text = dna_seq,
              first = seq(1, str_length(dna_seq), 1),
              last  = seq(1, str_length(dna_seq), 1))

DNA sequence scrambling function

Finally, we can write the sequence scrambling function. The way it works is:

  1. turn a DNA sequence into a vector of single bases,
  2. shuffle the elements of this vector by drawing all of them one by one, randomly and without replacing them,
  3. concatenate the elements of this new vector into a new string.

This is where the pipe notation makes it much easier to read.

scramble_dna <- function(dna_seq) {
    dna_seq %>%
        explode() %>%
        sample(x = .,
               size = str_length(dna_seq),
               replace = FALSE) %>%
        str_c(collapse = "")


We can now use this function on an example sequence, say the EcoRI recognition motif GAATTC.

ecori <- "GAATTC"
## A T G C 
## 2 2 1 1
## [1] 33.33
ecori_scrambled <- scramble_dna(ecori)

# The scrambled sequence will obviously differ from the original one
all.equal(ecori, ecori_scrambled)
## [1] "1 string mismatch"
# But it will still have the same base composition and GC content
    ecori %>% base_composition(),
    ecori_scrambled %>% base_composition()
## [1] TRUE
    ecori %>% gc_content(),
    ecori_scrambled %>% gc_content()
## [1] TRUE

Possible improvements

This was easy to write, but could be better. One possible improvement would be to vectorize this function. As it is now, it will only work on a single DNA sequence. It could be nice to make it accept a vector of DNA sequences and return a vector of corresponding scrambled sequence.

An important and simple improvement would be to include an initial check of the alphabet of each string. If a string contains letters that are not A, C, G or T (for DNA) or A, C, G or U (for RNA), then it’s not a nucleic acid sequence and trying to compute base composition and GC content is irrelevant. The Biostrings package, part of Bioconductor, provides these alphabets in Biostrings::DNA_BASES and Biostrings::RNA_BASES.

Performance seems reasonnable with respect to sequence length: I tried with a ~15 kb long sequence, and base_composition, gc_content and scramble_dna all evaluated instantly (i.e. system.time reported 0 for all measured evaluation times). I don’t know how a vectorized version would perform with respect to the number of sequences to process.

Practical considerations

I could have looked whether Bioconductor offers such functionality (it definitely has advanced sequence manipulation tools), but this homemade version is totally sufficient for my typical need and took me very little time to write.

Some DNA-binding proteins recognize very short motifs, or derive specificity from very few base pairs within a larger consensus motif. As a result, it might be difficult to completely disrupt binding by simply shuffling the recognition sequence. One might need several attempts at shuffling a sequence before finding one that doesn’t match any feature (or as little as possible) of the recognition sequence. This can be checked by aligning these sequences, for example using Clustal Omega. One can also check that the resulting scrambled sequence doesn’t match any entry of databases such as HOCOMOCO or JASPAR.

Replicate this post

You can replicate this post by running the Rmd source with RStudio and R. The easiest way to replicate this post is to clone the entire git repository.

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.3 (2020-10-10)
##  os       macOS Catalina 10.15.7      
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Stockholm            
##  date     2021-02-14                  
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
##  blogdown      1.1     2021-01-19 [1] CRAN (R 4.0.3)
##  bookdown      0.21    2020-10-13 [1] CRAN (R 4.0.3)
##  cli           2.3.0   2021-01-31 [1] CRAN (R 4.0.2)
##  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.0)
##  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
##  htmltools 2021-01-22 [1] CRAN (R 4.0.2)
##  import        1.2.0   2020-09-24 [1] CRAN (R 4.0.2)
##  knitr         1.31    2021-01-27 [1] CRAN (R 4.0.2)
##  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.2)
##  rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.2)
##  rmarkdown     2.6     2020-12-14 [1] CRAN (R 4.0.3)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
##  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
##  withr         2.4.1   2021-01-26 [1] CRAN (R 4.0.3)
##  xfun          0.20    2021-01-06 [1] CRAN (R 4.0.2)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
## [1] /Users/guillaume/Library/R/4.0/library
## [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library