Here is a quick note to show how to calculate the theoretical refractive index increment (dn/dc) of a protein based on its sequence and on published dn/dc values of individual amino acids. This number can be used to determine protein concentration from measurements of relative refractive index, and is used in certain biophysical experiments such as SEC-MALLS. Very often, a consensus value for proteins is used, but it is actually not very difficult to calculate a value for each specific protein sequence of interest.
Getting dn/dc of individual amino acids
We will use dn/dc values of individual amino acids reported by Zhao et al (2011), using the version of the article provided by PubMed Central (PMC) because PMC’s normalized formatting will make it easier to automatically extract dn/dc values from its table 1 and will make this approach more easily reusable for other similar data extraction problems. The simple approach described here is an approximation and neglects the effects of buffer composition and temperature.
We will need the following functions and values from external R packages (some of them are quite big, so this approach of importing only the necessary bits is more convenient than loading everything, for the purpose of a blog post):
import::here(.from = magrittr, `%>%`) import::here(.from = xml2, read_html) import::here(.from = rvest, html_table) import::here(.from = tibble, as_tibble) import::here(.from = dplyr, select, arrange, mutate) # Biostrings comes from the Bioconductor repository. import::here(.from = Biostrings, AMINO_ACID_CODE, AA_STANDARD, readAAStringSet, alphabetFrequency)
Let’s begin by extracting the dn/dc values from the article’s table 1, and building a table that we can easily use from R. Let’s also convert amino acid names from 3-letter codes to 1-letter codes. This will be more convenient because protein sequences are more often found with the 1-letter code notation.
article <- read_html("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3149238/") dndc_table <- article %>% html_table() %>% .[] %>% as_tibble() %>% select(aa = `Amino acid`, dndc_value = `dn/dc(ml/g)‡`) %>% arrange(aa) %>% mutate(amino_acid = names(AMINO_ACID_CODE[AMINO_ACID_CODE == aa])) %>% select(amino_acid, dndc_value)
At this point, we have collected the two relevant columns from the article’s table into a data frame easy to manipulate with R:
## # A tibble: 20 x 2 ## amino_acid dndc_value ## <chr> <dbl> ## 1 A 0.167 ## 2 R 0.206 ## 3 N 0.192 ## 4 D 0.197 ## 5 C 0.206 ## 6 Q 0.186 ## 7 E 0.183 ## 8 G 0.175 ## 9 H 0.219 ## 10 I 0.179 ## 11 L 0.173 ## 12 K 0.181 ## 13 M 0.204 ## 14 F 0.244 ## 15 P 0.165 ## 16 S 0.17 ## 17 T 0.172 ## 18 W 0.277 ## 19 Y 0.24 ## 20 V 0.172
Because R offers vectorized operations, we can simplify further downstream use of these numbers by transforming this table into a named vector:
dndc <- dndc_table$dndc_value names(dndc) <- dndc_table$amino_acid dndc
## A R N D C Q E G H I L K ## 0.167 0.206 0.192 0.197 0.206 0.186 0.183 0.175 0.219 0.179 0.173 0.181 ## M F P S T W Y V ## 0.204 0.244 0.165 0.170 0.172 0.277 0.240 0.172
Calculating dn/dc of a protein
The dn/dc value of a protein is the average of all its amino acids’ dn/dc values. One way to calculate it is to first calculate the amino acid composition of the given protein sequence as the frequency of each amino acid, then multiplying each amino acid’s dn/dc value by its frequency in this particular sequence, and finally adding up the frequency-weighted dn/dc values.
Here is an example, using the sequences of human TRF2 (one of the two
proteins I studied during my PhD). First, we can retrieve the sequence
Biostrings::readAAStringSet() from Bioconductor returns
a list of sequences, so in this case a single-element list; to make subsequent
operations more convenient we can extract this single element.
trf2 <- "https://www.uniprot.org/uniprot/Q15554.fasta" %>% readAAStringSet() %>% .[] trf2
## 542-letter "AAString" instance ## seq: MAAGAGTAGPASGPGVVRDPAASQPRKRPGREGG...GNWAAISKNYPFVNRTAVMIKDRWRTMKRLGMN
Then we can calculate dn/dc. R’s vectorized operations and the
that we constructed in the previous paragraph will make this easy to express.
Biostrings::alphabetFrequency() computes a count of occurrences or frequency
of each amino acid in a protein sequence.
# Calculate amino acid composition as frequencies. trf2_aa_comp <- alphabetFrequency(trf2, as.prob = TRUE) # Calculate dn/dc values. # We need to subset trf2_aa_composition to keep only frequency # values of standard amino acids, because we only have dn/dc # values for these ones. trf2_dndc <- sum(trf2_aa_comp[AA_STANDARD] * dndc) trf2_dndc
##  0.1836863
This is pretty close to the consensus value of 0.185 for unmodified proteins that is often used by default when a dn/dc value is needed.
Now that we can automatically calculate a dn/dc value from a protein sequence,
we can determine the distributions of dn/dc across entire proteomes (like in
figures 2, 4 and 6 and table 2 of the article). A starting point for such an
analysis is the reference proteomes provided by Uniprot and the
Biostrings::readAAStringSet(). The rest mostly consists of figuring out how to
operate on a large list of sequences (
AAStringSet) instead of a single
purrr package will probably help.
Another improvement would be to read the article more carefully, especially the equations, and implement a more accurate calculation of dn/dc.
Zhao H, Brown PH & Schuck P (2011) On the Distribution of Protein Refractive Index Increments. Biophysical Journal 100: 2309–2317 https://doi.org/10.1016/j.bpj.2011.03.004