Idaho Spuds: May 2008

Thursday, May 15, 2008

The Design and Characterization of Two Proteins with 88% Sequence Identity But Different Structure

Biological structure and function of proteins are linked to sequence, but to what degree do minute changes in sequence affect structure? In 1994, Trevor Creamer and George Rose proposed the Paracelsus challenge, which sought to change the conformation of a protein while retaining at least 50% of its sequence. Three years later, Dalal et al. transformed the beta-1 domain of the streptococcal protein G from mainly beta-sheets into 4 alpha-helices while retaining 50% of the original sequence beating the Paracelsus challenge [1]. More recently, Alexander et al. from the University of Maryland looked at the behavior of two streptococcal protein G domains and iteratively transformed them to have 88% similarity in their sequences (different in only 7 amino acids) with very different folds and function [2].

The protein studied was a cell wall protein from the Streptococcus bacteria called protein G. The protein is composed of two separate domains: the first domain (GA) has the ability to bind to human serum albumin (HSA) found in blood and the second one (GB) can bind to a region of IgG, a human antibody also found in blood. The two domains have different folds 3-alpha and alpha/beta folds, respectively. Each of the domains have positive free energies (delta G) indicating that the folded states are more stable. These two domains (GA and GB) originated as PSD1 and GB1, each containing 56 amino acids. In the GA domain, only 47 of the amino acids were structured; the other 9 were disordered.

In the image below, amino acids shown in blue are identities and those shown in red are nonidentities. Spheres indicate mutations introduced in each design cycle.

Altering GA within the structured 47 amino acids can change the equilibrium of states from the GA 3-alpha fold to the alpha/beta fold of the GB domain. The goal of the project was to investigate the number of amino acids that could be changed in each protein while retaining biological function.

The first step was to introduce both the IgG and the HSA-binding epitopes (the parts of a molecule that antibodies bind to) to both proteins. The HSA binding site in GA is composed of 7 amino acids, while the IgG binding site is composed of 4 amino acids in the central helix, one amino acid in the beta-3 strand and the main chain contacts. The latent IgG binding site was introduced to GA through 3 mutations, while the HSA binding site was introduced into GB with 5 mutations; the mutants were denoted as GA30 and GB30, respectively, as denoting their 30% similarity.

Several methods were used to characterize the proteins. The secondary structures of both proteins were determined using circular dichroism (CD), a form of spectroscopy looking at differential absorption of polarized light [3]. Other methods used included: thermal denaturation to determine conformational stability, gel filtration to ensure monomeric behavior, and affinity chromatography to determine binding affinities to IgG and HSA. The addition of the latent binding sites did not produce any significant alterations in the thermodynamics of the unfolding reaction for each protein. The GB30 protein was less stable than the original, with a delta-G 3 kcal/mol lower. There was no variation in the heat capacity of the protein, which is correlated to the solvent-exposed area, meaning the hydrophobic cores of the proteins were not disturbed. Affinity chromatography showed that both GA30 and GB30 bound to IgG and HSA in a similar fashion to GA and GB.

Next, changes were made in the remaining 39 non-identity residues using random mutagenesis and phage display. One method for performing random mutagenesis is to introduce Mn2+ or Mg2+ into the system, which causes mutagenic conditions resulting in random errors during DNA replication [4]. Phage display relies on the bacteriophages that encode proteins displayed on the surface of the phage, which can be used to select for functionality. These proteins then can be selected for using immobilized antigens; the DNA encoding the protein will be located within the phage [5]. Mutations to the remaining 39 residues were categorized into one of three categories: (i) mutations tolerated independent of mutations to other residues, (ii) additional mutations that must be made to tolerate a mutation, and (iii) mutations found to be rare.

Mutations were made to 19 residues in GA30 and 8 residues in GB30 bringing the pair to 77% similarity. These new proteins showed a decrease in stability, a decrease in the free energy of unfolding, but both retained a delta-G greater than 4 kcal/mol at 25 degrees C. The proteins did not show a decrease in binding affinity and both remained monomeric.

The image below shows the stability curves for GA and GB.

The proteins were brought to 88% similarity by changing two sites in GA77 and four sites in GB77. GA88 and GB88 are the result of 49 residues being the same: nine residues initially the same, 16 mutations in GA, 17 mutations in GB, and the addition of seven residues to GA. The CD spectrum remained close to that of the original proteins. The stability of both proteins further decreased to 4 kcal/mol for GA88 and 2 kcal/mol for GA88 at 25 degrees C. Both GA88 and GB88 retained their binding specificity to HSA and IgG, respectively, but GA88 had a lower binding affinity than GA77. Both proteins remained monomeric at this level of similarity.

Mutation of the seven unique residues can shift the fold type of the proteins from either 99.9% 3-alpha fold to the 97% alpha/beta fold. This study has implications for researchers working on computational protein folding prediction because proteins may be able to exist in one of many stable folded states as a result of mutations. Secondly, this study showed that few mutations can alter the function of a protein given the inclusion of latent binding epitopes, which may not affect the function of the native state.

References

[1] Davidson AR. A folding space odyssey. Proc Natl Acad Sci U S A. 2008 Feb 26;105(8):2759-60. Epub 2008 Feb 19.
[2] Alexander PA, He Y, Chen Y, Orban J, Bryan PN. The design and characterization of two proteins with 88% sequence identity but different structure and function. Proc Natl Acad Sci U S A 2007 Jul 17 104(29):11963-8
[3] "Circular dichroism." Wikipedia, The Free Encyclopedia. 11 Mar 2008, 19:51 UTC. Wikimedia Foundation, Inc. 16 Mar 2008 .
[4] Pritchard L, et al. A general model of error-prone PCR. J Theor Biol. 2005 Jun 21;234(4):497-509.
[5] Sidhu SS, Koide S. Phage display for engineering and analyzing protein interaction interfaces. Curr Opin Struct Biol. 2007 Aug;17(4):481-7. Epub 2007 Sep 17.

Alexander, P.A., He, Y., Chen, Y., Orban, J., Bryan, P.N. (2007). The design and characterization of two proteins with 88% sequence identity but different structure and function. Proceedings of the National Academy of Sciences, 104(29), 11963-11968. DOI: 10.1073/pnas.0700922104

Friday, May 02, 2008

Copy Number Variations in Human Disease

In 2006, Redon et al. examined the 270 individuals from HapMap to study the importance of copy number variations (CNV) to genetic diseases and to provide a resource for other researchers. The term CNV refers to differences in genomic DNA with varying numbers of gene copies. It was found that these copy number variations covered 12% of the human genome. About 15% of the genes found in the Online Mendelian Inheritance in Man (OMIM) morbid map (285 out of 1,961) were found to overlap with CNV regions adding to the potential relevance of CNV to human disease [Redon R 2006].

Several groups have turned their focus to understanding the role of amplifications or deletions, which characterize CNVs, in diseases such as mental disease and cancer. Over 50 regions were found to be associated possessing either unique amplifications or deletions in a cohort of 60 cancer patients [Lucito et al 2007]. The choice by Redon et al. to use the same samples that make up the HapMap dataset was to ease the integration of analysis comparison of CNV and SNP information that makes up the HapMap data.Correlation between CNV associations have been compared to known associations determined from SNP data. Sutrala et al. looked at 15 major candidate genes for schizophrenia including dysbindin (DTNBP1), neuregulin (NRG1), RGS4 and DISC1 and found no copy number variations at these loci. This is consistent with previous studies in the area, suggesting that the two types of variations act independently. This observation is further corroborated by a model of complex traits that showed that CNVs were able to account for 18% of gene expression variation, while SNPs could account for only 84% of the variation [Stranger 2007]. This study by Stranger et al. showed that the phenotypic variations explained by both types of structural variations were largely mutually exclusive in the HapMap samples of lymphoblastoid cell lines.

This new understanding about the potential importance of CNVs brings with it new challenges for the analysts due to the lack of genotyped CNV data and a lack of methods for analyzing CNV data at the genome-wide scale. Stranger et al., McCarroll, and Altshuler point to the fact that the low resolution of CNV data and potential issues with measurement precision could lead to analysis inaccuracies. In 2007, a paper by McCarroll and Altshuler stated that out of the 1,500 CNVs that have been identified researchers have only genotyped 70 at a quality necessary to carry out linkage disequilibrium studies. A recent paper by Ionita-Laza et al. presents a new method for family-based association tests (FBATs) that circumvents the need for genotypes and relies on intensity values. Family-based association tests are a key method used by geneticists in determining the association of genetic markers and disease because many of the issues with the selection of controls can be avoided by using data made up of family members. Ionita-Laza et al. extended the FBAT statistic in such a way that the average intensity within a family corresponds to Mendelian transmissions. Using their method, Ionita-Laza et al. were able to determine potential associations between the SNP rs2240832 on chromosome 7 in a known CNV region and asthma in a sample of 400 parent/child trios [Ionita-Laza 2008].

Another problem exists with the use of CNV data as a result of the procedure for collecting CNV data using array comparative genomic hybridization (array CGH). The method looks at the relative hybridization intensities of labeled probes of test and reference genomic DNA to determine raw intensity values [Pinkel and Albertson 2005]. Conrad et al. show that the frequency of CNV deletions increases as the the size of the deletions became smaller. This presents a problem with the current use of array CGH methods because of the low resolutions afforded by the method is approximately 5–10 Mb [Conrad 2007].

The direction of research seems directed toward the integration CNVs and SNPs for the purposes of understanding the role of variation in human disease. But fundamental issues of data resolution and quality must be answered before this type of integration can take place. This is a part of the much larger problem of classifying and integrating information of sequence and structure variation in genomes. These variations include SNPs and CNVs discussed here, but also inversions, translocations, and deletions of varying scales which, currently, must be identified using different methods. Lee et al points out that the small number of findings linking CNVs to phenotypic differences has a negative effect on the expanded use of higher-resolution array CGH for clinical purposes, which in turn, affects the accumulation of CNV data that may prove useful.

[1] Redon R, et al. Global variation in copy number in the human genome. Nature 2006 Nov 23 444(7118):444-54.
[2] Lucito R, et al. Copy-number variants in patients with a strong family history of pancreatic cancer. Cancer Biol Ther. 2007 Oct;6(10):1592-9. Epub 2007 Jul 12.
[3] Sutrala SR, et al. Gene copy number variation in schizophrenia. Am J Med Genet B Neuropsychiatr Genet. 2007 Dec 28 [Epub ahead of print].
[4] Stranger BE, et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007 Feb 9;315(5813):848-53.
[5] McCarroll SA, Altshuler DM. Copy-number variation and association studies of human disease. Nat Genet. 2007 Jul;39(7 Suppl):S37-42.
[6] Ionita-Laza I. et al. On the analysis of copy-number variations in genome-wide association studies: a translation of the family-based association test. Genet Epidemiol. 2008 Apr;32(3):273-84.
[7] Pinkel D, Albertson DG. Comparative genomic hybridization. Annu Rev Genomics Hum Genet. 2005;6:331-54.
[8] Scherer SW. et al. Challenges and standards in integrating surveys of structural variation. Nat Genet. 2007 Jul;39(7 Suppl):S7-15.
[9] Conrad, D.F. et al. A high-resolution survey of deletion polymorphism in the human genome. Nat Genet. 2006 Jan;38(1):75-81. Epub 2005 Dec 4.
[10] Lee C, et al. Copy number variations and clinical cytogenetic diagnosis of constitutional disorders. Nat Genet. 2007 Jul;39(7 Suppl):S48-54.

Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H., Shapero, M.H., Carson, A.R., Chen, W., Cho, E.K., Dallaire, S., Freeman, J.L., GonzÃ¡lez, J.R., GratacÃ²s, M., Huang, J., Kalaitzopoulos, D., Komura, D., MacDonald, J.R., Marshall, C.R., Mei, R., Montgomery, L., Nishimura, K., Okamura, K., Shen, F., Somerville, M.J., Tchinda, J., Valsesia, A., Woodwark, C., Yang, F., Zhang, J., Zerjal, T., Zhang, J., Armengol, L., Conrad, D.F., Estivill, X., Tyler-Smith, C., Carter, N.P., Aburatani, H., Lee, C., Jones, K.W., Scherer, S.W., Hurles, M.E. (2006). Global variation in copy number in the human genome. Nature, 444(7118), 444-454. DOI: 10.1038/nature05329