Chapter 2 Basic Aspects of Cellular and Molecular Biology
The 20th century saw pulmonary medicine blossom with scientific advances. Although physiology remains at the core of this specialty, pulmonologist investigators are now at the leading edge of research in cell and molecular biology. These laboratory-based disciplines provide tools to study lung disease both in entire populations and at the level of individual proteins. Molecular biology encompasses both genetics and structural biology, which underpin cell biology, from which physiology emerges. Genetics helps to identify the alleles of genes that increase the risk of disease; structural and cell biology then aim to provide the mechanism. It is through an understanding of disease mechanisms that this century will see its major clinical breakthroughs. A working knowledge of these disciplines is therefore essential for today’s pulmonologist.
Genetic factors play an important role in diseases that affect the airways (asthma, chronic obstructive pulmonary disease [COPD], cystic fibrosis, primary ciliary dyskinesia), parenchyma (pulmonary fibrosis, Birt-Hogg-Dubé syndrome, tuberous sclerosis), and vasculature (hereditary hemorrhagic telangiectasia) of the lung (Table 2-1). Such conditions include simple monogenic disorders such as Kartagener syndrome and α1-antitrypsin deficiency, in which mutations of critical genes are sufficient to induce well-defined disease phenotypes. By contrast, many other disease processes affecting the lung are complex genetic traits in which inheritance subtly affects pathogenesis. This group of entities includes COPD, asthma, and idiopathic pulmonary fibrosis. Extending current understanding of the genetic basis of pulmonary conditions will be essential to provide new insights into their underlying pathophysiology, to make predictions about outcome, and to develop novel therapeutic strategies.
Identification of single-gene defects in families that show the same phenotype is now relatively straightforward, owing to completion of the human genome project and improvements in DNA sequencing. Consequently, the past 20 years have seen rapid progress in elucidation of the genetic basis of disease. This rate of progress can be appreciated by a consideration of the many years required to identify the gene associated with cystic fibrosis. Dorothy Hansine Andersen first defined the condition in 1938 when she described cystic fibrosis of the pancreas in association with lung and intestinal disease. Only later was it recognized to be a recessive condition. The sweat test that is used to diagnose the condition was developed after the detection of abnormal sweat electrolytes by Paul di Sant’ Agnese in 1952. The search for the cystic fibrosis gene started in the early 1980s, and the gene was localized to chromosome 7 in 1985 through recognition of linkage with the highly polymorphic gene paraoxonase in many populations. This achievement was followed by the identification of additional markers more closely linked to the cystic fibrosis locus, MET and D7S8, allowing prenatal diagnosis of the disorder and eventually leading directly to the mapping of the causative gene in 1989 by teams headed by Lap-Chi Tsui, Francis Collins, and Jack Riordan. This gene was called the cystic fibrosis transmembrane conductance regulator (CFTR), and now more than 1000 different mutations have been identified that cause cystic fibrosis.
By contrast, today, what had once taken many groups a decade to complete can be undertaken in a single laboratory in days. For example, modern exome sequencing enables all 180,000 exons encoded by the human genome to be characterized in an individual patient or an entire kindred. Although the exome equates to only 1% of the genome, or about 30 megabases, it is thought to contain 85% of the mutations responsible for mendelian disorders. This technology, for example, was recently used to identify the causative gene of Miller syndrome, a rare disorder that manifests with cleft palate, absent digits, and ocular anomalies. The entire exomes of four persons so affected were sequenced, allowing mutations to be identified in the causative gene encoding dihydroorotate dehydrogenase (DHODH).
The major challenges now are therefore no longer the single-gene disorders but complex genetic diseases such as cancer, COPD, asthma, and interstitial lung disease. These diseases are the result of interactions between multiple genes and environmental factors. Consequently, the diseases cluster within families but do not show a clear pattern of inheritance.
Many single-gene disorders have been linked with respiratory disease (see Table 2-1). They are perhaps best typified by the autosomal recessive condition α1-antitrypsin deficiency. This condition shows a clear genotype-phenotype correlation with current understanding of the molecular basis providing new insights into the pathogenesis of disease. α1-Antitrypsin is the archetypal member of the serine proteinase inhibitor (“serpin”) superfamily. It is synthesized in the liver and secreted into the plasma, where it is the most abundant circulating proteinase inhibitor. Most people of North European descent carry the normal M allele, but 1 in 25 carries the Z variant (Glu342Lys), which results in plasma α1-antitrypsin levels in the homozygote that are 10% to 15% of the normal M allele. The Z mutation causes the accumulation of α1-antitrypsin in the rough endoplasmic reticulum of the liver, predisposing the homozygote to the development of juvenile hepatitis, cirrhosis, and hepatocellular carcinoma. The greatly reduced circulating levels of α1-antitrypsin are unable to protect the lungs against proteolytic damage by neutrophil elastase, predisposing the Z homozygote to the development of early-onset emphysema.
The structure of α1-antitrypsin is based on a dominant β-pleated sheet A and nine α-helices (Figure 2-1). This scaffold supports an exposed mobile reactive loop that presents a peptide sequence as a pseudosubstrate for the target proteinase. After docking, the proteinase is inactivated by a mousetrap-type action that swings it from the top to the bottom of the serpin in association with the insertion of an extra strand into β-sheet A (see Figure 2-1). This six-stranded protein bound to its target enzyme is then recognized by hepatic receptors and cleared from the circulation. The structure of α1-antitrypsin is central to its role as an effective antiproteinase but also renders it liable to undergo conformational change in association with disease. The Z mutation is at residue P17 (17 residues proximal to the key P1 amino acid that defines the inhibitory specificity of α1-antitrypsin) at the head of a strand of β-sheet A and the base of the mobile reactive loop (see Figure 2-1). The mutation opens β-sheet A, thereby favoring the insertion of the reactive loop of a second α1-antitrypsin molecule to form a dimer (see Figure 2-1). This dimer can then extend to form polymers that tangle in the endoplasmic reticulum of the liver to form the inclusion bodies resulting in liver disease. Support for this pathomechanism comes from the demonstration that Z α1-antitrypsin formed chains of polymers when incubated under physiologic conditions. The rate was accelerated by raising the temperature to 41° C and could be blocked by peptides that compete with the loop for annealing to β-sheet A. The role of polymerization in vivo was clarified by the finding of α1-antitrypsin polymers in inclusion bodies from the livers of Z α1-antitrypsin homozygotes (see Figure 2-1).
Figure 2-1 The molecular basis of α1-antitrypsin deficiency. α1-Antitrypsin may be considered to act by a mousetrap mechanism. A, After docking (left), the target proteinase (gray) is inactivated by movement from the upper to the lower pole of the protein (right). This is associated with insertion of the reactive loop (red) as an extra strand into β-sheet A (green). The mousetrap mechanism may be triggered spontaneously by point mutations in association with disease. The Z mutation (Glu342Lys) of α1-antitrypsin is at the head of a strand of β-sheet A (green) and the base of the reactive loop. B, Mutations in this region can destabilize β-sheet A to allow the insertion of a reactive loop of a second molecule (middle). This dimer then extends to form long chains of polymers (right). Each molecule of α1-antitrypsin in the polymer is shown in a different color. It is these polymers that tangle in the endoplasmic reticulum to cause inclusions resulting in liver disease. C, An inclusion body (arrow) from the liver of a patient with α1-antitrypsin deficiency (left). The inclusions are composed of chains of molecules of α1-antitrypsin (right).
(Modified from Gooptu B, Lomas DA: Conformational pathology of the serpins—themes, variations and therapeutic strategies, Annu Rev Biochem 78:147–176, 2009.)
Although many α1-antitrypsin deficiency variants have been described, only three other mutants of α1-antitrypsin have similarly been associated with plasma deficiency and hepatic inclusions: α1-antitrypsin Siiyama (Ser53Phe), α1-antitrypsin Mmalton (Phe52 deleted), and α1-antitrypsin King’s (His334Asp). All of these mutants lie in the shutter domain that controls opening of β-sheet A. They destabilize the molecule to allow the formation of loop-sheet polymers in vivo. Further investigations have shown that polymerization also underlies the mild plasma deficiency of the S (Glu264Val) and I (Arg39Cys) variants of α1-antitrypsin. The point mutations that are responsible for these variants have less effect on β-sheet A than does the Z variant. Thus, the associated rate of polymer formation is much slower than that for Z α1-antitrypsin, which results in less retention of protein within hepatocytes, milder plasma deficiency, and the lack of a clinical phenotype. However, if a mild, slowly polymerizing I or S variant of α1-antitrypsin is inherited with a rapidly polymerizing Z variant, then the two can interact to form heteropolymers within hepatocytes. These polymers underlie the inclusions that cause cirrhosis.
Emphysema associated with α1-antitrypsin deficiency results from lack of protection against proteolytic attack in the lungs associated with reduced levels of circulating proteinase inhibitor. This is particularly the case with individuals who smoke tobacco. The Z α1-antitrypsin that does escape from the liver into the circulation is less efficient in protecting the tissues from enzyme damage and, like M α1-antitrypsin, may be inactivated by oxidation of the P1 methionine residue. The demonstration that Z α1-antitrypsin can undergo a spontaneous conformational transition in association with liver disease raised the possibility that this might also occur within the lung. Indeed, polymers have been detected in bronchoalveolar lavage fluid in patients with Z α1-antitrypsin deficiency. This observation may have important implications for the pathogenesis of disease, because polymerization obscures the reactive loop of α1-antitrypsin, rendering the protein inactive as an inhibitor of proteolytic enzymes. Thus, the spontaneous polymerization of α1-antitrypsin within the lung will exacerbate the already reduced antiproteinase screen, thereby increasing the susceptibility of the tissues to proteolytic attack and increasing the rate of progression of emphysema. Finally, the α1-antitrypsin polymers themselves are inflammatory for neutrophils, which will also increase the proteolytic load in the lung. Recent data suggest that cigarette smoke can induce the intrapulmonary polymerization of Z α1-antitrypsin, thereby exacerbating the lung damage associated with smoking.
One approach to looking for genes associated with complex genetic disorders is by means of an association study, which analyzes genetic variation between cases and controls (i.e., without disease) matched for various factors. The genetic variation commonly used in such studies is the single-nucleotide polymorphism (SNP) (DNA sequence variation) found approximately every 300 base pairs across the genome. These studies have been undertaken in patients with COPD matched with control subjects who do not have COPD but who are the same age and have the same smoking history and the same ethnic background. The early studies typically were small (100 to 150 cases plus controls) and often were confounded by failure to match carefully cases and controls. To increase the likelihood of finding a disease-associated gene, such studies frequently included SNPs in multiple genes in the same cohort. However, such multiple comparisons can result in false-positive results. With study of sufficient numbers of genes, purely by chance a variant will arise that erroneously appears to be associated with the disease being studied. Careful statistical analysis is necessary to avoid this problem.
The analysis was made more complex in COPD by the inherent complexity of the disease phenotype—a heterogeneous mix of airway disease and emphysema. Indeed, larger family-based studies have shown the independent clustering of the airway disease and emphysema components of COPD within families. This finding suggests that different genetic factors predispose to each of these components of the phenotype. The only way to overcome the inherent variation in COPD is to focus on groups of patients with well-characterized disease components or to undertake studies with large sample sizes and then to replicate any positive findings in other cohorts. This is now the case with candidate gene studies, and good evidence has emerged to show that heterozygosity for α1-antitrypsin deficiency (phenotype PiMZ) and polymorphisms in genes involved in oxidative stress—those encoding microsomal epoxide hydrolase (EPHX1), glutathione S-transferase (GST-P1 and GST-M1), heme oxygenase (HMOX1), and superoxide dismutase 3 (SOD3)—are associated with an increased risk of COPD (Figure 2-2). More recently, a minor allele of an SNP in the matrix metalloprotease-12 gene (MMP12) has been shown to protect against COPD in adult smokers.
Figure 2-2 Genes implicated in chronic obstructive pulmonary disease (COPD). When smoke enters the alveolus, many of its constituent compounds are absorbed. Some of these are detoxified by an array of enzymes; those that escape detoxification cause local damage and inflammation. The influx and activation of inflammatory cells lead to the liberation of proteases that attack the extracellular matrix, primarily elastin. The effect of these proteases is attenuated by endogenous antiprotease activities, whereas growth factor signals are thought to modulate the repair and remodeling of the extracellular matrix.
The limitation of association studies using a candidate gene approach is that they are by definition restricted to pathways already recognized to be associated with the disease—in the case of COPD, the proteinase-antiproteinase balance, the oxidative stress pathway, and the integrity of the extracellular matrix (see later). Consequently, this approach lacks the capacity to identify unanticipated players and is thus restricted to hypothesis testing, rather than hypothesis generation.
In recent years, the collection of large cohorts of patients combined with technologic advances has allowed unbiased genome-wide association studies of many patients with lung disease. It is currently possible to use microarrays to assay up to a million different SNPs in the genome in the same patient. The variation in SNPs is then compared between cases and controls. The largest study was undertaken in a cohort from Bergen, Norway, and then replicated in the International COPD Genetics Network, the National Emphysema Treatment Trial with controls from the Normative Ageing study, and then finally in the Boston Early Onset COPD cohort. Top hits from this analysis were SNPs in the α-nicotinic acetylcholine receptor CHRNA-3/5 and the hedgehog-interacting protein HHIP. The first of these (rs8034919) in the α-nicotinic acetylcholine receptor also was identified in three genome-wide association studies of lung cancer and also is thought to be important in peripheral vascular disease and nicotine addiction. It is possible that this SNP functions as a marker for an addiction gene. People who carry this SNP may require more cigarettes to satisfy nicotine addiction, may inhale more deeply, and may find it more difficult to withdraw from cigarette smoking. If this hypothesis is correct, the disease-associated allele of this gene would account for 12% of the population risk for COPD.
In interpreting any association study, it is important to consider two important caveats. First, many genetic associations studies report false-positive findings owing to a failure to appreciate the prior probability of an association and the power of the study to detect a meaningful effect. When the prior probability of an association is low, that is to say when there is little functional or epidemiological data to support an association, the numbers of subjects required to guard against a false-positive result increases. Consequently, the identification of a genetic association in a single study must always be treated with caution. Clearly, in the case of the α-nicotinic acetylcholine receptor, it is possible to construct very plausible models for its potential role in COPD, so the prior probability is not low. Moreover, those studies in which it was identified were well powered. The second caveat, however, relates to the phenomenon of linkage disequilibrium. The combination of more than one genetic variant or allele is called a haplotype. Some haplotypes occur in the population more often than one would expect by random association of alleles. This can be caused by, but is not restricted to, the inheritance of blocks of adjacent genes on a chromosome. Clearly, nearby genes are less likely to be separated by recombination during gametogenesis than are more distantly spaced genes. Each population of humans has its own characteristic set of common haplotypes. In this light, a disease-associated SNP can more accurately be viewed as a marker of the haplotype that is associated with the disease under study and the causative gene must be identified within that group. Indeed, in the case of disease-associated SNPs in the α-nicotinic acetylcholine receptor, there does appear to be linkage disequilibrium with SNPs in the iron-responsive element binding protein 2 IREB2. This was identified from expression analysis in lung tissue from persons with COPD and then confirmed in three separate COPD cohorts. IREB2 is localized to the human epithelial cell surface and may play a role in protecting against epithelial damage from oxidative stress.
It is, of course, possible that a haplotype identified in large genome-wide association studies may contain multiple disease-associated genes, so each one needs individual validation. Indeed, many diseases appear to involve the interaction of multiple disease-associated alleles, each with relatively small contributions when studied individually. The chances of identifying an allele that imparts a small relative risk for developing a disease are improved both by increasing the numbers of cases studied (increased power) and by carefully selecting cases of the same phenotype. With diseases such as COPD that are likely to represent the final common pathway of many forms of lung damage, this consideration is particularly important.
The analysis of still larger numbers of patients with COPD has identified a disease-associated SNP in FAM13A. The role of this gene in disease is unclear, but expression has been associated with hypoxia. FAM13A also has been associated with lung function in a second independent study. A detailed analysis of these genes in well-characterized cohorts showed that SNPs in the α-nicotinic acetylcholine receptor are associated with smoking intensity, airflow obstruction, and emphysema, and SNPs in the hedgehog-interacting protein are associated with systemic features of COPD (low body mass index) and exacerbations, whereas SNPs in FAM13A are associated with airflow obstruction.