Abstract
Childhood disease results from the interplay of genetic, environmental, and developmental factors. The rapid development of DNA technologies has unmasked the genetic diversity of individuals, almost all of which is not functionally significant, making it difficult to identify which variants might be contributing to phenotype. This chapter reviews genetic terminology and provides an overview of the strengths and limitations of currently available genetic testing and interpretation.
Keywords
exome sequencing, genome sequencing, genetic association studies, phenotype, genetic variation
Childhood disease results from the interplay of genetic, environmental, and developmental factors ( Fig. 3.1 ). The rapid development of DNA technologies has permitted the identification of the molecular basis of more than 5500 disorders caused by functional variation in more than 3500 genes (Online Mendelian Inheritance in Man, http://omim.org/statistics/geneMap , accessed June 28, 2016), including more than 100 monogenic lung diseases that contribute to a significant burden of pediatric pulmonary conditions. Cystic fibrosis resulting from disruption of CFTR is the most common of these monogenic pulmonary disorders, affecting 1 of 3000 births in Northern European populations. In contrast, asthma is the most common respiratory disease of childhood, affecting more than 300 million individuals worldwide. Although asthma is a heritable disease, multiple candidate genes have been identified, each of which individually explains only a small proportion of this heritability.
The availability of high-throughput sequencing has permitted extensive investigation into the “gene” arm of the gene × environment × development interaction, especially for complex traits like asthma, bronchopulmonary dysplasia (BPD), or neonatal respiratory distress syndrome (RDS) where evidence of heritability is present, but for which identified genetic variation explains only a small proportion of disease risk. For example, disparate outcomes of monozygotic twins suggest that genetic factors account for 50%–80% of the risk for BPD and approximately 50%–60% of the risk for neonatal RDS, but variation in identified genes accounts for only approximately 10% of this risk in RDS. Aside from the fact that only a limited number of genes or individuals have been studied, other sources of this “missing heritability” include environmental and developmental interactions, some of which are epigenetic. BPD, primarily a condition of premature infants, is a classic example of how developmental factors are integral to the mechanisms of disease. An exquisitely coordinated repertoire of genes interacts to form a lung that is structurally and functionally mature to exchange gas once birth occurs, but premature birth at a time when the lung is structurally and functionally immature has the potential to disrupt this cascade of normal development. Variation in developmentally expressed genes that are critical for lung maturation can contribute to abnormal lung development after premature birth and lung injury but would otherwise be silent in babies born at term.
Virtually all known mechanisms of genetic or genomic variation can cause pediatric lung diseases. A basic understanding of the types of variation is crucial to appreciating the advantages and limitations to various methods of genetic testing in current clinical use, as well as interpreting a rapidly expanding body of literature on the genetic basis of both monogenic and complex lung disease. Here we review some genetic terminology ( Table 3.1 ) and current strategies for genetic testing, but the reader may also wish to consult several excellent textbooks or review articles for more detail. The reader is also referred to the Human Genome Organization (HUGO) guidelines for gene nomenclature ( www.genenames.org ).
BASIC TERMINOLOGY | |
DNA | Double helix–shaped molecule composed of nucleotides that contains code for protein synthesis |
Nucleotide | Basic molecules of DNA and RNA composed of sugars, phosphates, and nitrogenous bases: adenine, cytosine, guanine, thymine [A, C, G, T] for DNA; adenine, cytosine, guanine, uracil [A, C, G, U] for RNA |
Gene | Sequence of DNA nucleotides that specifies amino acid order for a specific protein |
RNA | Typically single-stranded molecule composed of nucleotides that is transcribed from DNA and is essential for protein synthesis (messenger, transfer, and ribosomal RNA [mRNA, tRNA, rRNA]) |
Exon | The “coding” portion of a gene—the sequence that is transcribed into a messenger RNA for subsequent translation into a protein |
Intron | The “noncoding” portion of a gene—important in regulation of gene expression |
Complementary DNA (cDNA) | DNA sequence that is derived from reverse transcription of messenger RNA, commonly used to denote the location of a variant within a gene (e.g., c.456 G > T) |
Transcription | Process of copying DNA into RNA |
Translation | Process of synthesizing protein from RNA |
Exome | Portion of the genome formed by the exons or protein coding regions (~1% of human genome) |
Genome | Full genetic complement of an individual organism (3 billion base pairs in human) |
Genotype | Specific combination of alleles for a particular gene or locus |
Phenotype | Observable characteristic of an individual/organism resulting from interaction of genotype with environment |
Pleiotropy | One gene affects more than one phenotypic trait |
Polygenic trait | More than one gene contributes to a phenotypic trait |
Allele | Alternative form of a gene that may occur at a given gene locus |
Homozygous | Having two identical alleles at a gene locus |
Heterozygous | Having two different alleles at a gene locus |
Dominant allele | Gene that is phenotypically expressed whether or not the other allele is identical |
Recessive allele | Gene that is phenotypically expressed when the other allele is identical, but whose expression is masked in presence of dominant allele |
Haplotype | Group of variants on a single chromosome that are inherited together |
LD | Nonrandom assortment of alleles at different genetic loci |
TagSNP | Single nucleotide polymorphism in a region of the genome with high linkage disequilibrium that is informative of additional variants at other genomic positions |
Recombination | Rearrangement of genetic material resulting from crossing over of chromosomes during meiosis |
Germline variant | Variant in DNA sequence transmitted by egg or sperm, presumed to be present in all nucleated cells |
Somatic variant | Variant in DNA sequence arising in a specific tissue, presumed to be present in all cells derived from original progenitor cell |
Monozygotic twins | Twins resulting from a single zygote (single egg, single sperm) that separates in early development |
Dizygotic twins | Twins resulting from two separate zygotes (two eggs, two sperm) |
TYPES OF GENETIC VARIATION | |
SNV | Variation in DNA sequence occurring at a single nucleotide position; variants with a frequency >1% are commonly called SNPs |
Synonymous variant | A single nucleotide variant that does not result in a change in the amino acid at that location |
Nonsynonymous variant | A single nucleotide variant that results in a change in the amino acid at that location; also called a missense variant |
Nonsense variant | Single nucleotide change that results in a premature stop in transcription which results in an unstable messenger RNA that is unable to be translated or, if translated, a truncated and often nonfunctional protein product |
CNV | Deletions or amplifications (e.g., duplications, triplications) of chromosomal segments; can arise during meiosis or somatic divisions |
Insertion/deletion “indels” | Insertions—extra nucleotides are inserted into the DNA sequence Deletions—nucleotides are deleted from the DNA sequence |
In-frame indel | Insertion or deletion of multiples of three nucleotides that add or delete amino acids without disrupting the remainder of the amino acid sequence |
Frameshift indel | Insertion or deletion of nucleotides that disrupt the reading frame and the remainder of the amino acid sequence |
Mutation | Any variation in the nucleotide sequence of a gene; however, it typically connotes a deleterious effect. All of the aforementioned variants are technically “mutations.” The terminology is migrating to the following classification scheme. |
VARIANT CLASSIFICATION | |
Pathogenic | Very strong or strong evidence that a variant is disease causing |
Likely pathogenic | Strong to moderate evidence that a variant is disease causing |
VUS | DNA changes with too little information known to classify functionality |
Likely benign | Strong to supporting evidence that a variant is not disease causing |
Benign | Strong evidence that a variant is not disease causing |
Types of Genetic Variation
Genetic variation comes in many forms, and single nucleotide variants (SNVs), small insertions/deletions (indels), exonic deletions, and trinucleotide repeats have all demonstrated roles in the pathogenesis of lung disease ( Fig. 3.2 ). SNVs are single nucleotide changes in the genome and can occur in coding and noncoding regions. SNVs in coding regions can have relatively little effect on protein function (e.g., no change in amino acid sequence [synonymous] or conservative amino acid substitution), can result in a truncated or nonfunctional protein (e.g., nonsense, frameshift SNVs), or can have intermediate effects (nonconservative missense, splice site, and in-frame indels). SNVs that are present in greater than 1% of the population (minor allele frequency > 0.01) are referred to as single nucleotide polymorphisms (SNPs), and these “common” variants have enabled research methodologies, including genome-wide association studies (GWAS). SNVs present in less than 1% of the population are termed “rare” variants, and, although most are still of modest effect size, they are more likely to have larger effect sizes than SNPs. The term “mutation,” often used interchangeably with SNVs and SNPs, has a more negative connotation and may even imply a “disease-causing change,” whereas the term “polymorphism” has a more neutral connotation and indicates a more common variant. Indels may affect one or more nucleotides and can be “in-frame,” meaning they occur in multiples of three nucleotides and add or delete amino acids in that region without disrupting the remainder of the amino acid sequence, or they can cause a “frameshift,” meaning they disrupt the reading frame and all subsequent amino acid sequence. In-frame indels generally result in less disruption to the reading frame, but, as evidenced by the delF508 mutation in CFTR, even the deletion of three bases and a single amino acid can be deleterious. Indels that result in addition or deletion of bases that shift the reading frame are frequently pathogenic.
Although the vast majority of monogenic lung disease is caused by SNVs, there are important examples of less common genetic variation causing lung disease. For example, the absence of dozens or hundreds of base pairs encompassing one or more exons (so-called exonic deletions) can cause serious disease. For the pulmonologist the prime examples are Duchenne and Becker muscular dystrophies. Methods for assaying SNVs including Sanger sequencing will not reliably detect exon skipping (see Table 3.3 later in this chapter), and methods such as multiplex ligation-dependent probe amplification (MLPA) must be used when there is high clinical suspicion for these disorders. Disorders such as myotonic dystrophy result from expansion of three nucleotide repeat (so-called trinucleotide repeat) regions in coding regions that result in unstable and abnormal proteins. Methods for assaying SNVs will not reliably detect trinucleotide repeats, and methods such as polymerase chain reaction (PCR) must be used when there is high clinical suspicion for these disorders.
In addition to single gene variation, copy number variation (deletions, duplications) may occur at the whole chromosome level (e.g., trisomy 13, 18, 21), where pulmonary effects are largely of secondary importance as compared with other organ dysfunction, but also at the microscopic or submicroscopic level, affecting only a portion of a single chromosome or as a translocation involving two or more chromosomes. Autosomal dominant disorders may be caused by deletions, autosomal recessive conditions may be unmasked by deletion of the “normal” allele, and translocations with breakpoints within a given gene may cause similar effects.
Technologies to Identify Genetic Variation
The Human Genome Project, with initial sequencing completed in 2001, has led to the development of multiple technologies of relevance to pediatric lung disease from both clinical and research perspectives. These technologies are being increasingly integrated into clinical medicine, often blurring the lines between clinical diagnostic evaluation and research. Thus it is important to understand both the technologies in current clinical use, as well as those being used on a research basis, to interpret an emerging body of literature and to anticipate the maturation of such technologies for clinical application (see Table 3.2 ).
Historically, chromosomal or cytogenetic variation was assayed using the classical karyotype or fluorescent in situ hybridization (FISH); however, array-based technologies have become more prevalent in both research and clinical care. Chromosomal microarray (CMA) consists of hundreds of thousands to millions of DNA probes (as oligonucleotides with or without SNPs) spotted onto a solid surface. DNA (or RNA) is hybridized to these probes, and hybridization is measured by a fluorescent reporter. CMA can assess for copy number variation (microdeletions, microduplications, and unbalanced translocations) and (when SNPs are used) regions of excessive homozygosity, suggesting increased risk for recessive or imprinting disorders. Although often used as a first-line tool for genetic analysis, CMA will not detect small changes in the sequence of single genes including point mutations, tiny duplications or deletions within a single gene (e.g., between array probes), or balanced chromosomal rearrangements (translocations, inversions). Microarray can also be used to assess RNA expression as a research tool (e.g., comparing RNA transcripts for an SNV predicted to alter exon-intron splicing).
The International HapMap Project demonstrated that SNPs at different positions in the genome are frequently inherited together, a phenomenon known as linkage disequilibrium. Although millions of SNPs were present in the human genome, one could infer, through linkage disequilibrium, many more variants by sequencing only a subset of these “tagSNPs.” The use of tagSNPs facilitated the rise of GWAS using arrays that contain up to 2 million previously identified variants.
The Human Genome Project also fostered significant advancement in sequencing technologies. Classically, sequencing of single genes was done by Sanger sequencing. Highly reliable and able to generate reads of greater than 500 base pairs, Sanger sequencing is still used to validate limited numbers of selected variants generated by high-throughput sequencing technologies; however, it does not scale well to sequence multiple genes. “Next-generation” sequencing technologies allow rapid sequencing of hundreds of genes or even whole exomes or genomes simultaneously by combining microarray technology and parallel sequencing of short reads (50–400 base pairs) with computational techniques to align the fragments to a reference sequence.
Whole exome sequencing (WES) targets the exonic (protein coding) regions of all approximately 20,000 genes in the genome simultaneously and has become a widely used clinical diagnostic tool to identify rare sequence variants in patients with a phenotype suspected to be due to disruption of a single gene. However, exons comprise only approximately 1% of the genome. Whole genome sequencing (WGS) has therefore emerged as a diagnostic tool and offers several advantages over exome sequencing, including detection of structural variation and variation in nonexonic regulatory, intronic, and intergenic regions. In silico programs to interpret structural and noncoding variation are emerging but lag behind those used for coding variation; currently, genome sequencing is not widely available and is significantly more expensive than exome sequencing. Clinical exome and genome sequencing is estimated to identify a causative variant in approximately 25% of trios (affected child plus both parents), but this depends largely on patient selection based on phenotype, family history, and suspected inheritance pattern. Clinical exome and genome sequencing can lead to the identification of previously identified variants or suspected pathogenic variant(s) in a gene known to be associated with human disease. Alternatively, exome or genome sequencing may identify suspected pathogenic variants in a gene not previously associated with a human phenotype, in which case additional clinical or laboratory investigation may be needed to establish pathogenicity. Positive results from clinical exome or genome sequencing are highly accurate, but false-negative results can occur, depending on the quality of data from a specific genomic region. In addition, the clinician must consider how well the putative variant(s) explains the patient’s phenotype in terms of clinical assessment and interpretation of existing medical literature. In some individuals, more than one candidate gene may be identified or, alternatively, no candidate variants may be identified. Reasons for negative findings include lack of coverage of a genomic region, limitations of variant prediction algorithms, polygenic inheritance, epigenetic mechanisms, or lack of an underlying genetic etiology. Reanalysis of sequence data may be of use as variant prediction algorithms mature, additional patients with similar phenotypes are reported, and the functions of additional genes are characterized.
Interpretation of Genetic Variation
Because each individual has millions of genetic variants, the vast majority of which are functionally insignificant, the principal challenges of sequencing are interpretation, as opposed to sequence generation. For previously described SNVs many databases exist to assist in interpretation or classification, examples of which are listed in Table 3.2 . Some programs, such as ANNOVAR, efficiently combine the results of multiple in silico prediction algorithms. However, it should be noted that such databases are not infallible and variant classification is very dynamic (i.e., a variant that might be classified as “benign” or “of unknown significance” might be reclassified with accumulation of additional information). ClinVar provides the most current cataloging of these variants and also allows for conflicting interpretations of the same variant( http://www-ncbi-nlm-nih-gov.easyaccess2.lib.cuhk.edu.hk/clinvar/ ). Furthermore, because sequencing has the potential of identifying novel variants, both the clinician and the researcher must have a basic understanding of in silico prediction methods and their limitations. Wu and Jiang provide an overview of available tools for this purpose and caution that, although specific algorithms may outperform others under specific conditions, best results are typically obtained by using multiple algorithms and integrating knowledge over all available domains. In addition to in silico prediction algorithms, functional studies and identification of variants in other diseased (or nondiseased) individuals can also aid in variant interpretation. Given the limitations of in silico prediction, there is increasing demand for biologic confirmation of in silico predictions. Although a detailed exploration of the various methods of developing model systems is beyond the scope of this chapter, special attention should be paid to the emerging technology of CRISPR/Cas9, which permits targeted genome editing and is being applied to existing model organisms to test functionality suggested by in silico prediction tools.
If one takes a genomic approach and obtains WES/WGS, many (most) of the genes sequenced and variants that are identified will not have anything to do with the clinical phenotype. Variants in such genes are termed secondary findings and are an area of significant controversy. As the availability of genomic testing spreads and nongeneticist clinicians are able to order such testing, they must be aware of the potential for secondary findings and provide adequate counseling on this possibility. The American College of Medical Genetics has a position statement on secondary findings that can guide clinicians.