“Our true progress must continue to be based on research done painstakingly and accurately and on experience honestly and wisely interpreted.” John W. Kirklin, M.D., 1967
Section I: Generating knowledge from information, data, and analyses
Introducing the chapter
Cardiac surgery is among the most quantitative subspecialties in medicine. It started that way and continues to be that way. Studies in cardiac surgery reveal a complex, multifactorial, multilayered, multidimensional interplay among patient characteristics, variability of the heart disease, effect of the disease on the patient, conduct of the cardiac surgery, and response of the patient to treatment in the short and long term. Because cardiac surgeons were “data collectors” from the beginning of the subspecialty, it is understandable that efforts to improve quality and appropriateness of medical care while containing its costs found cardiac surgical results an easy target. The dawn of medical report cards made it evident that multiple factors influencing outcome must be taken into account to make fair comparisons. This scrutiny of results, often by the media, reveals that variability in performing technical details of operations, coupled with environmental factors often not under direct control of cardiac surgeons, contribute to differences in results.
Propensity toward data collection in cardiac surgery was reinforced in the 1970s and early 1980s by challenges from cardiologists to demonstrate not simply symptomatic improvement from cardiac surgery, but longer survival and better long-term quality of life (appropriateness). This resulted in one of the first large-scale, government-funded registries and an in-depth research database ( Box 7.1 ) of patients with ischemic heart disease, as well as a rather small, narrowly focused randomized trial (Coronary Artery Surgery Study ). It stimulated subsequent establishment by the Society of Thoracic Surgeons (STS) of what is now the largest nongovernmental registry of cardiac surgical data.
• BOX 7.1
Types of Databases
Lauer and Blackstone have reviewed various types and purposes of databases in cardiology and cardiac surgery. Among these are the following, each containing a different fraction of the information on the longitudinal healthcare of individual patients, and each constructed for differing purposes, although they at times overlap.
Registry
A database consisting of core data elements on every patient in a defined population.
These are generally classified as quality registries and as such generally do not require patient consent. Examples include national registries for all cardiac surgery performed at a hospital for which core Society of Thoracic Surgeons (STS) or EuroSCORE variables are abstracted. The implication is that a registry is an ongoing activity that is broad, but contains limited data content.
Research database
A database consisting of in-depth data about a defined subset of patients. A research database, in contrast to a registry, is narrow and deep. Williams and McCrindle have called such databases “academic databases” because they usually are constructed by those in academic institutions to facilitate clinical research.
Even with such a database, an individual study may use a fraction of the variables and then must add a number of new variables that are relevant to particular studies. Often the fixed structure of the database is not such that these additional variables can easily be assimilated, so they are often entered into ancillary databases. These ancillary databases may not be available as a resource to subsequent investigators (see Section II ). Data warehouses or, better, semantic data integration are key to making the entire research database accessible.
Administrative database
A database consisting of demographic variables, diagnostic codes, and procedural codes that are available electronically, generally from billing systems. Administrative databases are used by outcomes or health services research professionals for quality assessment.
National database
A database consisting of completely deidentified data, generally of limited scope but usually containing meaningful medical variables, including patient demography, past history, present condition, some laboratory and diagnostic data, procedure, and outcomes. National (and international) databases are intended to be used for general quality assessment; medical health quality improvement; government activities at a regional, national, or international level; and public consumption.
Thus, it is important for all in the subspecialty of cardiac surgery, not just those engaged in bench, translational, or clinical research, to (1) understand how information generated from observations made during patient care is transformed into data suitable for analysis, (2) appreciate at a high level what constitutes appropriate analyses of those data, (3) effectively evaluate inferences drawn from those analyses, (4) translate and apply new knowledge to better care for individual patients, and (5) disseminate that knowledge.
It is our desire that the reader realize these goals and not conclude prematurely that this chapter is simply a treatise on data science, biostatistics, outcomes research, epidemiology, biomathematics, or bioinformatics.
Thus, to lead from information to new knowledge, they envision bringing together quantitative needs in structural biology, biochemistry, molecular biology, and genomics at the microscopic level, and medical, health services, health economics, and social systems disciplines at the macroscopic level, with analytic tools from computer science, mathematics, statistics, physics, and other quantitative disciplines. This vision transcends current restrictiveness of traditional biostatistics in analysis of clinical information. This is why we emphasize in this chapter that the material is not simply for surgeons, their clinical research team, and consulting and collaborating biostatisticians, but also for a wider audience of professionals in a variety of quantitative and qualitative disciplines.
Who should read this chapter
This chapter should be read in whole or in part by (1) all cardiac surgeons, to improve their comprehension of the medical literature and hone their skills in its critical appraisal; (2) young surgeons interested in becoming clinical investigators, who need instruction on how to pursue successful research (see “ Technique for Successful Clinical Research ” later in this section); (3) mature surgeon-investigators and other similar medical professionals and their collaborating statisticians, mathematicians, and data and computer scientists, who will benefit from some of the philosophical ideas included in this section, and particularly from the discussion of emerging analytic methods for generating new knowledge; and (4) data managers and data scientists of larger clinical research groups, who need to appreciate their pivotal role in successful research, particularly as described in Sections I, II, and III of this chapter and Appendix 7A .
The potential obstacle for all will be language. For the surgeon, the language of statistics, mathematics, and computer science may pose a daunting obstacle of symbols, numbers, and algorithms. For collaborating statisticians, mathematicians, and computer scientists, the Greek and Latin language of medicine is equally daunting. For most, the language of implementation and dissemination science, behavioral science, and econometrics is unique. This chapter attempts to surmount the language barrier by translating ideas, philosophy, and unfamiliar concepts into words while introducing only sufficient statistics, mathematics, and algorithms to be useful for the collaborating scientist.
Because this chapter is intended for a mixed audience, it focuses on the most common points of intersection between cardiac surgery and quantitative and qualitative science, with the goal of establishing sufficient common ground for effective and efficient communication and collaboration. As such, it is not a substitute for statistical texts or academic courses, nor a substitute for the surgeon-investigator to establish a collaborative relationship with biostatisticians, nor is it intended to equip surgeons with sufficient statistical expertise to conduct highly sophisticated data analyses themselves.
How this chapter has evolved
At least three factors have contributed to the evolution of this chapter from edition to edition of this textbook: increasing importance of computers in analyzing clinical data, introduction of new and increasingly appropriate and applicable methods for analyzing those data, and growing importance of nontraditional machine learning methods for drawing useful and important inferences from medical data with fewer assumptions. In Edition 4, we introduced collaborators in the fields of artificial intelligence, ontology, and machine learning. Revision was also strongly influenced by the Institute of Medicine’s (IOM; now the Academy of Medicine) Learning Health System initiative , and comparative effectiveness emphases of the Academy and NIH.
In this edition, we split off Quality Assurance as a separate chapter ( Chapter 8 ); amplify machine learning, causal inference, and graphical representation of mechanistic hypotheses as directed acyclic graphs; and cast patient-specific predictions (precision medicine) into the formal next-generation propensity arena of virtual twins. We hint at the emerging fields of implementation and dissemination science with their frameworks and mixed methods in the context of social determinants of health. The latter affects not only access to cardiac diagnosis and treatment, but long-term outcomes of patients.
How the chapter is organized
The organizational basis for this chapter is the Newtonian inductive method of discovery. It begins with information about a microcosm of medicine, proceeds to translation of information into data and analysis of those data, and ends with new knowledge about a small aspect of nature. This organizational basis emphasizes the phrase “Let the data speak for themselves.” It is that philosophy that dictates, for example, placing “Indications for Operation” after, not before, presentation of surgical results throughout this book.
Information.
In health care, information is a collection of material, workflow documentation, and recorded observations (see Section II ). This information is largely in electronic (computer) format.
Data.
Data consist of organized values for variables, usually expressed symbolically (e.g., numerically) by means of a controlled vocabulary (see Section III ). Characterization of data includes descriptive statistics that summarize the data and express their variability.
Analysis.
Analysis is a process, using a large repertoire of methods, by which data are explored, important findings are revealed and unimportant ones suppressed, and relations are clarified and quantified (see Section IV ).
Knowledge.
Knowledge is the synthesis of information, data, and analyses arrived at by application of information, data, and analyses in the form of study designs that yield evidence for evidence-based medicine based on average treatment effects and evidence for precision medicine based on individual treatment effects (see Section V ). New knowledge may take the form of clinical inferences , which are summarizing statements that synthesize information, data, and analyses, drawn with varying degrees of confidence that they are true. It may also include speculations , which are statements suggested by the data or by reasoning, often about mechanisms, without direct supportive data. Ideally, it also includes new hypotheses , which are testable statements suggested by reasoning, inferences, or speculations from the information, data, and analyses.
Knowledge dissemination.
To be useful both to incrementally benefit science and to contribute to advancement of health and health care, knowledge must be disseminated. The clearest example of this is scientific publications and presentations (see Section VI ). That is, to be most useful, what has been found must be made public. In this edition, we only hint at the rapidly evolving field of implementation and dissemination science and its barriers (see Section VII ), but this science is crucial to apply knowledge to routine clinical practice. The lag between discovery and its integration into routine clinical practice is estimated to be 17 years, and this needs to be shortened.
How to read this chapter
Unlike most chapters in this book, whose various sections and parts can be read somewhat randomly and in isolation, Section I of this chapter should be read in its entirety before embarking on other sections. It identifies the mindset of the authors; defends the rationale for emphasizing surgical success and failure; contrasts philosophies, concepts, and ideas that shape both how we think about the results of research and how we do research; lays out a technique for successful clinical research that parallels the surgical technique portions of other chapters; and for collaborating scientists engaged in analyzing clinical data, lays the foundation for our recommendations concerning data analysis.
Much of the material in this introductory section is amplified in later portions of the chapter, and we provide cross-references to these to avoid redundancy.
Driving forces of new knowledge generation
Many forces drive the generation of new knowledge in cardiac surgery, including the business economics of healthcare, need for innovation, clinical research, surgical success and failure, and awareness of medical error.
Economics
The economics of health and healthcare are driving changes in practice toward what is hoped to be less expensive, more efficient, yet higher quality care. Interesting methods for testing the validity of these claims have become available in the form of cluster randomized trials. , In such trials (e.g., a trial introducing a change in physician behavior), patients are not randomized; physicians are. (Patients form the cluster being cared for by each physician.) This leads to inefficient studies that nevertheless can be effective with proper design and a large enough pool of physicians. , It is a study design in which the unit of randomization (physician) is not the unit of analysis (individual patient outcome). Such trials appear to require rethinking of traditional medical ethics.
Innovation
Just when it seems that cardiac surgery has matured, innovation intervenes and occurs at several levels. It includes new devices; new procedures; existing procedures performed on new groups of patients; simplifying and codifying seemingly disparate anatomy, physiology, or operative techniques; standardizing procedures to make them teachable and reproducible; and introducing new concepts of patient care. Many of these innovations apply beyond the field of cardiac surgery.
Yet, innovation is often at odds with cost reduction and is perceived as being at odds with traditional research. In all areas of science, however, injection of innovation is the enthalpy that prevents entropy, stimulating yet more research and development and more innovation. Without it, cardiac surgery would be unable to adapt to changes in managing ischemic heart disease, potential reversal of the atherosclerotic process, percutaneous approaches to valvular and congenital heart disease, and other changes directed toward less invasive therapy. What is controversial is (1) when and if it is appropriate to subject innovation to formal clinical trial and (2) the ethics of innovation in surgery, for which standardization is difficult.
Reducing the unknown
New knowledge in cardiac surgery has been driven by the quest to fill voids of the unknown, whether by clinical research or laboratory research. This has included research to clarify normal and abnormal physiology, such as the abnormal state of the body supported on cardiopulmonary bypass.
Clinical research has historically followed one of two broad designs: nonrandomized studies of observational patient cohorts (“clinical practice”) and randomized clinical trials. Increasing emphasis, however, is being placed on translational research, bringing basic research findings to the bedside and back. John Kirklin called this the “excitement at the interface of disciplines.” Part and parcel of the incremental risk factor concept (see “ Incremental Risk Factor Concept ” in Section IV) is that it is an essential link in a feedback loop that starts with surgical failure, proceeds to identifying risk factors, draws inferences about specific gaps in knowledge that need to be addressed by basic science, generates fundamental knowledge by the basic scientists, and ends by bringing these full circle to the clinical arena, testing and assessing the value of the new knowledge generated for improving medical care.
Surgical success and failure
Results of operative intervention in heart disease, particularly surgical failure, drive much of the new knowledge generated by clinical research—in the late 1970s and early 1980s, a useful concept arose about surgical failures—the absence of natural disaster or sabotage, there are two principal causes of failure of cardiac operations to provide a desired outcome: (1) lack of scientific progress and (2) human error.
Human error
Increased awareness of medical error is driving the generation of new knowledge, just as it is driving increasing regulatory pressure and medicolegal litigation. Kirklin’s group at the University of Alabama at Birmingham (UAB) was one of the first to publish information about human error in cardiac surgery and place it into the context of cognitive sciences, human factors, and safety research. This interface of disciplines is essential for facilitating substantial reduction in injury from medical errors.
Surgical failure
Surgical failure is a strong stimulant of clinical research aimed at making scientific progress. With increasing requirements for reporting both outcomes and process measures in the United States (with “pay for performance”), there is now also an economic stimulus to reduce human error (see Chapter 8 , Quality Assurance). The term “human error” carries negative connotations that make it difficult to discuss in a positive, objective way to do root-cause analyses of surgical failures. It is too often equated with negligence or malpractice, and almost inevitably leads to blaming persons on the “sharp end” (caregivers), with little consideration of the decision-making, organizational structures, infrastructures, or other factors that are remote in time and distance (“blunt end”).
Human error
Richardson in 1912 recognized the need to eliminate “preventable disaster from surgery.” Human error as a cause of surgical failure is not difficult to find, , , particularly if one is careful to include errors of diagnosis, delay in therapy, inappropriate operations, omissions of therapy, and breaches of protocol.
When we initially delved into what was known about human error in the era before Canary Island (1977), Three-Mile Island (1979), Bhopal (1984), Challenger (1986), and Chernobyl (1986), events that contributed enormously to knowledge of the nature of human error, we learned two lessons from the investigation of occupational and mining injuries. , First, successful investigation of the role of the human element in injury depends on establishing an environment of nonculpable error . An atmosphere of blame impedes investigating, understanding, and preventing error. How foreign this is from the culture of medicine! We take responsibility for whatever happens to our patients as a philosophical commitment. , Yet cardiac operations are performed in a complex and imperfect environment in which every individual performs imperfectly at times. It is too easy when things go wrong to look for someone to blame. Blame by 20/20 hindsight allows many root causes to be overlooked.
Second, we learned that errors of omission exceed errors of commission . This is exactly what we found in ventricular septal defect (VSD) repair in the 1960s and 1970s ( Table 7.1 ), suggesting that the cardiac surgical environment is not so different from that of a gold mine, and we can learn from that literature.
TABLE 7.1
Management Errors Associated with 30 Hospital Deaths Following Repair of Ventricular Septal Defect (UAB, 1967 to 1979; n = 312)
Data from Rizzoli and colleagues.
| Error | Number |
|---|---|
| Operation despite severe pulmonary arterial disease | 3 |
| Undiagnosed and overlooked ventricular septal defects | 8 |
| Despite heart block, no permanent pacing electrodes inserted at operation | 1 |
| Clots in oxygenator and heat exchanger (and circle of Willis) | 1 |
| Extubation without reintubation within a few hours of operation in seriously ill infants | 4 |
| Self-extubation without reintubation in the face of low cardiac output | 1 |
| Transfusion of packed red blood cells of wrong type | 1 |
These two lessons reinforced some surgical practices and stimulated introduction of others that were valuable in the early stages of growth of the UAB cardiac surgery program: using hand signals for passing instruments, minimizing distractions, replying simply to every command, reading aloud the protocol for the operation as it proceeds, standardizing apparently disparate operations or portions thereof, and focusing morbidity conferences candidly on human error and lack of knowledge to prevent the same failure in the future. To amplify, these practices were enunciated as a “culture of clarity”—in today’s terms, a culture of transparency—by the late Robert Karp, MD, the end result of which is a reproducible and successful surgical endeavor. In the operating room, each individual on the surgical team is relaxed but alert:
-
•
Hand signals serve to inform assistants and the scrub nurse or technician of anticipated needs for a relatively small number of frequently used instruments or maneuvers.
-
•
Spoken communication is reserved for those out of the field of sight (i.e., the anesthesiologist and perfusionist). When verbalized, “commands” are acknowledged with a simple reply.
-
•
Anticipated deviations from the usual are presented in the preprocedure “huddle.” Unanticipated deviations are acknowledged to all concerned as soon as possible.
-
•
Successful routines are codified. These include chronology for anticoagulation and its reversal, myocardial management routines, and protocols controlled by the surgeon for commencing and weaning the patient from cardiopulmonary bypass.
-
•
Technical intuitive concepts are articulated. For example, some think the VSD in tetralogy of Fallot is a circular hole. Thus, closing such a hole would simply involve running a suture circumferentially to secure a patch. Kirklin and Karp were able to describe the suture line as having four different areas of transition in three dimensions and precisely articulated names for those transitions. Each had a defined anatomic relationship to neighboring structures, so the hole became infinitely more interesting!
-
•
Discussion of surgical failure is planned for a time when distractions are minimal. The stated goal is improvement, measurable in terms of reproducibility and surgical success. The philosophy is that events do not simply occur but have antecedent associations, so-called root-cause analysis. An attempt is made to determine if errors can be avoided and if scientific knowledge exists or does not exist to prevent future failure.
A major portion of the remainder of this chapter addresses acquisition and description of this new knowledge.
Categories of human error.
Slips are failures in execution of actions and are commonly associated with attention failures ( Box 7.2 ). Some external stimulus interrupts a sequence of actions or in some other way intrudes into them such that attention is redirected. In that instance, the intended action is not taken. Lapses are failures of memory. A step in the plan is omitted, one’s place in a sequence of actions is lost, or the reason for what one is doing is forgotten. Mistakes relate to plans and so take two familiar forms: (1) misapplication to the immediate situation of a good plan (rule) appropriate for a different and more usual situation and (2) application of a wrong plan (rule).
• BOX 7.2
Human Error
Largely based on definitions by James Reason, these terms are used in a technical fashion by cognitive scientists studying human error.
Error
Failure of a planned sequence of mental or physical activities to achieve its intended outcome in the absence of intervention of a chance agency to which the outcome can be attributed.
Slip
Failure in the execution of an intended action sequence, whether or not the plan guiding it was adequate to achieve its purpose.
Lapse
Failure in the storage stage (memory) of an action sequence, whether or not the plan guiding it was adequate to achieve its purpose.
Mistake
Failure or deficiency of judgment or inference involved in selecting an objective or specifying the means of achieving it, regardless of whether actions directed by these decisions run according to plan.
Slips and lapses constitute active errors . They occur at the physician–patient interface. Mistakes, in addition, constitute many latent errors . These are indirect errors that relate to performance by leaders, decision makers, managers, certifying boards, environmental services, and a host of activities that share a common trait: planning, decisions, ideas, and philosophy removed in time and space from the immediate healthcare environment in which the error occurred. These are a category of errors over which the surgeon caring for a patient has little or no control or chance of modifying because latent errors are embedded in the system. It is claimed by students of human error in other contexts that the greatest chance of preventing adverse outcomes from human error is in discovering and neutralizing latent error.
Inevitability of human error.
If one considers all the possibilities for error in daily life, what is remarkable is that so few are made. We are surrounded with unimaginable complexity, yet we cope nearly perfectly because our minds simplify complex information. Think of how remarkably accident-free our early-morning commutes to the hospital are while driving complex machines in complex traffic patterns while we plan our day and listen to the news, weather report, and commercials.
When this cognitive strategy fails, it does so in only a few stereotypical ways. Because of this, models have been developed, based largely on observation of human error, that mimic human behavior by incorporating a fallible information-handling device (our minds) that operates correctly nearly always, but is occasionally wrong. Central to the theory on which these models are based is that our minds can remarkably simplify complex information. Exceedingly rare imperfect performance is theorized to be the price we pay for being able to cope this way with complexity. The mechanisms of human error are purported to stem from three aspects of “fallible machines”: downregulation, upregulation, and primitive mechanisms of information retrieval. In the text that follows, we borrow heavily from the human factors work of James Reason.
Downregulation.
We call this habit formation, skill development, and “good hands.” Most activities of life, and certainly those of a skillful surgeon, need to become automatic. If we had to think about every motion involved in performing an operation, the task would become nearly impossible to accomplish accurately. It would not be executed smoothly and would be error prone. It is hard to quantify surgical skill. It starts with a baseline of necessary sensory-motor eye-hand coordination that is likely innate. It becomes optimized by aggregation of correct “moves” and steps as well as by observation. It is refined by repetition of correct actions, implying identification of satisfactory and unsatisfactory immediate results (feedback). Then comes individual reflection and codification of moves and steps by hard analysis. Finally, motor skills are mastered by a synthesis of cognition and motor memory. The resulting automaticity and reproducibility of a skillful surgeon make a complex operation appear effortless, graceful, and flawless. However, automaticity renders errors inevitable.
Skill-based errors occur in the setting of routine activity. They occur when attention is diverted (distraction or preoccupation) or when a situation changes and is not detected in a timely fashion. They also occur as a result of overattention. Skill-based errors are ones that only skilled experts can make—beautiful execution of the wrong thing (slip) or failure to follow a complex sequence of actions (lapse). Skill-based errors tend to be easily detected and corrected.
Rule-based errors occur during routine problem-solving activities. Goals of training programs are to produce not only skillful surgeons but also expert problem solvers. Indeed, an expert may be defined as an individual with a wide repertoire of stored problem-solving plans or rules. Inevitable errors that occur take the form of either inappropriate application of a good rule or application of a bad rule.
Upregulation.
Our mind focuses conscious attention on the problem or activity with which we are confronted and filters out distracting information. The price we pay for this powerful ability is susceptibility to both data loss and information overload. This aspect of the mind is also what permits distractions or preoccupations to capture the attention of the surgeon, who would otherwise be focused on the routine tasks at hand. In problem solving, there may be inappropriate matching of the patient’s actual condition to routine rules for a somewhat different set of circumstances. Some of the mismatches undoubtedly result from the display of vast quantities of undigested monitoring information about the patient’s condition. Errors of information overload need to be addressed by more intelligent computer-based assimilation and display of data.
Primitive mechanisms of information storage and retrieval.
The mind seems to possess an unlimited capacity for information storage and a blinding speed of information retrieval unparalleled as yet by computers. In computer systems, there is often a trade-off between storage capacity and speed of retrieval. Not so for the mind. The brain achieves this, apparently, not by storing facts but by storing models and theories—abstractions—about these facts. Furthermore, the information is stored in finite packets along with other, often unrelated, information. (Many people use the latter phenomenon to recall names by associating them with more familiar objects, such as animals.) The implications for error are that our mental image may diverge importantly from reality.
The mind’s search strategy for information achieves remarkable speed by having apparently just two tools for fetching information. First, it matches patterns. Opportunity for error arises because our interpretation of the present and anticipation of the future are shaped by patterns or regularities of the past. Second, if pattern matching produces multiple items, it prioritizes these by choosing the one that has been retrieved most often. This mechanism gives rise to rule-based errors in a less frequently occurring setting.
Conscious mind.
When automatic skills and stored rules are of no help, we must consciously think. Unlike the automaticity we just described, the conscious mind is of limited capacity but possesses powerful computational and reasoning tools, all those attributes we ascribe to the thought process. However, it is a serial, slow, and laborious process that gives rise to knowledge-based errors . Unlike stereotypical skill- and rule-based errors, knowledge-based errors are less predictable. Furthermore, there are far fewer opportunities in life for “thinking” than for automatic processes, and therefore the ratio of errors to opportunity is higher. Errors take the form of confirmation bias, causality versus association, inappropriate selectivity, overconfidence, and difficulties in assimilating temporal processes.
The unusual ordering of material presented in the clinical chapters of this book was chosen by its original authors to provide a framework for thinking with the conscious mind about heart disease and its surgical therapy that would assist in preventing knowledge-based errors. For example, an algorithm (protocol, recipe) for successfully managing mitral valve regurgitation is based on knowledge of morphology, etiology, and detailed mechanisms of the regurgitation; preoperative clinical, physiologic, and imaging findings; natural history of the disease if left untreated; technical details of the operation; postoperative management; both early and long-term results of operation; and from all these considerations, the indications for operation and type of operation. Lack of adequate knowledge results in inappropriate use of a robot for mitral valve repair, too many mitral valve replacements, or suboptimal timing of operation.
Reducing errors.
We have presented this cognitive model in part because it suggests constructive steps for reducing human error and, thus, surgical failure.
It affirms the necessity for intense apprentice-type training that leads to automatization of surgical skills and problem-solving rules. It equally suggests the value of simulators for acquiring such skills. It supports creating an environment that minimizes or masks potential distractions. It supports a system that discovers errors and allows recovery from them before injury occurs. This requires a well-trained team in which each individual is familiar with the operative protocol and is alert to any departures from it. In this regard, de Leval and colleagues’ findings are sobering. , Major errors were often recognized and corrected by the surgical team, but minor ones were not, and a higher number of minor errors was strongly associated with adverse outcomes. It is also sobering that self-reporting of intraoperative errors was of no value. Must there be a human factors professional at the elbow of every surgeon?
James Reason suggested that “cognitive prostheses” may be of value, some of which are being advocated in medicine. For example, there is much that computers can do to reduce medication errors. A prime target is knowledge-based errors. Reducing these errors may be achievable through computer-based artificial intelligence, and novel modes of information assembly, processing, and display for processing by the human mind. Finally, if latent errors are the root cause of many active errors, analysis and correction at the system level will be required. A cardiac surgery program may fail, for example, from latent errors traceable to management of the blood bank, postoperative care practices, ventilation systems, and even complex administrative decisions at the level of hospitals, universities with which they may be associated, and national health system policies and regulations within which they operate.
Lack of scientific progress
A practical consequence of categorizing surgical failures into two causes is that they fit the programmatic paradigm of “research and development”: discovery on the one hand and application of knowledge to prevent failures on the other. The quest to reduce injury from medical errors that has just been described is what we might term “development.” Thus, lack of scientific progress is gradually reduced by generating new knowledge (research), and human error is reduced in frequency and consequences by implementing available knowledge (development), a process as vital in cardiac surgery as it is in the transportation and manufacturing sectors. , ,
Philosophy
Clinical research in cardiac surgery consists largely of patient-oriented investigations motivated by a quest for new knowledge to improve surgical results—that is, to increase survival early and in the long term; to reduce surgical complications; to enhance quality of life; and to extend appropriate operations to more patients, such as high-risk subsets. This inferential activity, aimed at improving clinical results, is in contrast to pure description of experiences. Its motivation also contrasts with those aspects of “outcomes assessment” motivated by regulation or punishment, institutional promotion or protection, quality assessment by outlier identification, and negative aspects of cost justification or containment. These coexisting motivations have stimulated us to identify, articulate, and contrast philosophies that underlie serious clinical research. It is these philosophies that inform our approach to analysis of clinical experiences.
Deduction versus induction
“Let the data speak for themselves.”
Arguably, Sir Isaac Newton’s greatest contribution to science was a novel intellectual tool: a method for investigating the nature of natural phenomena. His contemporaries considered his method not only a powerful scientific ally, but also a new way of philosophizing applicable to many other areas of human knowledge. His method had two strictly ordered aspects that for the first time were systematically expressed: a first, and extensive, phase of data analysis whereby observations of some small portion of a natural phenomenon are examined and dissected, followed by a second, less emphasized, phase of synthesis whereby possible causes are inferred and a small portion of nature revealed from the observations and analyses. This was the beginning of the inductive method in science: valuing first and foremost the observations made about a phenomenon, then “letting the data speak for themselves” in suggesting possible natural mechanisms.
This represented the antithesis of the deductive method of investigation that had been successful in the development of mathematics and logic (the basis for ontology-based artificial intelligence reasoning today). The deductive method begins with what is believed to be the nature of the universe (referred to by Newton as “hypothesis”), from which logical predictions are deduced and tested against observations. If the observations deviate from logic, the data are suspect, not the principles behind the deductions. The data do not speak for themselves. Newton realized that it was impossible to have complete knowledge of the universe. Therefore, a new methodology was necessary to examine just portions of nature, with less emphasis on synthesizing the whole. The idea was heralded as liberating in nearly all fields of science.
As the 18th century unfolded, the new method rapidly divided such diverse fields as religion into those based on deduction (fundamentalism) and those based on induction (liberalism). This philosophical dichotomy continues to shape the scientific, social, economic, and political climate of the 21st century.
Determinism versus empiricism
Determinism is the philosophy that everything—events, acts, diseases, decisions—is an inevitable consequence of causal antecedents. If disease and patients’ response to disease and its treatment were clearly deterministic and inferences deductive, there would be little need to analyze clinical data to discover their general patterns. Great strides are being made in linking causal mechanisms to predictable clinical response (see Section VII ). Yet many areas of cardiovascular medicine remain nondeterministic and incompletely understood. In particular, the relation between a specific patient’s response to complex therapy such as a cardiac operation and known mechanisms of disease appears to be predictable only in a probabilistic sense. For these patients, therapy is based on empirical recognition of general patterns of disease progression and observed response to therapy.
Generating new knowledge from clinical experiences consists, then, of inductive inference about the nature of disease and its treatment from analyses of ongoing empirical observations of clinical experience that take into account variability, uncertainty, and relationships among surrogate variables for causal mechanisms. Indeed, human error and its opposite—success—may be thought of as human performance variability.
Collectivism versus individualism
To better convey how new knowledge is acquired from observing clinical experiences, we look back to the 17th century to encounter the proverbial dichotomy between collectivism and individualism, so-called lumpers and splitters or forests and trees.
In 1603, during one of its worst plague epidemics, the City of London began prospective collection of weekly records of christenings and burials. In modern language, this was an administrative registry (see Box 7.1 ). Those “who constantly took in the weekly bills of mortality made little use of them, than to look at the foot, how the burials increased or decreased; and among the casualties, what has happened rare, and extraordinary, in the week current,” complained John Graunt. Unlike those who stopped at counting and relating anecdotal information, Graunt believed the data could be analyzed in a way that would yield useful inferences about the nature and possible control of the plague.
His ultimate success might be attributed in part to his being an investigator at the interface of disciplines. By profession he was a haberdasher, so Graunt translated merchandise inventory dynamics into terms of human population dynamics. He described the birth rate (rate of goods received) and death rate (rate of goods sold); he then calculated the population currently alive (the inventory).
Graunt then made a giant intellectual leap. In modern terms, he assumed that every person (any item on the shelf) was interchangeable with any other (collectivism). By assuming—no matter how politically, sociologically, or factually incorrect—that people are interchangeable, he achieved an understanding of the general nature of the birth-life-death process in the absence of dealing with specific named individuals (individualism). He attempted to discover, as it were, the general nature of the forest at the expense of ignoring characteristics of the individual trees.
Graunt then identified general factors associated with variability of these rates (risk factors, in modern terminology; see “ Multivariable Analysis ” in Section IV). From the City of London Bills of Mortality, he found that the death rate was higher when ships from foreign ports docked in the more densely populated areas of the city, and in households harboring domestic animals. Based on these observations, he made inferences about the nature of the plague—what it was and what it was not—and formulated recommendations for stopping its spread: avoid night air brought in from foreign ships (which we now know was not night air but rats), flee to the country (social distancing), separate people from animal vectors, and quarantine infected individuals. Nevertheless, they were effective in stopping the plague for 200 years until its cause and mechanism of spread were identified.
Lessons based on this therapeutic triumph of clinical investigation conducted 300 years ago include the following: (1) empirical identification of patterns of disease can suggest fruitful directions for research and eliminate some hypothesized causal mechanisms, (2) recommendations based on empirical observations may be effective until causal mechanisms and treatments are discovered, and (3) new knowledge is often generated by overview (synthesis), as well as by study of the fate of individual patients.
When generating new knowledge about the nature of heart disease and its treatment, it is important both to examine groups of patients (the forest) and to investigate individual therapeutic successes and failures (the trees). This is similar to Heisenberg’s uncertainty principle in chemistry, thermodynamics, and mechanics, in which physical matter and energy can be thought of as discrete particles on the microhierarchical plane (individualism, splitting, trees), and as waves (field theory) on the macrohierarchical plane (collectivism, lumping, forests). Both views give valuable insights into nature, but they cannot be viewed simultaneously. Statistical methods emphasizing optimum discrimination for identifying individual patients at risk tend to apply to the former, whereas those emphasizing probabilities and general inferences tend to apply to the latter. ,
Continuity versus discontinuity in nature
To discover relationships between outcomes and items that differ in value from patient to patient (called variables ), a challenge immediately arises: Many of the variables related to outcome are measured either on an ordered clinical scale (ordinal variables), such as New York Heart Association functional class, or on a more or less granular and unlimited scale (continuous variables), such as age. Three hundred years after Graunt, the Framingham Heart Disease Epidemiology Study investigators were faced with this frustrating problem. , Many of the variables associated with development of heart disease were continuously distributed ones, such as age, blood pressure, and cholesterol level. To examine the relationship of such variables to development of heart disease, it was then accepted practice to categorize continuous variables coarsely for constructing cross-tabulation tables. Valuable information was lost this way. Investigators recognized that a 59-year-old’s risk of developing heart disease was more closely related to that of a 60-year-old’s than to that of the group of patients in the sixth versus seventh decade of life. They therefore insisted on examining the entire spectrum of continuous variables rather than subclassifying the information.
What they embraced is a key concept in the history of ideas—namely, continuity in nature . The idea has emerged in mathematics, science, philosophy, history, and theology. In our view, the common practice of stratifying age and other continuous variables into a few discrete categories is lamentable because it loses the power of continuity (some statisticians call this “borrowing power”). Focus on small, presumed homogeneous groups of patients also loses the power inherent in a wide spectrum of related but heterogeneous cases. After all, any trend observed over an ever-narrower framework looks more and more like no trend at all! Like the Framingham investigators, we therefore embrace continuity in nature unless it can be demonstrated that doing so is not valid, useful, or beneficial. (Machine learning methods that use splitting rules may seem to stumble at this point, but repetition of analyses over hundreds or thousands of bootstrapped data samples followed by averaging achieves a close approximation to continuity in nature [see Section IV ].)
Single versus multiple dimensionality
The second problem the Framingham investigators addressed was the need to consider multiple variables simultaneously. Univariable (one variable at a time) statistics are attractive because they are simple to understand. However, most clinical problems are multifactorial. At the same time, clinical data contain enormous redundancies that somehow need to be taken into account (e.g., height, weight, body surface area, and body mass index are highly correlated and relate to the conceptual variable “body size”).
Cornfield came to the rescue of the Framingham investigators with a new methodology called multivariable logistic regression (see “ Logistic Regression Analysis ” in Section IV). It permitted multiple factors to be examined simultaneously, took into account redundancy of information among variables (collinearity), and identified a parsimonious set of variables for which the investigators coined the term “factors of risk” or risk factors (see “ Parsimony Versus Complexity ” later in this section and “ Multivariable Analysis ” in Section IV).
Various forms of multivariable analysis, in addition to logistic regression analysis, are available to clinical investigators. Their common theme is to identify patterns of relationships between outcome and a number of variables considered simultaneously. These are not cause–effect relations, but associations with underlying causal mechanisms (see discussion of surrogates under “ Multivariable Analyses ” in Section IV). The relationships that are found may well be spurious, fortuitous, hard to interpret, and confusing because of the degree of correlation among variables. For example, women may be at a higher risk of mortality after certain cardiac procedures, but female sex may not be a “risk factor” but rather a marker for more advanced disease because other factors, such as body mass index, may be the more general variable related to risk, whether in women or men. In this instance, it is simultaneously true that (1) being female is not per se a risk factor, but (2) women are at higher risk by virtue of the fact that, on average, they are smaller than men.
This means that a close collaboration must exist between statistical experts and surgeons, particularly in organizing variables for analysis and interpreting findings from analyses.
Linearity versus nonlinearity
Risk factor methodology introduced another complexity besides increased dimensionality. The logistic equation is a symmetric S-shaped curve that expresses the relationship between a scale of risk, called logit units , and a corresponding scale of absolute probability of experiencing an event ( Fig. 7.1 ). , Because the relationship is not linear, it is not possible to simply add up scores for individual variables and come up with a probability of an event.
Fundamental logistic relation of a scale of risk (logit units) to absolute probability of an event. (A) Logistic relation, shown when risk factors are translated into logit units, is depicted along horizontal axis, and probability of the outcome event along vertical axis. Logistic equation is inserted, where exp is the natural exponential function. (B) Relation between cardiac index and probability of hospital death in cardiac failure determined by logistic regression analysis of data obtained in intensive care unit (UAB). Cardiac index in L · min −1 · m −2 is plotted along the horizontal axis. z describes the transformation of cardiac index to logit units, where Ln is the natural logarithm. If data were replotted with transformation to logit units along the horizontal axis, depiction would reflect some portion of the curve in A .
The nonlinear relationship between risk factors and probability of outcome makes medical sense. Imagine a risk factor with a logit unit coefficient of 1.0 (representing an odds ratio of 2.7; Box 7.3 and see Fig. 7.1 ). If all other things position a patient far to the left on the logit scale, a 1-logit-unit increase in risk results in a trivial increase in the probability of experiencing an event. But as other factors move a patient closer to the center of the scale (0 logit units, corresponding to a 50% probability of an event), a 1-logit-unit increase in risk makes a huge difference. This is consistent with the medical perception that some patients experiencing the same disease, trauma, or complication respond quite differently. Some are medically robust because they are far to the left (low-risk region) on the logit curve before the event occurred. Others are medically fragile because their age or comorbid conditions place them close to the center of the logit curve. For the latter, a 1-logit-unit increase in risk can be “the straw that breaks the camel’s back.” It is this kind of relation that makes it hard to demonstrate, for example, the benefit of bilateral internal thoracic artery grafting in relatively young adults followed for even a couple of decades, but easy in patients who have other risk factors. The same has been demonstrated for risk of operation in patients with aortic regurgitation and low ejection fraction.
• BOX 7.3
Expressions of Relative Risk
Proportion
Consider two groups of patients, A and B. Mortality in group A is 10 of 40 patients (25%); in B, it is 5 of 50 patients (10%). For the sake of illustrating the various ways these proportions (see later Box 7.11 ), 0.25 and 0.10, can be expressed relative to one another, designate a as the number of deaths (10) in A and b as the number alive (30). The total in A is a + b (40) patients, n A . Designate c as the number of deaths (5) in B and d as the number alive (45). The total in B is c + d (50) patients, n B . Designate P A as the proportion of deaths in A, a/(a + b ) or a/n A , and P B as the proportion in B, c/(c + d ) or c/n B .
Relative risk (risk ratio)
Relative risk is the ratio of two probabilities. In the previous example, relative risk of A compared with B is P A / P B = [ a/ ( a + b )] / [ c /( c + d )] = 0.25/0.10 or 2.5. Equivalently, one could reverse the proportions, P B / P A = 0.10/0.25 = 0.4. If P A were to exactly equal P B , relative risk would be unity (1.0). Another way to express relative risk when comparing two treatments is by relative risk reduction , which for relative risks greater than 1 is 1 minus relative risk. This is mathematically identical to dividing the absolute difference in proportions by the higher of the two: (P B − P A )/ P B .
Odds and gambler’s odds
The odds of an event is the number of events divided by non-events. In the previous example, the odds of death in A is a/b = 10/30 = 0.33; in B, it is c/d = 5/45 = 0.11. The mathematical interrelation of probability ( P ) of an event and odds ( O ) are these: O = P /(1 − P ) and P = O/ (1 + O ). A probability of 0.1 is an odds of 0.11, but a probability of 0.5 is an odds of 1, of 0.8 an odds of 4, of 0.9 an odds of 9, and of 1.0 an odds of infinity. Often, it is interesting to examine the odds of the complement (1 − P ) of a proportion, (1 − P ) /P, which is gambler’s odds . Thus, a P value of.05 is equivalent to an odds of.053 and a gambler’s odds of 19:1. A P value of.01 has a gambler’s odds of 99:1, and a P value of 2 has a gambler’s odds of 4:1.
Odds ratio and log odds
The odds ratio is the ratio of odds. In the previous example, the odds ratio of A compared with B is (a/b)/(c/d) = ad/bc , which is either (10/30)/(5/45) = 3 or (10 · 45)/(30 · 5) = 3.
Note that the logistic equation is Ln[ P /(1 − P )]. For A, P A /(1 − P B ) is a/b, the odds of A. Thus, Ln[ P/ (1 − P )] is log odds. Logistic regression can then be thought of as an analysis of log odds. Exponentiation of a logistic coefficient for a dichotomous (yes/no) risk factor from such an analysis re-expresses it in terms of the odds ratio for those with versus those without the risk factor (see later Box 7.17 ).
When the probability of an event is low, say less than 10%, relative risk (RR) and the odds ratio (OR) are numerically nearly the same. The mathematical relation is RR = [(1 − P A /(1 − P B )] · OR. In the previous example, the relative risk was 2.5, but the odds ratio was 3, and the disparity increases as the probability of event increases to 50%.
Relative risk is easier for most physicians to grasp because it is simply the ratio of proportions. It is unusual to encounter a physician without an epidemiology background who understands the odds ratio.
Expressing relative risk and odds ratios
Both relative risk and odds ratios are expressed on a scale of 0 to infinity. However, all odds ratios less than 1 are squeezed into the range 0 to 1, in contrast to those greater than 1, which are spread out from 1 to infinity. It is thus difficult to visualize that an odds ratio of 4 is equivalent to one of 0.25 if a linear scale is used. We recommend that a scale be chosen to express these quantities with equal distance above and below 1.0. This can be achieved, for example, by using a logarithmic or logit presentation scale.
Risk difference (absolute risk reduction) and number to treat
The risk difference is the difference between two proportions. In the previous example, P A − P B is the risk difference. In many situations, risk difference is more meaningful than risk ratios (either relative risk or the odds ratio). Consider a low probability situation with a risk of 0.5% and another with a risk of 1%. Relative risk is 2. Yet risk difference is only 0.5%. In contrast, consider a higher-probability situation in which one probability is 50% and the other 25%. Relative risk is still 2, but risk difference is 25%. These represent the proverbial statement that “twice nothing is still nothing.” They reflect the relation between the logit scale and absolute probability (see Fig. 7.1 A), recalling that the logit scale is one of log odds.
An alternative way to express a difference in probabilities when the difference is arranged to be positive (e.g., P A − P B ), and thus expresses absolute risk reduction, is as the inverse, 1/ (P A − P B ). This expression of absolute risk reduction is called number to treat. It is useful in many comparisons in which it is meaningful to answer the question, “How many patients must be treated by A (compared with B) to prevent one event (death)?” In our example, absolute risk reduction is 25% − 10% = 15%, and number needed to treat is 1/0.15 = 6.7. Number needed to treat is particularly valuable for thinking about risks and benefits of different treatment strategies. If it is large, one may question the risk of switching treatments, but if it is small, the benefit of doing so becomes more compelling.
Hazard ratio
In time-related analyses, it is convenient to express the model of risk factors in terms of a log-linear function (see later Box 7.17 and “ Cox Proportional Hazards Regression ” in Section IV): Ln(λ t ) = Ln(λ t ) β 0 + β 1 x 1 … β k x k , where Ln is the natural logarithm and λ t is the hazard function. The regression coefficients, β, for a dichotomous risk factor thus represent the logarithm of the ratio of hazard functions. Hazard ratios, as well as relative risk and the odds ratio, can be misleading in magnitude (large ratios, small risk differences) in some settings. Hazard comparisons, just like survival comparisons, often are more meaningfully and simply expressed as differences.
This type of sensible, nonlinear medical relation makes us want to deal with absolute risk rather than relative risk or risk ratios (see Box 7.3 ). Relative risk is simply a translation of the scale of risk, without regard to location on that scale. Absolute risk integrates this with the totality of other risk factors.
Raw data versus models of data
Importantly, the Framingham investigators did not stop at risk factor identification. Because logistic regression generates an equation based on raw data, it can be solved for a given set of values for risk factors. The investigators devised a cardboard slide rule for use by laypersons to determine their predicted risk of developing heart disease within the next 5 years.
Whenever possible and appropriate, results of clinical data analyses should be expressed in the form of mathematical models that become equations. These can be solved after “plugging in” values for an individual patient’s risk factors to estimate absolute risk and its confidence limits. Equations are compact and portable, so that with the ubiquitous computer, they can be used to advise individual patients (see “ Knowledge for Clinical Decision-Making ” in Section V).
It can be argued that equations do not represent raw data. But in most cases, are we really interested in raw data? Archeologists are interested in the past, but the objective of most clinical investigation is not to predict the past, but to draw inferences based on observations of the past that can be used in treating future patients. Thus, one might argue that equations derived from raw data about the past are more useful than raw, undigested data.
But what about use of machine learning algorithms that may excel at prediction, but may be opaque? These may not be as compact in form as an equation, but we would include them in this notion of models of data (see “ Machine Learning for Multivariable Analysis ” in Section IV).
Nihilism versus predictability
One of the important advantages of generating equations and algorithms is that they can be used to predict future results for either groups of patients or individual patients. We recognize that when speaking of individual patients, we are referring to a prediction concerning the probability of events for that patient; we generally cannot predict exactly who will experience an event or when an event will occur. Indeed, whenever we apply what we have learned from clinical experience to a new patient, we are predicting. This motivated us to develop statistical tools that yield patient-specific estimates of absolute risk as an integral byproduct. These were intended to be used for formal or informal comparison of predicted risks and benefits among alternative therapeutic strategies (see “ Clinical Studies with Nonrandomly Assigned Treatment ” later in this section).
Of course, the nihilist will say, “You can’t predict.” However, in a prospective study of 3720 patients in Leuven, Belgium, we generated evidence that predictions from multivariable equations are generally reliable (see details under “ Residual Risk ” in Section V). We compared observed survival, obtained at subsequent follow-up, with prospectively predicted survival. The correspondence was excellent in 92% of patients. However, it was poor in the rest ( Fig. 7.2 and Table 7.2 ). A time-related analysis of residual risk identified circumstances leading to poor prediction and revealed the limitations of quantitative predictions: (1) when patients have important rare conditions that have not been considered in the analysis, risk is underestimated; (2) when large data sets rich in clinically relevant variables are the basis for prediction equations, prediction should be suspect in only a small proportion of patients with unaccounted-for conditions. Except for these limitations, multivariable equations and algorithms appear capable of adjusting well for different case mixes.
Predicted and observed survival after coronary artery bypass grafting, illustrating both ability to predict from multivariable equations and pitfalls in doing so. (A) Observed overall survival among prospectively studied patients ( n = 3720) compared with predicted survival. Each circle represents an observed death, positioned at time of death along horizontal axis, and according to Kaplan-Meier life-table method along vertical axis; vertical bars are 70% confidence limits (CL). Solid line and its 70% CLs represent predicted survival. Notice systematic underestimation of survival. Number of predicted deaths = 273 (5.7%) and observed deaths = 243 (6.5%); P =.03. (B) Patients stratified by presence (open squares) and absence (circles) of rare unaccounted-for risk factors (malignancy, preoperative dialysis, atrial fibrillation, ventricular tachycardia, or aortic regurgitation). Otherwise, format is as in A. Note excellent correspondence of predicted survival to observed survival in patients without these factors, and substantial underestimation of risk in patients with them.
TABLE 7.2
Predicted and Observed Number of Deaths after Primary Isolated Coronary Artery Bypass Grafting
Data from Sergeant and colleagues, July 1987 to 1992; n = 3720.
| TOTAL DEATHS | ||||||
|---|---|---|---|---|---|---|
| Observed | Predicted | |||||
| Rare Risk Factors | n | No. | % | No. | % | P |
| No | 3428 | 186 | 5.4 | 191 | 5.6 | .7 |
| Yes | 292 | 57 | 20 | 22 | 7.5 | <.0001 |
The amount of data necessary to generate new knowledge is much larger than that needed to use the knowledge in a predictive way. To generate new knowledge, data should be rich both in relevant variables and in variables eventually found not to be relevant. But for prediction, one needs to collect only those variables used in the equation or algorithm unless one is interested in investigating reasons for prediction error.
Blunt instruments versus fine dissecting instruments
A related use of predictive equations and algorithms is in comparing alternative therapies. Some would argue that the only believable comparisons are those based on randomized trials, and that documented clinical experiences are irrelevant and misleading. , However, many randomized trials are homogeneous and focused and are analyzed by blunt instruments, such as the average treatment effect. On the other hand, real-world clinical experience involves patient selection that is difficult to quantify, may be a single-institution experience with limited generality except to other institutions of the same variety, is not formalized unless there is prospective gathering of clinical information into registries, and is less disciplined. Nevertheless, analyses of clinical experiences can yield a fine dissecting instrument in the form of equations or algorithms that are useful across the spectrum of heart disease for comparing alternative treatments and therefore for advising patients , (see “ Clinical Studies with Nonrandomly Assigned Treatment ” later in this section).
Parsimony versus complexity
Although clinical data analysis methods and results may seem complex at times, an important philosophy behind such analysis is parsimony (simplicity). We have discussed two reasons for this previously. One is that clinical data contain inherent redundancy, and one purpose of multivariable analysis is to identify that redundancy and thus simplify dimensionality. A second reason is that assimilation of new knowledge is incomplete unless one can extract the essence of the information. Thus, clinical inferences are often even more digested and simpler than the multivariable analyses.
We must admit that simplicity is a virtue based on philosophical, not scientific, grounds. The concept was introduced by William of Ocken in the early 14th century as a concept of beauty—beauty of ideas and theories. Nevertheless, it is pervasive in science.
There are dangers associated with parsimony and beauty, however. The human brain appears to assimilate information in the form of models, not actual data (see “ Human Error ” earlier in this section). Thus, new ideas, innovations, breakthroughs, and new interpretations of the same data often hinge on discarding past paradigms (“thinking outside the box”). There are other dangers in striving for simplicity. We may miss important relations because our threshold for detecting them is too high. We may reduce complex clinical questions to simple but inadequate questions that we know how to answer.
For analyses whose primary purpose is comparison, it is important, when sufficient data are available ( Box 7.4 ), to account for “everything known.” In this way the residual variability attributed to the comparison is most likely to be correct.
• BOX 7.4
Sufficient Data
A common misconception is that the larger the study group (called the sample because it is a sample of all patients past, present, and future [see later Box 7.11 ]), the larger the amount of data available for analysis. However, in studies of outcome events, the effective sample size for analysis is proportional to the number of events that have occurred, not the size of the study group. Thus, a study of 200 patients experiencing 10 events has an effective sample size of 10, not 200.
Ability to detect differences in outcome is coupled with effective sample size. A statistical quantification of the ability to detect a difference is the power of a study. A few aspects of power that affect multivariable analyses of events are mentioned.
Many variables in a data set represent subgroups of patients, and some of them may be few in number. If a single patient in a small subgroup experiences an event, multivariable analysis may identify that subgroup as one at high risk, when in fact the variable represents only a specific patient, not a common denominator of risk (see “ Incremental Risk Factor Concept ” in Section IV). The purpose of a multivariable analysis is to identify general risk factors, not individual patients experiencing events!
Thus, more than one event needs to be associated with every variable considered in the analysis. The rule of thumb in multivariable analysis is that the ratio of events to risk factors identified should be about 10:1. , For us, sufficient data means at least five events associated with every variable. This strategy could result in identifying up to one factor per five events. We get nervous at this extreme, but in small studies we are sometimes close to that ratio. However, bear in mind that variables may be highly correlated and subgroups overlap, so in the course of analysis, the number of unexplained events in a subgroup may effectively fall below five, which is insufficient data.
Thus, there is both an upper limit of risk factors that can be identified by multivariable analysis and a lower limit of events to allow a variable to be considered in the analysis. Sufficient data implies having enough events available to test for all relevant risk factors.
New knowledge versus selling shoes
The philosophies described so far focus on the challenge of generating new knowledge from clinical experiences. However, other uses are made of clinical data.
Clinical data may be used as a form of advertising (“selling shoes”). Innovation stems less from purposefulness than from aesthetically motivated curiosity, frustration with the status quo, sheer genius, fortuitous timing, favorable circumstances, and keen intuition. With innovation comes the need to promote. However, promotional records of achievement should not be confused with serious study of safety, clinical effectiveness, and long-range appropriateness.
Closely related to promotion and innovation is proprietary information related to its commercialization. At present, the philosophies of scientific investigation and business are irreconcilable. One thrives on open dissemination of information, the other on proprietary information offering a competitive advantage. In an era of dwindling public resources for research and increasing commercial funding, we may be seeing increasing conflict between open scientific inquiry and commercial interests.
Past versus future
Is there, then, a future for quantitative analysis of the results of therapy, as there was in the developmental phase of cardiac surgery? Kirklin and Barratt-Boyes wrote in their preface to the second edition of this book:
The second edition reflects data and outcomes from an era of largely unregulated medical care, and similar data may be impossible to gather and freely analyze when care is largely regulated. This is not intended as an opinion as to the advantages or disadvantages of regulation of healthcare; indeed, as regulation proceeds, the data in this book, along with other data, should be helpful in establishing priorities and guidelines.
As already noted in all editions of this book, the last section of each clinical chapter is indications for operation. In the future, regulations of policymakers may need to be added to other variables determining indications, including patient preference.
On the horizon is the promise that medicine will become decreasingly empirical and more deterministic. However, as long as treatment of heart disease requires complex procedures, and as long as most are palliative in the life history of chronic disease, there will be a need to understand more fully the nature of the disease, its treatment, and its optimal management. This will require adoption of approaches to data that are inescapably philosophical.
Clinical research
In response to the American Medical Association’s Resolution 309 (I-98), a Clinical Research Summit and subsequently an ongoing Clinical Research Roundtable of the Academy of Medicine in the United States have sought to define and reenergize clinical research. The most important aspects of the definition of clinical research are that (1) it is but one component of medical and health research aimed at producing new knowledge; (2) the knowledge produced should be valuable for understanding the nature of disease, its treatment, and prevention; and (3) it embraces a wide spectrum of types of research. Here we highlight broad examples of that spectrum commonly found in cardiac surgery publications.
Descriptive studies: Techniques, case reports, and case series
The majority of publications in cardiac surgery clinical research fall into the general category of descriptive studies.
Techniques.
Every aspect of cardiac surgery has been described in thousands of publications about technical details. These may be techniques of preoperative evaluation, such as imaging, preoperative and intraoperative aspects of anesthesia, technical details of operations, details of postoperative management, and patient recovery. Many describe new devices and their properties and even failures, and some even describe novel methods of analysis.
The common denominator of these publications is transparent disclosure of techniques for their use by all in the field. Advancements by one surgeon or group that are not shared is a loss to the entire field of cardiac surgery.
Case reports.
The impact factor of most journals is not enhanced by individual case reports or very short case series because they are not commonly cited. However, particularly in congenital heart disease, these reports often advance the field. They may initially appear as oddities, but may later fit into gaps of knowledge, or their successful repair may advance the field in unexpected ways.
Case series.
Whether a short case series, a thousand-case series, or a national case series or hundreds of thousands of more cases, what are often classified as “just” descriptive studies have advanced the field of cardiac surgery tremendously. Hemodynamics of prosthetic valves, experience with the radial artery as a conduit for coronary bypass grafting, national experience with ablation for atrial fibrillation, Medicare costs, heart failure rehospitalizations after certain operations, and quality of life after aneurysm repair are all examples of descriptive case series. In “Technique for Successful Clinical Research” that follows, we will describe how to design effective descriptive studies that answer important relevant questions that are hypothesis driven and hypothesis generating.
Clinical studies with nonrandomly assigned treatment
Multivariable analyses.
In contrast to studies that describe a series of cases are observational studies that compare outcomes of different treatments. The fundamental objection to using observational clinical data for comparing treatments is that many uncontrolled variables affect outcome. Thus, attributing outcome differences to just one factor—the alternative treatment—stretches credibility. Even a cursory glance at the characteristics of patients treated one way versus another usually reveals that they are different groups. This should be expected because treatment has been selected by experts who believe they know what is best for a given patient. The accusation that one is comparing apples and oranges is well justified!
Indeed, a consistent message since Graunt is that risk factors for outcomes from analyses of clinical experience (and these include treatment differences) are associations , not causal relations . Multivariable adjustment for differences in outcome is valuable, and methods for multivariable analysis will be detailed in Section IV under “Multivariable Analysis.” It must be pointed out, however, that if alternative therapies are being analyzed by multivariable analysis, these analyses are not guaranteed to be effective in eliminating selection bias as the genesis of a difference in outcome (a form of confounding ). , ,
Case-control studies.
Over the years, a number of attempts have been made to move “association” toward “causality.” One such method is the case-control study. , The method seems logical and straightforward in concept. Patients in one treatment group (cases) are matched with one or more patients in the other treatment group (controls) according to variables such as age, sex, and ventricular function. However, case matching is rarely easy in practice. How closely matched must the pair of patients be in age? How close in ejection fraction? “We don’t have anyone to match this patient in both age and ejection fraction!” The more variables that must be matched, the more difficult it is to find a match in all specified characteristics. Yet matching on only a few variables may not protect well against apples-and-oranges comparisons. Diabolically, selection factor effects, which case matching is intended to reduce, may increase bias when unmatched cases are simply eliminated.
Comparative effectiveness studies.
During the 1980s, federal support for complex clinical trials in heart disease was abundant. Perhaps as a result, few of us noticed the important advances being made in statistical methods for valid, nonrandomized comparisons. One example was the seminal 1983 Biometrika paper “The Central Role of the Propensity Score in Observational Studies for Causal Effects” by Rosenbaum and Rubin. In the 1990s, as the funding climate changed, interest in methods for making nonrandomized comparisons accelerated. This interest has increased as comparative effectiveness research has taken on greater importance, and the concept of a Learning Health System has been advocated by the National Academy of Medicine in the United States. Studies using propensity scores for comparison of outcomes are detailed in Section V under “Comparisons Based on the Propensity Score.”
Virtual twin studies.
More recently, formal statistical methods for conceptually treating patients as their own controls—something that has been done in cardiac surgery since the 1970s , , —is known as virtual twin methodology. , The method borrows the concept of propensity score methods in identifying patients empirically eligible to be treated by two or more therapies, but both refines and importantly extends those methods such that patients are matched to themselves first with treatment received and then with the counterfactual treatment. Clinical research using virtual twins is detailed in Section V under “Virtual Twins and Causal Analysis.”
Clinical trials with randomly assigned treatment
A randomized clinical trial is a scientific experiment in human subjects wherein allocation of therapies and alternatives for individual patients is not under control of the clinician, but rather is assigned at random, eliminating selection bias and thus permitting direct comparison of treatment outcomes. “Therapy” may include a placebo (placebo-controlled trials) or current versus new therapy. In the hierarchy of clinical research study designs, the randomized trial generates the most secure information about average treatment differences. Randomized trials are detailed in Section V under “Clinical Trials with Randomly Assigned Treatments.”
Meta-analysis
Meta-analysis combines or integrates the results of multiple independently conducted clinical trials, observational clinical studies, or sets of individual patient data that are deemed combinable with respect to a common research question, then analyzes them statistically. Meta-analysis studies are detailed in Section V under “Meta-Analysis.”
Technique for successful clinical research
Marbán and Braunwald, in reflecting on training the clinician-investigator, provide guiding principles for successful clinical research. Among these are:
-
•
Choose the right project.
-
•
Embrace the unknown.
-
•
Use state-of-the-art approaches.
-
•
Do not become the slave of a single technique.
-
•
Never underestimate the power of the written or spoken word.
In this subsection, we emphasize these principles and suggest ways to operationalize them.
A deliberate plan is needed to successfully carry a study through from inception to publication. Many such plans have been proposed, but with important commonalities. Here we outline such a plan for the study of a clinical question for which clinical experience (an observational case series or comparative cohorts) will provide the data. This plan appears as a linear workflow, but in reality, most research efforts do not proceed linearly but rather iteratively, with each step being more refined and usually more focused right up to the last revision of the manuscript. As is true of most workflow, there are mileposts at which there need to be deliverables , whether a written proposal, data, analyses, tables and graphs, a manuscript, or page proofs. This plan is articulated in a research proposal.
Research proposal
Good science requires a plan. This may be seen as a bureaucratic necessity for Institutional Review Board (IRB) oversight, but for effective, efficient studies, be they for quality assurance investigations or any type of clinical study, a formal plan in the form of a proposal is essential. In the words of Daniel Burnham, “Make no little plans; they have no magic to stir men’s blood.”
Getting started.
A proposal first serves to clarify and bring into focus the question or set of related questions being asked. A common mistake is to ask questions that are unfocused, or uninteresting, or overworked, or that do not target a clear gap in knowledge or area of importance. Marbán and Braunwald say “Ask a bold question…about which you can feel passionate.”
Topical question.
It is not possible to ask a researchable question without knowing something about the topic of one’s curiosity. Thus, as a preliminary to the research question, there is the topical question that relates to what is already known about the topic of interest and what gaps there are in knowledge related to that topic. Help in this regard may come from clinicians and mentors, but there is no substitute for a literature search and review. Indeed, some research mentors encourage this to be done as a systematic review (or “scoping” review), which can result in a state-of-the-topic review manuscript that ends with a discussion of current gaps or limitations in knowledge, disparate findings that require reconciling, assertions from data that require verification, underpowered observations that need more robust study, and so on. From this literature study, you should be able to determine if your initial question has been adequately answered and no further study is needed, or to generate your research question.
Research question.
What is a good research question on your topic of interest? Browner and colleagues describe the characteristics of a good research question and study plan as meeting those described by the acronym FINER. F stands for feasibility : “adequate number of subjects, adequate technical expertise, affordable time and money, manageable scope, and fundable.” Many IRBs exempt obtaining numbers of cases meeting eligibility requirements for a study to determine if the study is feasible. I stands for interesting : “getting the answer intrigues investigators and their colleagues.” The National Institutes of Health (NIH) would call this significance , one of the graded criteria for grant proposals. N stands for novel : “provides new findings, confirms, refutes, or extends previous findings, may lead to innovations in concepts of health and disease, medical practice, or methodologies for research.” NIH calls this innovation . E stands for ethical : “a study that the IRB will approve.” As an aside, we note that there continues to be research fraud, and it is nearly impossible for reviewers to identify. R stands for relevant : “likely to have significant impacts on scientific knowledge, clinical practice, or health policy; may influence directions of future research.” NIH calls this overall impact, and this is the ultimate score that is given a proposal for funding.
A further suggestion we offer is that a good research question should have three parts: the population, the treatment, and the outcome. For example, one might offer the topical question “What should we do for ischemic mitral disease?” Given the knowledge gap from the literature review, this topical question can be transformed into a good investigable question: For patients with moderate ischemic mitral regurgitation (the population), does addition of an undersized rigid anuloplasty ring (the treatment comparison) result in more complete reverse left ventricular remodeling (the outcome)?
Study title.
Once you have a good question, you have what you need to provide a good title for your proposal. The title should be a statement that relates directly to your primary research question. Some like to then extract from the title a short word for the study (this is especially true of clinical trials), such as MATADORS: M ultidisciplinary study of A scending T issue characteristics A nd hemodynamics for the D evelopment of novel a OR tic S tentgrafts.
Mechanistic hypothesis.
Given the research question, a hypothesis that is focused not on statistical matters but on the “why” is important, because it leads directly to the endpoint, and the endpoint to the analysis or its refinement. The mechanistic hypothesis for the question about mitral valve anuloplasty might be “By stabilizing the mitral anular size with a rigid anuloplasty ring, regurgitation across the mitral valve is limited and stress on the left ventricular myocardium reduced, resulting in a decrease in systolic ventricular size (reverse remodeling).” There may be secondary hypotheses that lead to other endpoints, such as time-related mortality.
Identify the study group.
The next step is to clearly define the inclusion and exclusion criteria for the study group . A common mistake is to define this group too narrowly, such that cases “fall through the cracks” or an insufficient spectrum is stipulated. The inclusive dates should be considered carefully. Readers will be suspicious if the dates are “strange”; did you stop just before a series of deaths? Whole years or half years dispel these suspicions. Similarly, suspicion arises when a study consists of a “nice” number of patients, such as “the first 100 or 1000 repairs.”
In defining the study group, particular care should be taken to include the denominator. For example, a study question may relate to postoperative neurologic events, but it is also important to have a denominator to put these events into context. Or one may wish to evaluate a new surgical technique but will be unable to compare it with the standard technique without a comparison group. A study of only numerators is the true definition of a retrospective study; if the denominator is included, it is a prospective or cohort study ( Box 7.5 ). , The inclusion and exclusion criteria should allow one to perform an initial search for patients as a feasibility study. Are there too few patients to answer the study question? A common failing is forgetting that if an outcome event is the endpoint, the effective sample size is the number of events observed (see Box 7.4 ). A study may have 1000 patients, but if only 10 events are observed, one cannot find multiple risk factors for those events. In cardiothoracic surgery, the number of events is often a very small fraction of the denominator, and such data are known as imbalanced. This is particularly important for nonparametric machine learning methods and in applying traditional goodness-of-fit statistics (like the C-statistic) to parametric models.
• BOX 7.5
Retrospective, Prospective
When clinical data are used for research, some term this retrospective research (e.g., the National Institutes of Health). Epidemiologists also perform what they call retrospective studies that bear no resemblance to typical clinical studies. Thus, confusion has been introduced by use of both the word retrospective and prospective to designate interchangeably two antithetical types of clinical study. The confusion is perpetuated by institutional review boards and government agencies that believe one (prospective), but not the other (retrospective), constitutes “research” on human subjects. The confusion can be eliminated by differentiating between (1) the temporal direction of study design and (2) the temporal direction of data collection for a study, as did Feinstein.
Temporal direction of study design
The temporal pursuit of patients may be forward. That is, a cohort (group) of patients is defined at some common time zero, such as operation, and this group is followed for outcomes. Some call this a cohort study. , It is the most typical type of study in cardiac surgery: A group of patients is operated on and outcome is assessed. Statisticians have called this a prospective clinical study design ; it moves from a defined time zero forward (which is what the word prospective means).
In contrast, temporal pursuit of patients may be backward. Generally in such a study, an outcome event occurs, such as death from a communicable disease. Starting from this event (generally, a group of such events), the study proceeds backward to attempt to ascertain its cause. Feinstein suggests calling such a study a “trohoc” study ( cohort spelled backwards). For years, many epidemiologists called this a retrospective clinical study design because of its backward temporal direction of study.
Temporal direction of data collection
Increasingly, retrospective is used to designate the temporal aspect of collecting data from existing clinical records for either a cohort or trohoc study. If charts or radiographs of past patients in a cohort study must be reviewed or echocardiographic features measured, the data collection is retrospective. Feinstein has coined the term “retrolective” for this to avoid use of the word retrospective because of the previously well-understood meaning of the latter in study design. If registry data are collected concurrently with patient care, this process is surely prospective data collection. Feinstein suggests calling such data collection “prolective” data collection.
The deliverable for this feasibility investigation is a CONSORT-style diagram ( http://www.consort-statement.org/consort-statement/flow-diagram ). For observational studies, there is a related STROBE guideline, and all these are part of the EQUATOR (Enhancing the QUA lity and T ransparency O f health R esearch) network ( www.equator-network.org ).
Identify endpoints.
Endpoints (results, outcomes) should be linked one-to-one to the study questions and accompanying hypotheses. For example, if the first subquestion of a comparative effectiveness study is “How do these treatment groups differ?” and the mechanistic hypothesis is that for some patients, experienced, knowledgeable clinicians have selected one treatment for them and selected another treatment for other patients. However, for a group of these patients, systematic selection has not occurred, creating among them virtual equipoise. The endpoint for that hypothesis is the treatment received. For hypotheses about effectiveness of the treatment, endpoints would be those about short-, medium-, or long-term effectiveness of treatment. For a hypothesis related to differences in safety of the therapies, perioperative complications are likely among the endpoints. Whatever the endpoint, it must be clearly defined in a reproducible fashion. Generally, every event endpoint should be accompanied by its date of occurrence. A common failing is that for repeated endpoints (e.g., thromboembolism, assessments of functional status, rehospitalizations, echocardiographic assessment of valve gradient or grade of regurgitation), only the first or most recent time they occurred or were assessed is recorded. The latter is particularly egregious because every patient has a different interval from, say, surgery to that last observation, so the data at last observation are not interpretable. Instead, record every instance, every assessment. Techniques to analyze repeated endpoints are available (see “ Longitudinal Outcomes ” in Section IV).
Identify covariables.
Careful attention must be paid to the covariables that will be studied. They should be pertinent to the study question (purpose, objective, hypothesis). A common failing is to collect values for too many variables such that quality of data collection for important variables suffers. This error usually arises in a reasonable and understandable way. The surgeon-investigator reasons that because the patient medical records must be reviewed, a number of other variables “may as well” be abstracted while there. Or realizing the full complexity of the clinical setting, the surgeon-investigator feels compelled to collect information on all possible ramifications of the study, even if some of it is peripheral to the study’s focus. John Kirklin called this “the Christmas tree effect,” meaning adding ornament upon ornament until they dominate what once was “just” a fine tree. There needs to be a balance between so sparse a set of variables that little can be done by way of risk factor identification or balancing characteristics of the group, and so rich a set of variables that the study flounders or insufficient care is given to the quality and completeness of relevant variables.
Variables from electronic sources.
What variables can be obtained from electronic sources? Some aspects of these data may need to be verified. Of vital importance is determining the units of measurement for values from electronic sources. For example, in one source, height may be in inches and in another in centimeters!
Variables specific to study that need to be collected.
For many studies, at least some values for variables needed are not available electronically. This requires developing a database for their acquisition. Note that for successful data analysis, the vocabulary for these variables must be controlled, meaning that all possible values (including “unknown”) for each variable must be explicitly specified at the outset (no “free text”). These will become “picklists” for data entry.
Propose data analysis plan.
A data analysis plan should be linked one-to-one with the study questions, their accompanying hypotheses, and their accompanying endpoints. For example, if the first subquestion relates to how patients are receiving one therapy versus another, the first objective in the analysis plan would be comparing characteristics of the two groups of patients, for example, by a table or figure of standardized differences. The second step would be to develop a parsimonious model of these differences that would then be augmented into a nonparsimonious model, followed by generating a balancing or propensity score. This would be followed by 1:1 or weighted matching and comparison of characteristics of the resulting matched groups. This, in turn, would lead to comparisons of endpoints, and finally study of the remaining unmatched cases (the “oranges”).
Appreciate limitations and anticipated problems.
Every study has limitations and anticipated problems. These can be identified by a brief but serious investigation of the state of all the above. If any appear insurmountable or present fatal flaws that preclude later publication, the study should be abandoned. There are always more questions than can be addressed in cardiac surgery, so not being able to answer some specific research question is not an excuse to abandon the search for new knowledge!
Sketch shell tables and figures.
Every aspect of the data analysis plan is likely to generate perhaps some simple answers, but more often tables and figures. Laying these out with the proposal will be helpful to the statisticians analyzing the data and often identifies items missing from the analysis plan—those analyses, tables, and figures that “connect the dots.”
Establish a timetable.
Develop a timetable for data abstraction, data set generation, data analysis and reporting, possible meeting abstract and deadline, and all deliverables at various mileposts in the study. If the timetable is beyond that tolerable, abandon the study. It is rare for a study to be completed in a year from start to finish. This emphasizes both the bottlenecks of research and the need for lifelong commitment. Although abstract deadlines often drive the timetable, this is a poor milepost (see “ Presentation ” in Section V).
Final deliverable.
The final deliverable is a completed research proposal (or protocol) that is ready for review by collaborating investigators, data managers, and statisticians, followed by revision and iterative refinement of the proposal. The better the proposal, the better the research and the easier for statisticians to analyze. The protocol will often need to be approved by a research committee for its scientific merit and funding, and by the institutional review (ethics) board.
The proposal should be a living document. It is likely to be updated throughout the course of a study, and we advocate online tracking of each study, with periodic updates of the protocol as one of the tasks in project management.
Section II: Information
In this chapter, we define information as a collection of facts. The medical record is one such collection of facts about the health care of a patient. In it, observations are recorded (clinical documentation) for communication among healthcare professionals (and coders for billing for care) and for workflow (e.g., plan of care, orders). However, perhaps as much as 90% of the information communicated in the care of a patient is never recorded. The attitude of health insurers—“if it is not recorded, it did not happen”—represents a sobering lack of appreciation for the way information about patient care is used and communicated. However, it is also an indictment of the way medical practice is documented. Much is left out of written records, and many reports are poorly organized or incomplete. If important clinical observations are not recorded during patient care in a clear, complete, and well-organized (structured) fashion, information gathering for clinical research and a learning health system is impeded.
Computer-based patient record
In 1991—now more than 35 years ago—the Institute of Medicine (now Academy of Medicine) recognized the need not only for computerizing the paper medical record, but also for devising a radically different way to record, store, communicate, and use clinical information. , They coined the term “computer-based patient record,” or CPR, and distinguished it from the electronic medical record (EMR) by the fact that it would contain values for variables using a highly controlled vocabulary rather than free text (natural language).
For the cardiac surgical group interested in conducting serious and efficient clinical research, a CPR with a few specific characteristics could enormously facilitate clinical studies. Additionally, it could transform the results of this research into dynamic, patient-specific, strategic decision-support tools to enhance patient care. Although clearly elusive, and therefore theoretical, the nature of such a system can be described.
First and foremost, the CPR must consist of values for variables , selected from a controlled vocabulary . This format for recording information is necessary because analysis now and in the foreseeable future must use information that is formatted in a highly structured, precisely defined fashion, not uncontrolled natural language. Extracting structured information from natural language is a formidable challenge and one that should be unnecessary. Second, the CPR must accommodate time as a fundamental attribute. This includes specific time (date:time stamps), inexact time (about 5 years ago), duration (how long an event lasted, including inexact duration), sequence (second myocardial infarction [MI], before, after), and repetition (number of times, such as three MIs). Third, the CPR must store information in a fashion that permits retrieval not only at the individual patient level but also at the group level, according to specified inclusion and exclusion criteria. Fourth, the CPR will ideally incorporate mechanisms for using results of clinical studies in a patient-specific fashion for decision support in the broadest sense of the term, such as patient management algorithms and patient-specific predictions of outcome from equations or algorithms developed by research , , (see “ Decision-Making Based on Individual Effect of Therapies ” in Section V).
There are many other requirements for CPRs, from human-user interfaces, to administrative and financial functions, to healthcare workflow, to human error avoidance systems, to quality assurance (see Chapter 8 ) that are beyond the scope of the clinical research theme in this section. However, certain ideas about how medical information could be stored to facilitate clinical research follow.
Ontology
If medical information is to be gathered and stored as values for variables, a medical vocabulary and organizing syntax must be available. A technical term for this is ontology. In Greek philosophy, ontology meant “the nature of things.” Specifically, it meant what actually is (reality), not what is perceived (see “ Human Error ” in Section I) or known (epistemology). In medicine of the 17th and 18th centuries, however, it came to mean a view of disease as real, distinct, classifiable, definable entities. This idea was adopted by computer science to embrace with a single term everything that formally specifies the concepts and relationships that can exist for some subject, such as medicine. An ontology permits sharing of information, such as a vocabulary of medicine (terms, phrases), variables, definitions of variables, synonyms, all possible values for variables, classification and relationships of variables (e.g., in terms of anatomy, disease, healthcare delivery), semantics, syntax, and other attributes and relationships.
An ontology for all of medicine does not yet exist. Efforts to develop a unified medical language, such as the Unified Medical Language System (UMLS) of the National Library of Medicine, are well underway and becoming increasingly formalized linguistically as ontologies.
Ontology is familiar to clinical researchers, who must always have a controlled vocabulary for values for variables, well-defined variables, and explicit interrelations among variables. Without these, there is no way to accurately interpret analyses or relate results to the findings of other investigators. However, a clinical study is a microscopic view of medicine; scaling up to all of medicine is daunting.
Perhaps, then, the simplest way of thinking about an ontology for the researcher is as data dictionaries and their organizational structure, and some mechanism to develop and maintain them. These attributes have collectively been called metadata (data about data) or a knowledge base , and metadata-base or knowledge-base management systems , respectively.
Information (data) model
An information (data) model is a specification of the arrangement of the most granular piece of information according to specific relationships and the organization of all of these into sets of related information. The objective of an information model is to decrease entropy—that is, to decrease the degree of disorder in the information and thereby increase efficiency of information storage and retrieval (performance). Here we describe briefly two such information models, an old model still in use today and a new model from among many that may be less familiar.
Relational information model.
In medicine, information is multidimensional. A given value for a variable may carry with it time, who or what machine generated the value, the context of obtaining the value (“documentation”), format or units of measurement, and a host of attributes and relationships—indeed ontology—that give the value meaning within the context of healthcare delivery. Simply storing a set of values is insufficient. At a minimum, a comprehensive data dictionary must be developed, and ideally the database structure will contain metadata as to who entered the data, completeness of the data, and who can access and use the data. When data comes to analysis, information relevant to the values may importantly affect the analysis.
In relational database technology, variables are arranged as columns of a table, sets of columns are organized as a table, individual patients are in rows, and a set of interrelated tables constitute the database ( Fig. 7.3 A). Retrieval of data from such a model is often accompanied by the structured query language (SQL).
Comparison of relational information model with a semistructured one presented as a directed acyclic graph. (A) Relational. Tables are related by ID and source. Note that second table is many-to-one; that is, many postoperative echocardiograms were performed on one patient. (B) Semistructured.
(From Jonathan Borden, www.jonathanborden-md.com .)
Popularity of the relational model among clinical researchers stems from its simplicity in handling a microscopic corner of medical information. As soon as a new topic is addressed or new variables must be collected, the typical behavior of the research team is to generate a new specific database. Rarely do these multiple, independent, and to some extent redundant databases communicate with one another across studies. Thus, simplicity can work against more complex or comprehensive future studies.
Semistructured information model.
A different kind of information model emerged from an important conference at UAB of leaders in the development of several different types of database as part of a CPR project. After review of the strengths and limitations of various information models, a novel approach was suggested by Kirklin and then formalized. He proposed that all information that provided context and meaning to a value for a variable be packaged together. He envisioned that such a complex data element should be able to reside as an independent self-sufficient entity.
This idea has several meritorious implications. First, an electronic container for a collection of complex data elements could consist of a highly stable, totally generic repository for a CPR because it would be required to possess no knowledge of content of any data element. It could therefore manage important information storage and retrieval functions, implement data encryption for privacy and confidentiality, store knowledge bases used to construct the complex data elements and retrieve them, maintain audit trails, and perform all those functions of database management systems that are independent of data content. The second implication is that as medical knowledge increases, new entries would be made in the knowledge-based dictionaries. These would be updated, not the database structure. Not only would this ease database maintenance, but it would also enforce documentation in the knowledge base. The third implication, and the one most important for clinical research, is that no a priori limitations would be placed on relations; they could be of any dimensionality considered useful at the time data elements were retrieved for analysis. Thus, the electronic container is a single variable value-pair augmented with contextual documentation and capable of being modified as new or more knowledge accrues ( Fig. 7.3 B).
Essential characteristics of such an information repository would be:
-
•
Self-documentation at the level of individual values for a variable (complex data element)
-
•
Self-reporting at the time of data element retrieval and potential
-
•
Self-displaying in a human-computer interface
-
•
Self-organizing
The latter is an important attribute for future implementation of what might be called “artificial intelligence” features of a CPR. These may be as simple as self-generation of alerts, solution of multivariable equations or algorithms for decision support at the individual patient level, or intelligent data mining for undiscovered relations within the information.
About 1995, at the time these ideas were being developed at UAB, similar thinking was going on among computer scientists at Stanford University and the University of Pennsylvania, arising from different stimuli. , They termed an information model of complex data elements that carried with them all attributes intended for self-documentation, self-reporting, and self-organizing semistructured data. This phrase meant that the data elements were fully structured, but no necessary relation of one data element to another was presupposed. The culmination of these efforts was a database for storing complex data elements called Lore and a novel query language for retrieving complex data elements called Lorel.
In the 1990s, it was recognized that the information structure suggested by Kirklin and the University of Pennsylvania and Stanford computer scientists could be conceptualized as a directed acyclic graph. At that time, another entity was also rapidly coming into existence with similar properties, but of global proportions: the World Wide Web (WWW, or simply the Web). A Web page is analogous to a complex data element, with an essential feature being that it is self-describing, so it can be retrieved. The Web is the infrastructure for these pages. It has no need to be aware of Web page content. The subject matter has no bounds. Not surprisingly, then, the tools developed for retrieving semistructured data were quickly adapted to what has become known as search engines for the Web. Like Dr. Kirklin’s vision of complex data elements, information retrieved by a search engine can become related in ways never envisioned by the person generating it, because full structure is imposed only at the time of retrieval, not at the time of storage.
Thus, at Cleveland Clinic, investigators both harnessed and developed Semantic Web tools for data storage and manipulation, in part within the framework of the World Wide Web Consortium (W3C) and in part through ontologies built by Douglas Lenat of Cycorp.
An information model was built on a graph-based data model known as the Resource Description Framework (RDF), as well as a framework for describing conceptual models of RDF data in a particular domain known as the Ontology Web Language (OWL). It also includes a standard querying language called SPARQL .
RDF captures meaning as a collection of triples consisting of components analogous to those of an elementary sentence in natural language: subject, verb, object. Typically, terms in these sentences are resources identified by Uniform Resource Identifiers (URIs). URIs are global identifiers for items of interest (called resources ) in the information space of the Web. Collections of RDF triples constitute an RDF graph.
Many requirements outlined as crucial for CPR systems are addressed by using RDF as a data format for patient record content. In particular, our ability to link with other clinical records can be facilitated when RDF is used in this way. Use of URIs as syntax for the names of concepts in RDF graphs is the primary reason for this. The meaning of terms used in a patient record (as well as the patient record itself or some part of it) can be made available over the Web in a (secure) distributed fashion for on-demand retrieval.
A judicious application of Semantic Web technologies can also lead to faster movement of innovation from the research laboratory to the clinic or hospital. In particular, it was envisioned that use of these technologies would improve productivity of research, help raise quality of healthcare, and enable scientists to formulate new hypotheses, inspiring research based on clinical experience.
Data collection for a clinical study
Clinical studies are only as accurate and complete as the data available in patients’ records. Therefore, cardiac surgeons and team members must ensure that their preoperative, operative, and postoperative records are clear, organized, accurate, and extensive so that information gathered from these records can be complete and meaningful. The records should emphasize description, and although they may well contain the conclusions of the moment, it is the description of basic observations that becomes useful in later analyses.
Core data elements
Beyond these core variables, there will likely be a need for variables specific to a particular study. These should be identified and reproducibly defined in a data dictionary. Experienced investigators realize that in the midst of a study, it occasionally becomes evident that some variables require refinement, others collecting de novo, others rechecking, and others redefining. It is important to understand that when this occurs, the variables must be refined, collected, rechecked, or redefined uniformly for every patient in the study.
Extract values for variables
A source, or sources, for obtaining values for the set of variables specified in the clinical research proposal must now be identified for the study group ( Box 7.6 ). These are often contained in an electronic format (e.g., a quality registry or a hospital information system), but values for some variables may be in narrative form in patients’ medical records and must be abstracted.
• BOX 7.6
Core Data Element Concept
Core data elements represent the most granular source information that can be logically combined or mapped in multiple ways to generate answers (values) to specific questions (variables). Schematically, this is shown in the following diagram for six core data elements (CDE) from sources a-f.
The diagram that follows answers two specific database queries concerning use of anti-anginal medications.
Finally, a third database query relates to a specific medication and requires a combination of temporal reasoning and medication class, prescription, and use.
Export from electronic information sources.
If some or all the variables specified are in electronic format, sources must be identified, and a query made for patients in the study to extract values for the variables. This often time-consuming step is facilitated by three factors. First, at the time the information system is created, procedures can be built in to ease extracting, formatting, and exporting values for variables. This is particularly feasible in a so-called metadata-driven system, in which “data about data” drives not only the data entry process but the data extraction process as well. It is also particularly feasible for relational databases that are electronically linked (e.g., portals) to the analysis system. Second, “standard” groups of core data elements can be identified that form the basis for at least a major portion of the variables needed for most studies, and these may be part of prospectively maintained registries for quality reporting. The advantage of this strategy is that queries can be assembled carefully and refined over time. Third, successful, accurate ad hoc queries can be stored so that when the same variables are again specified, these queries can be reused.
Often more than one electronic data source must be used. In this case, values for variables in common may need to be adjudicated if they do not match in value, definition, granularity, or in variable nomenclature. Ultimately, unique variables must be joined into a common database.
Extract from medical records.
Even if the majority of information is available electronically, there are nearly always some variables new to the study that must be gathered from paper records or free text in the EMR if a CPR is not available. A more arduous process must be put into place for extracting data from original documents by natural language processing. A precise methodology is necessary for assembling information to prevent repetitious handling of both the patient’s record and the extracted information, as well as to ensure complete and accurate data retrieval while preserving patients’ privacy and confidentiality.
All information should be recorded in clearly defined, objective terms. There may be a preference for using descriptive terms that have been clearly defined (e.g., absent, trivial, mild, moderate, severe). Alternatively, numeric coding may be used, , with each numeral clearly defined. Pedal pulses, for example, may be recorded as 0, 1, 2, 3, or 4, with 4 indicating normal. Either method is equally rigorous as long as values are picked from a controlled vocabulary with clear definitions.
What one must avoid is an uncontrolled approach to entry of data in a spreadsheet ( Table 7.3 A). Each column contains data in multiple formats with a mix of alphabetical and numeric data, units associated with some numbers (differing units at that!), dates in several formats, anomalous dates in which day and month could be reversed, and different expressions of a quantitative variable. Keep in mind the data scientist or statistician who will be analyzing the data. In the end, they will need the data as it appears in Table 7.3 B. What is needed is a data entry form that has controlled data entry, in this case with separation of numerical values from a column of units of measure and non-ambiguous dates, and a data dictionary.
TABLE 7.3
Uncontrolled (A) and Controlled (B) Collection of Primary Research Data
| A | ||||||
|---|---|---|---|---|---|---|
| PT# | Sex | Height | DOB | Race | Surgery Date | Aortic Regurgitation |
| 1 | M | 68 in | 1/1/1970 | 1 | 3/16/2019 | None |
| 2 | F | 64 | Feb. 5 1960 | O | Dec. 12 2019 | 0 |
| 3 | F | 160 | 4/3/80 | 2 | 3/1/12 | Trace |
| 4 | 1 | 152 | MAY211952 | W | FEB042013 | 4+ |
| 5 | M | 172 cm | 25/01/1978 | 3 | 08/08/2013 | Moderate |
| 6 | M | 65 inches | 01221979 | 1 | 04052001 | 2 |
| 7 | 1 | 166 cent. | AUG 5 1982 | Black | OCT 8 2016 | Mild |
| 8 | 2 | 62 | 5/1/1976 | 2 | 8/12/2013 | Moderate/Severe |
| 9 | M | 65 | 11151964 | 1 | 02132017 | 3+ |
| B | ||||||
| PT# | Male | Height (in) | DOB |
Race:
1 = White 2 = Black 3 = Other |
Surgery Date | Aortic Regurgitation Grade |
| 1 | 1 | 68 | 01/01/1970 | 1 | 03/16/2019 | 0 |
| 2 | 0 | 64 | 02/05/1960 | 3 | 12/12/2019 | 0 |
| 3 | 0 | 63 | 04/03/1980 | 2 | 03/01/2012 | 0 |
| 4 | 1 | 60 | 05/21/1952 | 1 | 02/04/2013 | 3 |
| 5 | 1 | 68 | 01/25/1978 | 3 | 08/08/2013 | 2 |
| 6 | 1 | 65 | 01/22/1979 | 1 | 04/05/2001 | 2 |
| 7 | 0 | 65 | 08/05/1982 | 2 | 10/08/2016 | 1 |
| 8 | 1 | 62 | 05/01/1976 | 2 | 08/12/2013 | 3 |
| 9 | 1 | 65 | 11/15/1964 | 1 | 02/13/2017 | 3 |
Accuracy of data entry is improved by recording only primary information (e.g., date of birth, date of operation) and not indices derived or calculated from them (e.g., patient age at operation, body surface area). Such indices can later be calculated quickly and reproducibly by computer.
A key concept is to record core data elements that can be logically combined in multiple ways to form derived variables (see Box 7.6 ). For example, for analysis one may want to use only the variable “current smoker.” If this were the variable gathered primarily, one would be unable later to derive other data about smoking, such as pack-years, duration of smoking, or when a previous smoker quit. Core data elements would instead relate to dates of smoking and intensity, from which all others could be derived.
The process of gathering data is the most time-consuming step in a study. It is not unusual for it to consume months or years of work. Even if electronic sources of information are used, if values for variables were not entered at the point of care, the expense can be enormous. This is why the CPR, as the repository of all patient data and patient care workflow, is essential to increase the efficiency of clinical research.
For many institutions, identifying patients using a quality assurance registry for core data elements (see Chapter 8 , Quality Assurance) and extracting more detailed ancillary data for a specific study are the most cost-effective methods for clinical research. Although review of the medical record in this way is often considered a thankless chore, it has the benefit of an investigator gaining valuable in-depth insight into the patient cohort that is not captured by typical “case report forms” or routine registry capture. Indeed, the clinical importance of the research and the clinical inferences and practical recommendations coming from the research are often greatly enhanced by careful review of all or at least a substantial sampling of the medical records.
Time
The ability to manage that ubiquitous attribute of all medical data— time , —is not part of any widely available information retrieval system (generally called query languages ). Some proposals have been tested in a limited fashion, such as the Tzolkin system developed at Stanford University, but the software is not generally available. The reason for needing to consider time is readily apparent. Whenever we think about retrieving medical information along some time axis (e.g., sequence, duration, point in time), new logical relations must be generated to obtain reasonable results. For example, if we ask for all patients younger than 80 years who have undergone a second coronary artery bypass operation followed within 6 months by an MI, a number of time-related logical steps must be formulated. What is meant by patients younger than 80? Younger than 80 when? At the time of initial surgery, second surgery, MI, or at the time of the inquiry? The sequence of coronary artery bypass grafting (CABG) must be ascertained from data elements about each procedure a patient has undergone. Information about the MI and its relation to the date of the second CABG must be retrieved. The process is even more complex if only approximate dates are available.
Perhaps a growing interest pertaining to the time axis in business may stimulate development of better tools for managing queries related to time in medical information.
Follow-up information
Time-related events occurring after hospital discharge are often extracted opportunistically from clinic visit records rather than by systematic patient contact. Patients not appearing for clinic visits are said to be untraced. This is an unacceptable method of follow-up.
Even with systematic methods, however, some patients cannot be traced, and in the United States, privacy and confidentiality regulations are making this task increasingly difficult. A high prevalence of untraced patients potentially introduces bias into the time-related analysis, leading to overestimating or underestimating survival or freedom from other events, and reduces effective sample size.
Active follow-up.
Active follow-up means that patients or their families are contacted directly by mailed questionnaire, telephone, or electronic means (e-mail, Internet). Active follow-up is essential for discovering time-related events such as reinterventions and longitudinal clinical condition such as periodic echocardiographic assessment, perhaps with the exception of vital status, which may be available from government sources. Active follow-up data, and particularly the date of last active follow-up, must be kept separate from any augmentation of these data from passive sources. If a patient is found to have died, nearest relatives are contacted in a sensitive, sympathetic fashion to document the circumstances of death and ascertain all other pertinent cardiac events that occurred between the date of last active contact and death.
There are two general methods of active follow-up: anniversary and cross-sectional.
Anniversary method.
In the anniversary method, patients are contacted yearly on the anniversary of their surgery or entry into the study (or periodically if not yearly). This method is ideal for sampling the time-varying condition of the patient (e.g., functional status, freedom from angina, health-related quality of life, growth and developmental patterns). It has the added advantage of maintaining yearly contact with the patient or patient’s family, an important consideration in a mobile society. It has also been demonstrated that nonlethal morbid events such as thromboembolism and hemorrhage after heart valve replacement are forgotten unless there is at least yearly contact. Yearly active contact also makes it more likely patients will report events to their physicians during the course of the year.
For practical purposes, “anniversary” follow-up may be batched, resulting in a graph depicting completeness of follow-up that consists of a series of stair steps ( Fig. 7.4 A). In that case, one may wish to truncate follow-up for a given patient at exactly his or her anniversary date of surgery, even if events have happened in the interval beyond their anniversary date.
Completeness of follow-up event chart. In (A) representing anniversary follow-up, blue circles represent patients alive at last follow-up and X’s deaths plotted at the time of death on the vertical axis. The stairstep of living patients on the diagonal represents year-end assessment, and thus the blue circles are stair stepped across each year. The scattered blue circles represent transplant patients transferred to another center and lost to follow-up for the transplanting center. In (B), cross-sectional follow-up has been performed, so patients are likely to be depicted along the diagonal line than for anniversary follow-up at year end. Patients would be lined up on the diagonal if they were contacted on the exact anniversary of their surgery. In (C), a group of patients who have been followed both cross-sectionally and at anniversary at different times, showing stripes of incomplete follow-up. In (D), this same group of patients has been followed with a concerted follow-up effort, not only for vital status but for reoperations as well, something that cannot be assessed easily in passive follow-up. The blue circles lined up along 0 on the vertical axis represent patients who have not been followed after hospital discharge.
Cross-sectional method.
In the cross-sectional method, a specific follow-up inquiry of the patient cohort is initiated on a specific calendar date called the common closing date , with the goal of obtaining the status of all patients at a specific instant in time. In practice, of course, finite time is necessary to conduct the follow-up. For example, a cross-sectional follow-up may be initiated on August 1 and questionnaires returned over the ensuing 2 months. During this time, telephone calls may be made to non-responders or those whose questionnaires have been returned as undeliverable.
The status of every patient, including events observed, is that as of the common closing date (August 1 in this example). Any events occurring after the common closing date are ignored (censored). Patients still event-free by the common closing date will appear as a diagonal line on a completeness of follow-up graph ( Fig. 7.4 B).
For patient condition (longitudinal data), a decision must be made about condition as of the closing date. This can be made clear on the follow-up form or via the telephone script.
Passive follow-up.
If vital status is the only outcome of interest, date of death may be obtainable from government vital statistics offices or a death registry. In passive follow-up, only death and date of death (which may be approximate) may be identified, not whether each individual in the study is alive or dead at the time of inquiry. Usually there is a lag between death and reporting, so methods must be employed to determine the status of living patients at any given time. It is important to remember that nonfatal events cannot be determined by passive follow-up. If passive follow-up is used to supplement active follow-up, a separate date for end of active follow-up must be retained for analysis of nonfatal events. Passive follow-up can include use of office visits or hospital admissions as the sole source of follow-up. This is opportunistic and is non-systematic follow-up. Non-systematic follow-up captures only some of the events and some of the potential follow-up, so analyses of time-related events are distorted. Particular care must be taken if for a clinical trial with yearly follow-up, it is required that deaths be reported within a short interval, such as 48 hours. This will result for the last year of follow-up numerators (deaths) with no denominators. This is avoided by truncating data for analysis after the preceding systematic follow-up.
Completeness of follow-up.
In performing active follow-up, every effort must be made to contact every patient in as short a time as possible. Special assistance may be required to achieve a high level of follow-up under these circumstances. In the past, we advised using cross-reference indices to former neighbors or contact of relatives and former physicians, churches, and other agencies. However, in the United States this is now prohibited.
There is no perfect way to describe and quantify completeness of follow-up. Fig. 7.4 illustrates one way of visually assessing completeness of follow-up for each patient. In Fig. 7.4 C, note that many patients are at the bottom of the graph, indicating that no follow-up has started. Other incompletely followed patients appear below the upper triangle of completely followed patients. Lack of follow-up for nonfatal events requires that these unfollowed patients be followed cross-sectionally, as shown in Fig. 7.4 D.
Grunkemeier and Starr have described a patient-year method for estimating goodness of follow-up based on observed versus potential follow-up duration. For each patient, the duration of potential follow-up is computed (this is the interval from study entry, such as surgery, until death or, for the patient still alive at follow-up, the common closing date or response date, anniversary date, or analysis date, depending on the type of follow-up study performed). The measure of completeness of follow-up is the ratio of total observed follow-up duration to total potential follow-up duration.
In highly lethal diseases, Korn suggests a different definition of potential follow-up, namely total patient years or median follow-up as if no events occurred. , This is important in a setting where one has complete follow-up over 20 years, but median follow-up is only 1.5 years because of rapid demise of patients. It obviates the reader’s reaction that there is little follow-up or very incomplete follow-up, or that the therapy has only recently been introduced, all of which could be alternative reasons for short median follow-up.
Follow-up for longitudinal data.
Although follow-up for clinical events has dominated past cardiac surgery studies, increasingly, longitudinal data are important outcomes. These include continuous data such as valve gradient, binary data such as recurrence of arrhythmias, and ordinal data, such as grade of valvular regurgitation. Typically, longitudinal data represent a snapshot at various time intervals. But the assessments are often made at somewhat irregular times, for a different number of measurements per patient depending on whether the operation occurred recently or many years ago.
At times investigators collect and assess values only at last follow up. Such data are uninterpretable because the interval from surgery to last follow-up differs from patient to patient. Others set thresholds and when exceeded at the time of an assessment, analyze the data as if they were an event. This has several drawbacks: (1) it is uncertain when the value exceeded the threshold; all one knows is it had not at the previous assessment; (2) longitudinal data are rarely static but vary from measurement to measurement, making the threshold a moving target; and (3) a lot of detailed data are lost. As will be seen under “Longitudinal Outcomes” in Section IV, good methods are available today for analyzing longitudinal data.
When collecting longitudinal data it may seem logical to string the values in a single row on a spreadsheet, as seen in Table 7.4 A. However, this creates variable length records that are difficult to analyze. Instead, collect these repeated measures in what is known as “long format,” illustrated in Table 7.4 B. Doing so results in many records per patient ( Table 7.5 ), but the format, known as many-to-one, is ideal for analysis. Tables of such outcomes can be linked to the “one” static record of patient demographics, comorbidities, and surgical procedure in a relational database model.
TABLE 7.4
Longitudinal (Repeated Measures) Data
| A | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Patient_ID | Surg_dt | Race_Wh | Female | AR_base | EF_base | Echo_dt1 | Ef_po1 | Echo_dt2 | Ef_po2 | … |
| 1 | 1/1/12 | 1 | 0 | 4 | 65 | 2/3/12 | 75 | 7/6/12 | 70 | … |
| 2 | 6/1/12 | 0 | 1 | 3 | 55 | 9/1/12 | 65 | 12/12/12 | 60 | … |
| B | ||||||||||
| Study_ID | Surg_dt | Race_Wh | Female | AR_base | EF_base | Echo_dt | Ef_po | |||
| 1 | 1/1/12 | 1 | 0 | 4 | 65 | 2/3/12 | 75 | |||
| 1 | 1/1/12 | 1 | 0 | 4 | 65 | 7/6/12 | 70 | |||
| 1 | 1/1/12 | 1 | 0 | 4 | 65 | 12/18/12 | 75 | |||
| 1 | 1/1/12 | 1 | 0 | 4 | 65 | 1/28/14 | 65 | |||
| 1 | 1/1/12 | 1 | 0 | 4 | 65 | 2/7/15 | 70 | |||
| 2 | 6/1/12 | 0 | 1 | 3 | 55 | 9/1/12 | 65 | |||
| 2 | 6/1/12 | 0 | 1 | 3 | 55 | 12/12/12 | 60 | |||
| 2 | 6/1/12 | 0 | 1 | 3 | 55 | 3/1/13 | 55 | |||
| 2 | 6/1/12 | 0 | 1 | 3 | 55 | 4/14/13 | 45 |
TABLE 7.5
Use of Symbols of Inequality: Illustration with P -Values and Their Interpretation
| ≤ P < | Interpretation of Null Hypothesis | Inferences About the Difference | |
|---|---|---|---|
| .05 | Almost certainly not true | Unlikely to be due to chance | |
| .05 | .1 | Probably not true | Probably not due to chance |
| .1 | .2 | Possibly not true | Possibly not due to chance |
| .2 | Nearly certainly true | Likely to be due to chance | |
Follow-up instrument.
The follow-up instrument may be a simple questionnaire mailed to the patient (with one or two remailings followed by telephone calls to non-responders). If so, it is wise to not exceed a single sheet in length, relying on the telephone or personal contact to obtain more details if events have occurred. Alternatively, the inquiry may be completed by telephone, using well-trained individuals, a script, and a form that is filled out during the conversation, by electronic means coupled with the patient’s medical record, or by other electronic mechanisms.
Patients rarely resent being followed up; to the contrary, periodic follow-up is useful not only in detecting medical trends in individual patients who may need attention, but also in generating good will between the patient and the medical system.
Verify collected information
No matter what method of data export or extraction is used, experience dictates that one or more iterative data verification steps must be inserted before any analyses are performed. This can take three general forms: (1) value-by-value checking of recorded data against primary source documents, (2) random quality checking, (3) automatic reasonableness checking, and (4) comprehensive graphical review. If a routine activity of recording core data elements is used, it is wise to verify each element initially to identify those that are rarely in error (these can be “spot checked” by a random process) and those that are more often in error. The latter are usually a small fraction of the whole and are often values requiring interpretation. These may require element-by-element verification.
This process is long and tedious and can be boring. It usually reveals many errors, but it also allows review of each patient’s record to detect missed information. Data are then checked for reasonableness, a process greatly aided by computer, as described in the following Section III under “Screen and Scrub and Descriptive Data Exploration.” Discrepancies and inaccurate outliers are found and the data corrected. Importantly, if data errors are found, they should be corrected in the primary information repository and the data re-exported. This policy ensures upgrades of repository quality with each study.
Although a controversial statistical point, data should probably not be “doctored” by rejecting outliers unless there is a reason to suspect they are less reliable than values obtained for other patients. It is also useful to list all patients with missing data so that renewed efforts can be made to obtain them. Finally, a check is made to detect possible duplicate records, particularly if the same patient was the subject of a preceding study or if data were extracted from an electronic source in which patients may be entered multiple times.
Section III: Data
Data consist of organized information. We add the following further constraints.
First, data consist of values for variables . These values have been selected from a list of all possible values for a variable, and this list is part of a constrained vocabulary. The constrained vocabulary may consist of numbers, “yes” or “no,” ordered lists (none, mild, moderate, severe), nonordered lists (such as names of diseases), calendar dates, and so forth but not unconstrained free text.
Second, data consist of values for variables that have been accurately and precisely defined both at the level of the database and medically. One of the important benefits of multicenter randomized trials, concurrent observational studies, and national registries is that these activities require establishing agreed-upon definitions at the outset. Coupled with this is often intensive and ongoing education of study coordinators and other data-gathering personnel about these definitions, exceptions, and evolution of definitions and standards. There is a mechanism to monitor compliance with these definitions and standards throughout the study, and the same should hold true for any registry. A mechanism to ensure similar adherence to definitions is essential even for individual clinical studies. Further, documentation must be in place to identify dates on which changes in definition have occurred, and these must be communicated to the individuals analyzing the data (generally, indicator variables are created that “flag” cases for which definitions of an individual variable have changed). The rigor of establishing good definitions is considered distasteful by investigators who are impatient to collect data, but it is essential for successful research. It is also somewhat of an iterative process, which is why we suggest extracting data on the basis of initial definitions for a few patients scattered over the entire time frame of the study, then refining the definitions. One must also be aware of standards developed by national and international groups of cardiac surgeons and cardiologists assembled for this purpose.
Third, data consist of values for variables that have been organized , generally using a database management system, into a database or data set(s). At the present time, data analysis procedures presume that data will be organized in a fully structured format, generally a relational database. In such a database, tables with columns for each variable and rows for each separate patient contain static information (e.g., demographics, past medical history), but other tables linked to that table may contain multiple entries of longitudinal and follow-up data for patients, with a key that links that table to individual patients in a many-to-one fashion. It is our view that the fully structured organization of data, probably in relational database format (see “ Relational Information Model ” in Section II), should be imposed only at the point of extraction of values for variables from information (often called the “export” phase in a process termed rectangularization ). This allows the input of data to be semistructured (see “ Semistructured Information Model ” in Section II), maximally flexible, and with few imposed organizational constraints (outside of retrievability), so that relations among variables are imposed by the research question being asked and not by a priori database constraints. Thus, we advocate for “self-documenting” primary databases such as registries, with Yes (or Y), No (N), or Unknown (U) so there is no mistake as there can be with assigning a numerical value to these quantities.
Information to data
An idealized, linearized perspective on the process of transforming clinical information to data suitable for analysis requires three broad steps: (1) formulating a clinical research proposal that leads to identifying a suitable study group (see “ Research Proposal ” in Section I), (2) gathering proposed variables and values that lead to an electronic data set, and (3) manipulating the values and variables to create a data set in a format suitable for analysis. This is a linear process in theory only. In reality, it contains checks that cause the investigator to retrace steps.
Data conversion for analysis
An often underappreciated, unanticipated, and time-consuming effort is conversion of data elements residing in a database to a format suitable for data analysis. Even if the day comes when all medical information is recorded as values for variables in a computer-based patient record, this step will be unavoidable. Statistical procedures require data to be arranged in “columns and rows,” with each column representing values for a single variable (often in numeric format), and each row either a separate patient or multiple records on a single patient (many-to-one, as in repeated-measures longitudinal data analysis).
An important activity is managing sporadic missing data. If too many data are missing, the variable may be unsuitable for analysis (see “ Managing Missing Values ” later in this section). Otherwise, missing-value imputation is necessary so that entire patients are not removed from analyses, the default option in many analysis programs.
Analysis data set
The last step in transforming information into analyzable data is creating one or more analysis data sets in a format compatible with analytical procedures. This step includes (1) manipulating variables and values, (2) screening and scrubbing data, (3) imputing missing values, and (4) organizing a set of analysis variables into medically meaningful categories. The general process has been described under “Technique for Successful Clinical Research” in Section I.
Manipulate variables and values
Time intervals.
To interpret patient information meaningfully, an index time for study entry must be established for every patient. The reason is the central place of date:time in medicine. Thus, in a system of longitudinal data entry at the time of patient care, past, present, and future are defined in terms of this index time. For surgeons, fortunately, this is often the time of an operative procedure. It is more difficult in medical situations to define, say, onset of disease; often, date of diagnosis or patient encounter is used. Once index time for study entry has been determined, then, using dates for each data element, one can determine if there have been previous events, such as MIs, how many of these have occurred, and the interval from the most recent to the index time. All items in what we commonly think of as “past medical history” are defined in terms of this index time for study entry.
One of the most common requirements is to compute intervals between dates. For example, the age of a patient at index time is calculated from index time and date of birth. Follow-up intervals are similarly computed from dates. A common error is attempting to manually calculate intervals between dates and index time. This is rarely accurate and should be done by computer.
Indicator variables.
Indicator variables are always required. These may simply be the translation of a variable whose value has been coded as YES or NO into the numbers 1 and 0, respectively. A cardinal rule to avoid ambiguity, human error, and misinterpretation is that the computer variable name of an indicator variable must always be the one indicated by a YES or 1. Thus, in forming an indicator variable from a primary variable called SEX with values of MALE and FEMALE, one would name the indicator variable MALE with values of 1 for male, and 0 for female. An indicator variable called SEX or GROUP is ambiguous and should not be used as an analysis variable, because it is not self-documenting.
Another common requirement is to form multiple indicator variables from a single variable containing values from a nonordered list. These list variables often are represented by a set that allows selection of multiple items from the list. Typical list variables are diagnoses or type of operation. A nonordered (polytomous) list variable is not interpretable in many types of data analysis (the exception is mutually exclusive lists, for which polytomous methods are useful; see “ Logistic Regression Analysis ” in Section IV). Generally, a variable useful for data analysis from such list variables must take on at least one of three values: 0 (NO), 1 (YES), or blank (MISSING). However, in many cases the medical picklist can be so long that in general clinical practice, only positive findings are recorded and all the rest (e.g., thousands of possible diagnoses) “dismissed,” sometimes called “coding by exception.”
When it comes to data analysis, however, we often need indicator variables for all list items that identify more than YES (or 1). Generally, the assumption is made that if a list item is not selected, the patient does not have that condition or procedure. However, it is possible that they did, but (1) the list item was added recently and so was not collected at index time, (2) the data abstractor could not find the item, forgot to look for it, or was distracted and did not return to find it, or (3) the item was recorded in a previous clinical record that was not retrieved by the criterion used to gather the electronic data set.
Such ambiguities can be avoided to some extent for a particular discipline by recording and abstracting important data elements using individual values for variables. For example, one may wish to have unambiguous information on the comorbidities diabetes (and its treatment); preoperative dialysis for chronic renal failure; prior MI (perhaps with date or a count of the number experienced); prior stroke; carotid artery and peripheral artery disease; chronic pulmonary disease; and the like. Rather than making them items in a long list of variables, create individual variables whose values of YES or NO must be explicitly entered. An alternative is to have multiple missing value indicators, representing by default “not yet found,” but then including the extremes of “pending” and “completely absent,” “don’t know,” “lost chart,” “illegible,” “invalid response,” “refused to answer,” and “not applicable” (e.g., child relation of parent fields to which the answer is NO, or all smoking history variables for a person who never smoked).
No matter the coding chosen, it is highly likely that there will be parent–child relationships, explicit or implied, that will lead to pseudo-missing data. For example, a child of diabetes = YES may be a checklist of pharmaceuticals for treating diabetes. If none is checked, it may be because diabetes = NO; one should then impute a 0 for each of the pharmaceuticals. Similarly, if only insulin, for example, is checked, the others should be set to zero. Depending on the construct schema of the database used for data abstraction, this can be an easy or difficult task, but what is always required is to think about smart imputation when you encounter missing (“null”) values in a data set.
Naming variables.
We advocate for a disciplined common naming convention for variables that is used across all studies, as opposed to an ad hoc approach. These common names are defined in a data dictionary or ontology. In our clinical research unit, two-letter prefixes and suffixes are used in part to naturally organize variables and for procedural studies to indicate temporal relations. For example, medical history names have the prefix hx, time intervals iv, surgical procedures sp, transcatheter procedures tp, and postoperative po.
With this disciplined approach, other data analysts, future research fellows, and investigators can readily understand the analysis data sets.
Screen and scrub.
After intervals and indicator variables are created, data screening is performed. If negative intervals are found, dates or times need to be investigated and corrected in original sources. Impossible combinations of variables may be found, such as a “normal” aortic valve said to have a 100-mmHg peak gradient and valve area less than 1 cm 2 . Parent–child relations are verified, particularly if the database has inadvertently overlooked constraining such relations. For example, specifying an aortic valve prosthesis should be a child variable of a parent variable for “aortic valve replacement.” Inadvertent redundancies must be resolved if they disagree.
Inconsistencies in applying definitions from the data dictionary are reported to the investigation team for resolution, and the iterative process is repeated. This process is often discouraging to the novice investigator who assumes that all data (particularly those personally extracted) are flawless.
Just as important as improving accuracy of the data set is evaluating the quality of each variable by this screening and scrubbing process. One may find that information is too often unavailable in medical records to trust the variable, and it is dropped. One may also find that interpretation of the clinical condition has been so variable, such as heart failure, that the values gathered are not reproducible. Either a better surrogate has to be found or the data element dropped from further consideration.
In times past, continuous-variable data screening has consisted of producing summary tables of means, standard deviation, medians, minimum, maximum, selected percentiles, and number of missing values; quantitative variable data screening has consisted of tables of values and missing data counts. We have found these of limited value in discovering “warts” in the data. What we now do is produce pages of “postage stamp” scattergrams of every continuous variable that includes all data points, usually plotted against date of operation, and rug marks for missing values ( Fig. 7.5 ). These may reveal “outliers” that may be data entry errors, floor or ceiling effects, or an admixture of different units of measurement that need to be normalized to one measurement scale. For binary variables, we recommend two types of “postage stamp” pages: one of actual count ( Fig. 7.6 A) and a second one of percentage ( Fig. 7.6 B). These may show missing data, gaps in the data Fig. 7.6 C), different patterns of data collection, and many other anomalies that are not evident if just numbers and percentages are shown in a table. Ordinal data are displayed as stacked plots ( Fig. 7.7 ).
Sample of “postage stamp” plots of raw continuous variables as a function of date of surgery (here the number of years after time zero of surgery). The “rug marks” (short vertical lines on horizontal axis” represent missing values. The graphs demonstrate “outliers” such as BMI, BUN, LV mass, and LV mass index that should be investigated (they may be due to differences in measurement units such as centimeters and inches, kilograms and pounds, SI versus common units, coding errors, etc.). In addition, this kind of plot can reveal “block missingness” for lab measurements that did not exist until well into the study or variables not collected before a certain time. Ceiling or floor effects may be evident.
Sample of “postage stamp” plots of raw binary variables as a function of date of surgery (here the number of years after time zero of surgery). Blue portions of bars indicate the variable identified by “presence of” the title of the chart, and reddish areas represent absence of the variable. Gray portions of the bars represent missing data. In (A), actual frequency of values for variables are given, which reflects changing denominator trends over time. In (B), the data are presented as percent. Both depictions are useful. Note that if a variable is spuriously absent, such as with Native AV, the incomplete data require investigation. Some variables could abruptly change in magnitude at some point in time, which may be due to change in their definition or method of measurement. Some variables like robotic approaches or TAVR only appear when they are introduced; in that case, rather than missing values, the variable should be set to zero.
Sample of “postage stamp” plots of raw ordinal variables as a function of date of surgery and presented as stacked plots. These show the distribution of grade of mitral valve regurgitation in raw numbers (left) and as percentages (right). Note that the gray portion of bars, representing missing data, decreases over time.
In addition to these individual data checks, we routinely plot height versus weight to find “skinny giants” or “fat midgets,” along with graphs of other highly correlated variables. These drive us back to data sources to determine what the problem is and correct it, or if the cause seems to be an error in clinical recording, setting the value to missing.
It is important that corrections of errors discovered in the data anywhere along the way be fed back to the original database and the data re-exported. In this way, database quality will be constantly improved. Ideally, the change in the primary data will be documented by an audit trail.
Managing missing values.
In addition to missing values from parent–child relationships, in any study there are likely to be values for some variables that have not been recorded. Most statistical procedures eliminate entire observations (e.g., patients) for which any data requested for analysis are missing. In medical data analysis, however, one is more likely to introduce bias by eliminating all data on an entire patient than by substituting a value for the missing data that can be shown not to importantly bias the analysis. The process is called missing value imputation .
Although the literature on managing missing data is extensive, much of it is directed toward survey investigations in which entire survey instruments have not been returned. The general directive for such data is to eliminate records for nonresponders. In clinical research, missing data are most commonly sporadic or systematic (block missing) for some specific time segment (e.g., missing magnetic resonance imaging data before the introduction of that technology). These common types of missing data should be managed in a different way from those of surveys.
Sporadic missing values.
For sporadic values missing in a tiny proportion of patients, it is reasonable to substitute (impute) the mean value for all patients with nonmissing data (called noninformative imputation ). Thus, if 1% of patients are missing values for ejection fraction, the mean value may be substituted.
If there are at least five outcome events associated with patients having sporadic missing values for a variable, a dichotomous (0,1) missing value indicator is created and forced into all models in which the primary variable is incorporated. If the indicator variable is not statistically significant, it is likely that the imputation has been noninformative with respect to outcome. If it is significant, the indicator variable both adjusts for this and serves as a warning that additional work must be done, such as use of informative imputation.
Informative imputation capitalizes on redundancy in medical information. A multivariable equation is generated (see “ Multivariable Analysis ” in Section IV) for the variable of interest but using only patients for whom values are not missing. A value is predicted from this equation for the patient with the value missing, and this is the imputed value. Missing value indicators are just as germane for informative as for noninformative imputation.
Yet another strategy is multiple imputation . , Briefly, a set of randomly chosen values is used for imputing missing values for each patient and analysis is performed, followed by another set of values and analysis. This process may be repeated as many as 200 to 1000 times, and the many analyses summarized. More commonly, an initial investigation data set is constructed and used for preliminary model building (see “ Multivariable Analysis ” in Section IV). This is followed by applying that preliminary model to additional imputed data sets and aggregating the results.
When to impute missing values by whatever method is chosen turns out to be important: transform, then impute. As noted in “Calibration of Continuous Variables” under “Risk Factor Identification” in the Section IV discussion titled “Multivariable Analysis,” the scale of continuous variables may have to be transformed to meet model assumptions (linearizing transformations ). It is important that transformations first be performed, followed by missing value imputation, as documented by von Hippel. Note that one advantage of nonparametric machine learning algorithms such as random forests is that such transformations of scale need not be done, and, indeed, partial dependency plots will reveal any nonlinearities (see “ Classification Using Machine Learning ” in Section IV).
For some forms of nonparametric machine learning, specialized types of missing value imputation are used, often particularly focused on 100% missing values, such as social media clues used by retail establishments to identify individuals for selective person-specific advertising.
Systematic missing values.
Systematic missing values occur under two conditions that can be managed similarly. First, a value may be inapplicable. For example, in a study of mitral valve surgery, values for various repair techniques are inapplicable to patients receiving a prosthesis. Second, some test may come into use partway (in calendar time) through a study, or information may not have been collected about some variable until a certain calendar date. For such patients, we suggest that missing data be managed as “interaction terms.” By this we mean that systematic missing values be set to zero (0). Then a missing value indicator is generated: 1 for patients with systematic missing values and 0 otherwise. Both variables are linked in all analyses. This makes interpretation of the models realistic, although it is a strategy that is computationally close to noninformative missing value imputation. A drawback is that the missing value indicator may become an unrecognized surrogate for temporal trends in the data if the block missing data are concentrated among patients early in the study.
Organize variables for analysis.
Once the aforementioned steps have been achieved, often iteratively, the result is a final data set in the format needed for analysis. However, one further step remains: organizing variables deemed suitable for analysis in a medically meaningful way. The reason for this is the importance we place on informed and supervised data analysis . Those analyzing the data must “know the data” just as the investigator knows the data. Not every variable has equal importance for analysis. For example, quantitative ejection fraction is “better data” than a qualitative assessment of left ventricular function on a coarsely graded scale; creatinine level at surgery contains more data than a diagnosis of renal failure; individual components of cusp morphology in atrioventricular septal defect contain higher information content than Rastelli type.
Not every variable is of equal reliability, but medical information tends to be redundant, so more reliable surrogates should be sought and analyzed. Many variables are highly correlated and may be of equal reliability, such as height, weight, body surface area, and body mass index (indeed, the latter two are calculated from the former two). Therefore, if such variables are equally associated with an outcome, and are naturally colinear, one may arbitrarily select the most reliable or easily measured representative of that concept (in this example, the concept of body size).
The data management team, in collaboration with the investigator, must then compile a final list of analyzable variables . These should be grouped in a medically meaningful fashion that aids data analysis. A suggested grouping might be as follows, although it will vary from study to study:
-
•
Demographics (age, sex, body size, social determinants of health)
-
•
Symptoms (functional status, angina class)
-
•
Ventricular function (ejection fraction, number of previous MIs, interval from last MI to surgery)
-
•
Pathophysiology and etiology (grade of mitral valve regurgitation, etiology of valvular regurgitation)
-
•
Coronary artery anatomy and disease (degree of left main disease and that of each coronary system, dominance)
-
•
Other cardiac comorbidity (previous cardiac operations, atrial fibrillation)
-
•
Noncardiac comorbidity (smoking history, creatinine, pulmonary disease, diabetes, albumin level, organized by organ system)
-
•
Preoperative management (preoperative temporary mechanical circulatory support for hemodynamic instability, intravenous nitroglycerin for unstable angina)
-
•
Cardiac procedure (CABG, aortic valve replacement, mitral valve repair)
-
•
Support techniques (duration of aortic clamping, use of warm substrate-enhanced induction cardioplegia, duration of circulatory arrest, duration of cardiopulmonary bypass)
-
•
Experience (date of operation, surgeon)
-
•
Outcome, in-hospital events (occurrence of various complications, hospital death, length of postoperative stay)
-
•
Outcome, time-related events (all-cause mortality, interval from surgery to death or censoring)
-
•
Longitudinal data (echocardiographic findings after valve operations)
-
•
Missing value indicator variables
-
•
Interaction terms (organized according to previous schema)
A practical way to implement this organizational structure is to isolate programming code in the form of a computer macro that contains the list of available variables for analysis and a place for those analyzing the data to insert code for imputing missing values, transforming the scale of variables, forming additional indicator variables, and performing other useful data manipulations. This strategy guards against human error in data analysis by isolating to a single location all data manipulation used in the analysis.
Descriptive data exploration
After the analysis data set has been constructed, data are explored by producing simple descriptive tables (sorting and tallying) and simple statistics about continuous variables, scatterplots of variables, and other exploratory data analyses. To understand this process, some appreciation of numeric data is necessary.
Numbers
Accuracy and precision.
Because both calculators and computers express numbers to many digits ( Box 7.7 .), it is necessary to know a set of rules for compaction and expression (display) of numeric data. The format in which a numeric value is expressed has implications. The number 493, for example, implies that the truth is somewhere between 492.5 and 493.5 (accuracy), and that the scatter in repeated measurements of the number (precision) is no greater than that explicitly expressed ( Box 7.8 ). The number 492.8 implies that the truth is somewhere between 492.75 and 492.85, and 492.76 implies that the truth is somewhere between 492.755 and 492.765. This last numeral to the right (right-most digit) explicitly indicates that the accuracy is much greater and the precision much less than when the number is 493 or 490.
• BOX 7.7
Expressing Numbers
Digit
A digit is one of the 10 Arabic number symbols, 0 through 9. Digits are also called numbers , numerals , or integers .
Number
Although number and digit are synonyms, a number is more generally applied to a series of digits, separators (commas, decimal points), and other notations (see “ Scientific Notation ” in this box) that together represent a numeric quantity.
Even numbers
The Arabic numerals beginning with 0, 2, 4, 6, 8…. These are numbers divisible by 2 without a remainder.
Odd numbers
The Arabic numerals beginning with 1, 3, 5, 7, 9…. These are numbers divisible by 2 with a remainder of exactly 1.
Decimal format
Most numbers in scientific work are expressed in decimal format—that is, in a numeric system based on multiples of 10 as the fundamental unit, called the base of the numbering system. Other systems were prominent in antiquity, such as base 60 in Babylonian times, existing now only as the basis for clock time. Bases other than 10 are used in computer systems, such as base 2, 8, or 16. Yet others have been suggested as having better arithmetical properties. However, the fact that humans have 10 fingers has played a dominant role in popularizing the decimal system.
In decimal notation, each place is a multiple of 10 and is named. A symbol known as the decimal point (a period is used in the United States and throughout this book) separates what is known as the units or ones place from the “tenths” (1/10) place. Whole numbers are to the left of the decimal point and fractional ones to the right.
To the left of the decimal point, the first place is called the units (or ones) place, the second the tens place, the third the hundreds place, and the fourth the thousands place. Commonly, a separator is inserted for each multiple of 1,000 (a comma in the United States, and period in Europe). In the number 1,234, the 4 is in the units place, 3 in the tens place, 2 in the hundreds place, and 1 in the thousands place. Another way to express this number is as the sum of 4 + 30 + 200 + 1,000.
To the right of the decimal point, the first place is called the tenths (1/10) place, the second the hundredths (1/100) place, and the third the thousandths (1/1000) place. Thus, in 0.1234, the 1 is in the tenths place, 2 in the hundredths place, 3 in the thousandths place, and 4 in the ten-thousandths place. Another way to express this number is as the sum of 0.0004 + 0.003 + 0.02 + 0.1.
Decimal place
In the decimal system, decimal place is the position of digits immediately to the right of the symbol designating the decimal point. Location of the decimal point reflects the scale of measurement and is unrelated to significant digits.
Significant digit
Digits of the decimal form of a number beginning with the leftmost nonzero digit and extending to the right, with the implicit implication that all digits to the right are significant digits . That is, they are warranted either by inherent properties of the measuring device used to generate the numbers or by statistical properties of a collection of such numbers. Examples of 2 significant digits are 24,000, 240, 0.24, 0.0024, and 0.0000024. These examples illustrate that the position of the decimal point is irrelevant when expressing significant digits.
Scientific notation
A method of expressing (displaying) numbers from 1 to 9, followed by a decimal point , the remaining significant digits, if any, multiplied by a power of 10. For example, 0.00037 in scientific notation is 3.7 × 10 −4 , where 10 −4 is 0.0001. In general, the numeric value, here 3.7, is called the mantissa , 10 is called the radix , and −4 is called the exponent. In the examples under “Significant Digit,” the largest would be expressed as 2.4 × 10 4 and the smallest 2.4 × 10 –6 .
Leading zero
Zero placed before a decimal point that is not considered a significant digit. It is generally used (1) when it is implied that a nonzero significant digit could replace it, or (2) to separate a negative sign (−), a positive sign (+), or a plus or minus sign (±) from the decimal point. Increasingly, numbers that are constrained to the range 0 to 1, such as probabilities (including P values), are expressed (displayed) without a leading zero.
• BOX 7.8
Accuracy Versus Precision
Accuracy
Absence of systematic error of measurement (bias) from the “truth.” It is an expression of “rightness.”
Precision
Ability to provide the same answer in repeated measurements. It is an expression of “exactness.”
These terms are often interchanged, but in data analysis they are not synonymous! Repeated measurements of Po 2 in a blood gas machine may have a great deal of scatter on repeated readings (imprecise), but their average value may reflect faithfully the true Po 2 (accurate). Another blood gas machine may yield Po 2 with little scatter in repeat readings (precise), but may be uncalibrated, so the readings are inaccurate (biased). There is often a trade-off between accuracy and precision in medical measuring instruments.
Rounding.
In computation and computer storage, all available digits of numbers displayed or recorded by measuring devices should be retained. It is only at the last step of numeric presentation that numbers are rounded ( Box 7.9 ). In presenting numeric information, numbers should be rounded in such a way as to reflect their precision or reproducibility, although consistency within tables is also important. ,
• BOX 7.9
Rounding Numbers
Certain generally agreed-upon conventions for rounding numbers exist, although they are not easily found in print.
Step 1: Determine the number of digits to save
This is suggested by precision of the measuring instrument for individual numbers and by the standard error of the mean value or proportion associated with a series of numbers (see later Box 7.10 ). For the latter, the place of the first significant digit of the standard error is found, and the mean or proportion is then rounded to that place. The same place is saved in confidence limits. If the standard error is also being expressed, one additional place is saved in it (because the usual ± expression of the standard error is a form of shorthand, and saving the extra place helps in using the standard error to calculate confidence limits).
Step 2: Look for exceptions
Exceptions to Step 1 are as follows: (1) if the first significant digit of the standard error is 1, then one additional place may be saved; (2) for percentages with a floor of 0% and ceiling of 100%, if the percentage is between 0% and 10% or between 90% and 100%, keep at least two significant digits; and (3) within a single contingency table, consistency in saving digits is desirable, so all numbers may be rounded to the place indicated by the majority of the numbers. In medical data, two significant digits (see Box 7.7 ) usually suffice.
Step 3: Round
Round the number by removing digits from its right side that falsely suggest a high degree of precision or accuracy. This is done as follows :
-
•
If the digit in the first place beyond (to the right of) the digit to be rounded is greater than 5, add 1 to the right-most digit to be retained and drop all other digits to its right. This is called “rounding up.”
-
•
If the digit in the first place beyond the digit to be rounded is less than 5, simply drop it and all other digits to its right. This is called “rounding down.”
-
•
If the digit in the first place beyond the digit to be rounded is exactly 500…0, add 1 to the rightmost digit to be retained if the last significant digit is odd (i.e., 1, 3, 5, 7, 9), and leave the digit to be rounded as is if it is even (i.e., 0, 2, 4, 6, 8). This rule results after rounding in a rightmost digit that is always an even number.
Tabular presentation.
Numbers are often presented in tabular form that indicates distribution of data between the extremes of a continuous variable (e.g., patient age). Such tables should be prepared so that positioning of any point along the continuous variable can be unambiguously determined. In this text, intervals between extremes of a continuous variable are indicated by symbols of inequality (see Table 7.5 ). This method of presentation of tabular information is mathematically conventional ( Box 7.10 ) but not conventional for medical publications, where ambiguity often abounds.
• BOX 7.10
Inequalities
<
Less than; 3 < 4 means “3 is less than 4.”
>
Greater than; 5 > 3 means “5 is greater than 3.”
≤
Less than or equal to; systolic blood pressure ≤ 130 means systolic pressure is “less than or equal to 130.”
≥
Greater than or equal to; diastolic blood pressure ≥ 80 means diastolic pressure is “greater than or equal to 80.”
30 ≤ X < 40
The number represented by x is greater than or equal to 30 (i.e., 30 is less than or equal to x ) but is less than 40. Note that x is strictly less than 40 (39.999…), not exactly equal to 40. This statement is unambiguous, whereas the statement that “ x is between 30 and 40” is ambiguous because it is unclear whether 30 or 40 (or both) is included by the word between .
Descriptive statistics
Descriptive statistics are numbers used to summarize values for a specific variable recorded for a group of patients ( sample ); nomenclature is given in Box 7.11 , such as age, presence or absence of coronary artery disease, and New York Heart Association (NYHA) functional class. Variables fall into two broad categories for which different methods and expression of summarization are appropriate: (1) categorical and (2) continuous .
• BOX 7.11
Words
Piantadosi, Kirklin, and Blackstone provided a glossary of statistical terms in the first edition of Pearson and colleagues’ Thoracic Surgery.
Population
The entire set of things with specified attributes. For example, the population of patients with ischemic heart disease encompasses everybody with that disease, not only at the present time, but anyone in the past or future.
Sample
One or more things with specific attributes belonging to a population. Thus, my next patient, or a group of patients I have operated on with ischemic heart disease, represents a sample of the population of such individuals.
Proportion
A proportion is a part compared with the whole. Specifically, it is the number having some attribute value of interest, divided by the number in the sample. Ten deaths among 30 patients is a proportion of 0.30.
Percent
Percent is a part compared with the whole, normalized to a sample size of 100. It is calculated by multiplying a proportion by 100.
Parameter
A constant used to characterize some attribute of a population. One generally uses a sample of patients to estimate such constants. These constants are commonly (but not always) designated by letters or symbols in mathematical equations called models (see later Box 7.17 ).
Variable
An attribute about a thing that can take on different values from one thing to another. For example, systolic blood pressure is a variable because its value differs from patient to patient. The word parameter is often used incorrectly when the word variable is meant.
Prevalence, incidence, rate
Prevalence , incidence , and rate are often used interchangeably. Perhaps common usage should prevail (it rarely leads to confusion), but from the standpoint of correct usage, these are not interchangeable terms. We prefer selecting the specific word whose technical definition matches the context.
Prevalence
Frequency of occurrence of some factor, characteristic, event, or incident in a group. Of the three words being considered, it is the least commonly used but the most commonly meant! For example, if 78% of patients are men, the prevalence of males in the sample is 78%; we would not use the phrase, “The incidence of males was….” Similarly, hospital mortality may be 1%. That is the prevalence , or occurrence, of hospital mortality. We would not use the phrase, “Hospital mortality rate was…” or “Incidence of hospital mortality was….” The word occurrence is often a suitable substitute for prevalence .
Incidence
Frequency of occurrence per unit of time . It is expressed on a scale of inverse time (cases per year, deaths per year) or rate of occurrence. The prevalence of mortality across time is expressed as survival; the incidence of mortality is expressed by the hazard function.
Rate
Quantity per unit time. Speed is a rate: km · h −1 ; cardiac output is a blood flow rate: L · min −1 . In the context of events, rate is synonymous with incidence . The hazard function is a rate (mortality · year −1 ) and incidence.
How, then, can we rephrase such common expressions as the following?
-
•
Incidence of hospital mortality was…
-
•
Hospital mortality rate was…
-
•
Five-year survival rate was…
We could write “Prevalence of hospital mortality was….” However, in most instances, the words prevalence , incidence , and rate are superfluous. It is better to just write “Hospital mortality was…” or “Five-year survival was….”
Categorical variables take on a small number of values. If they take on just two (e.g., YES, NO), they are called dichotomous variables . If they have values that are ordered (e.g., none, mild, moderate, severe), they are called ordinal variables. If they are just a list (e.g., type of valve prosthesis), they are called polytomous variables.
Continuous variables take on a theoretically limitless number of values, although these values may have natural constraints (e.g., age, which cannot be negative). Their degree of granularity may vary (e.g., age may be calculated in whole years in adults, but in days or even hours [higher granularity] in neonates).
Categorical variables
Dichotomous.
Descriptive statistics for dichotomous categorical variables include simple counts (i.e., a count of the number of times the variable was YES [or 1] or NO [or 0]): How many cases were performed? How many men and women were in the study? How many patients died after operation? Summary counts are of limited value, however, because they do not reflect the size of the sample. Therefore, a summary statistic can be formulated that normalizes the counts to a standard denominator, commonly 100 (percent). This is a probability parameter estimate, so it not only reflects what is experienced within the sample but also begins to give insight into characteristics of the population ( Box 7.12 ).
• BOX 7.12
Parametric Versus Nonparametric
Nonparametric
A statistical method that summarizes data in specific ways and by specific procedures that do not use either an empirical or biomathematical model (see later Box 7.17 ). A median value in the distribution of values for age is a nonparametric estimate. Kaplan-Meier survival estimates are nonparametric estimates.
Parametric
A statistical method that summarizes data in terms of either an empirical or biomathematical model (see later Box 7.17 ). Numeric estimates of the constants in these models are called parameter estimates . Coefficients of a regression equation are parameters (see later Box 7.17 ), as are mean and standard deviation.
Parameter estimates
Parameters in mathematical models (see later Box 7.17 ) are placeholders for numeric values. When the parameters take on specific values, the model becomes an equation that can be solved—for example, for an individual patient’s risk. These numeric values are called parameter estimates . They are estimates because they are based on a finite sample of data. Just as a mean value (a parameter estimate) is associated with uncertainty proportional to both the standard duration (another parameter estimate) and effective sample size, so any parameter estimate is associated with uncertainty.
Parameter values are estimated by means of statistical theory and procedures. The estimation process may be complex or as simple as counting and dividing.
Ordinal.
Each value of an ordinal variable bears a strictly increasing or decreasing (monotonic) relation to all other possible values. For simplicity, it may be tempting to group some of these values together—forming a less granular dichotomous variable that lumps NYHA classes I and II versus III and IV, for example. This is an information-losing transformation of scale that should be done only if outcome is truly found by analysis not to follow the ordinal scale but to suggest just two groups of patients, or if one or more categories is sparse.
When ordinal variables are analyzed with respect to outcome, it is important to use statistical methods that are appropriate for ordered values (trend statistics) rather than for lists (tests of independence of categories). This must be communicated to the data analyst.
Polytomous.
Variables with values that are simply a list (complications after operation, type of valve prosthesis) can be counted, but special mention must be made as to whether the counts represent mutually exclusive categories. A list of types of prosthesis used is likely to be mutually exclusive (a patient can fall into only one category), but a table of complications is unlikely to be so (a patient can experience more than one complication). In presenting lists, all categories should be represented, including number of missing values and whether some categories have been coalesced (e.g., under “other”).
List variables are often useful for analysis if they are mutually exclusive. Otherwise, the list should be decomposed into a set of dichotomous variables for each category.
Continuous variables.
The other broad category of variables is continuous, for which each patient in a study (sample) may have a different value (e.g., age, weight, ejection fraction). Thus, the raw data are rarely published, because each patient or subject in a study is likely to be unique in regard to continuous variables, making any tabular presentation unwieldy unless the number of patients and number of variables are small. Summarizing statements may be made of the raw data by one of several techniques.
A commonly used summarization of raw data is a simple table with patients grouped into “nice” ordered categories. A histogram is a plot of such a table ( Fig. 7.8 A). Another method of constructing a simple table is to sort patients into several groups of equal number, even if the width of the range of values in each group is different. Because the number of such groups was originally 10, these are called decile tables .
Yet another alternative is to divide patients into percentiles , stating the value of the variable at these percentiles as follows. Patients or subjects are first sorted by (generally) increasing magnitude of the variable under consideration (e.g., by increasing age). Then the number (or more commonly the proportion) of patients with values less than or equal to each value is calculated. For example, if there are 21 patients and each is a different age at operation, patients are first sorted from youngest to oldest. No patient is younger that the youngest one (0/21, 0%, or minimum); 1/21 are as young or younger than the youngest (4.8%), and for these data, this is also the 5th percentile; 2/21 (9.5%) are as young or younger than the second youngest patient, and this is also the 10th percentile. The middle value of age, that of the 11th patient in this list, is called the median or 50th percentile . All (21/21, 100%) are as young or younger than the oldest. A cumulative distribution plot , produced easily by computer but laboriously by hand, presents all the raw data in this percentile format ( Fig. 7.8 B).
Alternatively (and more commonly), a value is found below which a stated proportion of patients have that value or a lesser one (100 times that proportion is the percentile). For example, the median is the 50th percentile. This means that half the patients have a value for the continuous variable below the median, and half have values greater. For consistency, one might also state the 15th and 85th percentiles, as they correspond to 70% confidence limits (CLs; see “ Confidence Limits [Intervals] ” in Section IV). More commonly, 25th and 75th percentiles (quartiles) or 10th and 90th percentiles are used, which summarize the middle 50% of data.
This method of summarizing data is called nonparametric (see Box 7.12 ). Beyond such simple counting (percentages and percentiles), more abstract methods are often brought into play to describe continuous data. The methods have in common a process whereby raw data on a sample of patients are used to estimate values of parameters of mathematical equations. The most familiar of these is the mean , which is estimated as the summation of all values of the continuous variable (e.g., age, pulmonary artery pressure) divided by the number of people or observations ( n ). The rationale for using the mean is that it provides an estimate of the central tendency of the data and a characteristic of the population studied. If the data are distributed perfectly symmetrically in the form of a bell-shaped curve, the mean is exactly at the midpoint of the data range (see Fig. 7.8 A). It is also the most frequently occurring number ( mode ), with half the patients above it and half below ( median ) (see Fig. 7.8 B).
Distribution of a continuous variable, age at operation. (A) Histogram of age at operation of 102 patients undergoing coronary artery bypass grafting. Approximately 30% were 50 to 55 years of age at operation, 25% were age 55 to 60, and lesser percentages of patients were older or younger. (B) Cumulative distribution plot of age at operation in these 102 patients. Vertical axis shows percentage of patients coming to operation at or younger than any given age on horizontal axis. It also gives directly the percentile of patients coming to operation by a given age: Median is the 50th percentile. S shape of this particular plot suggests a normal distribution; any other shape would suggest a different distribution.
The derivation of averages, or means, was begun by astronomers centuries ago. They thought that the scatter in their data was from observational error or imprecision, and they used means, or averages, in an attempt to obtain true values (accuracy). Later, Gauss discussed and described the symmetric normal distribution curve , which actually was described earlier by DeMoivre ( Box 7.13 ).
• BOX 7.13
Gaussian Distribution
The equation of the bell-shaped Gaussian (normal) distribution curve is:
where:
-
π is a constant, approximately 3.1415927…, pi
-
e is a constant, approximately 2.7183…, the base of the natural logarithms
-
σ is a parameter that represents the standard deviation of the variable
-
µ is a parameter that represents the mean of the variable
-
x represents a value of the variable X , generally graphed on the horizontal axis
-
y represents the probability of occurrence of a particular value of x
Because in medicine normal has several unrelated meanings, we have used the more technical term Gaussian.
Standard deviation versus standard error
Standard deviation is the Gaussian distribution parameter representing the scatter or deviation of individual values from the mean. It is a descriptive statistic, the inflection point of the Gaussian distribution (see Fig. 7.9 B).
Standard error is the standard deviation of the mean, an estimate of the precision of the mean. Unlike the standard deviation, which is similar in value for large and small samples of data, the standard error decreases as approximately the square root of n increases.
Because the Gaussian curve is symmetric around the mean, the two parameters of the Gaussian distribution are expressed by the shorthand mean ± SD, where SD is 1 standard deviation. This means 68.3% of values for patient age fall between (mean − SD) and (mean + SD). This is one instance, not common in statistics, where the shorthand ± is used instead of confidence limits (see later Box 7.14 ).
The mean is the easiest statistic to calculate. Unfortunately, it is not a robust measure of central tendency. If many infants and only one or two adults are in a study, average age is greatly exaggerated by the few adults. A more robust measure of central tendency is the median. Whether or not the sample data are distributed in a Gaussian-type bell-shaped curve (see Fig. 7.9 and Box 7.13 ) may be tested by such statistics as the Shapiro-Wilk W statistic for small n (e.g., 50 or less) and the Kolmogorov-Smirnov D statistic for larger samples. The skewness of the data (rightward or leftward asymmetric tail) and their kurtosis (unusual peakedness of the distribution of values) are also tested.
Gaussian bell-shaped distribution curve for age. (A) Distribution curve (smooth line) fitted to data summarized in underlying bar graph (see also Fig. 7.8 A). (B) Distribution with mean ±1 and ±2 standard deviations (SD) marked by vertical lines. Point of inflection of the curve from concave to convex is 1 SD.
Thus, in addition to an estimation of the population mean, some measure of dispersion (variance, spread, scatter) of values is needed. One such measure is the standard deviation , the name of the second parameter of the Gaussian distribution equation (see Box 7.13 ). It refers to variability from subject to subject or variability of individuals within the sample and is used to determine whether an individual is “within limits of normal.” Standard deviation is necessary for comparison statistics. For example, an individual’s standard deviation from the mean regarding a particular measured variable (commonly called z ) is often useful. This is calculated from the difference between the measurement for the individual and the mean normal value divided by the standard deviation. A z may be negative or positive and has no units (see “ Standardization of Dimensions ” under “ Dimensions of Normal Cardiac and Great Artery Pathways ” in Chapter 1 ).
However, if the distribution of data values does not conform to the normal distribution, then reporting two model parameters of the normal distribution, namely mean and standard deviation, represents a mismatch between the data and the model of the data, called model misspecification. Thus, for a continuous variable that is strictly positive (and nearly all of them in medicine have only positive values), if the standard deviation is larger than the mean or even close to it, the data are skewed ( Table 7.6 and Fig. 7.10 ), and median and selected percentiles must be provided instead to accurately summarize the distribution of data.
TABLE 7.6
Summary Statistics for Distributions Shown in Fig. 7.10
| Patient Characteristic | Mean ± SD | 15th/50th/85th Percentiles | Distribution |
|---|---|---|---|
| Age at surgery (y) | 56 ± 15 | 40/57/71 | Normal |
| Preop creatinine (mg/dL) | 1.8 ± 1.7 | 0.76/1.1/2.8 | Skewed |
| Preop BUN (mg/dL) | 24 ± 18 | 10/18/37 | Skewed |
| ICU length of stay (h) | 143 ± 214 | 25/69/237 | Skewed |
BUN, blood urea nitrogen; ICU, intensive care unit; Preop, preoperative; SD, standard deviation.
Illustration of a normally distributed variable and three skewed distributions. In (A), probability density functions (histograms and a smooth curve) are skewed “to the right” (long tail to the right). This is typical of many medical variables whose value cannot be less than zero. The long right tail often indicates pathologic values. In (B), cumulative distribution functions that correspond to the area beneath density functions shown in A . “Normally” distributed values take on an S-shaped curve, which is not so for the skewed variables. Parametric statistics such as the mean and standard deviation are appropriate for the more-or-less normal bell-shaped curves, but nonparametric statistics are usually needed for the skewed distributions. BUN , blood urea nitrogen; ICU , intensive care unit.
Standard error is a measure of the reliability with which the population mean is estimated from the sample mean, and it is needed for comparing one group with another. It is more appropriately (but infrequently) called the standard deviation of the mean and is obtained simply by dividing the standard deviation by the square root of n (see Box 7.13 ).
Other methods are available for summarizing skewed data. One is to resort to a purely nonparametric (i.e., without equations, coefficients) description (e.g., using the median and its various percentiles). , Another is to transform the data into a more normally distributed scale. For example, a logarithmic transformation is often useful; the resultant mean is called the geometric mean .
Section IV: Analyses
Specific data analysis methods are described in this section. Here, we simply indicate how this aspect of the research process leads to success.
First, the analysis process leads to understanding of the “raw data,” often called exploratory data analysis. This understanding is gleaned from such analyses as simple descriptive statistics, correlations among variables, simple life tables for time-related events, cumulative distribution graphs of continuously distributed variables, and cluster analyses whereby variables with shared information content are identified.
Second, the analytic process attempts to extract meaning from the data by various methods akin to pattern recognition. Answers are sought for questions such as “Which variables relate to outcome and which do not? What inference can be made about whether an association is or is not attributable to chance alone? Might there be a causal relationship? For what might a variable associated with outcome be a surrogate?”
What will be discovered is that answering such questions in the most clinically relevant way often outstrips available statistical, biomathematical, and algorithmic methodology! Instead, a question is answered with available techniques but not the question. Some statisticians, because of insufficient continuing education, lack of needed statistical software, lack of awareness, failure of communication, or lack of time, may explore the data less expertly than required. One of the purposes of this chapter is to stimulate effective collaboration between cardiac surgeons and data analysis experts so that data are analyzed thoroughly and with appropriate methodology, and if methodology does not exist to answer the research question, then we hope data scientists and statisticians will be inspired to develop new analytic methods.
Historical note
Analysis, as expressed by Sir Isaac Newton, is that part of an inductive scientific process whereby a small part of nature (a phenomenon) is examined in the light of observations (data) so that inferences can be drawn that help explain that aspect of the workings of nature.
Philosophies underpinning methods of data analysis have evolved rapidly since the latter part of the 19th century and may be at an important crossroad. Stimulated in large part by the findings of his cousin Charles Darwin, Sir Francis Galton, along with Karl Pearson and Francis Edgeworth, established at that time what has come to be known as biostatistics . Because of the Darwinian link, much of their thinking was directed toward an empirical study of genetic versus environmental influence on biological development. It stimulated development of the field of eugenics (human breeding) and the study of mental and even criminal characteristics of humans as they relate to physical characteristics (profiling). The outbreak of World War I led to development of statistics related to quality control. Sir Ronald Fisher formalized a methodological approach to experimentation, including randomized designs, particularly in agriculture. The varying milieus of development led to several competing schools of thought within statistics, such as frequentist and Bayesian, with different terminologies and different methods. Formalization of the discipline occurred, and whatever the flavor of statistics, it came to dominate the analytic phase of inferential data analysis, perhaps because of its empirical approach and lack of underlying mechanistic assumptions.
Simultaneously, the discipline of biomathematics arose, stimulated in particular by the need to understand the growth of organisms (allometric growth) and populations in a quantitative fashion. Biomathematicians attempted to develop mathematical models of natural phenomena such as clearance of pharmaceuticals, enzyme kinetics, and blood flow dynamics. These continue to be important today in understanding such altered physiology as cavopulmonary shunt flow. Many of the biomathematical models came to compete with statistical models for distribution of values for variables, such as the distribution of times to an event.
Advent of the fast Fourier transform in the mid-1960s led to important medical advances in filtering signal from noise and image processing. The impetus for this development came largely from the communications industry, so only a few noticed that concepts in communication theory coincided with those in statistics, mathematics, and physics.
As business use of computers expanded, and more recently as genomic data became voluminous, computer scientists developed methods for examining large stores of data (see footnote a, p. 238). These included data mining in business and computational biology and bioinformatics in the life sciences. Problems of classification (e.g., of addresses for automating postal services) led to such tools as neural networks, which have been superseded in recent years by an entire discipline of machine learning. ,
In the past quarter century, all these disciplines of mathematics, computer science, information modeling, and digital signal processing have been vying for a place in the analytic phase of clinical research that in the past has largely been dominated by biostatistics (see footnote a, p. 238). Specifically, advanced statistics and algorithmic data analysis have conquered the huge inductive inference problem of disparity between number of parameters to be estimated and number of subjects (e.g., in genetics, hundreds of thousands of variables for n = 1). Advanced high-order computer reasoning and logic have taken the Aristotelian deterministic approach to a level that allows intelligent agents to connect genotype with phenotype. It may be rational to believe that the power of these two divergent approaches to science can be combined in such a way that very “black box” but highly predictive methods can be explored by intelligent agents that report the logical reasons for a black-box prediction. ,
Fortunately, those in cardiac surgery need not be threatened by these alternative voices but can seize the opportunity to discover how each can contribute to better understanding of the phenomena encountered in this medical discipline. For this reason, in this section, Dr. Hemant Ishwaran from the University of Miami in Florida has interjected a number of advanced concepts involving machine learning to complement the largely parametric methods presented.
Overview
This section highlights (1) the important statistical concept of dealing with uncertainty , illustrating it with CLs, P values, and measures of importance; (2) the increasingly important signal processing concept, multivariable analysis , illustrating it with logistic regression of early postoperative events; (3) analysis of time-related events; and (4) longitudinal data analysis, which we present in terms of biomathematical concepts and machine learning. In Section VI, other specialized methods are highlighted, including some that are only peripherally related to serious clinical research but importantly affect cardiac surgeons.
Uncertainty
Publication of an experience with triple valve replacement in 438 patients, among whom 8 (1.8%) died in the hospital, is in isolation a record of past achievement. Assuming honest reporting, there is no uncertainty about this result, but in and of itself, except for inviting applause or criticism, it has only historical value. Yet most persons expect past experience to be useful in predicting what can be accomplished in the present or the future, or in comparing outcome with that of other surgical options or continued medical therapy (see “ Nihilism Versus Predictability ” in Section I). That is, they recognize the future is uncertain, but they are not nihilists; they assume there is continuity in nature (see “ Continuity Versus Discontinuity in Nature ” in Section I). There are well-tested theories and methods that quantify the uncertainty of inferring from the past the probable results in the future (assuming nothing changes), expressed as a degree of uncertainty. Quantifying the degree of uncertainty is a major part of making results of past experience useful.
Point estimates
Point estimates represent the central tendency of a set of numbers that describe the characteristics or state of a sample (e.g., group of patients). The previously mentioned 1.8% hospital mortality is a point estimate. So are the mean value of age in a group of patients and percent survival 1 or 20 years after an operation.
Such numbers are generally derived from a study of a sample (see Box 7.11 ) of members of a population (e.g., everyone everywhere undergoing triple valve replacement). Yet the clinical study is nearly always performed to generalize beyond the sample examined.
Generalizing from a sample to the population is fraught with uncertainty. Recorded, unrecorded, or unrecognized patient characteristics may occur at a different frequency in the sample than in the population (including your future patients). Surgeons use expert clinical judgment in decision-making, and this introduces selection bias into the sample. Well-recognized variance in institutional policies, processes, procedures, skill, and experience influence outcomes in ways that may be difficult to dissect and confound inextricably both outcomes and interpretation of outcomes. These suggest that inferences from sample point estimates alone are unlikely to be predictive of results in either the population or future samples.
Nevertheless, over the past quarter century, fewer and fewer cardiac surgery publications accompany point estimates with a measure of uncertainty.
Confidence limits (intervals)
CLs, the two extremes of a confidence interval ( Box 7.14 ), are the fundamental statistics that quantify uncertainty of point estimates. It is not the underlying data that are uncertain (e.g., how many hospital deaths occurred in a defined group of patients), but inferences about the future based on known data from the past.
• BOX 7.14
Confidence Limits, Confidence Intervals
Confidence limits
Numbers at the two extremes of an interval that encompasses a stated percentage of the variability of a point estimate. In this book, we use confidence limits (CL) rather than confidence intervals (CI) to avoid confusion with cardiac index (CI), a familiar abbreviation used by cardiac surgeons.
Confidence interval
Interval encompassing a stated percentage of the variability of a point estimate.
For example, if there is one hospital death in three operations for postinfarction VSD, the proportion of hospital deaths (hospital mortality) is 0.33 (1/3, 33% hospital mortality). This is the mortality in that experience, looking at it solely as a record of achievement. Likewise, if 10 deaths occur among 30 such operations, or 100 occur among 300 operations, mortality is also 33%. Intuitively, there is more confidence that the risk in an entire population, not just in the small sample studied, is near 33% on the basis of the experience with 300 operations than on the basis of three operations. Yet, also intuitively, one suspects something has been learned about the risk in an entire population from only three operations. For example, the true risk cannot be exactly 0% or exactly 100%.
Historical note.
The questions “What is the risk of repair of postinfarction VSD in general?” and “Is risk with the method of repair I used higher or lower than that with the method another surgeon is using?” are similar to questions put to Galileo about the nature of chance, particularly games of chance, by 17th-century gamblers. , From those questions emerged the Laws of Chance, now known as the theory of probability . These laws are believed to apply to all things that can have more than one possible result. Many scientists believe that all natural phenomena, including those of the physical world, behave in accordance with the theory of probability. , Events and phenomena of cardiac surgery behave in accordance with this theory.
Galileo showed that there is variability in sample point estimates. To illustrate, if the risk of death in the entire population of patients undergoing repair of postinfarction VSD by a given method is 33%, and samples of three patients are taken repeatedly, 0 deaths among the three would be experienced in 30% of samples, 1 death in 44% of samples, 2 deaths in 22% of samples, and 3 deaths in 4% of samples. In larger samples, results are less variable. For example, with samples of size 300, although the number of deaths experienced may still be quite variable, the proportion dying will be 30% to 36% in 70% of samples taken.
Because of this random variability in the sample estimates of risk, it is impossible to estimate the population parameter (see Box 7.11 ) with certainty (i.e., to know the risk in the entire population) from sample information. However, the pattern of variability in repeated sampling is well understood, and in most situations, it is possible to derive a formula to calculate the range of values that would contain the parameter for a specified percentage (e.g., 70%) of samples taken.
Users of CLs should be aware that this range of values for all proportions except 0.5 (50%) is asymmetric, in contrast to standard deviations, which are symmetric. Thus, we must report both the point estimate (probability) and lower and upper CLs.
As the sample size increases and more information becomes available, width of the confidence interval decreases (i.e., a more precise estimate is obtained). With a more precise estimate, the investigator is less uncertain where the population parameter lies, or in other words, what the “true” risk is. With a less precise estimate, the investigator is more uncertain.
Computational methods.
A number of methods have been developed to calculate CLs for proportions. , Bootstrapping is a generalized method for obtaining CLs for any statistic. The original sample of data is randomly sampled in such a way that the patient can be sampled again (sampled with replacement) to form a data set equal in size to the original. Because of replacement, some patients will appear more than once in this bootstrap sample, and others will not appear at all. The point estimate (e.g., hospital mortality) is estimated in this sample. Then another sample is drawn in the same fashion, and this process is repeated as many as 1000 times. All the point estimates from each sample are sorted from smallest to largest, as in forming a cumulative frequency distribution (see “ Descriptive Data Exploration ” in Section III). The “best” estimate of the point estimate is the median value (50% above and 50% below). If 70% CLs are desired, then the 15th percentile is the lower CL and the 85th percentile is the upper limit (if 68.3% limits are desired, the numbers would be approximately the 16th and 84th percentiles). Approximating formulae are used in most statistical packages.
What level of confidence?
Any desired CL can be derived, such as 50%, 70%, 90%, 95%, or 97.5%. Choice of CLs to be expressed (called the confidence coefficient ) depends on (1) use to be made of them, (2) consistency, or (3) convention, in that order of preference.
Most often in cardiac surgery, CLs are used as scanning tools to aid predictions and comparisons, either of proportions or time-related depictions (see “ Scanning Tool ” later in this section). If great certainty is desired in the inference that there is a difference between two proportions of time-related depictions, 95% confidence intervals may be chosen for the comparisons. If only moderate certainty is required that the evident difference is a true difference and would be found in larger samples, 50% confidence intervals might be chosen.
Most situations in cardiac surgery seem to lie somewhere between these extremes, so use of 70% CLs for most comparisons is reasonable. The interval is relatively narrow (specific), and although it is reasonably certain that truth lies within the CLs, there is a 15% chance it will be higher and a 15% chance it will be lower.
Seventy-percent CLs (actually 68.3%) are equivalent to 1 standard error (SE), and 95% CLs are consistent with 2 SEs. For consistency, if other numeric estimates are presented to 1 SE, 70% CLs should be used, and if 2 SEs are presented, 95% CLs should be used. We emphasize consistency because we believe surgeons should become familiar with using CLs as a scanning tool; to use a tool effectively, it is helpful to be consistent among all measures of uncertainty. Conventionally, many statisticians use 95% CLs, even in the context of using 1 SE for most everything else, and 50% limits for nonparametric statistics. This makes no sense and is simply a habit, not a product of reflective thinking about the inferences or about consistency.
In a numeric presentation of differences, such as difference in survival curves (see Box 7.3 ), 90% CLs are equivalent in comparative inference to individual 70% CLs, a largely empirical finding. The reason is that a one-sided confidence interval of a difference between two estimates is narrower than the sum of the 70% upper and lower CLs that just touch. This narrowness is compensated for by use of somewhat wider CLs (90%) of the difference.
Scanning tool.
Overlapping or nonoverlapping of CLs around two or more point estimates can be used as a simple and intuitive scanning method for determining whether the difference in point estimates is unlikely to be due to chance alone. They delimit the effect, and because they are accompanied by the magnitude of the effect, there is no confusion between statistical significance and magnitude of the effect, as there may be if P values are used (see “ P -Values ” later in this section). When CLs are not overlapping, the difference is unlikely to be due to chance alone.
Because “nonoverlapping CLs suggest with a stated degree of uncertainty that a difference exists” is cumbersome, the phrase evident difference may be used to express the same idea ( Appendix 7B ). Nonoverlapping CLs are easily visualized in a nomogram in which the CLs are displayed around the point estimate expressing the association between variables. Within this context, it can be said with a stated degree of uncertainty that the effect of the independent variable compared with a baseline value becomes evident at the point at which the CLs just separate. However, in contrast to evident differences in a contingency table, this point is not easily seen in a nomogram, and it does not appear in an equation. The point at which evident differences appear in equations can, however, be calculated mathematically (see Appendix 7B ).
We stress that comparing CLs in this way is a scanning tool . The classic method using P values involves computing the difference between the two proportions and testing the hypothesis that the difference is zero. Experience with scanning and P value methods has taught that when the lower 70% CL of one estimate just touches the upper 70% confidence of the other, the P value for the difference is between.08 and.1; when similar 95% CLs just touch, the P value is about.01.
P values
The phrase “statistically significant,” generally referring to P values, has done disservice to the understanding of truth, proof, and uncertainty. This is because of fundamental misunderstandings, in part because of failure to appreciate that all test statistics are specific in their use, and in part because P values are frequently used for their effect on the reader rather than as one of many tools useful for promoting understanding and framing inferences from data. In fact, P values are deemed by some to be unnecessary statistics and not worth the risk of misinterpreting or misusing them. They prefer CLs. Others cite machine learning alternatives.
Definition.
In the context of hypothesis (or significance) testing, the P value is the probability of observing the data we have, or something even more extreme, if a so-called null hypothesis is true ( Box 7.15 ). Or, as stated by the American Statistical Association (ASA), “Informally, a P value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.” However, if you must have a formal definition, three of them are presented in commentaries within the ASA white paper by Deborah Mayo, Michael Lavine, Joseph Horowitz, and Valen Johnson.
• BOX 7.15
Hypothesis (Significance) Testing
Statistical hypothesis
A claim about the value of one or more parameters. For example, the claim may be that the mean for some variable (e.g., creatinine) is greater than some fixed value or some value obtained under different conditions or in a different sample of patients.
Null hypothesis
A claim that the difference between one or more parameters is zero or no change (written H 0 ). It is the claim the investigator is arguing against . When a statistician infers that there is “statistical significance,” it means that by some criteria (generally a P value), this null hypothesis has been rejected. Some argue that the null hypothesis can never be true, and that sample size is just insufficient to demonstrate this fact. They emphasize that the magnitude of P values is highly dependent on n , so other “measures of surprise” need to be sought.
Alternative hypothesis
This is the “investigator’s claim” and is sometimes called the study hypothesis . Usually the investigator would like the data to support the alternative hypothesis.
Test statistic
A number, computed from the distribution of the variable to be tested in the sample of data, that is used to test the merit of the null hypothesis.
Type I error
Rejecting the null hypothesis when it is true (false negative). The probability of a type I error is designated by the Greek letter alpha (α).
Type II error
Not rejecting the null hypothesis when it is false (false positive). The probability of type II error is designated by the Greek letter beta (β).
Historically, hypothesis testing is a formal expression of English common law. The null hypothesis represents “innocent until proven guilty beyond a reasonable doubt.” Clearly, two injustices can occur: a guilty person can go free or an innocent person can be convicted. These possibilities are termed type I error and type II error , respectively (see Box 7.15 ). Evidence marshaled against the null hypothesis is called a test statistic , which is based on the data themselves (the exhibits) and n. The probability of guilt (reasonable doubt) is quantified by the P value or its inverse, the odds [(1/ P ) − 1] (see Box 7.3 ).
Had the originators been raised under a different judicial system, perhaps a different pattern for testing hypotheses might have arisen. Specifically, the system does not judge how innocent a person is (the “alternative hypothesis”; see Box 7.15 ), nor does it test for equivalence, a very important matter for comparing pharmaceuticals and even alternative surgical therapies.
Some statisticians believe that hypothesis or significance testing and interpretation of the P value by this system of justice is too artificial and misses important information. For example, it is sobering to demonstrate the distribution of P values by bootstrap sampling. Furthermore, the magnitude of the P value is dependent on two factors: magnitude of difference and sample size. These individuals would prefer that P values be interpreted simply as “degree of evidence,” “degree of surprise,” or “degree of belief.” We agree with these ideas and suggest that rather than using P values for judging guilt or innocence (accepting or rejecting the null hypothesis), the P value itself should be reported as degree of evidence. It is worthwhile considering the conclusion of the ASA white paper: “Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.”
Calculating the P -value.
All methods for calculating P values have in common one or more point estimates, some measure of variability for each, some comparison statistic related to the point estimates (e.g., the difference or a ratio), an estimate of the variability of the comparison statistic, and size of the groups.
The test to be used is selected. This must be appropriate for the comparison. It is crucial that a biostatistician familiar with the data and desired comparison be the one to select this test and interpret the results. In general terms, this demands that a specific distribution of the difference or ratio be selected. From the difference or ratio, some measure of its variability , and n , a number, is computed for the particular distribution selected, called the test statistic (see Box 7.15 ). There are a number of test statistics, which means there are a number of prescribed, defined, specific methods (tests) for calculating the test statistic. The statistician selects the test statistic to be used on the basis of the fit of the data to the assumptions underlying the test.
The magnitude of the computed test statistic among the hypothetically determined distribution of values for the test chosen is determined. The area under the distribution curve (proportion of the total area) occupied by more extreme values of the test statistic is the P value , a number ranging from 0 to 1.
In the case of many test statistics, a family of distribution curves exists, and to determine the P values, one of these must be selected. The selection is based, more or less, on the sample size ( n ). By “more or less,” we mean that some information content in the n may already have been “used up” in other calculations in the process and may not be available for computation of the P value. What is left, called degrees of freedom , determines the distribution curve selected.
The phrases one-tailed P value and two-tailed P value are commonly used. Which is appropriate depends on the research hypothesis being tested. When the hypothesis relates to differences in either direction (“different from zero”), a two-tailed P value is used; when it relates to differences in only one direction (“less than,” for example), a one-tailed P value is used. A two-tailed P value is always the same as or larger than a one-tailed P value. Generally, in the work described in this book, two-tailed P values are used.
Use of expressions of degree of uncertainty
Whether one uses CLs or P values, a decision must be made concerning the degree of certainty desired in the inference that A is different from B. Some have a slavish attachment to a certain P value, such as.05, or a certain width of CLs, such as 95%, as the yardstick for all situations. Sir Ronald Fisher wrote, “No scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”
This discussion would be unnecessary if all sample sizes were moderately large and the number of events ample, providing adequate power (information content) for all computations (see Box 7.4 ). In many clinical investigations, a large sample is simply not available, yet important decisions must be made on the basis of the inference generated. Then the cost of making a wrong decision based on an analysis, and the risk of overlooking or not finding a relation between two variables that in fact exists, play importantly in the decision regarding what P value to use (see Table 7.5 and Box 7.15 ). The greater the cost, the smaller the P value demanded.
An apparent contradiction to the foregoing discussion is the setting of so-called humongous databases of hundreds of thousands or millions of patients. In this setting, the dependence of P values on n becomes glaringly apparent. Essentially in every comparison, no matter how small the clinical difference, P values are small. “All null hypotheses are false.” In this circumstance, other measures of surprise must be devised for testing differences that take into account the magnitude of the difference.
Multivariable analysis
The necessity
Surgeons have intuitively understood that surgical outcomes, such as hospital mortality, may be related to a number of explanatory variables, such as age or renal and hepatic function. However, when presenting a risk factor analysis of outcome for a group of patients, two reactions are heard, often from the same critic: (1) “Your analyses are much too complex, far beyond the comprehension of ordinary cardiac surgeons,” and (2) “This is a very complex, multifactorial situation, and you have not begun to take all the things that could have influenced outcome into consideration.” This contradiction reflects the cognitive structure of the human mind, as discussed in Section I. On the one hand, we perceive, understand, and store in our brains simplified models of reality; on the other hand, our conscious minds recognize that “things are often less simple than they seem.”
To complicate matters, we generally know neither the cause nor the causal sequence that leads to a surgical failure, and that is what we want to know to make progress toward preventing future failures. The cause may in fact be buried in the clinical information and the data we have extracted therefrom, but we do not know if this is true and suspect we are ignorant of the real cause. Extensive cautionary literature on surrogate endpoints for clinical trials and how they can lead us astray fuel this anxiety. , We need, perhaps, to be reminded that public health recommendations based on crude risk factors for the plague were effective in halting it and preventing its recurrence for 200 years until the causative organism and vector were discovered. , ,
Faced with hundreds, perhaps thousands, of variables, the investigator seeks to find simple or dominant or stratospheric comprehension of the data. They want to discover the wood, not necessarily the trees (or branches and leaves, for that matter). Multivariable analysis ( Box 7.16 ) is a set of methods for considering multiple variables simultaneously and for (1) identifying those that by some criteria are associated with an outcome, (2) estimating the magnitude of each variable’s influence in light of all others, (3) quantifying the degree of uncertainty of those estimates, and (4) revealing the relation among the set of variables so identified while (5) dismissing others either as noise or as so correlated with other variables associated with outcome that they either do not contribute further information or are so lacking in additional information content that their association cannot be determined.
• BOX 7.16
Multivariable Versus Multivariate
Multivariable analysis is an analysis of a set of explanatory variables with respect to a single outcome variable. Multivariate analysis is an analysis of several outcome variables simultaneously with respect to explanatory variables. Before modern multivariable analysis was possible, the terms most used for a multivariable analysis were “multiple” or “multivariate.” Since the advent of methods to analyze multiple outcomes simultaneously, multivariable has come to be associated with single outcomes analysis. Some statisticians argue, however, that multivariate is still the correct word to use because multivariable analysis is the degenerate form of multivariate analysis when number of outcomes accounted for simultaneously is one.
Explanatory variables
The set of variables examined in relation to an outcome is called explanatory variables , independent variables , correlates , risk factors , incremental risk factors , covariables , or predictors . These alternative names distinguish this set of variables from outcomes. No statistical properties are implied. The least understood name is independent variable (or independent risk factor ). Some mistakenly believe it means the variable is uncorrelated with any other risk factor. All it actually describes is a variable that by some criterion has been found to (1) be associated with outcome and (2) contribute information about outcome in addition to that provided by other variables considered simultaneously. The least desirable of these terms is “predictor,” because it implies causality rather than association.
Dependent variable
In ordinary regression, this meant the variable on the left side of the equals sign and therefore was distinguished from independent variables on the right side (see later Box 7.17 ). In analysis of non–time-related outcome events, it is synonymous with an indicator variable for occurrence of the event. It is sometimes called the response variable or end point . In time-related analysis, the dependent variable is actually the entire distribution of times to an event, although the indicator variable (e.g., death) is often cited inaccurately as the dependent variable.
Historical note
Fisher understood the relation of outcome to possibly multiple explanatory variables when he wrote that the behavior of a sample could be considered characteristic of the population only when no subsets within the population behaved differently. Yet the use of these ideas in a formal way in medicine emerged only during the last half of the 20th century. This is because multivariable analysis, particularly of events after cardiac procedures and more especially time-related or longitudinal outcomes, involves considerable computational power. The mathematical models are generally nonlinear ( Box 7.17 ), so solving for their parameters is (1) an iterative process, that is, a series of systematically directed mathematical steps that follow an algorithm or plan to find the best value of the parameter (often called a coefficient) and its variability by gradually closing in on it, and (2) a mathematical process in which computations for explanatory variables are performed simultaneously. Because the computational challenge is considerable, use of multivariable analysis had to await development of computers.
• BOX 7.17
Regression
Sir Francis Galton, cousin of Charles Darwin, explored the relation between heights of adult children and average height of both parents (midparent height). He found that children born to tall parents were in general shorter than their parents, and children born to short parents, taller. He called this “regression towards mediocrity.” He even generated a “forecaster” for predicting son and daughter height as a function of father and mother height. It is presented in an interesting way as pendulums of a clock, with chains around two different-sized wheels equivalent to the different weights (regression coefficients) generated by the regression equation!
Today, any empirical relation of an outcome or dependent variable to one or more independent variables (see Box 7.16 ) is termed a regression analysis . Several of these are described later.
Linear
The form of a linear regression equation for a single dependent variable Y and a single independent variable x is:
where a is called the intercept (the estimate of Y when x is zero), and b is the slope (the increment in Y for a one-unit change in x ). More generally, when there are a number of x ’s:
Where β 0 is the intercept, x 1 through x k are independent variables, and β 1 through β k are weights, regression coefficients, or model parameters (see Box 7.11 ) that are multiplied by each x to produce an incremental change in Y.
It would be surprising if biological systems behaved as a series of additive weighted terms like this. However, this empirical formulation has been valuable under many circumstances in which there has been no basis for constructing a biomathematical model based on biological mechanisms (computational biology).
An important assumption is that Y is distributed in Gaussian fashion (see Box 7.13 ), and this may require the scale of the raw data to be mathematically transformed.
Log-linear
A log-linear regression equation has the following form:
where Ln is the logarithm to base e . Such a format is used, for example, in the Cox proportional hazards regression model (see Section IV ). However, in studies of events (see “ Logistic Regression Analysis ” and “Time-Related Events” in Section IV), the estimation procedure does not actually use a Y . Rather, just as in finding the parameter estimates called mean and standard deviation of the Gaussian equation (see Box 7.13 ), parameter estimation procedures use the data directly. Once these parameters are estimated, a predicted Y can be calculated.
Logit-linear
A logit-linear regression equation, representing a mathematical transformation of the logistic equation (see Fig. 7.1 A and Section IV), has the following form:
where Ln is the logarithm to base e and P is probability. The logit-linear equation is applicable to computing probabilities once the β s are estimated.
Model
A model is a representation of a real system, concept, or data, and particularly the functional relationships within these; it is simpler to work with, yet predicts real system, concept, or data behavior.
Mathematical model
A mathematical model consists of one or more interrelated equations that represent a real system, concept, or data by mathematical symbols. These equations contain symbols that represent parameters (constants) whose values are estimated from data and an estimating procedure.
A mathematical model may be based on a theory of nature or mechanistic understanding of what the real system, concept, or data represent (biomathematical models or computational biology). It may also be empirical. The latter characterizes most models in statistics, as depicted previously. The Gaussian distribution is an empirical mathematical model of data whose two parameters are called mean and standard deviation (see Box 7.13 ). All mathematical models are more compact than raw data, summarizing them by a small number of parameters in a ratio of 5 to 10 or more to 1.
Linear equation
When applied to mathematical models, it is an equation that can be solved directly with respect to any of its parameter values by simple mathematical manipulation. A linear regression equation is a linear model.
Nonlinear equation
When applied to mathematical models, a nonlinear equation is an equation that cannot be solved directly with respect to its parameter values but rather must be solved by a sequence of guesses following a recipe (algorithm) that converges on the answer (iterative). The logistic equation is a nonlinear model.
The first use of multivariable analysis to identify risk factors for outcome events in humans was probably the Framingham epidemiologic study of coronary artery disease. Two papers are landmarks in this regard. In 1967, Walker and Duncan published their paper on multivariable analysis in the domain of logistic regression analysis, stating that “the purpose of this paper is to develop a method for estimating from dichotomous (quantal) or polytomous data the probability of occurrence of an event as a function of a relatively large number of independent variables.” Then in 1976, Kannel and colleagues coined the term “risk factors” (actually “factors of risk”), noting that (1) “a single risk factor is neither a logical nor an effective means of detecting persons at high risk” and (2) “the risk function…is an effective instrument…for assisting in the search for and care of persons at high risk for cardiovascular disease.” In 1979, the phrase “incremental risk factors” was coined at UAB to emphasize that risk factors add in a stepwise, or incremental, fashion to the risk present in the most favorable situation.
Before the advent of multivariable analysis, stratification of the values of one or more potential risk factors was often used to search for association of risk with outcome. Although this is still of interest as a scanning method, it has serious disadvantages, including (1) loss of information by coarseness of stratification and (2) possibly erroneous inferences from the necessarily arbitrary nature of stratification. These dangers were well summarized by Kannel and colleagues, who stated that “while there is some convenience in dichotomizing a continuous variable like blood pressure into high and low, one would prefer some method to take into account the exact value.”
Machine learning for multivariable analysis.
Advances in machine learning for multivariable analyses evolved rapidly beginning in the late 1990s. Freund and Schapire introduced “AdaBoost” (adaptive boosting) for classification, a precursor to the now widely used “gradient boosting” method. A remarkable feature of AdaBoost was that it required little supervision and appeared to be immune to overfitting. In many experiments it was found to outperform traditional methods, especially in challenging examples where standard model assumptions failed to hold. The mathematical explanation for AdaBoost’s success, however, seemed very mysterious: The algorithm iteratively reweighted observations fitting in each iteration a simple “weak” learner (such as a tree) with higher weights given to those data points difficult to classify. This seemed intuitive, but a rigorous explanation of what it was doing was not clear at first.
Fig. 7.11 provides an illustration of how AdaBoost uses its weights to construct a classifier. Data were simulated from the circle in a square synthetic experiment (in this experiment, we have two covariates, x 1 and x 2 , and the outcome is one of two classes: class one is a circle and class two is the outside of the circle inside a square). Stumpy trees with a depth of 3 were used for weak learners (trees are characterized later in this section). The figure shows how AdaBoost’s weights vary with number of iterations, m. As seen, weights quickly migrate from a uniform distribution to small and large values, where large values are observed near the classification boundary (where the circle touches the square). The last panel displays test-set prediction error. Error drops substantially with only a few trees: This rapid effect is due to classifying easy cases, those away from the boundary, which make up the majority of the data. After that, AdaBoost spends its effort trying to improve classification for the hard to classify cases near the boundary, and the effect on reducing prediction error is much slower.
Circle in a square two-class synthetic experiment using AdaBoost with stumpy tree learners with depth of 3. Shown are AdaBoost weights for iterations m = 1; 2; 3; 4; 5; 6; 10; 50. Size of weights are proportional to size of points: Values near the boundary of the circle with the square are hardest to classify and receive the largest weights as m increases. Bottom right figure shows test set prediction error as a function of m (number of trees).
Many learning methods apply the principle of empirical risk minimization. In many cases empirical risk equals the training error. For example, in regression the empirical risk is equivalent to goodness of fit, which essentially amounts to measuring how well the procedure performs in fitting the data. These procedures seek to minimize empirical risk and are constructed in a manner allowing them to fit flexible models while avoiding overfitting (this latter aspect of machine learning to avoid overfitting is called regularization). Empirical risk minimization is a well-understood concept with good properties, but how then was it possible for AdaBoost to work so well? There seemed no reason to believe that AdaBoost was an empirical risk-reduction strategy.
Surprisingly, researchers were later able to show that the weighting schemed used by AdaBoost was equivalent to fitting an additive stagewise learner. In fact, the algorithm was shown to be minimizing a loss function nearly identical to that used in logistic regression. The term stagewise refers to a procedure that repeatedly refits residuals. Thus, AdaBoost repeatedly refits residuals, each time seeking to minimize a loss function and its associated empirical risk.
The insight that AdaBoost was actually performing risk minimization quickly led to a unified treatment for general loss functions applicable to many problem settings. This unified treatment is now generally referred to as gradient boosting and is taken to be the modern standard for boosting. Just like AdaBoost, the idea is to repeatedly fit the data using weak learners. Gradient boosting, however, sets about choosing the weak learner using the principle of steepest descent, thereby directly attacking the issue of empirical risk minimization head on. (See also “matching pursuit,” which is a closely related idea. ) The AdaBoost algorithm was originally proposed for classification. It constructs its classifier by using an additive expansion F M = ∑ m = 1 M c m h m , where h 1 , h 2, …, h M are weak learners and F M is the combined learner used for classification. Therefore, AdaBoost adopts the view of combining weak base learners for the purpose of creating a better classifier. A closely related idea is ensemble learning, which was another intensely studied area in the early development of machine learning. Researchers found, surprisingly, that a procedure’s performance could be significantly improved by combining its runs over different subsets of the data. The resulting averaged procedure was termed an ensemble (as an example, an ensemble classifier is a collection of classifiers whose votes are averaged). One especially successful procedure is bagging, which combines bootstrapped trees. Random forests refine the idea of bagging by introducing further randomization into the tree construction. The averaged random trees result in a more accurate ensemble than bagging.
A characteristic common to the aforementioned procedures is that they can all be implemented with little supervision. Typically, only a few parameters, called tuning parameters, need to be determined, and choosing them wisely prevents overfitting. Tuning of parameters is the so-called regularization step needed when fitting a machine learner. For gradient boosting, the most important parameter is the number of iterations (in our circle in the square problem, this is the number of trees used); for random forests, they are the parameters specifying how a random tree is constructed, for example, how many random features to be selected when splitting an internal tree node or how large the terminal nodes (end of a tree) are. Importantly, a hallmark of these methods is their robustness to tuning parameters and that parameter tuning is generally easy to do in practice; typically, this is accomplished by using cross-validation. Another hallmark is their ability to automatically learn from the data and find difficult-to-grasp relationships, such as interactions and nonlinear trends without human input. For this reason, these procedures are generally referred to as machine learning methods.
Carrier of risk factors: Underlying mathematical model
Multivariable analysis as described by the Framingham investigators requires a model (equation) that relates a placeholder for explanatory variables (generally one or more of the model parameters) to the dependent (outcome) variable. The equation may be a completely linear one; for these, iterative techniques are not required, but for most models of surgical outcomes the computations are large and for all practical purposes require a computer. The general term for such a model is a regression equation (see Box 7.17 ).
Logistic multivariable regression analysis is a nonlinear model that is illustrative for understanding the nature of the relation of risk factors to outcome in a medically rational fashion. Fig. 7.1 A illustrates the relation between the absolute probability of a clinical event on the vertical axis and an expression of risk measured in logit units along the horizontal axis. The horizontal axis is the one related to risk factors. The relation is sigmoidal (S-shaped). Notice that an increment of risk along the horizontal axis, if far to the left or right of the curve, is not associated with a perceptible increase or decrease along the probability scale. However, a small increment near 0 logit units is associated with a large change in probability.
To illustrate, imagine two patients. One is a strapping football player who is mugged on his way to a pharmacy late at night. He is stabbed in the abdomen, and his inferior vena cava is lacerated. Fortunately, a trauma center is nearby, and he is rushed to surgery. His anxious parents arrive at the hospital about an hour after the incident and want to know “What are his chances, doctor?” Let us say that the injury moves the football player’s risk two units to the right on the logit scale. Before the incident, this robust individual was positioned far to the left on the logit curve, so his chances of recovery are good.
A week later, the second patient, a frail, elderly diabetic man, is walking to the same pharmacy for his insulin when he is stabbed in the abdomen, and his inferior vena cava is lacerated. He, too, is rushed to the trauma center and into the operating room. An hour later, his anxious daughter arrives at the hospital and wants to know “What are his chances, doctor?” The fragile patient may already have been sitting near the center of the logit curve, say at −1 logit units, before the incident. Two logit units of acute risk greatly increase his probability of hospital mortality.
These anecdotes emphasize that the models’ underlying composition makes good medical sense. They reflect what we mean by a robust patient, a fragile patient, and an unsalvageable patient. They reflect the reality that the identical risk factor may operate with respect to absolute risk differently, depending on the presence or absence of other risk factors, that is, where the patient is along the horizontal axis.
Risk factor identification
Given a mathematical model to carry risk factors (see Box 7.17 ), the next task is risk factor identification. It requires (1) screening of candidate variables for suitability in the analysis, (2) calibrating continuous and ordinal variables to outcome, (3) selecting variables related to outcome, and (4) presenting results in the format of incremental risk factors (see Box 7.16 ).
Screening.
Screening candidate variables has two purposes: (1) to determine whether there are sufficient data (see Box 7.4 ) to be suitable in the analysis and (2) to understand a variable in relation to other candidate variables. Because for outcome events the effective sample size for analysis is the number of events, not the number of patients, a variable may not be suitable for analysis when it represents a subgroup of patients with too few events to evaluate. This represents a limitation of the study, not of methodology. Indeed, one is generally happy with a therapy associated with few events ; however, it then makes sense that risk factors cannot be identified.
We do not screen variables to discover which ones relate individually to outcome. It is a common practice of many groups to ignore variables that are not univariably associated with outcome. However, there is a long history of occurrence of lurking variables ( Box 7.18 and Fig. 7.12 ) that are found to relate to outcome only when (1) other variables that mask their importance are accounted for in the analysis or (2) they are suitably transformed (or coupled with nonlinear rescaling of themselves), indicating a complex association with outcome. ,
• BOX 7.18
Lurking Variables
Lurking variables are those found to relate to some outcome or dependent variable (see Box 7.17 ) only after (1) other variables masking their importance are taken into account either by multivariable analysis or matched-type analyses (e.g., using balancing scores) or (2) the lurking variable (if continuous or ordinal) is properly rescaled (e.g., transformed) so that complex relations are revealed, such as higher risk of mortality at both old and young age.
Fig. 7.12 A shows survival in patients after exercise stress testing stratified according to long-term aspirin use. Apparently there is no relation to survival. However, Table 7.8 shows that there are multiple differences in patient characteristics between these two groups of patients, with those taking aspirin being older, for example. Indeed, in multivariable analysis, the moment age is taken into account, a beneficial effect of long-term aspirin is revealed. Fig. 7.12 B shows survival in propensity-matched pairs of patients (see “ Clinical Trials with Nonrandomly Assigned Treatment ” in Section I). The lurking benefit of long-term aspirin use is clearly revealed.
Demonstration of a lurking variable. Survival after stress testing is shown on an expanded scale and stratified according to use and nonuse of long-term aspirin therapy. (A) Risk-unadjusted survival in entire cohort. Note similarity of survival. (B) Survival in propensity-matched patients. Note dissimilarity of survival revealed when risk factors for death are balanced between groups.
(From Gum and colleagues. )
It is valuable to determine the pairwise correlation of variables. This will help one understand why many variables may be associated with outcome, but only a few are selected as risk factors. Medical data are highly redundant, sharing a great deal of information.
Calibration of continuous variables.
Continuous variables contain unique values for each patient and so are particularly valuable in analyses. For unclear reasons (statisticians uniformly decry the practice), many investigators stratify continuous variables into two or a few arbitrary categories, throwing away valuable information. This flies in the teeth of a fundamental philosophy of data analysis: continuity in nature (see “ Continuity Versus Discontinuity in Nature ” in Section I). Furthermore, to better understand the phenomenon one is studying, it is important to determine the shape of the relation of continuous variables (e.g., age, birth weight, creatinine) to outcome.
However, the scale on which a continuous variable has been measured or expressed may not coincide with a linear increase in risk. Nature does not know about man-made rulers! Therefore, the appropriate calibration of the variable to outcome must be discovered. One method to accomplish this is to examine various linearizing transformations ( Fig. 7.13 ). However, the “perfect” transformation of scale may not coincide with the best one after other factors have been considered in a multivariable model. Thus, we rely on graphical methods, as in the figure, to obtain a set of similar transformations, and then include all transformed variable candidates in the selection process to be described. A promising offshoot of nonparametric machine learning techniques, such as random forests technology, is the generation of risk-adjusted coplots and risk-adjusted partial dependency plots that can suggest the shape of the relationship of these continuous variables with risk ( Fig. 7.14 ). ,
Calibration of 1-second forced expiratory volume (FEV 1 ) to risk of hospital mortality. Scale of risk is given on vertical axis (akin to logit units of Fig. 7.1 ), and eight groups of equal numbers of patients according to value for FEV 1 along horizontal axis. Their mortality, converted to the risk scale, is shown by each closed circle . (Eighth closed circle cannot be shown because there were no deaths in the eighth group with highest FEV 1 s.) (A) Linear scale of FEV 1 . Clearly there is a decreasing (more negative) value of risk at higher FEV 1 (simple regression line shown, with explained scatter for these points of 80%). (B) Inverse scale of FEV 1 . Because of the inverse transformation, lower FEV 1 s are to right of scale, and higher FEV 1 s to left. Risk falls from left to right, unlike in A . There is now tighter correspondence of risk to this rescaling of FEV 1 (85% of scatter explained) than in the conventional scale of A .
(From Blackstone and Rice. )
Risk-adjusted partial dependency plot from a random forest analysis of operative mortality demonstrating the “shape” of its relationship to a number of continuous variables representing demographics and organ dysfunction.
Variable selection.
A seminal contribution of the Framingham Study investigators was the idea that in the absence of identified mechanisms of either disease or treatment failure, useful inferences for medical decision-making, lifestyle modification, and programmatic decisions about avenues of further research can be gleaned by nonspecific risk factor identification. A direct consequence of the idea, however, is that for any set of potential variables that may be associated with outcome, there is no unique set of risk factors that constitute the best common denominators of disease or treatment failure. Therefore, different persons analyzing the same data may generate different sets of risk factors. As a consequence, multivariable identification of risk factors has become an art that depends on expert medical knowledge of the entity being studied, understanding the goals of the research, knowledge of the variables and how they may relate to the study goals as well as to one another, identification of the quality and reliability of each variable, and development of different, often sequential, analysis strategies appropriate to each research question. , , , Not all these issues of art or expertise will disappear, but there are substantial aspects of multivariable analysis that are yielding to science.
Naftel of UAB, in an important 1994 letter to the editor of the Journal of Thoracic and Cardiovascular Surgery, addressed nine aspects of multivariable analysis that contribute to obtaining different models (sets of risk factors). He called these “steps and decisions that may influence the final equation”:
-
•
Differing statistical models. For example, if time-related events are being modeled, results using a Cox proportional hazards model (see “ Cox Proportional Hazards Regression ” later in this section) will differ from those using a multiphase nonproportional hazards model (see “ Parametric Hazard Function Regression ” later in this section).
-
•
Differing approaches to missing data (see “ Managing Missing Values ” in Section III).
-
•
Differing approaches to minimal information (see Box 7.4 ).
-
•
Differing approaches to correlated data. Variables with similar information content should be chosen for maximal insight by the clinical investigator, not necessarily the statistician.
-
•
Differing coding of data. Some may pay more attention than others to linearizing transformation of continuous variables, to whether continuous or ordinal variables should be dichotomized or in other ways collapsed, or to management of interaction (multiplicative) variables.
-
•
Differing approach to apparently incorrect data. True data outliers, handling of clearly imperfect data, improbable combinations of variables (e.g., apparently exceedingly short, very heavy patients as a result of misplaced decimal points or mixed metric and English units), and attitude toward whether or not a large sample negates errors are all decisions made during the screening process for multivariable analysis.
-
•
Differing variable selection methods and P value criteria. This area is undergoing complete change through introduction of machine learning algorithmic methods. Even with new methods, however, a criterion must be arbitrarily established to differentiate what is signal from what is noise ( P values, for example).
-
•
Differing computer resources. Although even desktop computers rival the computational capacity of large-scale computers of several decades ago, computer-intensive methods may require high-intensity parallel processing.
-
•
Differing appreciations of the science. Unless data analysts work collaboratively with the surgeon-investigator, analysis may be unrevealing. One cannot divorce the underlying clinical science from data analysis.
In all areas, new knowledge has been generated that is beginning to differentiate inadequate techniques from reasonable techniques and optimal techniques. Perhaps the more active area presently is “differing variable selection methods,” and it is an important one. Part of the challenge is that variables may be thought to be risk factors because they are associated with a small P value, and other factors may be thought not to be risk factors because of larger P values, but both opinions may be erroneous (type I and type II statistical errors, respectively; see Box 7.15 ). There is therefore a need for a method that balances these two types of error. Closely coupled with this is the need for a statistic that measures the reliability with which a risk factor has been identified. Because one is analyzing only a single set of data rather than many sets of data about the same subject, determining this reliability has been elusive. It is in the arena of machine learning that promising solutions have arisen to address this gap in knowledge.
Variable selection by bootstrapping.
Thus, there is new thinking about what risk factor identification is. In thinking anew, we leave traditional statistical methodology out of the picture, and risk factor identification becomes an attempt to find signal (risk factors) in noise (other candidates). Important advances in pure mathematics ( logical analysis ) , and machine learning ( algorithmic analysis ) are proving valuable for such diverse signal detection challenges as handwriting identification, genomic identification, and now risk factor identification. These techniques are evolving rapidly, and we will describe only the most basic here: bootstrap aggregation, or bagging .
Bootstrapping belongs to a class of methods that has been developed over the past 40 years. In 1983, an astonishing article entitled “Computer-Intensive Methods in Statistics” appeared in the popular scientific literature. Its authors, Persi Diaconis and Bradley Efron from Stanford University, indicated that “most statistical methods in common use today were developed between 1800 and 1930, when computation was slow and expensive. Now, computation is fast and cheap. The new methods are fantastic computational spendthrifts….The payoff for such intensive computation is freedom from two limiting factors that have dominated statistical theory since its beginnings: the assumption that the data conform to a bell-shaped curve and the need to focus on statistical measures whose theoretical properties can be analyzed mathematically.”
Efron and his group demonstrated that random sampling with replacement from a data set to create a new data set, resampling to produce perhaps thousands of new data sets, and combining the information generated from these many data sets can produce robust and accurate statistics without assumptions. His group called this technique bootstrapping , after the expression “pulling yourself up by your own bootstraps,” because it reflected the fact that one could develop all the statistical testing necessary directly from the actual data simply by repeatedly sampling them (see footnote g, p. 278).
These techniques have been applied to entire analytical processes, including multivariable analysis. , In fact, one still has to pay attention to appropriate models, missing data, variable considerations, correlated variables, appropriate strategy, and so forth, that remain part of a disciplined, informed approach to the data. However, the variable selection process is bootstrapped.
In practice, a carefully crafted set of variables is formulated that will be subjected to simple automated variable selection, such as forward stepwise selection, whereby the most significant variables are entered one by one into a multivariable model. Specific P value criteria for entering and retaining these variables are specified. Then a random bootstrap sample of cases is selected, generally of the same sample size as the original n. A complete automated analysis is performed, and its results are stored. Then another random bootstrap set of cases is drawn from the original data set, and analysis is performed. This resampling of the original data set, followed by analysis and storage of the results, continues perhaps hundreds and even thousands of times, then the frequency of occurrence of factors identified among these many models is summarized. Frequency of occurrence generally stabilizes after about 100 bootstrap analyses. The many models are also analyzed by cluster techniques to detect closely related variables that in the final model will be represented by the most commonly occurring representative and by noting if a variable and one or more transformations of scale occur with about equal frequency, indicating a nonlinear relationship to risk. All this information is used to select variables for the final multivariable model.
Of interest, the variables identified for every bootstrap data set are usually different, a sobering revelation. However, it becomes evident that some variables are never selected or seldom selected; these constitute “noise.” Variables that appear in 50% or more of models are claimed to be reliable and are considered “signal” for inclusion in the final model.
This phenomenon is illustrated in Table 7.7 and Fig. 7.15 . Fifteen variables were selected from among many being analyzed for the late hazard phase of death following mitral valve repair or replacement for degenerative disease. In analysis of the first bootstrap sample, 8 of these 15 variables were selected (only 5 were ultimately found to be reliable risk factors). By 100 analyses, although every variable had been identified as a risk factor in at least 2 analyses, 5 variables dominated the analyses (we considered these reliable risk factors), 8 rarely appeared, and 2 appeared in 22% to 32% of analyses.
TABLE 7.7
Frequency of Occurrence (%) of Variables Selected in Bootstrap Analyses of the Late Hazard Phase of Death after Mitral Valve Repair or Replacement for Degenerative Disease
| NUMBER OF BOOTSTRAP ANALYSES | ||||||||
|---|---|---|---|---|---|---|---|---|
| Variable | 1 | 5 | 10 | 55 | 100 | 250 | 500 | 1000 |
| Demography | ||||||||
| Age | 100 | 100 | 100 | 100 | 99 | 99 | 99 | 99 |
| Women | 0 | 0 | 0 | 6 | 7 | 4 | 3 | 5 |
| Noncardiac Comorbidity | ||||||||
| Bilirubin | 0 | 40 | 20 | 16 | 12 | 10 | 10 | 10 |
| BUN | 100 | 40 | 60 | 72 | 76 | 78 | 77 | 78 |
| Hypertension | 0 | 0 | 10 | 6 | 6 | 5 | 6 | 6 |
| Peripheral artery disease | 0 | 0 | 0 | 4 | 2 | 4 | 3 | 3 |
| Smoker | 0 | 0 | 0 | 6 | 8 | 9 | 11 | 10 |
| Ventricular Function | ||||||||
| Ejection fraction | 0 | 0 | 0 | 18 | 22 | 22 | 24 | 25 |
| Left ventricular dysfunction (grade) | 100 | 60 | 70 | 70 | 66 | 66 | 68 | 68 |
| Right ventricular systolic pressure | 100 | 20 | 10 | 10 | 8 | 8 | 8 | 8 |
| Cardiac Morbidity | ||||||||
| Coronary artery disease | 100 | 100 | 100 | 96 | 94 | 92 | 92 | 91 |
| Anterior leaflet prolapse | 100 | 80 | 90 | 82 | 82 | 84 | 85 | 85 |
| Preoperative Condition | ||||||||
| NYHA class | 100 | 20 | 20 | 30 | 32 | 34 | 33 | 36 |
| Hematocrit | 0 | 0 | 20 | 16 | 17 | 14 | 16 | 17 |
| Experience | ||||||||
| Date of operation | 100 | 20 | 10 | 14 | 15 | 14 | 13 | 14 |
BUN, Blood urea nitrogen; NYHA, New York Heart Association.
Example of automated variable selection by bootstrap aggregation (bagging). Fifteen variables labeled A through O are depicted as potential predictors of death after mitral valve surgery. In column A, analyses of five bootstrap samples are shown. Tall bars indicate the variable was selected at P <.05, and gaps represent variables not selected. In all cases, variables A and D were selected, but otherwise analyses appear to be unique. Panel B shows a running average of these five analyses. Variables A, D, I, and J were selected more often than others. Panel C shows averages of 10, 50, 100, 250, and 1000 bootstrap analyses. Notice that no variable was selected 100% of the time, and all 15 were selected at one time or another. But if we consider variables appearing in 50% or more analyses as reliable risk factors, variables A, C, D, I, and J fit that criterion of “signal” and the rest are “noise.”
(From Blackstone and colleagues. )
What happens in bagging (bootstrap aggregation ) is similar to what is seen in signal averaging, such as in visual evoked potentials. Noise is canceled out, and signal amplified. In the same way, many variables appear rarely in models, but a few show up time and time again (see Fig. 7.15 ). One can therefore express the reliability of identification of a given risk factor at a selected level of statistical significance.
Bagging, although demanding a huge number of computer cycles, removes much of the human arbitrariness from multivariable analysis and provides another important statistic: a measure of reliability of each risk factor. Thus, increasingly we have been reporting not only the magnitude of the effect, its variance, and its P value, but also its bootstrap reliability. The technique appears to provide a balance between selecting risk factors that are not reliable (type I error) and overlooking variables that are reliable (type II error).
Variable selection by machine learning.
Variable selection for traditional parametric models generally relies on use of statistical significance ( P values) or ad hoc stepwise methods for reducing dimension and choosing variables. Issues with P values have been discussed earlier, and stepwise procedures are unreliable because their results are highly dependent on the order in which variables are entered or eliminated. Neither of these procedures is suitable when number of variables can be high. A regularization procedure to address this is the least absolute shrinkage and selection operator (lasso) method. In the case of regression, the lasso is fit in the same manner as least-squares but with the additional constraint that the length of the regression parameters must be constrained. The lasso measures length using absolute distance (mathematically called L 1 -distance), which is the sum of the absolute values of the coefficients (for example, if there are two variables x 1 and x 2 with coefficients β 1 and β 2 , then the lasso penalty term is λ l (| β 1 | + | β 2 |), where λ l > 0 is the lasso regularization parameter (as will be explained, larger values of λ l induce more penalization and therefore sparser solutions). This differs from ridge regression that constrains the length of parameters using Euclidean distance (called L 2 -distance). Thus, in our example the ridge penalty is λ r (| β 1 | + | β 2 | ), where λ r > 0 is the ridge regularization parameter.
Ridge regression was introduced to combat instability of least squares in linear regression arising from the presence of collinearity among the covariates. By applying the ridge penalty, we can always be assured that a solution exists even in high correlation and even in the presence of many covariates. This is a nice feature of using ridge penalization; however, the problem is that this does not address the issue of variable selection. The seemingly small difference between L 2 -regularization used by ridge versus L 1 -regularization used by lasso has tremendous consequences for selecting variables, especially in high dimensions. Unlike ridge regression, the estimated regression model using lasso will have coefficient values that are exactly zero, whereas ridge regression is not able to set any coefficient values to zero. Thus, the lasso has the desirable property that it achieves estimation and variable selection simultaneously, the latter being achieved by having coefficients that are exactly zero and therefore of no value to the model.
How does the lasso achieve this? Consider Fig. 7.16 . Displayed is the optimization problem for the lasso in a regression problem involving three variables, labeled x, y, and z . As mentioned, lasso is a penalization problem that penalizes the least squares solution by absolute distance. The least squares solution (the usual regression solution) is the center of the ellipsoid. To solve the penalization problem, the solution is to find the point at which the ellipsoid first touches the constraint region, which is the pointy region centered at zero. The size of the constraint region depends on the lasso parameter λ l , with larger values applying more constraint. Because of the shape of this region, it will often happen that the solution will touch one of the axes, which in this case is x . The result is that this coefficient becomes zero. In problems with many variables, the pointy nature of the constraint region will induce many variables to be zero. This leads to the so-called sparsity property of the lasso in high dimensions.
The lasso solution is the point where the ellipsoid touches the lasso constraint region | x | +| y | +| z | ≤ C .
Now consider Fig. 7.17 , which is the solution for ridge regression. The difference here is that the constraint region is a sphere centered at zero. The ridge solution is where the ellipse touches the sphere. The size of the sphere is related to the ridge parameter, which like the lasso parameter controls the amount of regularization. However, due to the round nature of the sphere there is no possibility for any coefficient to become zero like the lasso. This problem persists and becomes worse as the number of variables increases. Thus, the ridge estimator does not possess sparsity and cannot be used for variable selection like the lasso.
The ridge solution ( p = 3 dimensions) is the point where the ellipsoid centered at the ordinary least square value touches the constraint region x 2 + y 2 + z 2 = C .
In another promising approach, machine learning technologies are being harnessed in interesting ways by either embedding traditional parametric models or extending nonparametric analytic strategies. As mentioned, results of traditional variable selection are highly dependent on the order in which variables are entered or eliminated. One can instead imagine forming thousands of bootstrap models with clusters of randomly chosen variables forced into each and aggregating the results. One can apply learning theories to model development. One can examine variable importance (often revealing that many variables actually degrade predictive power). Splitting algorithms can be averaged to reveal the most common splits ( Fig. 7.18 ).
Use of random forests for variable selection. (A) Example of a random tree. A bootstrap sample of patients from original data set is used to create a random tree. At the root node, a random set of variables is chosen to be candidates, and the most predictive variable for survival among those is identified. Node levels are numbered based on their relative distance to top of tree (i.e., 0, 1, 2). Splitting of nodes to create trees continues until terminal nodes have a few distinct events (e.g., deaths). (B) Illustration of minimal depth of a variable in a random tree from a 2000-tree forest. Highlighted are three top variables: peak V ˙ o 2 (violet) , blood urea nitrogen (BUN, aqua ), and exercise time (tan) . Depth of a node is indicated by numbers 0, 1, 2, 3-8. Minimal depths are 0, 1, 2 for exercise time, peak V ˙ o 2 , and BUN, respectively. (C) Illustration of six random trees from a 2000-tree forest. The three most important variables among these trees are color coded blue for treadmill exercise time, violet for peak V ˙ o 2 , and green for serum BUN. (D) Minimal depth (variable importance) from random survival forests analysis. Dashed blue line is threshold for filtering variables: All variables below line are predictive. Diameter of each circle is proportional to forest-averaged number of maximal subtrees for that variable.
(From Hsich and colleagues. )
In machine learning, variable selection is often performed using variable importance (VIMP), defined by how much prediction accuracy of the model depends on the information in each feature. , , One of the most popular methods is permutation importance, introduced in random forests by Breiman. To calculate a variable’s permutation importance, the given variable is randomly permuted in the out-of-sample data (i.e., the observations not selected in the bootstrap random sampling with replacement, called the out-of-bag [OOB] data), and the permuted OOB data are dropped down a tree. OOB prediction error is then calculated. The difference between this and the OOB error without permutation (i.e., from the original tree), averaged over all trees, is the importance of the variable. The larger the permutation importance of a variable, the more predictive the variable, as illustrated in Fig. 7.19 .
Illustration of how randomly permuting a variable leads to different terminal node assignments, which is at the heart of why permutation importance works. Red nodes are tree nodes that split on the target variable v. In the top panel on the left, the bold arrows show the path that a data point x takes as it traverses through the tree to its terminal node assignment “1.” On the right, the v coordinate of x has been randomly permuted, and its new terminal node assignment is now “4.” In the bottom panel is the path for another x value. Its terminal node assignment is “6.” However, when v is permuted, its new terminal assignment becomes “5.” Importantly, notice that in the top panel the terminal node assignment after randomly permuting v is much farther than its original terminal node assignment than in the bottom panel. This shows that the higher v splits in the tree, the more effect permutation has on final terminal node membership, and hence on prediction error.
Other approaches not using prediction error have also been developed for selecting variables; however, these tend to be specifically designed for the algorithm being considered. Recently, attention has been given to developing variable importance that can apply more generally across different types of learning procedures within the framework of model selection. Related to this are methods developed within the framework of model-free feature screening in which explanatory variables are identified without producing an overall prediction model (see Li and colleagues for a discussion of the difference between model selection and variable selection).
A new promising method called variable priority (VarPro) takes a broader approach in the spirit of these latter methods. An interesting aspect of VarPro is that it does not assume linearity or other specific model formulations often used for the conditional means used in model-free feature screening algorithms, while on the other hand, it does construct trees just like in model-selection methods using trees; however, the goal, which is to construct neighborhoods of the covariate space rather than predicting the outcome, is different. Therefore, this method is called model-independent, in the spirit of borrowing the best parts of both model selection and model-free variable selection. Indeed, “variable selection” is a misnomer, because the inner working of VarPro is to progressively exclude (give zero weight to) variables that contribute nothing to predicting outcome, leaving behind those that do. Using rules from externally constructed trees, the VarPro importance statistic for a set of variables equals the difference between the estimator of conditional mean based on a rule and the estimator based on the released rule obtained by removing any constraints on the variables of interest. VarPro readily scales to large data sets requiring only calculating sample averages and can be used in a variety of settings, including regression, classification, and survival.
Verification.
The ideal verification of a multivariable analysis is to demonstrate its accuracy in predicting results of a new set of patients, preferably extramurally. , Another popular method, if the data set or number of events is large, is to split the data set randomly into training and testing data sets. Modeling is performed on the former and verification on the latter. Whether this is an efficient and effective strategy has been debated. One of the first applications of bootstrapping was to address this issue by generating multiple training and testing sets. Within the domain of the primary multivariable analysis itself, there are, as it were, internal validity diagnostics. For example, in linear regression (see Box 7.17 ), a measure of explained scatter is the r 2 value (square of the familiar correlation coefficient). It is desirable that the value of r 2 be high (closer to 1 than 0); however, if a model is overdetermined by having in it either too many factors or surrogates for the outcome-dependent variable, a high r 2 may be spurious.
Calibration.
In logistic regression (see “ Logistic Regression Analysis ” later in this section), a number of diagnostic tools are available. One of the earliest was the decile table, often attributed to Hosmer and Lemeshow but used much earlier by the Framingham investigators and others. By solving the multivariable equation for each patient, patients are ordered with respect to their estimated probability of experiencing an event. They are then stratified in up to 10 groups (thus “decile”), and within each group the estimated probabilities are summed. This sum represents expected events; it is compared with observed events in each decile. The Hosmer-Lemeshow statistic is a general calibration test of the differences between observed and predicted events ( Box 7.19 ).
• BOX 7.19
Calibration and Discrimination
Calibration
Calibration is the process of determining if the results predicted by a model are consistent with the actual data, the closeness (goodness) of fit of a model to data. Hosmer and Lemeshow introduced a method to test the goodness of fit of expected proportion of events as predicted from such regression analyses to observed proportion of events. , They proposed a simple way to do this was to “bin” data into deciles (10 groups) of ascending predicted proportion of events and determine within each decile the proportion of actual events observed. A good fitting model would line these up on the diagonal. Deviation from that line was tested by the simple chi-squared goodness of fit test. Despite advances since, this is still a useful tool to explore non-linearities of continuous variables in a model and presence of interactions between variables.
A calibration metric that is useful for both binary models and time-to-event models is the Briar score , , which quantifies the differences between predicted probability ( p ) and observed events ( o ):
Note that the better the accuracy, the smaller the value of the Briar score. Walsh and colleagues also provide a comparison of calibration methods.
Discrimination
Discrimination is the ability of a model to stratify data into 2 or more distinct classes. The most familiar tests of discrimination is the C-statistic, or area under the curve (AUC) of a receiver operating characteristic curve (ROC). To understand this, consider the following 2×2 table (sometimes called a confusion matrix):
| Event happened (positive) | Event didn’t happen (negative) | |
|---|---|---|
| Predicted positive | a True positive (TP) | b False positive (FP) |
| Predicted negative | c False negative (FN) | d True negative (TN) |
From this are derived a number of relationships:
-
Sensitivity = TP/(TP + FN), detection of true positives
-
Specificity = TN/(TN + FP), detection of true negatives
-
Positive predictive value = TP/(TP + FP), for those testing positive, proportion actually positive
-
Negative predictive value = TN/(TN + FN), for those testing negative, proportion actually negative
-
Precision = TP/(TP + FP), which is identical to positive predictive value
-
Recall = TP/(TP + FN), which is identical to sensitivity
ROC is plotted as sensitivity vs. 1-specificity and ROC-AUC is the area under the relationship of sensitivity to 1-specificity.
Trouble begins, however, when events, such as those after cardiac surgery become small and its denominator proportionately large (known as imbalanced data), as is demonstrated in detail under Classification Using Machine Learning. In such a case, false negatives become a large problem, and this is particularly for machine learning for classification, whose predictions will favor the majority class. This suggests a focus on the minority class, which is what recall (sensitivity) does along with precision (positive predictive value). As data become increasingly imbalanced, ROC-AUC increases to approach 1, even though its prediction of the outcome of interest may be poor, because this is overwhelmed by all the true negatives with hardly any false positives (sensitivity). In other words, sensitivity is focused on how well a model detects true positives, and precision is focused on how well the model avoids the false positives. Thus, for imbalanced data, one wants the area under the precision-recall curve (PR-AUC) to be approaching 1.
There are a multitude of tests for discrimination in addition to ROC-AUC and PR-AUC. For time-related events, Harell’s Concordance Index is analogous to ROC-AUC. G-mean is the square root of the product of sensitivity and specificity, which is a measure of poor performance in classifying the minority class, because if there are few false positives, specificity will approach 1 while false negatives are emphasized by the sensitivity. F1 measures the balance between precision and sensitivity and is 2 times the product of sensitivity and precision divided by the sum of sensitivity and precision.
One of the best sources of information on calibration and discrimination and other metrics for goodness of model fit is a series of articles by Frank Harrell and colleagues. , , For machine learning, O’Brien and Ishwaran have provided a method to generate accurate probabilities in the face of class imbalance.
Discrimination.
Borrowing from classification theory, which deals with false and true positives and false and true negatives, an analysis of sensitivity and specificity can be performed, varying the cut point from 0 to 1 of what probability is considered predictive of an event occurring. The number of correctly predicted events (true positives) divided by the number of true positives plus false negatives is sensitivity. The number of correctly predicted nonevents (true negatives) divided by the number of true negatives plus false positives is specificity. A graph of 1-specificity on the horizontal axis and sensitivity on the vertical axis is then constructed—the receiver operating characteristic (ROC) curve ( Fig. 7.20 ). The area beneath this curve is a measure of goodness of fit (discrimination, see Box 7.19 ). , (Harrell and colleagues called this the c index of concordance. ) It varies from.5 to 1. A concordance index between.8 and.9 is desirable for prediction purposes.
Illustration of receiver operating characteristic (ROC) curves. These are for renal failure after either coronary artery bypass grafting or cardiac valve procedures in 15,844 patients operated on at Cleveland Clinic from 1986 to 2000. Three ROC curves are shown. One is based on preoperative laboratory measurements alone, the second on extensive clinical data alone, and the third combines the two. Diagonal dashed line is the line of random prediction.
Other model diagnostics.
In addition, for all varieties of multivariable models, a number of regression diagnostic procedures are used, including formal testing of goodness of fit, identification of observations that particularly influence the results, and analysis of residuals (the difference between observed and predicted values) in linear regression.
However, as risk of cardiac surgery approaches zero (in terms of mortality and several morbidities), traditional ROC curves as a measure of discrimination become deceptive, yielding a misleading metric of accuracy. This is because the outcomes are highly imbalanced (few events among a large number of patients), and simply predicting no events will result in a high metric of accuracy. There are several metrics that mitigate this problem (see Box 7.19 ). One is the graph of precision versus recall. Precision is another term for sensitivity (true positive/[true positive + true negative]), and recall is another term for positive predictive value (true positive/[true positive + false negative]).
A validation technique, part of calibration, that holds future promise is OOB prediction error assessment. , As noted earlier in this section, one bootstrap sample or average does not select about a third of patients. These nonselected patients are known as the OOB sample. A model developed on the two thirds of data can be applied to the OOB sample and prediction error calculated as VIMP. ,
Presentation.
A multivariable parametric model analysis generates an enormous amount of information, including:
-
•
The structure of the model and estimates of parameters related to that structure
-
•
A list of risk factors identified
-
•
Magnitude of association of each risk factor with outcome as adjusted for all other variables in the model (these multipliers may be expressed either as the parameter estimates themselves—called model coefficients —or as some reformatted relative risk expression (see Box 7.3 )
-
•
Direction of each relation, positive or negative
-
•
Uncertainty of the associations, generally expressed as standard deviations of the coefficients
-
•
A statistical score on which a P value is based
-
•
P values
-
•
A set of numbers indicating quantitative interrelation of all parameter estimates in the model (the variance–covariance matrix)
-
•
Bootstrap reliability of each risk factor identified
There is some controversy about which of these nine sets of numbers should be reported in a manuscript. It may be sufficient for understanding the relations to simply list the risk factors and place in an appendix some of the numeric data. If the model is intended to be used for prediction, including CLs, the entire list must be reported or provided electronically, as was previously done for Society of Thoracic Surgeons National Cardiac Database models.
None of the nine, however, directly addresses the way a final multivariable model is formulated to reveal incremental risk factors (see “ Incremental Risk Factor Concept ” in text that follows). The incremental risk factor concept was developed to facilitate medical interpretation of a multivariable analysis. Any dichotomous risk factor in a multivariable analysis can be complemented to allow it to have a positive sign. This is desirable because we think of variables in the model as risk factors, and usually we consider risk to be increasing (positive value) with increasing value of the risk factor. Generally, continuous and ordinal variables cannot be formulated this way, so we recommend that each of these be accompanied by an indication of the direction of greater risk (younger age, lower ejection fraction, greater functional impairment, higher bilirubin).
Stay updated, free articles. Join our Telegram channel
Full access? Get Clinical Tree