Description and Analysis of Data, and Critical Appraisal of the Literature




The current standard of practise when faced with a clinical dilemma is to make decisions based on the use of evidence. This is broadly defined as decision-making that incorporates the best available research, together with clinical judgement, while considering also the preferences of the patient, or his or her parents. Thus, evidence based on research is brought to bear on individual and specific circumstances. Evidence based on research implies the use of data or measurements acquired in order to test a hypothesis concerning an area of uncertainty. The data itself may have been initially collected for reasons other than research, as is the case with data acquired from administrative, medical record, or observational sources, or to address specifically a research question, such as with prospective studies or clinical trials. The circumstances under which the data were collected, including their purpose, degree of measurement, rigour, and design of the study, play an important role in influencing the degree to which any conclusions or extrapolations from that data are valid or reflect the truth, in other words, whether they are free from systematic error and bias, are reliable, being free from random error, and apply to the clinical situation at hand. Data must be clearly described, relationships must be accurately and reliably determined, and the report must have sufficient detail to allow complete critical appraisal. The evaluation of presented data and analyses is a key skill for the contemporary clinician faced with increasingly complex clinical situations, and an exponentially expanding body of available literature.


MEASUREMENT AND DESCRIPTION OF DATA


Level of Measurement


Data is specific pieces of information that are defined by their level of measurement and their relationship to other data. Those pieces of information are often referred to as variables, since they may take on different values. The type of values that a variable may assume determines the level of measurement. The level of measurement determines how the values for a given variable are to be described, and how associations between variables are to be assessed and related to random error or variations. Levels of measurement can be categorical, ordinal, ratio, and interval.


Categorical variables are those for which the values fall into discrete and mutually exclusive categories. The value for a given individual or measurement can be applied to only one category. The relationship between the different categories reflects a qualitative difference. Variables with only two categories are referred to as being dichotomous or binary. Examples of dichotomous categories include yes or no, present or absent, male as opposed to female, and right versus left. Examples of variables with more than two categories include the type of connection used to create the Fontan circulation, the position of a ventricular septal defect, and medications used.


When presenting measurements related to a categorical variable, the frequency can be given either as the absolute number of measurements within that category, or as a proportion or percentage of the measurements from all of the categories. It is important to be clear about the denominator when giving percentages, and not to provide decimal places for percentages unless they will have meaning or importance. Giving the denominator is also important, since often the values for some individual measurements may be either missing, either because they were not measured or are not available, or else not applicable, in that the variable may not apply to that individual, such as the presence of amenorrhea in male subjects. The denominator should represent only those with non-missing and applicable values. The number of missing and inapplicable values should be reported separately. Alternatively, the number of non-missing and applicable values should be given. For example, ‘Of 38 patients converted to the Fontan circulation, 21 (55%) were male, and chromosomal abnormalities were present in 4 (8%) of the 48 subjects tested’. For some categorical variables, an individual may have multiple values that place him or her in more than one category. A patient may be taking more than one medication. For this situation, there are two approaches. First, a category for that variable could include a specific combination of categories. For example, ‘At most recent follow-up, 19 (50%) of the 38 patients with the Fontan circulation were taking anticoagulant medication, including warfarin alone in 10, aspirin alone in 8, and one patient taking both’. Second, the variable could be split into multiple categorical variables reflecting each category. In this example, the sentence would be, ‘Warfarin was used in 11 patients and aspirin in 9 patients, with one patient taking both’.


A specific type of dichotomous categorical variable is the occurrence of a discrete event, such as performance of an intervention, or death. Events are almost always associated with an interval reflecting a period of time at which the risk exists. That period of time is an important aspect of that particular variable. This can be presented as the number of patients experiencing the particular event during a specified period expressed as a proportion of patients at risk for that event. For example, ‘There were 5 (13%) deaths within 30 days of surgical completion, in 38 patients, of the Fontan circulation’. A frequent problem, however, is that not all individuals will have the same period of risk, given that some patients will be lost to follow-up, or something will happen, usually another event, which will cause them no longer to be at risk. The interval of risk for these individuals is said to be censored. Thus, the numerator, that is, the number of individuals free of the event, and the denominator, namely, the number of individuals remaining at risk for the event, are both decreasing over time. For this situation, techniques analyzing survival or timed events are applicable to describe the data, such as the estimates based on the technique established by Kaplan and Meier to make survival curves.


Ordinal variables reflect a specific type of categorical level of measurement whereby the categories can be ordered, such that we know that one category is more or less than another category. The exact magnitude by which one category differs from another category, however, is not known, and remains only semi-quantitative. An example would be the subjective grading of valvar regurgitation or ventricular function from echocardiography. We would know that mild regurgitation across the mitral valve is less than moderate, but more than trivial. We do not know the exact magnitude by which the so-called trivial variant differs from that labeled mild, or how the mild variant differs from the moderate grade. The categories are discrete and ordered, and the values would be presented in a manner similar to other categorical variables, that is, as frequencies, proportions, and percentages. A common mistake with ordinal variables is to assign or code them with a numerical grade, and then to present or analyse them as if they were continuous variables, erroneously implying that an equal distance or magnitude exists between each numerical category.


Quantitative or continuous variables are those whereby the difference between two values reflects a quantifiable amount. In addition, a constant difference between two values across the spectrum of possible values always represents the same amount. For some continuous variables, there is a defined zero, a point at which the quantity reaches nil, and there cannot be a negative value. This is referred to as a ratio variable. Ratio variables are common, and examples include height, weight, age, ventricular ejection fraction, and blood pressure. We know the exact meaning of the absolute difference between two values regardless of those values. The difference between 120 and 100 cm in height is the same 20 cm as the difference between 80 and 60 cm in height. With a ratio variable, however, the difference from the defined zero allows the ratio to have meaning. For example, 120 cm is 1.2 times, or 20%, greater than 100 cm. For some variables, in contrast, there is no defined zero or point of absence. These are referred to as interval variables. The relative relationships between these values cannot be quantified. For example, it is possible to consider changes in ejection fraction in response to a medication. Some individuals will show worsened ejection fraction, and have a negative value, others will stay the same, while still others might improve, and exhibit a positive response. The value recorded will reflect the quantity of the magnitude of change.


Since continuous variables can usually take on an infinite variety of values, reporting frequencies of values is meaningless, unless specific cut-points are used, or the values are grouped. Such categorisation of a continuous variable is rarely justifiable, nonetheless, since it overly simplifies the meaning and the presentation of the data, as well as diminishing statistical power. Continuous variables tend to take on a distribution, and the standard is to present some measure of the center of the values, along with the magnitude and pattern of their variation. The first step is to look at a frequency plot of the distribution of values. In general, some sort of bell-shaped curve will be observed, with a central hump tapering to the two sides. If the distribution is bell-shaped, we refer to this is as being normally distributed. This implies that the centre, and the variance, has specific definable properties or parameters. The measure of the centre would be the mean or average value, calculated as the sum of all values divided by the number of values. For the mean of a perfect normal distribution, half of the values would be less, and half would be greater than this value. The measure of variation would be the standard deviation. This is calculated as the sum of the square of each of the differences between the values and the mean divided by the number of values. The standard deviation has specific properties in defining the proportion of values represented, and is an assumption supporting probability theory, which underlies statistical or inferential testing.


The normal curve, however, can be subject to distortion. If the central hump is abnormally peaked or flat, this is referred to as kurtosis. If the tails or the sides of the distribution are unequal, then there is skewness. Important kurtosis or skewness can cause the distribution to become abnormal. The standard parameters and characteristics of mean and standard deviation then no longer apply. In this case, measures of the centre might be chosen which reflect the ranking of values, and not their interval magnitude. In ranking all of the values, the median value would be that measured value at the 50th percentile. For skewed data, the greater the amount of skewness, the greater the difference between the median value and the calculated mean. Less frequently, the most common value, referred to as the mode, is taken as the measure of the centre. Measures of variation include values at specific percentiles, such as the quartile values, presented as the measured values at the 25th and 75th percentiles, with the interquartile range presented as the difference between these two values. Alternatively, the measured values at the 5th and 95th percentiles might be presented, or the minimum and maximum values. Since these values are not dependent on the distribution being normal, they are often referred to as non-parametric distributions. The use of non-parametric statistical testing tends to be less robust and less powerful than parametric statistical testing. Hence, mathematical transformations of the measured values may be performed to create a more normal distribution. This can sometimes be accomplished by calculating and plotting the logarithm or the square root of the measured values, or applying other types of mathematical transformations. The normalised transformed variable is then used in parametric statistical analyses.


Validity, Accuracy, and Reliability


In addition to level of measurement, variables have properties reflecting the impact of how the measurements were made or determined. These properties include validity, accuracy, and reliability. Validity is the property by which the measurement used is a true reflection of the desired concept. It answers the question, ‘Am I really measuring what I think I am measuring?’ Accuracy is the degree to which the measurement comes close to the truth in the subjects being measured. It is a property of the measurement tool itself, and often reflects systematic bias or error. Reliability is the degree of variation in the measurement that is attributable to application of the tool. This may reflect both systematic and random bias. For example, if the purpose of my measurement is analogous to shooting a gun at a target, validity is the degree to which I am aiming at the correct target, accuracy is the degree to which I come closest to the bull’s-eye, and reliability reflects how closely multiple hits are clustered together. If the hits are tightly clustered together, but always to the right of the bull’s-eye, then this may reflect a problem with the sight of the gun. The error is systematic, or reflective of a component of the system. If the hits are at, or close to, the bull’s-eye, but widely scattered, then this may reflect some twitchiness or poor eyesight on the part of the shooter, or subtle variations in the bullets that influence their trajectory. The error may be both random and systematic. When using or interpreting a measurement, one must be aware of these properties of validity, accuracy and reliability. This becomes more important when attempting to measure more subjective concepts, such as quality of life or cardiac failure.


Validity can be an elusive entity to achieve when the concept or phenomenon being measured is more qualitative and subjective. Rather than a gold standard or criterion, it is usual to begin with a definition, and then attempt to develop measurable aspects that reflect that definition. If we take aortic regurgitation as an example, a subjective grading is often applied when performing echocardiographic assessment, characterised by ordinal categories of none, trivial or trace, mild, moderate, or severe, with some intermediate categories. The subjective or qualitative grade is meant to reflect the overall impression of the observer, and informally takes into account many aspects related to the concept being pursued, giving more weight to some than to others. If we wished to validate our subjective system of grading, we might start by convening a panel of expert echocardiographers and asking them first to define the concept of aortic regurgitation. There may be agreement that it has something to do with the volume of blood that re-enters the left ventricle via the aortic valve during diastole. Some may argue, however, that this would represent the total volume adjusted for the size of the patient. Others may argue that it would represent the proportion or percentage of the forward stroke volume ejected through the aortic valve. The panel of experts may agree that there is no method of echocardiography permitting quantification of this volume, but that indirect measures may exist against which the subjective grade might be compared. It is agreed that no single indirect measure will suffice, and that multiple aspects may need to be considered simultaneously. These indirect measures are chosen because they have validity of content, meaning that they are judged to be related to specific aspects reflecting aortic regurgitation, and validity of construction, or construct validity, meaning that they are judged to have a plausible causal or physiological reason for having a relationship to aortic regurgitation. The panel may choose measures that reflect the state of the aortic valve in producing regurgitation, such as the width of the regurgitant jet relative to the width of the aortic outflow tract as measured using cross sectional and colour Doppler interrogation. They may choose measures that might reflect the volume of regurgitant blood, such as measurement of areas from colour Doppler mapping of flow, or even less direct measures, such as pressure half-time intervals or patterns of reversal of flow in the aorta as acquired using Doppler interrogation. They may choose measures that reflect the impact of aortic regurgitation on the ventricle, such as end-systolic and end-diastolic ventricular dimensions and volumes, as well as functional indexes such as shortening and ejection fraction. Alternatively, they may seek to measure this volume using other methods, such as with magnetic resonance imaging, or by creating an experimental model system. This process is aimed at criterion-related validity, or the degree to which the proposed measure relates to accepted existing measures. They may also seek to assess how the subjective grade relates to clinical or outcome measures, known as predictive validity. Subjective grading may be shown to relate to clinical symptoms, abnormalities of exercise capacity, ventricular dysfunction, or arrhythmias and sudden death, or to need for repair or replacement of the aortic valve.


If subjective grading of aortic regurgitation is judged, as based on content, construct, criterion-related and predictive validity, to be a valid measure of aortic regurgitation as it was conceived, then additional assessment is required regarding accuracy and reliability. Accuracy is a reflection of validity, but also includes any systematic error or bias in making the measurement. Systematic error refers to variations in the measurements that might always occur predominately in one direction. In other words, the deviation from the criterion or reference standard tends to be consistent. This may occur at the level of the tools or instruments used to make the assessment, and may represent lack of calibration. Regarding aortic regurgitation, this might reflect differences in the settings of gain or frequency of the probe used when the assessment was made. This may occur at the level of the observer, whereby the observer has a consistent bias in making the interpretation. Regarding aortic regurgitation, this might reflect the fact that some observers do not describe physiological amounts of regurgitation, and assign a subjective grade of none, and might not use intermediate grades. Alternatively, some observers may place more weight on a specific aspect when assigning a specific grade, such as areas as revealed by colour Doppler interrogation. This may also occur at the level of the subject, whereby a condition exists relative to the subject which biases the measurement. In terms of aortic regurgitation, this might reflect differing physiological conditions present in some subjects, such as concomitant mitral valvar regurgitation or aortic valvar stenosis.


Reliability or precision refers to the reproducibility of the measurement under a variety of circumstances, and relates more to random error or bias. It is the degree to which the same value is obtained when the measurement is made under the same conditions. Specific aspects of reliability should be assessed. If we again consider aortic regurgitation, we would want to know that, if two observers assess the same subject using the same methodology, they would report the same subjective grade. This is referred to as inter-observer variability. We would also want to know that, if an observer repeats the assessment in same subject using the same methodology, he or she would report the same subjective grade. This is referred to as intra-observer variability. Both of these sources of variability would be determined using measures of agreement, and not correlations or associations. Some of the random variation in measurements may be attributed to the instruments, such as variations in settings, differing equipment, or different algorithms or processing. Some of the random variation may also relate to the subject, such as variations in physiological state, medications, or the position when the assessment was made.


Ideally, the assessment should be as accurate and reliable as possible. This would minimise variation in measurements related to systematic and random error or bias, and would improve statistical power, minimising the necessary number of subjects to be studied, or observations to be made. This can be optimised by standardizing as much of the assessment as possible. Training sessions for observers can be designed whereby criterions and skills for making the assessment and interpreting the results are reviewed and practised, so that they are applied in a uniform manner. Limiting the number of observers, and having independent adjudications, also improves reliability, particularly if the review is blinded. Having protocols in place that define and standardise all aspects of the assessment also improves reliability. This would specify the equipment, calibration, and settings to ensure quality control. It would also specify the conditions under which the assessment should take place, standardizing the setting and the preparation of the subject. Taking repeated assessments, and pooling the results within subjects, also reduces random error.


There are some additional features of measurement that may need to be considered. Sensitivity refers to the degree to which the value of the assessment can detect differences. If patients with clinically important differences in the volume of aortic regurgitation are both subjectively graded as moderate, then such subjective grading might not be sufficiently sensitive. An aspect of sensitivity is responsiveness, or the degree to which the measurement reflects change related to time or an intervention. If a medication results in a clinically important reduction in the volume of aortic regurgitation, but the subjective grade remains moderate, then this means of grading might not be a sufficiently responsive measure. Measurements should also be specific, being a measurement only of the entity of interest as it was defined. Measures of ventricular function, such as shortening and ejection fraction, may not be very specific indirect measures of aortic regurgitation, and may be more influenced by many other factors. The measurement should also have a sufficient distribution of responses. If the object is to study factors associated with aortic regurgitation after repair of subaortic stenosis, yet the population studied has mainly grades of none to mild, then such subjective grading may not be a good measure to use for the assessment of aortic regurgitation. When many measures that assess different or similar aspects of an entity are available, completeness and appropriateness should be balanced with efficiency or parsimony. In a study of aortic regurgitation, all of the indirect measures could be assessed, as well as subjective echocardiographic grading, as well as some novel measures, which might include magnetic resonance imaging. This might provide the most complete assessment, but would be intensive in terms of time and resources, and might limit the recruitment of suitable subjects, as well as creating logistical issues regarding operationalisation, standardisation, and adjudication. It might also lead to inclusion of measurements that might be inappropriate to the aim of the study. It would be very complex to handle this large number of different measurements with different properties in an analysis aimed to address the question at hand. It would be preferable to focus on those measures which would give the highest level of measurement, in other words a continuous value, and which are valid, accurate, reliable, specific, sensitive or responsive, and parsimonious.




ANALYSIS OF DATA


Analysis is the method by which data or measurements are used to answer questions and generate knowledge, and then to assess the confidence in inferring those findings beyond the subjects that were studied. Analysis is often a secondary consideration when designing and implementing a study, and not considered until after the measurements have been made and the study is completed. In fact, the plan for analysis of the data is an integral part of the study protocol. It has an important relationship to the research questions being pursued, and a strong influence on the measurements to be made. Also, the chosen form of analysis can be an important tool to increase the validity of observed associations, by providing adjustment for potential sources of bias, such as confounding, or to determine more complex associations, such as interaction. The appropriate planning, strategy, execution and interpretation are essential elements to the critical appraisal of any research report.


What Is the Plan?


Research Question


Every study must begin with a well-defined question. The drafting of this question is the first step towards creating a research protocol. It is common to start with a topic of interest, or sometimes a clinical dilemma, and through consultation with experts, review of the published literature, and other sources of information, to define a discrete area of controversy to be addressed. Keeping track and documentation of this process often leads to the background section of a protocol, the reason for the study, and the introduction section of a subsequent manuscript. Often, constant revision, clarification, and specification of the question are required as part of this process. New investigators tend to be overly ambitious in their scope. The process of defining the research question often incorporates considerations of feasibility, relevance, novelty, and interest to the investigators. It is an iterative process that is not complete until a well-defined question, drafted in the form of a question, is finalised. It is not sufficient to only have objectives, aims, or purposes.


A well-defined question often suggests the design of the study, the population to be studied, the measurements to be made, and the plan for analysis of the data. It also determines whether the study is descriptive or comparative. For example, in considering the topic of hypertrophic cardiomyopathy, a descriptive research question might be, ‘What are the outcomes of hypertrophic cardiomyopathy?’ In appraising this question, its overt vagueness would be noted. The first step would be to determine what answers are already known regarding this question, and what areas of controversy remain. The investigator might look at the question and the background review and ask some of the following: ‘What outcomes do I wish to study?’, ‘How will I define hypertrophic cardiomyopathy and in what subjects?’, ‘At what time point or over what time do I wish to examine these outcomes?’ In answering these questions, the research question is revised and further specified to ‘What is the subsequent risk of sudden death for children with familial hypertrophic cardiomyopathy presenting to a specialised clinic?’ This question suggests an observational study of a chosen cohort. Depending on feasibility for the investigators, the study may be conducted non-concurrently from existing medical record information, or concurrently by enrolling subjects in the present, and following them forward in time. The question also suggests the population of interest, and from this the investigators would develop criterions for inclusion and exclusion with definitions, and how the subjects would be identified from the specialised clinic. The measurements are also suggested, in that baseline characteristics would need to be obtained to define the criterions, and to describe the characteristics of the included subjects. The definitions and sources of data would need to be defined regarding detection of events of interest. The plan for analysis is suggested in that the question is descriptive, and would need to describe the proportion of subjects who would have experienced sudden death over a defined period of time, or more likely a time-event type of analysis would be performed. While the primary question here is descriptive, the investigators would be missing an important opportunity if they did not have secondary questions that explored the impact of treatments or risk factors on the risk of sudden death. It is rarely justified, or necessary, to perform a study that is only descriptive in nature. A well-defined and focused research question, nonetheless, is essential before considering other aspects of the proposed study or report.


Using Variables to Answer Questions


When the question is established, the next step is generating a plan for analysis so as to select and define variables. What information will be needed to answer the defined question? In the process of defining variables, should be given to considerations of their measurement. Specifically, definitions should be defined, sources of data determined, and issues of validity and reliability of measurements considered.


Classification


Variables can be classified for statistical purposes as either dependent or independent. Dependent variables are generally the outcomes of interest, and either change in response to an intervention or are influenced by or associated with factors. Independent variables are those which influence or are associated with the dependent variable. A detailed consideration of the question should clearly identify the key dependent and independent variables. Independent variables can be further classified. Some independent variables have a direct causal or predictive relationship with the dependent variables, and are the variables in which we are most interested. Some independent variables have a direct influence on the dependent variables, but influence each other in the nature of that relationship. This is referred to as interaction. Some independent variables also have an indirect relationship with the dependent variables, primarily by their association with other independent variables that have a more direct association. This is referred to as confounding. Some dependent or outcome variables can become independent variables in their association with subsequent outcomes or dependent variables. An important step in creating a plan for analysis is to classify the available variables prior to considering how they will be described and related to one another in order to address the research question.


Outcomes or Dependent Variables


In any study, there are usually one or two primary outcomes of interest. There are often additional secondary outcomes, which are usually included to support the analysis based on the primary outcomes. If analysis of the primary outcomes is negative, conclusions from a study must be negative, regardless of those associated with the secondary outcomes. Analysis of secondary outcomes is also used for exploring or generating additional hypotheses and should not represent a definitive analysis. It should be recognised that the greater the number of outcomes examined in a study, the more likely it is that statistical significance will be reached on at least one of them. This is the challenge of multiple comparisons. With advances in management of many diseases and conditions, some outcomes have become increasingly rare, and many studies are grouping outcomes into composite outcomes. An example might be creating a composite outcome of death, myocardial infarction, or stroke in a clinical trial of cardiovascular risk factor reduction. The appropriateness of composite outcomes is questionable, and issues have been raised about their validity. 1 First, not all possible outcomes that might be included in a composite outcome have the same importance for subjects. Second, the creation of a composite outcome might obscure differences between the individual outcomes. The risk for the component outcomes may be different, as well as the impact of therapy or associated variables. Specific outcomes should be favored over composite outcomes when feasible and relevant.


Outcome variables can differ in their nature, which influences how they should be handled in an analysis plan. Some outcomes are discrete, being either present or absent, or classified into nominal categories. Some outcomes can be ordinal or continuous in nature, indicating a degree or magnitude. Some outcomes are discrete events that have a relationship with time. These outcome events may be repeated or recurrent, and subjects may be simultaneously at risk for different and mutually exclusive outcome events that may compete. Some outcomes may evolve over time, and thus have a longitudinal dimension or trend over time. Each of these features needs to be considered, as specific methods of data and statistical analysis need to be applied.


Independent Variables


Independent variables are variables for which an association with the dependent variables is sought. These variables can include subject characteristics and interventions, but sometimes outcomes that may be predictive or causal of other subsequent outcomes. The research question should define the primary independent variable, which is commonly a specific treatment or a key subject characteristic. Independent variables can be grouped according to the proposed nature of their relationship with the dependent variable, which is described in more detail in a subsequent section. We usually have some independent variables in which we are interested whether or not they have a direct predictive or causal relationship with the dependent variable. We also have some independent variables in which we are less interested, but which may act as potential confounders.


Planning the Description


The first step in the plan for analysis is to define how all of the variables are to be described. The methods by which this is done have been covered in a previous section. In describing data, the results or values of each individual variable are summarised. Description is important in detailing the characteristics of the subjects to be studied, usually at baseline, and will allow the reader of a subsequent report to determine if the results of the study might be applicable to their own clinical practice. Description of the data also can give the details of what interventions were performed, and what outcomes occurred. Additionally, we can describe data that were collected in order to establish compliance with the study protocol. This might include compliance with therapy, co-interventions, cross-over from assigned treatment, duration of follow-up and completeness of follow-up. Description is also used to determine issues that might have an impact on statistical testing, such as extreme values or outliers, missing values, categories with too few values, and skewed distributions.


Planning to Establish Relationships and Associations


This step in the plan is guided by the research question. The aim of most research questions will fall into one of three main categories. Some studies aim to answer questions about the effect of a treatment, either alone or in comparison with either no treatment or an alternative treatment. In this case, the treatment is the primary independent or predictor variable, and there can be many different outcome or dependent variables. If the study is comparative, then the characteristics of the subjects should be compared at baseline for the groups. We also need to compare additional management that might have occurred. If the treatment allocation was randomised, then comparison of baseline characteristics determines the success of equivalency between groups. If the treatment allocation was not randomised, the comparison of baseline characteristics might highlight important or relevant differences between groups that could potentially confound comparisons of outcomes, and for which statistical methods for adjustment should be used.


The second category includes those research questions aimed at determining the prognosis of subjects with a common characteristic or after a specific intervention, the risk of a particular adverse outcome, or factors that might be predictive or associated with prognosis or risk, particularly those of a potential causal nature. The dependent variable is the outcome variable of interest, and associations are explored with multiple independent variables that may include subject characteristics, interventions and other outcomes. This type of analysis plan should take into account independent variables that may serve as potential confounders, and that have a direct relationship to the outcome, but are also associated with the independent variable of interest for which we would like to infer a direct relationship with the outcome.


The third category includes those questions aimed at identifying factors that may discriminate or differentiate different groups of subjects. Some of these questions may be aimed at evaluating a diagnostic test. In this case the dependent variable is the group status of the subjects, usually as defined by application of a criterion standard, and the primary independent variable is the result of the diagnostic test. Some questions may be aimed at contrasting characteristics between two or more defined groups of subjects. Case-control studies are a classic example. In this case the primary dependent variable is the characteristic by which the cases and controls were defined, and multiple independent variables including subject characteristics, interventions and outcomes are contrasted.


Types of Relationships Between Variables


An important defining feature of variables, in addition to the level of measurement and measurement properties, is the relationship between them. Variables can be classified in terms of their relationship to one another. Independent variables are those variables which incite, influence or are associated with a response. The response variable is termed the dependent or outcome variable. Independent variables may have a direct association with the dependent variable, either through prediction or causality, or they may have an indirect or confounding association.


Causality is a desired feature of associations we wish to determine. The aim of many studies is to determine relationships that are cause leading to effect. The nature of associations and features of study designs that help to give confidence that a discovered association is cause and effect will be described later in this chapter, but include evidence of a correct temporal relationship, a strong dose-response relationship, freedom from bias, consistency, and biological or pathophysiological plausibility. When we cannot be certain that the temporal relationship is correct, that the potential causal factor was present first and led to the subsequent development of the effect or outcome, we cannot exclude that the relationship between the two variables is actually reversed. In this case the putative causal factor may actually be caused by the assumed effect or outcome. For example, in a cross sectional study of treatment with β-blockers, we note that the use of β-blockers and the presence of palpitations are significantly associated. We might erroneously conclude that β-blockers are causal of palpitations, when in fact arrhythmia causing palpitations has lead to treatment with β-blockers.


Confounding occurs when the independent variable exerts its influence or association with the dependent variable primarily through its relationship with a further independent variable that is more directly related to the dependent variable. The identification and adjustment for confounding can be a challenge in the analysis and interpretation of observational or non-experimental study data. Confounding is most likely to occur when independent variables are highly related or correlated with one another, which is referred to as colinearity. For example, a hypothetical study shows an association between increased use of systemic anticoagulation and increased risk of death after the Fontan procedure. Consideration is given to recommending against the use of routine anticoagulation. Further analysis, however, may reveal that the use of systemic anticoagulation was predominately in those patients with poor ventricular function. Poor ventricular function is then found to be causally and strongly related to mortality, and the association of anticoagulation with mortality is felt to be confounded and hence indirect because of its increased prevalence of use in patients with poor ventricular function. Stratified or multi-variate analyses are often used to explore, detect, and adjust for confounding, and to determine relationships between variables that are independent of other variables.


Interaction is a particular type of relationship between two or more independent variables and a dependent variable. The relationship between one independent variable and a dependent variable is influenced or modified by an additional independent variable. For example, in our hypothetical study, further analysis shows that the relationship between systemic anticoagulation and mortality is more complex. For patients with poor ventricular function who are treated with systemic anticoagulation, mortality is less than for those not treated. For patients without poor ventricular function, there is no difference in mortality between those treated versus those not treated with systemic anticoagulation. Thus, there is a significant interaction present between systemic anticoagulation and poor ventricular function, demonstrated by the differential association of anticoagulation with mortality as influenced by the presence of poor ventricular function. In order to be certain that the anticoagulation was the true cause of the reduction in mortality, and not attributable to some other unmeasured confounding factor, a properly designed and executed randomised clinical trial should be undertaken.


Statistical Analysis—Detecting Associations and Relationships with Confidence


Opinions about the extent of statistical knowledge of physicians vary widely. On the one hand, conducting or appraising clinical research sometimes requires knowledge of increasingly complicated statistical methodology. On the other hand, little time in medical school is available appropriately to address statistics and their use and applicability in clinical practice. A recent survey of 171 articles published in the journal Pediatrics found that nine-tenths of published studies used some sort of inferential statistics. The same study found that a reader who understands descriptive statistics, proportions, risk analysis, logistic regression, t tests, non-parametric statistics, analysis of variance, multiple linear regression and correlations could still only fully understand the analysis in just under half of the articles. 2 Despite disagreements on the extent of statistical education in clinical training, the increasing complexity of statistics reported in the medical literature clearly appeals for a standard of at least some statistical literacy among practicing clinicians and clinical researchers.


Principles of Probability and Probabilistic Distribution—The Science of Statistics


While conducting a census of every citizen of a given country, you find that the proportion of women in the population is exactly 52%, and that their average height is 152 cm. You select a random sample of 100 people from this same population and, to your surprise, 55% of your sample is composed of women and their average height is 149 centimeters. Subsequently, you decide to select a second random sample of 100 people. This time, 47% of your sample is women and the average height is 158 centimeters. The phenomenon at play here is called random effect, random error, or random variation. Each individual sample taken randomly from a larger population will have an uncertain distribution in terms of characteristics. The distribution of characteristics in each sample, nonetheless, is based on the probability of those outcomes for the entire population, which has a specific distribution of features known as parameters.


If you flip a coin, the probability of the coin landing on heads is exactly 50%. If you flip that coin 100 times you might get 50 heads, as you would expect, but owing to random effect you may get 49 or 52 heads, and on rare occasions 45 or 55. If you repeat this exercise numerous times, you will get exactly 50 heads a few times, most of the times you will get somewhere between 48 and 52 heads, and in the great majority of tries you will get between 45 and 55 heads. The outcome of each series of 100 tosses is random, but remains a function of the true probability of getting a head with each coin toss. If we were to plot the frequency of heads in our many samples, we might see a bell-shaped curve centered at or very close to 50 that slopes to the right and the left, with more extreme values occurring less frequently in the tails. This curve is called a distribution of probability. It has some specific properties about the variation that allow us to be able to predict how frequently a given outcome will be observed in an infinite number of random samples.


Inference Based on Samples from Random Distributions


When we measure something in a research study, we may find that the values from our study subjects are different from what we might note in a normal or an alternative population. We want to know if our findings represent a deviation from normal, or whether they were just due to chance or random effect. Inferential statistics uses probability distributions to tell us what the likelihood might be for our observation in our subjects, given that our subjects come from a larger population which forms the basis for the probability distribution. We can never know for certain if our subjects truly deviate from the norm, but we can infer the probability of our observation from the probability distribution. If that probability is low, we can feel confident in our conclusion. In general, we assume that if we can be 95% certain, and accept a 5% chance that our observation is really not different from normal, or the center of the probability distribution, then we state that our results are significant. The probability that the observed result may be due to chance alone represents the P -value of inferential statistics.


When a random sample is selected from a population, differences between the sample and population are due to the random effect, or random error. In our census, while the entire population had an average height of 152 cm, it does not mean that everyone in this population was 152 cm tall. Some were 130 cm in height, and others were 170 cm. Hence, if you select a random sample of 100 people, some will be taller and some shorter. By chance alone, it might be that a specific sample will have fewer shorter people or fewer taller people. Thus, the average height for this specific sample might be lower or higher than the average by chance alone. The key point to remember is that as long as the sample was randomly chosen, the mean of your sample should be close to the mean of the population. The larger your sample, the more precise the measurement is, and the closer you are to the true mean of 152 cm. This is because each value contributes less to the total and, as such, extreme values have less of an effect. The same is true with the example of coin tossing. Out of 10 tosses, a result of heads for 90% or 10% of tosses is rare but not impossible. If the coin toss is repeated 1000 times, the results will certainly be much closer to the 50% probability of a single coin toss. The fact that the result is different from that in the population from which the sample was taken does not necessarily imply that the subjects in your samples are inherently different from that population. This might just be due to random error.


Consider a situation in which a researcher polls a random sample of 100 paediatric cardiologists regarding their preferred therapy for failure, and found that 72% of the sample prefers using drug A over drug B. Since the sample was chosen at random, the researcher decides that it is a reasonable assumption that this group is representative of all paediatric cardiologists. A report is published entitled ‘Drug A is preferred over drug B for the treatment of heart failure in children’. Had all paediatric cardiologists been polled, would 72% of them have chosen drug A over drug B? If another researcher had selected a second random sample of 100 paediatric cardiologists, would 72% of them also have chosen drug A over drug B? The answer in both cases is probably not, but if both the samples come from the same population, and were chosen randomly, the results should be close. Consider that a new study is subsequently published that reports that drug B is actually better than drug A. You then poll a new sample of 100 paediatric cardiologists and find that only 44% still prefer drug A. Is the difference between your original sample and your new sample due to random error, or did the publication of the new study have an effect on preference in regard to therapy for children in heart failure? The key to answering this question is to estimate the probability by chance alone of obtaining a sample in which 44% of respondents choose drug A when, in fact, 72% of the population from which the sample is drawn actually prefers drug A. In such a situation, inferential statistics can be used to assess the difference between the distribution in the sample as opposed to the population, and the likelihood or probability that the difference is due to chance or random error.


Relationship between Probability and Inference


A detailed explanation of the exact methods used to determine if the probability distribution of the sample is different from or similar to that of the overall or target is beyond the scope of this chapter. Suffice to say that each type of data, and each specific question, requires specific methodologies and tests, some of which will be briefly introduced later. The common point of any statistical test is that all types produce a P -value, which represents the probability that two distributions are similar. Statistical testing takes into account the number of subjects being tested, the observed variation in the data, the magnitude of any differences, and the underlying nature of the probability distribution, and thus the results are influenced by these features. Every statistical test comparing two groups starts with the hypothesis that both groups are equivalent. A two-tailed test tests the probability that group A is different from group B, either higher or lower, while a one-tailed test tests the probability that group A is either specifically higher or lower than group B, but not both. As a general rule, only two-tailed tests should be used in most situations, as they assess the probability of two groups being different without any presumptions about the direction of the difference between groups. There is no assumption that A is higher or lower than B, just that the two are different. One-tailed tests assume that the difference observed has a direction, for example, higher or lower, but not both. These tests should be used only in specific situations, an example being a non-inferiority trial. By convention, statistical significance is reached when the P -value obtained from the tests is under 0.05, meaning that the probability that both groups are equivalent, and not different, is lower than 5%. The P -value is an expression of the confidence we might have that the findings are true and not the result of random error. From our previous example of preferred treatment for heart failure, the P -value was less than 0.001 for 44% being different from 72%. This means that in a population where 72% of cardiologists prefer drug B the probability of having a random sample with 44% favoring drug B is less than 1 in 1000 trials. We can confidently conclude, therefore, that the second sample is truly different from the original sample, and that the opinion in the population of paediatric cardiologists had changed.


Relevance of P -values


Limitations


The threshold of P < 0.05 for defining statistical significance is acknowledged to have been selected somewhat arbitrarily, and is consequently a subject of debate. With this in consideration, it is important to bear in mind the implications and meaning of a P -value. Statistical testing takes into account the number of subjects being tested, the observed variation in the data, the magnitude of any differences, and the underlying nature of the probability distribution, and thus the results are influenced by these features. The magnitude of any difference is only one component, yet many assume that if the P -value is less than 0.05 then the results are important and meaningful. This is not the case, since a P -value only is a measure of confidence. P -values are highly dependent on the number of study subjects or observations. As the size of the sample increases, the precision of the estimate around the true mean increases, and thus the random error decreases. With large enough samples, the random error will be close to zero, hence very small differences or associations could meet the confidence threshold, yet be unimportant. With a sample size of 10,000, a P -value of 0.04 is probably of limited interest because the large size of the sample will ensure that even very small differences will be statistically significant. On the other hand, with very small samples, statistical significance at P < 0.05 is much less likely to occur even with large differences or associations. With small sample sizes, a non-significant P -value implies that we do not have sufficient confidence in the result, not that it might represent an important difference or association. P -values are only a measure of confidence, and the results must always be considered in light of the magnitude of the observed difference or association, the variation in the data, the number of subjects or observations, and the clinical importance of the observed results.


Clinical Relevance versus Statistical Significance


A primary consideration regarding statistical analysis in medical research is the difference between statistical significance and clinical relevance. There is no real value to an association that is highly statistically significant but clinically or biologically implausible. Likewise, studies with results that are clinically important but not statistically significant are of uncertain value. This is a key concept that is not widely acknowledged. Statistical significance does not necessarily equate to clinical relevance and, similarly, a result that is not statistically significant might be clinically important. Statistics are a tool to describe and to uncover evidence regarding the underlying mechanism of disease and impact of therapy, but it is important to know how to translate statistical knowledge into clinical knowledge.


Assessing the clinical relevance of an observed difference or association and the impact of random effect can be helped by examining the confidence interval. Confidence intervals are important tools to assess clinical relevance as they are intrinsically linked to a statistical P -value, but give much more information. The confidence interval is a representation of the underlying probability distribution of the observed result, based on the sample size and variation in the data. A 95% confidence interval would encompass 95% of the results from similar samples with the same sample size and variation properties, and thus represents a wider range of values over which we can be confident that the true result might lie. We can also construct a 70% confidence interval if we are willing to accept a greater chance of random error.


The confidence interval gives us the greatest information regarding the possibility that we have made an error in the interpretation of the results. The confidence interval asks two questions: first, what is the likely difference between groups, and second, what are the possible values this difference might have? For example, the randomised clinical trial testing the effect of drug B for children in cardiac failure found a reduction in mortality with drug B compared to drug A, but the P -value was greater than 0.05 and hence the result did not achieve statistical significance, or a sufficiently high level of confidence. Before we conclude that there was no relative benefit to drug B, we need to examine the result and the confidence interval. The trial randomised 60 patients equally to the two drugs. The mortality after 5 years was 40% with drug A and 20% with drug B, giving a 20% absolute difference in mortality after 5 years with drug B relative to drug A (−20%). We calculate a number needed to treat of 5 patients, meaning that based on this result we would need to treat only 5 patients with drug B in order to prevent one death relative to treatment with drug A. The 95% confidence interval, however, ranges from −41% to +3% and the P -value is 0.09. Based on the P -value, we might mistakenly conclude that there is no benefit to drug B. The 95% confidence interval from this sample suggests that, if the true size of the effect is −20%, 95% of random samples would have a difference in mortality ranging from a 41% reduction in mortality with drug B compared to drug A through to a 3% increased mortality with drug B compared to drug A. We are 95% confident that the truth lies somewhere between these two points, and since the interval includes the value 0%, we cannot confidently conclude that there is a difference between the two drugs. We also cannot confidently conclude that there was not an important difference between the two drugs, including a benefit as great as a 44% absolute reduction in mortality with drug B. It then becomes difficult to know what to conclude from this study. This situation most commonly arises when we have an insufficient number of study subjects or observations. We can calculate the power of this study, which is the probability of concluding that there was benefit with drug B when in truth there really was a difference. Based on the observed findings and the number of subjects studied, the power would be 0.29, meaning that if we conclude that there was a benefit of a 20% absolute reduction in mortality with drug B over drug A, we would have only a 29% chance of being correct. We might suppose that a 4% difference in mortality would be the smallest effect that we would consider to be clinically relevant and would prompt us to prefer drug B over drug A, with a number needed to treat of 25. With our observed difference of 20% and our sample size, we would have a power of 0.89 to detect this effect of 4%, meaning that we could conclude that the benefit of drug B is at least a 4% absolute reduction in mortality relative to drug A with an 89% chance of being correct.


Type I and Type II Error and Power


A study can have one of four possible conclusions. Two are desirable. We might truthfully conclude from our results that a difference or association exists or does not exist, with confidence. Alternatively, we can incorrectly conclude that a difference or association exists or does not exist, when in truth the opposite is true. Although the P -value is a highly useful statistical indicator, it is an imperfect one and we cannot rely on it as the sole piece of information on which to base a conclusion. Results drawn from samples and inferred to a target population are susceptible to two types of error: type I error and type II error. A type I error is one in which we conclude that a difference or association of a certain magnitude exists, when in truth it either does not or is less. From probability distributions, we can estimate the probability of making this type of error, which is referred to as alpha. We also know this entity as the P -value, or the probability that we have made a type I error. Such errors are evident when a P -value is statistically significant, but there is no true difference or association. With a P -value of 0.05, we are taking a 5% likelihood that our conclusion is incorrect, and the results are due to chance or random error. In a given study, we may conduct many tests of comparison and association, and each time we are willing to accept a 5% chance of a type I error. The more tests that we do, the more likely we are to make a type I error, since by definition, 5% of our tests may reach the threshold of a P -value less than 0.05 by chance or random error alone. This is the challenge of doing multiple tests or comparisons. In order to avoid making this error, we can lower our threshold for defining statistical significance, or we can perform adjustments that take into account the number of tests or comparisons being made. Alternatively, we can report only those comparisons or associations that are significant in multi-variable analyses.


The second error which is of concern is the so-called type II error. In this situation, we might conclude from the results in our sample that there is no difference or association, when in truth one exists. We can determine the probability of making this type of error, called beta, from the probability distribution. Beta is most strongly influenced by the number of subjects or observations in the study, with greater number of subjects giving a lower beta. We can use beta to calculate power, which is 1-beta. Power is the probability of concluding from the results in our sample that a difference or association exists when in truth it does. It is a useful calculation to make when the P -value of a result is non-significant, and we are not confident that the observed result is due to chance or random error. Before we conclude that there is no difference or association, we need to be sure that there was sufficient power in order to detect reliably an important difference or association. If we can be fairly certain that we can exclude the fact that we might be missing a clinically important difference or association, we can be confident in our conclusion. We calculate the power in order to determine what might be the likelihood that a difference or association truly exists based on our observed results and the number of subjects we studied. As a general rule, a study reporting a negative result needs to have a power of at least 80%. This means that we are 80% sure that the negative results from this study are not due to a failure to reject a difference, when there is actually one. We are taking a 20% chance of making a type II error.


Applying Statistical Testing


Techniques for analysis of data are used to describe characteristics and to determine associations. Statistical analysis is then used to determine the likelihood that the observed differences or associations are due to chance or random error, and gives us confidence in inferring the results from our study sample to the target population. The selection of statistical test depends on the level of measurement of the dependent and independent variables. It must be remembered that statistical analysis is strictly a mathematical operation based on both facts and assumptions, and that meaning comes from our interpretation of the variables and the results. As such, we must ensure that we are first specifying a hypothesis, performing the appropriate statistical tool, and then interpreting the result. We must avoid the situation where we have a certain result in mind, and then apply different manipulations of the variables and statistical testing to achieve an analysis that supports that result.


Comparing Two or More Groups or Categories


The most common type of plan for analysis involves the comparison of two or more groups defined by a particular characteristic or treatment. The group assignment represents the independent variable, and we seek to determine differences in characteristics and outcomes which are the dependent variables. The simplest form of comparison between two groups is the comparison of two dichotomous or binomial variables from a two-by-two cross-tabulation table using a chi-square test, which compares the variables based on the probability of obtaining a given distribution of dichotomous outcomes. The chi-square test becomes inaccurate when any cell in the cross-tabulation table has less than 5 subjects or observations, and a Fisher exact test is then used. The chi-square test can be applied if there are more than two categories for either the dependent or independent variable or both. Again, if any cell in the cross-tabulation table has less than 5 subjects or observations, the chi-square test cannot be reliably applied. Categories should then be combined in a logical manner until all cells in the table have 5 or more subjects or observations, then the chi-square test can be applied. If the categories of the dependent variable are ordinal in nature, then a Mantel Haentszel chi-square test can be an indicator of trend. Chi-square testing contrasts the difference between a distribution of values in the cells of a cross-tabulation table that would be expected if they were distributed at random, versus the observed distribution. The difference between the observed and expected result is then related to the chi-square probability distribution to determine whether the difference is sufficiently unlikely.


When the dependent variable is continuous, and the independent categorical variable has only two categories or groups, then the Student t test is applied. The probability of the observed difference relative to an hypothesis that there is no difference is derived from the t probability distribution. When there are more than two groups or levels, then an analysis of variance is applied, with use of the F distribution. If the dependent variable is a highly skewed continuous variable or an ordinal variable with many levels, then a nonparametric analysis of variance may be needed that utilises ranks rather than the actual values.


Correlations and Associations


In specific circumstances a statistical test is aimed not at comparing two groups, but at characterizing the extent to which two variables are associated with each other. Correlations estimate the extent to which change in one variable is associated with change in a second variable. Correlations are unable to assess any cause-and-effect relationship, only associations. The strength in the association is represented by the correlation coefficient r, which can range from −1 to 1. An r value of −1 represents a perfect inverse correlation, where an increase of 1 unit in a variable is exactly associated with a decrease of 1 unit in the other. Conversely, an r value of 1 is a perfect correlation, where an increase of 1 unit in a variable is exactly associated with an increase of 1 unit in the other. An r value of zero indicates that a change in one variable is not associated with a change in the other, meaning that both variables are independent of each other. Correlation coefficients are directly proportional to the strength of the association between the variables. P -values for correlation coefficients represent the probability that there is no association between the variables, or an r value of zero. There are many types of measures of association for assessing the relationship between two categorical variables. For two ordinal variables, Spearman rank correlation is used. For two continuous variables, Pearson correlation is used. Underlying the analysis of association between two continuous variables is a regression line, with the correlation coefficient representing the amount of variation around that line.


Matched Pairs and Measures of Agreement


Usually measurements for a study are independent of one another. For example, we may wish to compare measurements made in two groups of subjects where we know that the groups are composed of separate individuals that bear no relationship to one another. In order to reduce variation or to control for a given factor, we may create pairs of individuals matched for a common characteristic. Alternatively, the two groups may not be independent but have an individual level relationship, such as a group of subjects and a group of their siblings. The subjects and their sibling represent matched pairs. When this is the case, we must use statistical testing that takes into account the fact that the two groups are not independent. If the independent variable is categorical, we would use a McNemar chi-square test. If the independent variable is ordinal, we would use an appropriate nonparametric type of test, such as Wilcoxon signed rank test. If the independent variable is continuous, we would use a paired t test. Each of these tests would relate to a different and specific type of probability distribution.


Sometimes, we will make repeat measurements in the same subject but using different methodology. By their nature, it is not surprising that the measurements will correlate, since they are measuring the same thing. What we are actually interested in is the degree to which the measurements agree, or agree with a criterion standard. Agreement between two binary variables can be expressed in one of two ways, either through the raw agreement or through the chance corrected agreement. The raw agreement is merely the number of times two measures agree divided by the total number of measures. By chance, two binary variables will agree approximately half of the time. Based on this, raw agreement is of limited interest. Cohen’s kappa or chance-corrected agreement is the degree of agreement between variables beyond that expected by chance alone. When continuous variables are of interest, agreement is assessed and depicted using Bland-Altman plots. A Bland-Altman plot plots the difference between two measurements in a pair on the y-axis versus the mean of those two measurements on the x-axis. If the agreement were perfect, all of the points would be at a difference of zero regardless of the value of the measurement. The plot can show the degree and limits of agreement, but also any patterns. Systematic bias can be noted, as well as changes in the magnitude of agreement as the average values get larger or smaller. A paired t test can be used to determine if any systematic differences exist between pairs of measures.


Linear and Logistic Regression Models


Often the relationship between two variables can be represented by a line. For continuous dependent variables, that line can be straight, in which case the analysis would be a simple linear regression. Sometimes the relationship is more complex over the range of values of the dependent variable, and may not be linear. In this case, transformations of dependent and independent variables may be used, or non-linear regression techniques can be applied. If the dependent variable is dichotomous, then the relationship is between the probability of a value of the dependent variable as a function of the independent variable. In this case, a logistic regression or log-linear technique would be used. These different regression techniques can be applied to incorporate the relationship between the dependent variable and multiple independent variables. The relationship between the dependent variable and each independent variable would then be independent of the effect of the other independent variables included in the analysis. The output is in the form of an additive mathematical equation, called a regression equation, whereby the value of each independent variable is weighted in its effect on a baseline value called the intercept. The weighting factors are called regression coefficients or parameter estimates, and take into account the level of measurement of the variable and its units of measurement. Regression equations are useful in determining the independent effect of specific variables of interest on a given dependent variable, by allowing the investigator to account for bias resulting from potential confounding factors. Statistical testing can be applied to the whole regression equation, and to the individual variables that were included. Interaction can also be explored within regression modeling by incorporating interaction terms as additional independent variables in the equation.


Survival or Timed-event Analysis


Since death is a certain event for all of us, if you follow every subject in a cohort for an infinite amount of time, your cumulative survival will eventually reach 0%. If a study reports that the survival of a cohort of subjects is 50%, the number is meaningless unless the time interval over which that survival was accrued is also reported. Additionally, we would need to be assured that all of the surviving subjects at risk are accounted for. One of the challenges of following cohorts is that subjects may be lost to follow-up or be lost to further observation before they achieve the event of interest, which is known as censoring. For estimates of survival, subjects may be dropping out of the numerator as they achieve the event of interest, but also out of the denominator as they terminate their contribution to the cohort. In order to accurately depict this phenomenon over time, we need to continuously account for these changes to the cohort. We also need to line up the subjects at a common starting point. In order to correctly assess time-related events, a time-related analysis should be used. Common events that are studied include death, intervention and repeat intervention. The events must occur at a discrete time point. Common starting points that are studied include birth, presentation, diagnosis and intervention. They must also be at a defined discrete time point. For an analysis of survival from birth, every subject is assigned an interval to either death, or to the end of the observation period at which they were last known to be alive. The proportion of subjects surviving to a given time point is continuously calculated, with subjects who die dropping out of the numerator and denominator, and subjects ending their period of observation but still alive dropping out of the denominator, representing the remaining patients alive and at risk of death at a given time point.


The most common form of survival analysis is to calculate and plot non-parametric Kaplan-Meier estimates. We can use statistical tests to determine whether independent variables have an association with time-related survival using Wilcoxon and log rank tests. We can use a particular type of regression analysis which handles time-related events as the dependent variable by using Cox’s proportional hazard regression modeling. This allows us to explore the independent relationship with multiple independent variables, with the parameter estimates representing risk ratios.


If we wish to use independent variables to predict time-related survival, we can use mathematical modeling of the underlying rate or hazard at which events occur. Parametric analysis techniques are increasingly being used, and survival models can be derived which encompass three phases of time-related risk: early, constant, and late. Independent variables associated with each phase can be modeled separately. From this type of analysis, we might discover that the factors associated with early mortality are not the same as those associated with late mortality. One of the most interesting applications of parametric survival models is the creation of competing risk models. Competing risk analysis establishes the likelihood of subjects to have achieved one of two or more mutually exclusive events over time for which they are simultaneously at risk. The competing risk analysis estimates, at each time point, the likelihood of each competing event occurring against all others, based on a parametric survival model for each event. An excellent example of how this might be applied was a study of outcomes after listing for heart transplantation. 1 After listing, a patient may die without getting a transplantation, may have a transplantation, may be taken off the list if they no longer require a transplantation because they improve or because they develop complications that preclude transplantation, may have an alternative procedure, or may remain on the list surviving without having any of the previously mentioned events. A given patient is simultaneously at risk for all of these mutually exclusive events or end-states. The rate at which patients are achieving these events over time can be modeled mathematically and associated factors determined. These rates and their associated factors can then be used for prediction. We can determine the likelihood of a patient with a given set of characteristics receiving a transplantation at a given point in time, remaining on the list, being removed from the list, or dying. The advantage of competing risk analysis over multiple survival curves is that every subject is calculated once and not censored in multiple different curves.


Longitudinal or Serial Measures Analysis


The value of some variables can change over time if we measure them repeatedly, and trends can be noted. Examples might be changes in left ventricular ejection fraction or subjective grade of mitral valve regurgitation. If we measure something repeatedly in a subject, then the measures are not independent, and we need to account for this in the analysis. If the measurements occur at discrete and common time points across subjects and there is little missing data, then a repeated measures analysis of variance can be used. This is rarely the case using clinical data, since patients rarely all have measurements at the same time, some patients may have more measurements than others and may have been followed longer, and there may be variable amounts of missing data. We may also wish to determine if independent variables are associated with the measurements and with changes over time. Specific types of regression analyses have been developed and applied to handle this type of complex data. If the serial measurements are of a continuous variable, then mixed linear or non-linear regression analysis can be used. The effect of time course can be included as an independent variable, and interactions with time and other independent variables can be explored to determine associations with trends. If the variable is categorical or ordinal, then a general estimating equation type of regression analysis can be used.


Accounting for Non-random Allocation in Comparison of Therapies


Non-random allocation of patients to therapies is one of the limitations of using observational clinical data to compare therapies. Patient characteristics that influence the selection of particular therapies may also influence the outcomes of those therapies, and hence bias any comparison. This is especially true for patients receiving new and higher risk therapies, since they are often the patients with the most severe/active disease and are the ones expected to have the worst clinical outcomes. The statistical analysis should include the methodology to adjust for non-random assignment to therapy, and to equalise comparisons. One method often used is the creation of a propensity score. 3 A propensity score is the probability, based on a subject’s characteristics, that the subject would have been selected for a specific therapy versus the alternative. The propensity score is derived from a logistic regression model with the treatment assignment as the dependent variable, and all known subject characteristics as independent variables. The logistic regression equation is then solved for each subject and converted to a probability, which is the propensity of that subject to have been selected to the specific therapy. The propensity score can be used in three different ways. The first method uses the propensity score to select and create pairs of subjects, one from each treatment group, that are matched according to the score. The second method recreates a blocked randomised controlled trial by dividing subjects within treatment groups into tiers, quartiles or quintiles based on their propensity score. Outcomes for each block in the treatment group are then compared against those for the corresponding block in the alternative group. These two uses of the propensity score often require a large number of patients and imply that the propensity score will be balanced and sufficiently overlapping between treatment groups, which is not always the case. When propensity scores are highly unbalanced between groups and little or no overlap exists, the score can be used as an adjustment factor or an additional independent variable in regression analysis. Using the propensity score as an independent variable in regression equations is how it is most commonly used. The use of propensity scores, and other forms of statistical adjustment for potential confounding, only adjusts for those variables that are measured. It does not replace randomisation, which randomly distributes not only measured but unmeasured factors as well, and gives the greatest likelihood of an unbiased comparison.


Seeking Statistical Expertise


One of the key elements of data and statistical analysis is to know when to seek the assistance of a statistician. There are several statistical software programmes available for performing analysis, and some are more user friendly than others. Nearly all will allow you to perform statistical analysis without knowledge of the underlying mathematics and assumptions on which the tests are based, which increases the chances for a novice to make important errors regarding inappropriate application. In general, most individuals can handle the description of data, and do some simple testing where two variables are involved. Analysis which incorporates multiple independent or dependent variables, such as most forms of regression analysis, in contrast, should be performed with the guidance of or by a statistician. Additionally, consultation should also be obtained for instances where types of nonparametric tests are needed. We might sometimes think that statisticians speak a different language, but it is important to remember that for the majority of statisticians your data is but numbers, without meaning. It is essential, therefore, when enlisting a statistician, that your data be clearly organised and defined and that you have a detailed plan of how the analysis should proceed. This should be discussed with the statistician, and any questions answered and misinterpretations clarified. The role of the statistician is to apply the most appropriate technique given the level of measurement and assumptions about the data, to report the results, and to assist in the interpretation.

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Apr 6, 2019 | Posted by in CARDIOLOGY | Comments Off on Description and Analysis of Data, and Critical Appraisal of the Literature

Full access? Get Clinical Tree

Get Clinical Tree app for offline access