Echocardiography has come a long way over the last 40 years, moving from an investigative tool that had little support from mainstream cardiology, to one that is considered essential in patients with suspected congenital heart disease. It is not uncommon in 2012 for a pediatric cardiology trainee to perform the echocardiogram prior to ordering an electrocardiogram, with the chest X-ray being all but relegated to the dustbin of diagnostic tools. We convinced our colleagues that echocardiography is a reliable tool for anatomical and functional evaluation, but have we been fooling them?
The initial emphasis was on cardiac morphological diagnosis, with M-mode being a poor tool for this job. Two-dimensional echocardiography was responsible for overcoming the diagnostic challenges that were faced by pediatric cardiologists. Those were the Halcyon days of diagnosis with new observations being made on a weekly basis, followed by an explosion of reports in the literature. Those of us who were actively engaged at that time were basically descriptive echocardiographers, reporting what we observed and how that correlated with cardiac pathology, either at autopsy or cardiac surgery. This is not to belittle the contributions of our colleagues of that era, as they carried on a deep tradition of cardiac morphology, pioneered by cardiac morphologists such as Richard and Stella Van Praagh and Robert Anderson. We described what we saw, and applied simple statistical analysis to the data. Inter- and intra-observer variability were terms rarely used by morphological echocardiographers. Despite this, these studies changed the mode of diagnosis forever in pediatric cardiology.
At the same time there was a subgroup of pediatric echocardiographers with a passion for cardiac function, albeit through the application of M-mode techniques. M-mode was the champion of cardiac functional assessment, and those involved were obsessed with the technique, relying on its excellent temporal resolution. The next explosion in technology was continuous wave and range-gated pulsed Doppler. Continuous wave Doppler became a reliable and reproducible method for gradient measurement in the heart, and indeed has stood the test of time, being a modality that is currently relied upon by clinical and research cardiologists. The Bernoulli equation rules, and still does today. Pulsed Doppler was next out of the block, and the initial focus was on quantitative assessment of cardiac output and shunt quantification. The latter was fashionable for a while, and indeed many studies revealed high reproducibility with excellent correlation between Doppler derived and invasive hemodynamic calculations. It was at that time that the senior author became skeptical as to how the data was presented. This technique did not stand the test of time with only a relatively few pediatric echocardiographers actually use Doppler derived shunt calculations for their patients in the present day. The story goes on as more and more techniques of varying complexity were developed by our colleagues, many who worked in conjunction with industry to help identify applications for the newer techniques. We became consumed with multiple techniques to assess cardiac function: volumetric calculations, pulsed Doppler techniques for assessing diastolic function, tissue Doppler, acoustic quantification, speckle-tracking, color M-mode, etc. At the same time as the development of these techniques more and more complicated mathematical models were being developed to aid the echocardiographer in “analyzing the data”. What is an echocardiographer to do with so many choices (often too many)?
As well, short cuts were developed to make it simpler for the echocardiographer. One prime example was the study of the velocity of circumferential fiber shortening to wall stress relationship. Dr. Steven Colan was a proponent of this technique, and indeed spent part of his career validating this methodology in a very detailed and thorough manner. The theory was sound, and the methodology somewhat challenging, but done correctly it did provide great insight into this important relationship, probably the most important in the heart. What happened in the search for simplicity was replacing the end-systolic pressure, which was one of the key measurements, with the peak systolic pressure, to make it more “user friendly”. The senior author has known Dr. Colan for many years, never worked with him, but respect his contribution to the field. Dr. Colan and the senior author have followed different pathways, with the latter being interested in the application of cardiac ultrasound as it pertains to morphology, while Dr. Colan has focused on standards development and methodology. Dr. Colan has been at the forefront of the Pediatric Heart Network Investigators group with support from the NHLBI, and despite the lack of glamour that comes with some of these studies, has demonstrated his ability to set standards, establish the importance of developing reproducible techniques and challenging how we measure things.
It is therefore very appropriate that the current issue of JASE includes a seminal paper by Colan et al . What better person has the credentials to challenge how we apply quantitative echocardiography in pediatrics, and more importantly to question the validity of the findings?
Over the years there have been several standards documents produced by the pediatric echocardiography societies. These address training, as well as how data should be acquired and analyzed. It has always been the presumption that if such guidelines are followed, diagnostic errors can be avoided, and the data acquired should be reproducible. This is true for morphological diagnosis, as if not our surgical colleagues would have abandoned us many years ago. However, this may not apply to functional assessment in pediatrics. In pediatrics we have an advantage over our adult colleagues in image acquisition, as many of our patients have excellent acoustic windows, certainly in their younger years. This does not apply once they have undergone several surgical procedures, and unfortunately this is the case for one of our most important and ever growing populations i.e. those who have had a Fontan operation.
Patient cooperation is our main Achilles’ heel and is one of the most limiting factors that impact on image quality and reproducibility in the younger population. Although there are sedation guidelines that most of us follow, there is still reluctance to sedate large number of patients whom we evaluate. As a result many studies are performed on an uncooperative patient, which may be adequate for some of the simpler anatomical diagnosis, but certainly not for those with more complex forms of congenital heart disease. Functional assessment suffers an even worse fate, as how can reliable and reproducible data be acquired on an uncooperative patient? This is why it is very difficult to do research studies on younger patients, and even more challenging to persuade ethics boards to accept sedation as part of a study protocol. Building protocols into the echocardiography laboratory is another key to improving quality of the studies. Our current laboratory, and the one at Sick Kids where the senior author previously worked are protocol driven.
Not all echocardiographers and sonographers are equal in their scanning and interpretation abilities. This is an important factor in image quality and subsequent data analysis. Our subspecialty has an advantage over other imaging modalities in that it is more interesting for those performing the studies, but a disadvantage in that it is totally dependent on the ability of that individual to acquire an adequate “data set”. Human variability is probably one of the main weaknesses of echocardiography.
Another question is whether those echocardiography laboratories that perform research studies obtain higher quality data, both anatomical and functional. The answer is not clear for this but we would speculate that it is probably yes. Our experience is that if one is working according to protocols, measuring and analyzing data, then one becomes more discriminating. The decision to reject studies is also a challenge, and any paper that quotes a 100% success rate in obtaining data acceptable for analysis is probably stretching the truth. Rejecting data is a challenge, as it might have taken a long time to recruit the patients, particularly younger ones, resulting in the temptation to add studies that are substandard into the data pool. This is only human nature, but exposes one of the main weaknesses inherent to echocardiography, how do we decide what is adequate data to analyze. This problem is magnified if mathematical formulas are applied to calculations: for example, measuring the area of the aortic root in calculations of left ventricular output. One of the inherent errors in this measurement is due to limitations of axial resolution, the latter of which is squared in the calculation, magnifying the error in those patients with a smaller aorta.
We believe that only by performing research studies do you really understand the strengths and limitations of any given technique. This is a somewhat altruistic approach, and does not reflect the real world of echocardiography. Our colleagues who are not actively engaged in research studies therefore need to know what measurements and calculations are valid, can be applied to their own patients, and are reproducible from one study to another.
Thus the manuscript at hand by Colan et al is able to shed considerable light on the murky side of functional echocardiography. They embarked on a logistically large and complex study to determine the inter-study variability of echocardiographically derived left ventricular (LV) end-diastolic volume Z score, mass Z score, and ejection fraction Z score in patients with dilated cardiomyopathy. The paper reported in this manuscript is a sub-analysis of the data to examine the efficacy of beat averaging in improving measurement variability in 150 common LV echocardiographic variables in this pediatric population. In addition, and probably of more importance, they examined the weaknesses with regards to reproducibility of certain LV variables. The authors were able to summarize and categorize the 150 variables into important sub-types such as measurements involving areas, dimensions, calculated variables of differing complexity, those using 2, 3 or 4 measured variables in such a calculation, integrals, slopes, time and velocities, to determine where the echocardiographic “errors” occur.
Beat averaging was found to improve the mean percentage error in this pediatric population. Beat averaging is of particular importance in the pediatric population where breath hold at end-expiration is all but impossible until children are old enough to comprehend what it means. Beat averaging did reduce the percent error in all sub-types of echocardiographic variables and appears to be more effective in the poorly reproducible variables, such as more complex derived parameters and slope measurements. We note that although benefits also exist in more robust variables, they were marginal. As the authors discussed, the use of beat averaging should be considered and probably utilized to reduce variability in research studies, and the use of a core-lab with a single observer will probably result in the best standard in study design when echocardiographic parameters are chosen as a study end point. However in daily clinical practice, routine beat averaging on all echocardiographic parameters would be too time consuming and impractical. As well, there is questionable benefit as the upper quartile of all variable sub-types showed a percentage error of up to 150%, while beat averaging reduced mean percentage error by about 10% at best. Even some of our “good” variables may not be as reproducible as we believed; however, before we throw out all echocardiographic functional measures, it is important to provide words of caution on the interpretation of the findings of percentage error.
The use of percentage error as the only reported measure of variability may at times be overly simplistic. Percentage error is often favored for its simplicity and does not “hide” errors from the readers, but it does have several limitations and may at times judge a variable more harshly than it deserves. For example, a 10% percent error for inter-observer variability may be a consequence of a systematic bias between the two observers with one measuring consistently higher than the other by 10%. In this case their intra-observer variability would be 0%. Correction of variability from systematic bias can be achieved by repeat training of the observers. This is very different to inter-observer percentage error where no such bias exists, and correction of such variability is much more difficult from training, if at all possible. The science and statistics in the assessment of measurement error is complex and is an area of knowledge to which many researchers and clinicians have paid too little attention for far too long. This report of Colan et al has helped to refocus the importance of understanding and recognizing variability in our measurements, and also to a degree how we should assess this variability.
Our increasing awareness is reflected in a trend from manuscript reviewers to demand intra- and inter-observer measurement variability testing to accompany each new research study. However it is disappointing that the testing finally performed and the statistics used are often incomplete, misinterpreted and at worst misleading. The many pitfalls of various reproducibility testing are discussed in the education series by Bland and Altman, starting with their Lancet publication in 1986 and since carried on in the British Medical Journal. We think the need for statistical review is now almost mandatory, as is the application of the correct statistical method to test reproducibility accurately, but also to provide useful parameters that are applicable in single point measurements such as calculations of repeatability. It is also important to recognize when reproducibility should be requested in a manuscript. The need to present reproducibility data on variables used in every single submission is excessive when previous studies have performed and presented reproducibility testing in a similar setting. It is also even more irrelevant when the focus of the paper’s finding is not for direct application in a clinical setting. The conditions for achieving reproducibility in a research laboratory are far different from those in a clinical setting; hence reproducibility testing should be performed for an intentional purpose, as the authors have done.
In addition to highlighting the variability in our daily echocardiographic LV variables, Colan et al made a valiant attempt at identifying the sources of variability. The control of difference between machines, transducers and settings is difficult to do and more challenging to quantify. This is beyond the control of echocardiographers but is important if we want to produce reliable and reproducible data. This is where the main ultrasound manufacturers should provide leadership by setting similar standards between ultrasound machines, just as what happened with DICOM, whose development was driven by complaints from echocardiographers who were tired of too many different digital formats. What are within our control are the variables we choose to measure and our interpretation of them during the single time point of assessment. The presentation of the least and most reproducible variables in the study of Colan et al serves to highlight that less is best, with the most reproducible measurements being the simplest. The most reproducible are dimensions, time intervals and areas. This is somewhat disappointing, particularly if we are trying to measure volume derived ejection fraction, or end-systolic volume as a surrogate of ventricular function. The rogue’s gallery belongs to the some of our old favorites such as the Tei index, which is frequently used by pediatric echocardiographers. As well, many of the derived values and slopes that we have come to rely on in the field of research were prone to higher percent errors.
We have to speak out in defense of the much-maligned assessment of the slope of the isovolumic acceleration (IVA). The authors tested the variability of IVA using spectral tissue Doppler, which we would agree is nearly impossible and certainly in our own practice is never performed that way. The references mentioned in the discussion showed much more acceptable IVA reproducibility data presented in research settings. Both studies used offline analysis of color tissue Doppler, which most would agree is less prone to error and easier to reproduce. It is the evolving advancement of such technology and analysis software that has the potential to greatly improve our ability to measure these important variables when assessing LV function. Three-dimensional volumetric ejection fraction has already been demonstrated on numerous occasions to have better reproducibility than two-dimensional biplane Simpsons. The findings of this current study should serve as further motivation to apply this new technology. Tissue deformation assessments with speckle-tracking strain and strain rate are proving to be robust, if not a more sensitive assessment of LV function, relating well to volumetric ejection fraction, however the reproducibility of these approaches are yet to be truly tested.
The strength of these newer assessment tools is the trend towards removing human subjective input. This is probably the right approach, as the most reproducible data in the study of Colan et al was obtained when a single trained observer performed the studies using beat averaging. It is impractical for one individual to do all the measurements with beat averaging, but this can easily be done by a machine. However some obstacles remain, as competition between vendors and the lack of a standard for many of the auto-border detection and speckle-tracking algorithms makes measurements between laboratories difficult to interpret. Koopman et al recently compared vendor specific measurements of circumferential and longitudinal strain, demonstrating a systematic bias for circumferential and a higher variation for radial, and emphasizing care when performing multi-institutional studies. With increasing sophistication of offline analysis software comes an increasing variation in the settings available for the algorithm to perform its analysis. For example, color TDI using EchoPAC software (GE Healthcare, Milwaukee, WI) has about 6 different ways of smoothing the velocity trace for display before measurements are performed with calipers, yet which smoothing algorithm has been used is hardly mentioned in most research methodology or in the current recommendations on evaluation of myocardial mechanics. Similarly, such settings exist for speckle-tracking temporal and spatial smoothing. Our lack of awareness and emphasis on software settings could potentially be another future source of measurement technology related variability.
It would appear that there are inherent problems with many of the measurements that are being performed, and it may well be difficult to interpret the data from any one particular study and apply it to one’s own patient population. Some of the percent errors in the measurements from one center to another far exceed some of the functional changes that we are looking for in our patients from one clinic visit to another, making the values questionable, particularly when complex measurements and calculations are undertaken. Although the study protocol used in the Pediatric Heart Network study included the inter-study variability, the results were not presented in this manuscript. The variability for most, if not all, of the variables may be magnified from differences in two-dimensional imaging planes attained by a different observer. Three-dimensional imaging has the potential to somewhat reduce this variability. Finally, the most disturbing aspect is what is the percent error in those echocardiography laboratories that do not adhere to the rigorous protocols that have been established by this working group? We would hazard a guess that many of the measurements that are being performed by all of us in the field are of little clinical value.
The study of Colan et al has been a long time coming, considering that it is reporting on measurements and calculations that we have used for a long time. What we can take away from it is that echocardiography has a way to go in becoming a reliable and reproducible technique for functional assessment in children, and that until this happens we must use such studies as a reference to interpret data obtained at any point in time in our patients. It also has major implications with regards to the value of this technique in multi-institutional studies that address the impact of treatment, or progression of ventricular disease in the pediatric population.