Critical Review of Current Approaches for Echocardiographic Reproducibility and Reliability Assessment in Clinical Research




Background


There is no broadly accepted standard method for assessing the quality of echocardiographic measurements in clinical research reports, despite the recognized importance of this information in assessing the quality of study results.


Methods


Twenty unique clinical studies were identified reporting echocardiographic data quality for determinations of left ventricular (LV) volumes ( n = 13), ejection fraction ( n = 12), mass ( n = 9), outflow tract diameter ( n = 3), and mitral Doppler peak early velocity ( n = 4). To better understand the range of possible estimates of data quality and to compare their utility, reported reproducibility measures were tabulated, and de novo estimates were then calculated for missing measures, including intraclass correlation coefficient (ICC), 95% limits of agreement, coefficient of variation (CV), coverage probability, and total deviation index, for each variable for each study.


Results


The studies varied in approaches to reproducibility testing, sample size, and metrics assessed and values reported. Reported metrics included mean difference and its SD ( n = 7 studies), ICC ( n = 5), CV ( n = 4), and Bland-Altman limits of agreement ( n = 4). Once de novo estimates of all missing indices were determined, reasonable reproducibility targets for each were identified as those achieved by the majority of studies. These included, for LV end-diastolic volume, ICC > 0.95, CV < 7%, and coverage probability > 0.93 within 30 mL; for LV ejection fraction, ICC > 0.85, CV < 8%, and coverage probability > 0.85 within 10%; and for LV mass, ICC > 0.85, CV < 10%, and coverage probability > 0.60 within 20 g.


Conclusions


Assessment of data quality in echocardiographic clinical research is infrequent, and methods vary substantially. A first step to standardizing echocardiographic quality reporting is to standardize assessments and reporting metrics. Potential benefits include clearer communication of data quality and the identification of achievable targets to benchmark quality improvement initiatives.


Highlights





  • The authors reviewed approaches for assessing echocardiographic reproducibility in research.



  • Statistical methods assessing echocardiographic data quality vary substantially.



  • For each echocardiographic variable (such as LV ejection fraction), the range of values for each statistical metric is large.



  • The relative data quality varies depending on the metric examined.



  • Achievable precision metrics were identified for each echocardiographic variable (such as LV ejection fraction) in the studies examined.



Cardiac ultrasound has the potential to contribute critically important imaging end points in clinical cardiovascular research. However, its use is limited by real and perceived suboptimal measurement reproducibility. One notable example of poor data reliability is the Predictors of Response to Cardiac Resynchronization Therapy trial, which reported a large percentage of nonassessable echocardiographic data (up to 50% for Doppler tissue imaging parameters) and low interobserver reproducibility results (coefficient of variation up to 72%), making the trial findings difficult to interpret.


Additionally, comparison among studies is difficult, because a variety of statistical approaches to assess variability have been reported in the literature. Some studies report coefficient of variation (CV), while others report the Pearson correlation coefficient, percentage error, or the intraclass correlation coefficient (ICC). Of note, the American Society of Echocardiography Cardiovascular Technology and Research Summit developed a roadmap for 2020 that included the goals of (1) documenting the reproducibility of quantitative echocardiographic biomarkers and (2) developing reproducibility standards for echocardiography core laboratories.


In response to this roadmap, the aims of the present study were to (1) determine the range of reproducibility methods in use, (2) calculate not-reported reproducibility metrics for selected variables across representative studies to compare these results, and (3) determine if these findings hold implications for establishing a standard of precision for echocardiography in clinical research.


Methods


Overview


A representative cohort of studies was identified to capture the range of metrics used to assess reproducibility and their corresponding results for six variables: left ventricular (LV) end-diastolic volume, LV end-systolic volume, LV ejection fraction, LV mass, LV outflow tract diameter, and mitral diastolic inflow peak early velocity. Of these, studies with sufficient reported data were selected to calculate missing but commonly reported reproducibility metrics with reasonable assumptions. On the basis of these findings, estimates of reasonably achievable targets for these metrics are proposed for each variable.


Cohort Selection


To identify a sample of studies reporting metrics for echocardiographic reproducibility, a PubMed search was performed using the following search terms: “echocardiographic reproducibility,” “LV ejection fraction,” “interstudy echocardiography reproducibility,” “echocardiography core lab,” and “echocardiographic LV mass correlates” (see Supplemental Table 1 and Figure 1 ). Only human studies published in English with data providing quantitative assessments of the reproducibility of continuous variables were included. In addition, a review of each study’s cited references was performed to identify additional relevant studies. Finally, unpublished results from our own core laboratory reproducibility testing were included. Each study was reviewed in detail to identify the study population, types and variables of reproducibility assessed, number of echocardiograms analyzed, and reported reproducibility metrics. If other commonly used reproducibility metrics were not reported, we determined if sufficient data were provided in the articles to make reasonable assumptions to calculate the other de novo estimates of data quality. Only those studies with sufficient data for us to report all commonly used reproducibility metrics for a given variable were included in the cohort for analyses.




Figure 1


PubMed search strategy to identify studies with quantitative echocardiographic reproducibility data on LV end-diastolic and end-systolic volumes ( n = 13), LV ejection fraction (LVEF) ( n = 12), LV mass ( n = 9), LV outflow tract (LVOT) dimension ( n = 3), and mitral inflow Doppler early velocity ( n = 4). Of these, studies with sufficient reported data to enable further calculations were selected to calculate missing but commonly reported reproducibility metrics with reasonable assumptions.


Statistical Analysis


Descriptive statistics including mean and SD of the overall population were extracted from the selected published results. If only the reproducibility population was reported, then the descriptive statistics for the reproducibility population were used for the overall population. We also extracted any reported mean and SD of paired differences of measurements on the same subject (and assumed that the mean difference was zero if not reported) and all data presented on reproducibility methods, including (1) ICC, (2) 95% limits of agreement (LOA), (3) CV, (4) coverage probability (CP), and (5) total deviation index (TDI). These indices were selected on the basis of a previous review of agreement indices for assessing and improving measurement reproducibility in an echocardiography core laboratory setting.


Any indices not reported were calculated with specific approaches detailed in the Appendix , such that a complete range of indices were available for each variable in each study. Although the specific formula for each of the indices has been reported previously and is provided in the Appendix , we provide below a brief description and interpretation of these indices.


ICC


ICC has been the most popular index used to report reliability in the medical literature. Although there are different versions of ICC depending on different assumed analysis of variance (ANOVA) models, the original ICC based on a one-way ANOVA model with subject effect is still the most commonly reported index and is the ICC used in this study. It is defined as the ratio of between-subject variability to total variability (sum of between-subject variability and error variability). For the case of two observers, the error variability is equal to half of the variance of the differences of measurements by the two observers. ICC values range from −1 to 1. Interpretation of the ICC is that the larger the ICC value, the better the reproducibility. However, the definition of adequate and inadequate reproducibility on the basis of ICC is controversial. Landis and Koch provided adjectives of “substantial” for values between 0.6 and 0.8 and of “almost perfect” for values between 0.81 and 1.0, although these cut points are arbitrary and subjective. Intuitively, if the error variability is small relative to the between-subject variability, then the ICC value will be high (close to 1).


Because of the relativeness of this index, an artificially high or low ICC value can be obtained if the between-subject variability is small or high, respectively, even though the error variability is the same. This drawback can be visualized in Figure 2 with two examples of 20 pairs of hypothetically generated LV ejection fractions. In Figure 2 A, the first observer’s readings were randomly generated with values from 10% to 90%. The second observer’s readings were obtained by adding or subtracting 10% to the first reader’s readings so that the difference between the two observers is always equal to ±10%. The ICC for these data is estimated to be very high (ICC = 0.91; 95% CI: 0.79 to 0.96). In Figure 2 B, 20 pairs of hypothetical LV ejection fraction readings were also generated randomly. However, in this case, the first observer’s readings were randomly generated between values of 55% and 65%. The second observer’s readings were also obtained by either adding or subtracting 10 from the first observer’s readings. As a result, the difference between the two readers again is always equal to ±10%. However, the ICC for the second example is estimated to be very low (ICC = 0.15; 95% CI: −0.29 to 0.54). On the basis of ICC, one would conclude that there is excellent reproducibility in the first data set but poor reproducibility in the second data set. In fact, the reproducibility is the same for both data sets, because the differences between the two readers are always equal to ±10%. The low ICC value in Figure 2 B is due to the relatively small between-subject variability in the data. Even though this drawback has been recognized in the statistical literature, there is insufficient recognition of this drawback in the medical literature, and it continues to be a popular index for reporting reliability. We direct the interested reader to a previous publication in which different reproducibility metrics were explored more thoroughly.




Figure 2


(A) The difference in LV ejection fraction (LVEF) measurements between 20 pairs of randomly generated LVEF measurements is displayed on the y axis, and the average between the two LVEF measurement is displayed on the x axis. The first measurement has values ranging from 10% to 90%. The second measurement is always equal to ±10 of the first measurement. The ICC in this example is 0.91. (B) The difference in LVEF measurements between 20 pairs of randomly generated LVEF measurements is displayed on the y axis, and the average between the two LVEF measurement is displayed on the x axis. The first measurement has values ranging from 55% to 65%. The second measurement is always equal to ±10 of the first measurement. The ICC in this example is 0.15. However, the data in both (A) and (B) represent the same the reproducibility, because the differences between the two readers are always equal to ±10%. The low ICC value in (B) is due to the relatively small between-subject variability in the data.


Ninety-Five Percent LOA


The 95% LOA of Altman and Bland are a popular tool for examining agreement between two measurements on the same subject because of their intuitive appeal. The 95% LOA are centered on the mean difference; assuming that the differences are normally distributed, 95% of the differences would fall within these limits. Asymmetric limits imply some bias, and the magnitude of the limits indicates the magnitude of disagreement for 95% of the subjects. Smaller limits imply better reproducibility. A Bland and Altman plot, such as Figure 2 A or 2B, plotting average versus difference, is used to display the data visually. For both the two hypothetical data sets in Figure 2 , the estimated 95% LOA are −9.81% and 7.81%. This also implies that 95% of differences are within ±10%.


Within-Subject CV


Only the within-subject CV is used for assessing reproducibility. The CV is defined as the within-subject SD (i.e., the square root of error variability or half of the SD of differences) divided by the population mean. The smaller the within-subject CV, the better the reproducibility. By definition, the CV depends on the population mean, and its value could be artificially small or large for a study with a large or small population mean, even though the error variability is the same. For the two generated data sets in Figure 2 , the estimated within-subject SD is equal to 2.25% because of the same differences of the two observes, but the estimated population means are 43.3% and 59.5%, respectively, resulting in a larger within-subject CV (0.052 or 5.3%) for the first data set than the within-subject CV (0.038 or 3.8%) for the second data set.


CP and TDI


CP and TDI are two agreement indices, with equivalent concepts, to measure the proportion of cases within a boundary for allowed differences. Intuitively speaking, a reasonable criterion to judge whether reproducibility is satisfactory would be to require that an overwhelming majority (e.g., 95% or 80%) of the absolute differences be within a preset acceptable difference (i.e., prespecified acceptable measurement error). The probability of the observed absolute differences falling within the acceptable difference is the CP for the given acceptable difference. The higher the CP, the better the reproducibility. Estimation of the CP can be accomplished simply by using the proportion of paired absolute differences falling within limits that are considered acceptable. In the hypothetical data in Figure 2 , the CP for the acceptable difference of 10 is equal to 100% because the paired absolute differences are all equal to 10. If data are available, a CP curve can be constructed by plotting the observed absolute differences versus the corresponding estimated CPs connected with lines. One can then choose an acceptable difference and find the corresponding estimated CP in the curve. Therefore, the CP curve provides a visual spectrum of measurement error for different acceptable differences. For the hypothetical data in Figure 2 , the estimated CP curve is the straight line connecting two points: (0, 0) and (10%, 100%). If data are not available, the CP curve can be constructed, assuming that the differences are normally distributed, by using reported or imputed mean and SD. This latter approach is used for the computed CP curves presented in the “Results” section for the chosen studies, because the original data were not available. Comparisons of CP curves would be available in the future by using relative area under the curve.


Sometimes it may be difficult to prespecify the acceptable difference. If we believe that the observers are making quality measurements, and they represent the best achievable measurements for a given parameter, then it is of interest to know the range within which an overwhelming majority (e.g., 95% or 80%) of the observed absolute differences would be expected to fall. This expected absolute difference is called TDI (for a given probability as the measure of majority). We used 80% and 95% as the given probability in this study. In Figure 2 , the estimated TDI would 10 for both cases.


Barnhart et al. , who compared the pros and cons of various reproducibility metrics, concluded that the CP is the preferred index for assessing reproducibility. This conclusion was based on the computational simplicity of this index, its ability to identify discordant measurements to provide guidance for review and retraining, and its consistent evaluation of data quality across multiple reviewers, populations, and continuous as well as categorical data.


All statistical analyses were performed using SAS version 9.4 (SAS Institute, Cary, NC).




Results


Cohort Selection


Figure 1 details the search strategy used to identify 20 unique studies reporting reproducibility data for LV end-diastolic volume, LV end-systolic volume, LV ejection fraction, LV mass, LV outflow tract diameter, and mitral diastolic inflow peak early velocity. Individual study details regarding reproducibility types, number and type of observer, sample size, and statistical indices used are listed in Supplemental Table 1 .


Interestingly, there were a variety of different approaches to reproducibility assessments (interacquisition [ n = 1], interobserver only [ n = 1], both interobserver and intraobserver [ n = 6], intersite variability [ n = 1], intrasubject-interstudy variability [ n = 5], and temporal drift [ n = 2]), with different numbers of subject echocardiograms reviewed for reproducibility testing, ranging from 10 to 83 (or ranging from <1% to 100% of the overall study sample size), and different types of observers (sonographers, physicians, core laboratory, and site) ( Supplemental Table 1 ). Reported metrics included mean difference and its SD ( n = 7 articles), ICC ( n = 5), CV ( n = 4), and LOA ( n = 4). Finally, none of the 20 studies cited a benchmark to allow the reader to interpret their findings regarding the quality of the reproducibility.


Comparison of Reproducibility Results


Of the 20 unique studies identified in Figure 1 , sufficient quantitative echocardiographic data on reproducibility were available for us to compute the missing reproducibility metrics on LV volumes ( n = 6), LV ejection fraction ( n = 7), LV mass ( n = 7), LV outflow tract diameter ( n = 3), and mitral Doppler peak early velocity ( n = 3).


The range of reproducibility values reported or computed for LV end-diastolic volume is shown in Table 1 . Reported values are in boldface type, while computed values based on reported data are in regular type. We also reported unpublished results (in regular type) on the basis of the actual data from the Duke Clinical Research Institute Imaging Core Laboratory. The latter data are included because they are readily available, represent an in-depth assessment of intraobserver and interobserver reproducibility, and, unlike published studies, provide a complete data set that allows the calculation of all indices without any assumptions or extrapolations. For example, with the complete data from Duke, the actual CP curves are displayed rather than the extrapolated, smoothed versions due to best-case scenario assumptions we made from the available data in the published articles. Mean difference ranged from −4.3 to 8.9 mL, while the SD of differences ranged from 7.8 to 24.8 mL and ICC values ranged from 0.56 to 0.98. The CV ranged from 4.88% to 18.8%, while the CP (for an acceptable paired difference of 30 mL) ranged from 0.74 to 1.0.



Table 1

Reported and calculated measures of echocardiographic reproducibility of LV end-diastolic volume measurements































































































Data source Population studied; Reproducibility type reference Overall population Reported and calculated results for reproducibility agreement indices
n Mean mL (SD) n Mean difference mL (SD) ICC 95% LOA CV (%) CP TDI 80%, 95%
A. Hypertension; intraobserver/interstudy 81 132.5 (40.0) 81 0.0 (21.2) 0.86 −41.5, 41.5 11.3 0.844 27.1, 41.5
B. Myocardial infarction and normal; interobserver 24 184.8 (45.6) 24 0.0 (12.8) 0.96 −25.0, 25.0 4.9 0.981 16.4, 25.0
C. Mixed LV hypertrophy, heart failure, and normal volunteers; intraobserver/interstudy 60 180.0 (65.0) 60 0.9 (13.5) 0.98 −25.6, 27.4 5.3 0.973 17.3, 26.5
D. Heart failure; interobserver 1,460 222.4 (68.8) 67 8.9 (24.8) 0.94 −39.7, 57.5 7.9 0.744 33.8, 51.6
E. Cancer; interobserver 56 88.0 (25.0) 56 0.0 (23.3) 0.56 −45.7, 45.7 18.8 0.988 29.9, 45.7
F. Elderly community population; intraobserver 40 95.7 (39.0) 20 −4.3 (7.8) 0.98 −19.6, 11.0 5.8 0.801 11.5, 17.4
G. Mixed normal volunteers, heart failure, aortic stenosis; interobserver 10 198.3 (61.3) 10 0.31 (17.1) 0.96 −28.3, 28.9 6.1 1.000 22.9, 38.0

Reported data are in boldface type, and extrapolated data are in regular type. General guidelines for statistical test interpretation are as follows: the ICC ranges between −1 and 1, with higher values indicating better reproducibility; the CV ranges from 0 to infinity, with smaller values indicating better reproducibility; the CP ranges from 0 to 1, with higher values indicating better reproducibility; and the TDI ranges from 0 to infinity, with smaller values indicating better reproducibility.

The CP is the proportion of subjects or values that fall within the preset acceptable paired absolute difference of 30 mL.


The TDI is the absolute paired difference with the desired CP of 80% and 95%.



Figure 3 displays the computed CP curves for a range of pairwise differences for echocardiographically determined LV end-diastolic volume. If an acceptable pairwise difference of 60 mL is selected, then all six studies achieved a CP of almost 100%. However, if an acceptable difference of 10 mL is selected, then the CPs ranged from about 25% to about 75%. The dotted vertical line represents an acceptable pairwise difference of 30 mL, a value for which the majority of studies achieved a CP of >80%.




Figure 3


A comparative graphical display of the reproducibility data for echocardiographically determined LV end-diastolic volume is shown for the studies identified in Table 1 . The computed CP (in percentage on the y axis) curves for a range of pairwise differences (in milliliters on the x axis) for echocardiography-determined LV end-diastolic volume is based on an assumed normal distribution of the difference (except for study G, for which the curve was estimated from the data). If an acceptable pairwise difference of 60 mL is selected, then all seven studies achieved a CP of almost 100%. However, if an acceptable pairwise difference of 10 mL is selected, then the CPs ranged from about 25% to about 75%. The dotted vertical line represents an acceptable pairwise difference of 30 mL, a value for which the majority of studies achieved a CP of >80%.


Similarly, the range of reproducibility values reported or computed for echocardiographically determined LV end-systolic volume, LV ejection fraction, LV mass, LV outflow tract diameter, and mitral valve Doppler peak early diastolic velocity are shown in Tables 2–6 , respectively. Reported values are in boldface type, while computed values are not.



Table 2

Reported and calculated measures of echocardiographic reproducibility of LV end-systolic volume measurements































































































Data source Population studied; reproducibility type reference Overall population Reported and calculated results for reproducibility agreement indices
n Mean mL (SD) n Mean difference mL (SD) ICC 95% LOA CV (%) CP TDI 80%, 95%
A. Hypertension; intraobserver/interstudy 75 31.3 (15.8) 75 0.0 (11.6) 0.73 −22.7, 22.7 26.2 0.990 14.8, 22.7
B. Myocardial infarction and normal; interobserver 24 100.4 (46.7) 24 0.0 (4.5) 1.00 −8.9, 8.9 3.2 1.000 5.8, 8.9
C. Mixed LV hypertrophy, heart failure, and normal volunteers; intraobserver/interstudy 60 87.0 (70.0) 60 0.9 (14.0) 0.98 −26.5, 28.3 11.4 0.968 18.0, 27.5
D. Heart failure; interobserver 1,460 160.7 (60.4) 67 7.6 (23.3) 0.93 −38.1, 53.3 10.3 0.779 31.5, 48.0
E. Cancer; interobserver 56 37.0 (17.0) 56 0.0 (10.5) 0.81 −20.5, 20.5 20.0 0.996 13.4, 20.5
F. Elderly community population; intraobserver 40 41.4 (36.8) 20 0.4 (5.2) 0.99 −9.8, 10.6 8.9 1.000 6.7, 10.2
G. Mixed normal volunteers, heart failure, aortic stenosis; interobserver 10 133.4 (62.6) 10 2.6 (12.2) 0.98 −19.7, 24.8 6.5 1.000 13.7, 21.2

Reported data are in boldface type, and extrapolated data are in regular type. General guidelines for statistical test interpretation are as follows: the ICC ranges between −1 and 1, with higher values indicating better reproducibility; the CV ranges from 0 to infinity, with smaller values indicating better reproducibility; the CP ranges from 0 to 1, with higher values indicating better reproducibility; and the TDI ranges from 0 to infinity, with smaller values indicating better reproducibility.

The CP is the proportion of subjects or values that fall within the preset acceptable paired absolute difference of 30 mL.


The TDI is the absolute paired difference with the desired CP of 80% and 95%.



Table 3

Reported and calculated measures of echocardiographic reproducibility of LV ejection fraction































































































Data source Population studied; reproducibility type reference Overall population Reported and calculated results for reproducibility agreement indices
n Mean % (SD) n Mean difference % (SD) ICC % LOA CV (%) CP TDI 80%, 95%
A. Hypertension; intraobserver/interstudy 78 77.3 (7.8) 78 0.0 (3.3) 0.91 −6.4, 6.4 3.0 0.998 4.2, 6.5
B. Myocardial infarction and normal; interobserver 24 47.3 (11.2) 24 0.0 (3.8) 0.94 −7.4, 7.4 5.7 0.992 4.9, 7.4
C. Mixed LV hypertrophy, heart failure, and normal volunteers; intraobserver/interstudy 60 57.0 (20.0) 60 −0.3 (6.1) 0.95 −12.3, 11.7 7.6 0.898 7.8, 12.0
D. Oncology; interobserver 35 59.6 (7.3) 35 −0.2 (9.5) 0.79 −18.8, 18.4 11.3 0.707 12.2, 18.6
E. Aortic stenosis; interobserver 592 52.6 (13.6) 30 0.0 (6.4) 0.89 −12.5, 12.5 8.6 0.883 8.2, 12.5
F. Cancer; interobserver 56 61.0 (6.0) 56 0.0 (5.7) 0.56 −11.1, 11.1 6.6 0.923 7.3, 11.1
G. Elderly community population; intraobserver 40 59.5 (13.9) 20 −2.3 (5.2) 0.93 −12.5, 7.9 6.2 0.922 7.3, 11.1

Reported data are in boldface type, and extrapolated data are in regular type. General guidelines for statistical test interpretation are as follows: the ICC ranges between −1 and 1, with higher values indicating better reproducibility; the CV ranges from 0 to infinity, with smaller values indicating better reproducibility; the CP ranges from 0 to 1, with higher values indicating better reproducibility; and the TDI ranges from 0 to infinity, with smaller values indicating better reproducibility.

The CP is the proportion of subjects or values that fall within the preset acceptable paired absolute difference of 10%.


The TDI is the absolute paired difference with the desired CP of 80% and 95%.



Table 4

Reported and calculated measures of echocardiographic reproducibility of LV mass































































































Data source Population studied; reproducibility type reference Overall population Reported and calculated results for reproducibility agreement indices
n Mean g (SD) n Mean difference g (SD) ICC 95% ICC CV (%) CP TDI 80%, 95%
A. Hypertension; intraobserver/interstudy 74 321.5 (79.5) 74 0.0 (42.3) 0.86 −83.0, 83.0 9.3 0.363 54.3, 83.0
B. Myocardial infarction and normal; interobserver 24 201.8 (44.7) 24 0.0 (23.7) 0.86 −46.5, 46.5 8.3 0.600 30.4, 46.5
C. LV hypertrophy; intraobserver/interstudy 366 242.0 (53.5) 183 −1.7 (19.8) 0.93 −40.5, 37.1 5.8 0.686 25.5, 39.0
D. Young adults; interobserver 1,189 138.5 (38.6) NR 0.0 (20.8) 0.85 −40.7, 40.7 10.6 0.664 26.6, 40.7
E. Mixed LV hypertrophy, heart failure, and normal volunteers; intraobserver/interstudy 60 195.0 (51.0) 60 8.7 (25.0) 0.88 −40.3, 57.7 9.1 0.549 34.0, 51.8
F. Elderly community population; intraobserver 40 156.7 (82.5) 20 −3.2 (20.2) 0.97 −42.8, 36.4 9.1 0.672 26.2, 40.1
G. Mixed normal volunteers, heart failure, aortic stenosis, interobserver 10 257.1 (46.0) 10 −2.27 (24.9) 0.85 −45.1, 40.6 6.8 0.621 33.2, 53.3

NR , Not reported.

Reported data are in boldface type, and extrapolated data are in regular type. General guidelines for statistical test interpretation are as follows: the ICC ranges between −1 and 1, with higher values indicating better reproducibility; the CV ranges from 0 to infinity, with smaller values indicating better reproducibility; the CP ranges from 0 to 1, with higher values indicating better reproducibility; and the TDI ranges from 0 to infinity, with smaller values indicating better reproducibility.

The CP is the proportion of subjects or values that fall within the preset acceptable paired absolute difference of 20 g.


The TDI is the absolute paired difference with the desired CP of 80% and 95%.



Table 5

Reported and calculated measures of echocardiographic reproducibility of LV outflow tract dimension



















































Data source Population studied; reproducibility type reference Overall population Reported and calculated results for reproducibility agreement indices
n Mean cm (SD) n Mean difference cm (SD) ICC 95% LOA CV (%) CP TDI 80%, 95%
A. Aortic stenosis; interobserver 20 2.1 (0.2) 20 0.0 (0.1) 0.83 −0.2, 0.2 3.4 0.953 0.1, 0.2
B. Mixed aortic stenosis and normal volunteers; interobserver 50 2.1 (0.2) 50 0.1 (0.2) 0.74 −0.2, 0.4 5.0 0.706 0.3, 0.4
C. Aortic stenosis; interobserver 10 2.1 (0.1) 10 0.0 (0.1) 0.62 −0.2, 0.2 3.0 0.967 0.1, 0.2

Reported data are in boldface type, and extrapolated data are in regular type. General guidelines for statistical test interpretation are as follows: the ICC ranges between −1 and 1, with higher values indicating better reproducibility; the CV ranges from 0 to infinity, with smaller values indicating better reproducibility; the CP ranges from 0 to 1, with higher values indicating better reproducibility; and the TDI ranges from 0 to infinity, with smaller values indicating better reproducibility.

The CP is the proportion of subjects or values that fall within the preset acceptable paired absolute difference of 0.2 cm.


The TDI is the absolute paired difference with the desired CP of 80% and 95%.



Table 6

Reported and calculated measures of echocardiographic reproducibility of mitral Doppler peak early diastolic velocity



















































Data source Population studied; reproducibility type reference Overall population Reported and calculated results for reproducibility agreement indices
n Mean cm/sec (SD) n Mean difference cm/sec (SD) ICC 95% LOA CV (%) CP TDI 80%, 95%
A. Hypertension; intraobserver/interstudy 88 53.4 (15.1) 88 0.0 (12.9) 0.64 −25.3, 25.3 17.1 0.302 16.5, 25.3
B. General population; interobserver 3,022 68.1 (15.7) 58 0.0 (7.8) 0.88 −15.3, 15.3 8.1 0.478 10.0, 15.3
C. Elderly community population; intraobserver 40 67.0 (11.0) 20 2.1 (2.2) 0.98 −2.2, 6.4 2.3 0.906 4.0, 5.7

Reported data are in boldface type, and extrapolated data are in regular type. General guidelines for statistical test interpretation are as follows: the ICC ranges between −1 and 1, with higher values indicating better reproducibility; the CV ranges from 0 to infinity, with smaller values indicating better reproducibility; the CP ranges from 0 to 1, with higher values indicating better reproducibility; and the TDI ranges from 0 to infinity, with smaller values indicating better reproducibility.

The CP is the proportion of subjects or values that fall within the preset acceptable paired absolute difference of 5 cm/sec.


The TDI is the absolute paired difference with the desired CP of 80% and 95%.



Correspondingly, Figures 4–8 display the computed CP curves (y axis) for a range of pairwise differences (x axis) for echocardiographically derived LV end-systolic volume, LV ejection fraction, LV mass, LV outflow tract diameter, and mitral valve Doppler peak early diastolic velocity, respectively. A dotted vertical line represents an acceptable pairwise difference value for which the majority of studies achieved a CP of >80% for LV end-diastolic volume, LV end-systolic volume, LV ejection fraction, and LV outflow tract diameter, >50% for LV mass, and >45% for echocardiography-derived mitral valve Doppler peak early diastolic velocity.




Figure 4


A comparative graphical display of the reproducibility data for echocardiographically determined LV end-systolic volume is shown for the studies identified in Table 2 . The computed CP (in percentage on the y axis) for a range of pairwise differences (in milliliters on the x axis) is shown for echocardiography-determined LV end-systolic volume. If an acceptable pairwise difference of 60 mL is selected, then all seven studies achieved a CP of almost 100%. However, if an acceptable pairwise difference of 10 mL is selected, then the CP ranged from about 25% to >90%. Similarly, a dotted vertical line represents an acceptable pairwise difference of 30 mL, a value for which the majority of studies achieved a CP of >80%.



Figure 5


A comparative graphical display of the reproducibility data for echocardiographically derived LV ejection fraction is shown for the studies identified in Table 3 . The computed CP (in percentage on the y axis) for a range of pairwise differences (in percentage on the x axis) is shown for echocardiography-determined LV ejection fraction. If an acceptable pairwise difference of 20% is selected, then all seven studies achieved a CP of >90%. However, if an acceptable pairwise difference of 5% is selected, then the CP ranged 47% to 87%. A dotted vertical line represents an acceptable pairwise difference of 10%, a value for which the majority of studies achieved a CP of >80%.



Figure 6


A comparative graphical display of the reproducibility data for echocardiographically derived LV mass is shown for the studies identified in Table 4 . The computed CP (in percentage on the y axis) for a range of pairwise differences (in grams on the x axis) is shown for echocardiography-derived LV mass. If an acceptable pairwise difference of 40 g is selected, then all seven studies achieved a CP of >60%. However, if an acceptable pairwise difference of 5 g is selected, then the CP ranged from 9% to 20%. A dotted vertical line represents an acceptable pairwise difference of 20 g, a value for which the majority of studies achieved a CP of >60%.

Apr 17, 2018 | Posted by in CARDIOLOGY | Comments Off on Critical Review of Current Approaches for Echocardiographic Reproducibility and Reliability Assessment in Clinical Research

Full access? Get Clinical Tree

Get Clinical Tree app for offline access