## Background

Interpretative variability can adversely affect echocardiographic reliability, but there is no widely accepted method to minimize variability and improve reproducibility.

## Methods

A continuous quality improvement process was devised that involves testing reproducibility by assessment of measurement differences followed by robust review, retraining, and retesting. Reproducibility was deemed acceptable if ≥80% of all interreader comparisons were within a prespecified acceptable difference. Readers not meeting this standard underwent retraining and retesting until acceptable reproducibility was achieved for the following parameters: left ventricular end-diastolic volume, biplane ejection fraction, mitral and aortic regurgitation, left ventricular outflow tract diameter, peak and mean aortic valve gradients, and aortic valve area. Eight hundred interreader comparisons for evaluation of reproducibility were generated from five readers interpreting 10 echocardiograms per testing cycle. The applicability and efficacy of this method were then evaluated by testing a second larger group of 10 readers and reevaluating reproducibility 1 year later.

## Results

All readers demonstrated acceptable reproducibility for biplane ejection fraction, mitral regurgitation, and peak and mean aortic valve gradients. Acceptable reproducibility for left ventricular end-diastolic volume, aortic regurgitation, and aortic valve area was achieved by four of five readers. No readers attained acceptable reproducibility on initial evaluation of left ventricular outflow tract diameter. After review and retraining, all readers demonstrated acceptable reproducibility, which was maintained on subsequent testing 1 year later. A second larger group of 10 readers was also evaluated and yielded similar results.

## Conclusions

A continuous quality improvement process was devised that successfully reduced interpretative variability in echocardiography and improved reproducibility that was sustained over time.

Echocardiography provides essential structural and functional information about cardiac pathophysiology that guides clinical management. However, technical and interpretative variability can limit the reliability of echocardiographic results. The American Society of Echocardiography delineates several echocardiographic parameters that should be assessed on quality reviews, including an annual evaluation of interobserver and intraobserver reproducibility for left ventricular (LV) ejection fraction (EF) and valve regurgitation. However, discrete standards that define acceptable reproducibility for individual echocardiographic parameters are lacking. The Intersocietal Accreditation Commission accreditation process also requires routine evaluation of echocardiographic variability and mandates the adoption of a quality improvement (QI) program, but there is no widely accepted approach that can be easily implemented to accomplish these goals.

Many potentially lifesaving clinical decisions are directly influenced by the echocardiographic assessment of cardiac function, including the timing of valve replacement, chemotherapy regimens, and the insertion of implantable cardioverter-defibrillators for the primary prevention of sudden cardiac death. Additionally, there is increasing competition for quantitative assessments from alternative noninvasive modalities and a national imperative to minimize duplicative testing. Thus, it is crucial that echocardiography yield reliable and reproducible quantitative and qualitative results.

We hypothesized that interpretative variability in echocardiographic laboratories could be improved by integrating the statistical assessment of reproducibility into a feasible, operational framework applied across a spectrum of disease severities encountered in clinical practice. In this study, we devised and prospectively tested a method for continuous QI implementation to standardize the assessment of interpretative variability and improve echocardiographic reproducibility.

## Methods

We aimed to quantify interpretative variability and improve interreader reproducibility through a cyclical process of testing, review, and retraining ( Figure 1 ). This study was approved by the Institutional Review Board at Duke University Medical Center. All readers gave their consent for participation.

## Reproducibility Testing, Review, and Retraining

For the initial implementation testing, five readers composed of three registered diagnostic cardiac sonographers with 2 to 20 years of prior experience and two cardiologists with board certification in adult echocardiography and 1 to 32 years of experience in echocardiographic interpretation were selected to participate. Readers initially interpreted 10 previously acquired, deidentified two-dimensional transthoracic echocardiograms for the following parameters: LV end-diastolic volume (LVEDV), biplane EF, mitral regurgitation (MR), aortic regurgitation (AR), LV outflow tract (LVOT) diameter, peak and mean aortic valve gradients, and aortic valve area (AVA). These parameters were chosen on the basis of national guideline recommendations, association with clinical outcomes, and importance in preprocedural planning for common cardiovascular interventions. The test echocardiograms were selected with adequate views to interpret all eight parameters and contained a range of disease severities and varying image quality. Specifically, the test echocardiograms used to assess reproducibility depicted both normal and abnormal LV function (EF range, 15% to 65%) and contained varying degrees of MR, AR (none to severe), and aortic stenosis (AVA range, 0.6–3.0 cm ^{2 }).

EF and LVEDV were measured by tracing the LV endocardial border and calculated using the modified Simpson biplane technique. MR and AR severity was semiquantitatively evaluated on the basis of a combination of quantitative methods (proximal isovelocity surface area, vena contracta, and flow reversal) when available and expert visual interpretation. Valvular regurgitation was then graded on a five-grade scale: none = 0, trace = 1, mild = 2, moderate = 3, and severe = 4. Linear caliper measurements were used to measure LVOT diameter. Mean aortic valve gradient and velocity-time integral were measured by tracing Doppler velocity profiles. The peak aortic valve gradient was derived from the peak velocity across the aortic valve. AVA was calculated by the continuity equation using velocity-time integrals. Each reader interpreted the test echocardiograms independently and recorded his or her measured values on a data collection form. The test echocardiograms with each reader’s measurements were saved with unique identifiers.

Reproducibility was evaluated by pairwise comparisons of the measurement difference (interpretative variability) among all readers for each echocardiographic parameter. The “acceptable difference,” or limit of measurement variability, for each parameter served as the benchmark for reproducibility ( Table 1 ). Acceptable differences for each parameter were defined a priori on the basis of a literature review of reported variability and results from prior reproducibility testing in our laboratory and included 30 mL for LVEDV, 10% for biplane EF, one sequential grade for qualitative MR and AR, 0.2 cm for LVOT diameter, 20 and 10 mm Hg for peak and mean aortic valve gradients, respectively, and 0.3 cm ^{2 }for AVA.

Echocardiographic parameter | Acceptable difference |
---|---|

LVEDV (mL) | 30 |

Biplane EF (%) | 10 |

MR | ≤1 sequential grade ^{∗ } |

AR | ≤1 sequential grade ^{∗ } |

LVOT diameter (cm) | 0.2 |

AV peak pressure gradient (mm Hg) | 20 |

AV mean pressure gradient (mm Hg) | 10 |

AVA (cm ^{2 }) |
0.3 |

∗ On a five-grade scale (none, trace, mild, moderate, and severe).

When all readers completed the echocardiographic interpretations, a blinded review of the paired data set and image measurements by the lead sonographer and senior cardiologist was performed to ensure adherence to expert recommendations and national guidelines and to assess for clustering of erroneous results. Subsequently, result reports were generated and the pairwise measurements among readers were displayed as dot plot graphs and tables. Every result report included the overall group results for all five readers. Additionally, each reader received his or her individual reproducibility results compared with all other readers for every parameter. The magnitude of measurement differences between reader pairs was also evident through visual inspection of the graphical results. Paired measurements whose difference exceeded the prespecified acceptable difference (outliers) were reviewed. If reproducibility was not acceptable for a reader, then a process of retraining and retesting ensued.

Retraining involved individualized review and group-based sessions led by a senior cardiologist and lead sonographer. During 1-hour group-based sessions, national guidelines and illustrative case examples were discussed to promote uniformity of interpretation, eliminate individual idiosyncrasies, and provide an open forum for questions. Individual retraining was conducted one on one with the lead sonographer. These 30-min sessions incorporated review of individual statistical results, which guided the retraining process by indicating the extent and direction of outlier pairs and identified the source of interpretation and/or measurement error. This was followed by analysis of the individual reader’s test images and stored results for parameters with unacceptable reproducibility. After retraining, all readers interpreted a different set of 10 test echocardiograms for parameters that had unacceptable reproducibility on the initial testing, and reproducibility was reevaluated. This cyclical QI process of image interpretation, reproducibility assessment, and retraining continued until reproducibility was acceptable for every reader on all parameters. To assess the durability and efficacy of this QI initiative, this process was repeated with the same readers 1 year later.

## Statistical Analysis

A recently published review critically evaluated and compared several statistical indices to determine the most appropriate analytic approach for the assessment of interreader reproducibility. The coverage probability (CP) method was preferred given its computational simplicity, rapid identification of group and individual reader variability, and broad applicability to different patient populations, clinical settings, and continuous and categorical variables. Therefore, CP analysis was used to evaluate interreader reproducibility for this study. CP is the probability that the difference between any two measurements of a parameter on the same echocardiogram are within an acceptable difference. Specifically, all possible interreader comparisons (100 for each parameter) were examined to determine whether the measurement difference between paired readers was within the prespecified acceptable difference. The nonparametric approach was used to obtain the estimate of CP. The estimate of CP is the proportion of the number of pairwise interreader comparisons within the acceptable difference divided by the number of all possible pairwise comparisons. Perfect reproducibility corresponded to 100% CP (i.e., CP = 1), indicating that all measurements were within the acceptable difference for that parameter.

The prespecified acceptable differences and the cut point for the CP determine the standard for acceptable reproducibility. For this study, the cut point for the estimated CP was set at 0.80, meaning that reproducibility was considered acceptable if ≥80% of all possible pairwise comparisons were within the prespecified acceptable difference for each parameter. With acceptable reproducibility defined as an estimated CP ≥ 0.80, we were 95% confident that the true CP would be ≥72% (the lower limit of the 95% CI for an estimated CP of 0.80 is approximately 72% on the basis of 100 pairs from five readers each interpreting 10 echocardiograms per testing cycle). Additionally, all pairwise comparisons (100%) were required to be within twice the acceptable difference. To easily visualize the reproducibility data and rapidly identify outlier pairs, the results of the CP analysis were displayed graphically for continuous parameters and in tables for categorical parameters. The graphs for continuous parameters were generated by plotting all pairwise differences from a zero-difference center line with superimposed vertical lines of a positive or negative acceptable difference for each test echocardiogram. The tables contained the categorical results for each echocardiogram in rows, which were distributed among five columns indicating the measured severity (0 = none, 1 = trace, 2 = mild, 3 = moderate, and 4 = severe). To assess if reproducibility was influenced by underlying disease severity, we arranged the graphs and tables in descending order on the basis of extent of disease (i.e., largest LVEDV at the top and smallest LVEDV at the bottom in the graphs, most severe MR at the top and least severe MR at the bottom in the tables). The statistical platform used for this analysis was SAS version 9.2 (SAS Institute Inc, Cary, NC), but the basic arithmetic required for the CP method can also be calculated using Microsoft Excel.

To test the applicability of this analytic approach with a larger number of readers, a “reference cohort” method was used to simplify the statistical analysis. Ten readers (four cardiologists and six sonographers ranging in expertise from novice to highly experienced) interpreted 10 transthoracic echocardiograms for the parameters of LVEDV and biplane EF. All possible pairwise interreader comparisons were evaluated. The top four readers, as defined by the four highest CPs, were designated the “reference cohort.” The four readers in the reference cohort were required to have an estimated CP ≥ 0.80 compared with one another. The remaining readers were subsequently compared with this reference cohort. Using the reference cohort method, a total of 300 interreader comparisons were generated for the assessment of reproducibility for each parameter. Reproducibility was deemed acceptable if the CP was ≥0.80. Any reader who did not achieve this standard underwent retraining and repeat assessment until reproducibility was acceptable for every reader.

## Results

On the initial reproducibility testing, analysis of 800 pairwise comparisons (100 for each of the eight parameters) for all five readers demonstrated acceptable reproducibility for biplane EF, MR, and peak and mean aortic valve gradients, with individual coverage probabilities of ≥0.95, ≥0.85, ≥0.90, and ≥0.85, respectively ( Table 2 ). The reproducibility results for MR are depicted in Table 3 . One or more readers did not achieve a CP ≥ 0.80 for LVEDV, AR, LVOT, and AVA ( Table 2 ). For LVEDV, four of the five readers demonstrated acceptable reproducibility, with individual CPs ≥ 0.85, while the fifth reader, reader A, had a CP of 0.675. All interreader comparisons for LVEDV are illustrated in the dot plot in Figure 2a , which indicates the direction and absolute magnitude of the measurement difference between each pair of readers. Sixteen of 100 interreader comparisons exceeded the acceptable measurement difference for LVEDV (indicated in red), which were composed largely of reader pairs that included reader A. Figure 3a depicts the individual results for reader A when paired with all other readers. This dot plot illustrates that reader A tended to make smaller measurements for LVEDV across the spectrum of LV volumes compared with all other readers. On the initial evaluation of LVOT diameter, all readers lacked acceptable reproducibility (CP ≤ 0.70) for this parameter. The group results for LVOT diameter are depicted in the dot plot in Figure 4a , which shows a considerable scattering of reader paired points. The average time needed to interpret all parameters on the 10 test echocardiograms was 75.6 min (range, 51.3–92.5 min). There was no association between prior experience and reproducibility.

Reader | LVEDV | EF | MR | AR | LVOT diameter | AV mean pressure gradient | AV peak pressure gradient | AVA |
---|---|---|---|---|---|---|---|---|

Initial assessment | ||||||||

A | 0.675 | 0.975 | 0.850 | 0.950 | 0.675 | 0.975 | 0.950 | 0.925 |

B | 0.850 | 0.950 | 0.925 | 0.825 | 0.600 | 0.975 | 0.925 | 0.900 |

C | 0.850 | 0.950 | 0.925 | 0.825 | 0.700 | 0.950 | 0.975 | 0.875 |

D | 0.875 | 0.950 | 0.925 | 0.725 | 0.625 | 0.850 | 0.900 | 0.800 |

E | 0.950 | 0.975 | 0.975 | 0.875 | 0.650 | 0.950 | 0.900 | 0.700 |

Group | 0.840 | 0.960 | 0.920 | 0.840 | 0.650 | 0.940 | 0.930 | 0.840 |

Postretraining | ||||||||

A | 0.950 | NA | NA | 1.000 | 0.900 | NA | NA | 0.920 |

B | 0.925 | NA | NA | 1.000 | 0.850 | NA | NA | 0.920 |

C | 0.950 | NA | NA | 0.975 | 0.775 | NA | NA | 0.920 |

D | 0.950 | NA | NA | 0.975 | 0.850 | NA | NA | 0.880 |

E | 0.975 | NA | NA | 1.000 | 0.825 | NA | NA | 0.820 |

Group | 0.950 | NA | NA | 0.990 | 0.840 | NA | NA | 0.892 |