|
|
||||||||
Rapid Responses to:
|
|
Rapid Responses published:
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Teresa M Steffen, Professor Program in Physical Therapy, Concordia University Wisconsin, Mequon, Wis, Megan Seney
Send rapid response to journal:
csteffen1{at}wi.rr.com Teresa M Steffen, et al.
|
Using an intraclass correlation coefficient (ICC[2,1]) rather than ICC(3,1) changed the ICCs for 13 of the 24 tests less than one hundredth of a point. An ICC(2,1) increased the reliability coefficients for the Berg Balance Scale and the Sharpened Romberg Test with eyes open, reducing the minimal detectable change (MDC) scores by 1 point each. ICC(2,1) decreased the remaining ICCs by one hundredth of a point, which increased the MDC scores of 6 tests by 1 point, 2 showed no change, and the Six-Minute Walk Test (6MWT) increased to 86 meters. Dr Stratford was sent the gait speed data to utilize his suggested ICC(2,2) formula for tests that incorporated averaged scores. This ICC formula is not available in the SPSS software we utilized. The analysis did not change the ICCs values or the MDCs for the gait speed tests. Our article states that gait speed is the strongest gait outcome variable in the population with parkinsonism and Stratford's analysis supports this. We understand Stratford's suggestion on ICCs that test-retest reliability should always use ICC(2,k) formula. However, the article by Shrout and Fleiss1 did not suggest an ICC formula for test-retest reliability and changing the ICC formula had little effect on our study. Considering the same rater performed the same test each session, the formula for intrarater reliability ICC (3,k) was used. We appreciate Stratford's correction that arose from our report of the 6MWT being the only test to demonstrate a small learning effect. The incorrect use of the ICC formula can affect test-retest reliability when a systematic error occurs. Table 1 reports ICC(3,k), ICC(2,k), and minimal detectable change values using a 95% confidence interval (MDC95) for all the tests. Eleven MDC95 values had no change, 6 decreased, and 7 increased utilizing ICC(2,k) rather than ICC(3,k). Table 1.
a SF-36=36-Item Short Form Health Survey, UPDRS=Unified Parkinson Disease Rating Scale. Teresa M Steffen and Megan Seney References 1 Shrout PE, Fleiss JL. Intraclass correlation: uses in assessing rater reliability. Psychol Bull. 1979;86:420–428. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Paul W. Stratford, Professor McMaster University
Send rapid response to journal:
stratfor{at}mcmaster.ca Paul W. Stratford
|
To the editor:
Translating reliability coefficients into clinically meaningful representations of measurement error is a necessary and important step when the goal is to link clinical research to clinical practice. The study by Steffen and Seney1 investigates the reliability of several balance and ambulation tests and converts the obtained coefficients into minimal detectable change (MDC) estimates. The authors apply Shrout and Fleiss2 type 3,k intraclass correlation coefficients (ICC) to quantify relative reliability and, from these estimates, they calculate the standard error of measurement (SEM) to quantify measurement error in the same units as the original measurement. For some of the balance and ambulation tests, 2 trials were performed on each of 2 occasions (eg, Timed "Up & Go" Test [TUG]); for other tests (eg, Six-Minute Walk Test [6MWT]), a single measurement was performed on each of 2 occasions. In the former case, the authors reported a type 3,2 ICC; in the latter case, they presented a type 3,1 ICC. The authors’ rationale for applying the type 3,k ICC was “The ICC (3,k) was used instead of the Pearson correlation coefficient (r) for test-retest reliability because it assesses rating reliability by comparing the variability of different ratings of the same subject with the total variation across all ratings and all subjects.” In fact, the type 3,1 ICC provides an estimate of reliability similar to the Pearson r because neither coefficient accounts for a systematic difference in scores between the replicate measures (eg, either trials or occasions in Steffen and Seney’s study). Presumably, in a test-retest reliability study one is interested in both systematic and random errors, and, if this is true, the type 2,k ICC is the better choice because it includes both sources of variance in the reliability coefficient calculation. When the systematic error is zero, the type 2,k and 3,k ICCs provide identical estimates of reliability. However, when systematic error is present, as in the case of Steffen and Seney’s 6MWT data, the type 2,k ICC will be less than the type 3,k ICC. My second reflection addresses the use of the Shrout and Fleiss classification system in situations where two or more facets exist, such as for the TUG data. Here, the facets are trials and occasions. A dilemma occurs when attempting to interpret the meaning of the type 3,2 ICC reported by Steffen and Seney. It is not clear if the second digit (2) refers to 2 trials, 2 occasions, or 2 trials performed on each of 2 occasions (ie, a total of 4 measurements). I propose that a generalizability3 approach to the analysis has the potential to provide a clearer picture of the sources of variance, their magnitude, and the relative merits of averaging over either trials or occasions, or both. To illustrate the points raised above, I have generated synthetic data for the TUG. Paralleling the design of Steffen and Seney, the synthetic data represent 2 TUG trials performed on each of 2 occasions for 10 persons. The data presented in Table 1 were contrived to illustrate a systematic difference between occasions, but no systematic difference between trials. Table 1.
Table 2 reports the mean scores for trials and occasions. Of interest is that
the trial means averaged over occasions are almost identical; however, the occasion
means differ. Stated another way, a systematic difference exists between occasions,
but not between trials averaged over occasions.
Table 2.
Table 3 displays Shrout and Fleiss type 2,1 and type 3,1 ICCs obtained by performing
randomized block analysis of variance (ANOVA). Negative variance estimates were
set to zero for all analyses. Pearson r values also are reported in
this table. That the inter-trial type 2,1 and 3,1 ICCs are identical to 2 decimal
places reflects the similarity of trial means shown in Table 2. By contrast,
the inter-occasion means shown in Table 2 differed and this systematic difference
is not reflected in the type 3,1 ICC or in the Pearson r. Accordingly,
the type 3,1 ICC is greater than the type 2,1 ICC because the variance due to
occasion is greater than zero.
Table 3.
The following section illustrates a generalizability analysis that includes
both trials and occasions in a single analysis. I applied a 3-way random effects
ANOVA. The rationale for applying a random effects model was that I wished to
generalize beyond the persons, trials, and occasions composing the study sample.
The ANOVA and variance components were calculated using MINITAB statistical
software (Minitab Inc, Quality Plaza, 1829 Pine Hall Rd, State College, PA 16801-3008)
and the results appear in Table 4. Once again, negative variance estimates were
set to zero.
Table 4.
Inspection of the variance components reveals the following important findings: (1) there is a large variance among persons and this is desirable; (2) the variance between trials averaged over occasions is zero (this reflects the near identical means reported in Table 2); (3) there is a relatively large variance due to occasions (this reflects the difference in occasion means reported in Table 2); (4) the person by occasion (P×O) variance is substantially greater than the person by trial (P×T) variance (this suggests that averaging over occasion will have a greater effect than averaging over trials); and (5) the residual error is relatively small compared to the person variance.
The variance components reported in Table 4 can be applied to calculate generalizability
coefficients that represent inter-trial and inter-occasion reliability. They
can also be used to examine the distinct effect of averaging over trials, occasions,
or both.
Equation 1:
The theoretical inter-trial reliability (generalizability) for a single trial
is obtained by substituting the variance components into the Equation 1 and
by setting nt and no to 1. The obtained value is .97 and this is
analogous to the Shrout and Fleiss type 2,1 inter-trial ICCs of .96 reported
in Table 3. The inter-trial reliability for an average of 2 trials can be obtained
by setting nt to 2 and no to 1. This yields an inter-trial reliability of .98
which is analogous to a Shrout and Fleiss type 2,2 ICC.
When the goal is to draw inferences about the change status of a person, as
is the case when MDC is applied, the inter-occasion reliability (generalizability)
coefficient is of interest. It is calculated by applying Equation 2. The theoretical
inter-occasion reliability for a single trial is obtained by substituting the
variance components into Equation 2 and by setting nt and no
to 1. This gives an inter-occasion reliability of .74 which is the average of
the 2 inter-occasion reliability estimates reported in Table 3. The inter-occasion
reliability for a single trial performed on each of 2 occasions is obtained
by setting nt to 1 and no to 2. This yields an inter-occasion
reliability of .85.
Equation 2:
Finally, one can examine the inter-occasion reliability for the average of
2 trials on each of 2 occasions. This is accomplished by setting nt to 2 and
no to 2 in Equation 2. A value of .86 is obtained and, to my knowledge, there
is no equivalent Shrout and Fleiss coding scheme to represent this combination.
References
1 Steffen T, Seney M. Test-retest reliability and minimal
detectable change on balance and ambulation tests, the 36-Item Short-Form Health
Survey, and the Unified Parkinson Disease Rating Scale in people with parkinsonism.
Phys Ther. 2008 Mar 20 [Epub ahead of print].
2 Shrout PE, Fleiss JL. Intraclass correlation: uses in assessing
rater reliability. Psychol Bull. 1979;86:420-428.
3 Brennan RL. Elements of Generalizability Theory.
Iowa City, Iowa: ACT Publications; 1983. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH |