Advertisement

Test-Retest Reliability and Minimal Detectable Change on Balance and Ambulation Tests, the 36-Item Short-Form Health Survey, and the Unified Parkinson Disease Rating Scale in People With Parkinsonism

Teresa Steffen, Megan Seney

Abstract

Background and Purpose: Distinguishing between a clinically significant change and change due to measurement error can be difficult. The purpose of this study was to determine test-retest reliability and minimal detectable change for the Berg Balance Scale (BBS), forward and backward functional reach, the Romberg Test and the Sharpened Romberg Test (SRT) with eyes open and closed, the Activities-specific Balance Confidence (ABC) Scale, the Six-Minute Walk Test (6MWT), comfortable and fast gait speed, the Timed “Up & Go” Test (TUG), the Medical Outcomes Study 36-Item Short-Form Health Survey (SF-36), and the Unified Parkinson Disease Rating Scale (UPDRS) in people with parkinsonism.

Subjects: Thirty-seven community-dwelling adults with parkinsonism (mean age=71 years) participated. The Hoehn and Yahr Scale median score of 2 was on the lower end of the scale; however, the scores ranged from 1 to 4.

Methods: Subjects were tested twice by the same raters, with 1 week between tests. Test-retest reliability was calculated using intraclass correlation coefficients (ICCs). Minimal detectable change was calculated using a 95% confidence interval (MDC95).

Results: The ICCs for test-retest reliability were above .90 for the BBS, ABC Scale, SRT with eyes closed, 6MWT, and comfortable and fast gait speeds. The MDC95 values for those functional tests were: BBS=5/56, ABC Scale=13%, SRT with eyes closed=19 seconds, 6MWT=82 m, comfortable gait speed=0.18 m/s, and fast gait speed=0.25 m/s. The ICCs for test-retest reliability of SF-36 scores were above .80, with the exception of the social functioning subscale. The MDC95 values for the SF-36 ranged between 19% and 45%. The MDC95 values for the UPDRS Activities of Daily Living section, Motor Examination section, and total scores were 4/52, 11/108, and 13/176, respectively.

Discussion and Conclusion: Minimal detectable change values are useful to therapists in rehabilitation and wellness programs in determining whether change during or after intervention is clinically significant. High test-retest reliability of scores for the BBS, ABC Scale, SRT with eyes closed, 6MWT, and gait speed make them trustworthy functional assessments in people with parkinsonism. The SF-36 and UPDRS provide quality-of-life and disease severity rating values in the ongoing assessment of people with parkinsonism.

Physical therapists strive to create interventions that focus on improving a patient's functional ability. Function gained during or after therapy often is measured by change in scores on a functional assessment instrument over time.1 When results improve from one assessment to another, therapists often assume that the patient has progressed. Unfortunately, there is a chance the difference between assessments is a result of measurement error.2 A common problem involves deciding whether the results are clinically significant or an error in measurement. To determine whether an improvement is significant, researchers need to provide minimal detectable change (MDC) scores, by patient population, for tests. Minimal detectable change is defined as the minimal amount of change that is not due to variation in measurement.3

Clinicians can interpret MDC scores as the minimal change that is not due to error. Scores at or above the MDC level are due to patient improvement on the test rather than measurement error. Measurement error includes expected or typical variability in patient performance. In the literature, various methods are utilized to calculate change scores, including the standard error of measurement (SEM), minimal clinically important difference (MCID), and smallest detectable difference (SDD). The SEM is calculated by multiplying the standard deviation by √1 minus the reliability coefficient, which is the stability or variability of response and indicates the range of the scores that can be expected upon retesting.4 The MCID is the smallest meaningful change, as judged by the patient or experts in the field,3 and is determined by questioning or observing the patient. Some researchers refer to the MDC utilizing a 95% confidence interval (MDC95) as the SDD.5

Once the MDC is determined on a particular test for a given population, therapists can interpret whether the change score for their patient is at or above the minimal level of detectable change reported in the literature. If the patient's score is less than the MDC value, it is considered to be indistinguishable from measurement error. Accordingly, a patient who demonstrates less than the MDC value is viewed as not benefiting from the intervention. For example, following hip fracture, the MDC is 0.08 m/s for comfortable gait speed.6 If a patient's comfortable gait speed increases less than 0.08 m/s, the change is within measurement error, leading to the conclusion that a clinically significant change did not occur as a result of the therapeutic intervention.

To evaluate MDC, researchers first must measure test-retest reliability. On functional tests, a 7-day separation period typically is used. Sources of error may include inconsistencies caused by the participant's physical or mental condition, variations in the testing procedure, or tester error. Maintaining consistency and using standardized protocols for testing, such as using the same tester, setup, testing order, and time of day, can improve test-retest reliability.

The MDC is based on the SEM and is calculated using the following formula3: Embedded Image

The z-score represents the confidence interval from a normal distribution, SD is the standard deviation at baseline, and r is the test-retest reliability coefficient. The multiplier of √2 is used to account for the additional uncertainty introduced by using difference scores from measurements at 2 points in time. Some researchers1,3 suggest using a confidence interval of 90% due to its use being more common in the literature; however, a confidence interval of 95% increases the precision of score estimation and is the SDD.3 Internal consistency, determined by the Cronbach alpha, of a multiple-item test such as the Berg Balance Scale (BBS), the Activities-specific Balance Confidence (ABC) Scale, the Medical Outcomes Study 36-Item Short-Form Health Survey (SF-36), and the Unified Parkinson Disease Rating Scale (UPDRS) sometimes replaces r or intraclass correlation coefficients (ICCs) in the MDC formula when test-retest reliability is not reported. Internal consistency is the extent to which multiple items within a scale or subscale measure one characteristic and nothing else. Internal consistency of a multiple-item test is considered high if it approaches a Cronbach alpha of .90 in a given population. Test-retest scores are considered a more conservative approach when calculating MDC values3 in situations where both internal consistency and test-retest reliability are reported. Test-retest reliability and internal consistency reliability are necessary forms of reliability that should be reported in multiple-item tests in which item scores are summed or averaged.

One patient population for which MDC scores would be useful in helping to distinguish actual change from measurement error is people with parkinsonism. Parkinsonism is a constellation of symptoms, including tremor, rigidity, bradykinesia (slow movements), and loss of postural reflexes. Although Parkinson disease (PD) is the most frequent cause of parkinsonism, it includes other diagnoses such as Parkinson-plus syndromes of progressive supranuclear palsy and corticobasal degeneration. Parkinsonism symptoms create functional limitations of balance—which often are measured in the clinic with the BBS, ABC Scale, Functional Reach Test (FRT), Romberg Test (RT), and Sharpened Romberg Test (SRT)—and difficulties in mobility—which often are measured with the Six-Minute Walk Test (6MWT), Timed “Up & Go” Test (TUG), and gait speed. The disease also affects quality of life (measured most often with the SF-36). Disease severity is measured in people with PD with the UPDRS.

Literature Review of MDC Values

Extensive literature searches were done to find all previous test-retest articles published up to March 2007 on each of the instruments listed above. This was done to determine whether MDCs could be calculated for each of the instruments. When test-retest ICCs or Pearson (r) values and standard deviations were reported in the literature, those values were used to calculate MDC95 for that study.

BBS

The BBS is a 14-item test, using ordinal scoring from 0 to 4 for each item, designed to measure static and dynamic standing balance. The total score range is 0 to 56, with higher scores indicating better balance. The internal consistency of the BBS is moderate to high, ranging from .85 to .98.712 In 3 studies,9,13,14 ICCs for test-retest reliability of .97 to .99 were reported in subjects with stroke and traumatic brain injury (TBI), respectively. One study performed over 1 week on 26 people with PD reported an ICC of .87.15 Four studies contained sufficient data to determine MDC95. The MDC95 values were 2 for 26 people with PD,15 5 for 24 elderly people with or without stroke,9 3 for 20 people with hemiparesis,13 and 4 for 5 people with TBI.14 All of these studies were performed 7 days apart. The high test-retest reliability, moderate to high internal consistency, and low MDC95 scores in these studies indicate the BBS is a valuable measure to monitor responsiveness to change in patients with neurological disease.

FRT

The FRT is a static balance test designed to measure margins of stability. Based on a review of 10 articles, test-retest reliability for functional reach has been shown to vary from low to high, with ICCs ranging from .42 to .93. Nine studies1523 examined forward functional reach, and 1 study19 examined backward functional reach, with the time between tests varying greatly from 1 day to 1 month.1523 Only 3 studies examining test-retest reliability had a sample size over 30.16,17,24 Three studies15,20,22 reported test-retest reliability in subjects with PD. One study of 26 subjects with idiopathic PD reported an ICC of .74 for the FRT with a testing interval of 1 week,15 and a study of 14 subjects with PD reported an ICC of .84 for the FRT with a testing interval of 1 day.20 Another study of 10 elderly subjects with no known neurological impairment and 20 subjects with PD, using a testing interval of a week, reported ICC (2,1) values of .62 for the subjects with no known neurological impairment, .93 for subjects with PD who had a history of falls, and .42 for subjects with PD with no history of falls.22

Of the current studies examining test-retest reliability of the FRT, 4 studies19,2123 provided enough data to calculate MDC95, which ranged from 4 to 11 cm. Two studies reporting test-retest reliability of the FRT in 20 people with PD, with tests 1 week apart, demonstrated MDC95 values of 4 cm for people who had fallen, 8 cm for people who had not fallen, and 12 cm for 26 people with a diagnosis of idiopathic PD.15,22 Studies on forward functional reach have provided a wide range of MDC95 values for people with PD, and no MDC95 data on backward functional reach.

RT and SRT

The RT and SRT are tests of static balance that measure the ability to maintain balance or equilibrium with a narrowed base of support. Currently, there are no studies that have examined test-retest reliability of the RT or SRT for subjects with PD. In one study of 30 subjects with unilateral vestibular loss, aged 29 to 78 years, test-retest reliability values (ICC [2,2]) were .63 for the SRT with eyes open and .76 for the SRT with eyes closed.25 In a study of 20 subjects with central neurological dysfunction, aged 58 to 85 years, test-retest reliability values (ICC [2,2]) were .75 for the SRT with eyes open and .97 for the SRT with eyes closed.25 In 2 studies, one with 18 volunteers aged 24 to 39 years17 and one with 45 volunteers aged 55 to 75 years,26 test-retest reliability values (ICC [2,1]17 and r26) were .72 and .76 for the SRT with eyes closed and .90 for the SRT with eyes open. One study with a small sample size (n=12) used the coefficient of variation (CV), which indicates an association between 2 variables, and showed a high degree of variability, ranging from .14 to .86, between the tests on 5 consecutive days.27 In 2 studies, MDC95 scores for the SRT with eyes open ranged from 9 to 10 seconds,25 and MDC95 scores for the SRT with eyes closed ranged from 3 to 9 seconds.25,26 Test-retest studies are needed on the RT and SRT for populations with neurological disorders, including people with PD.

ABC Scale

The ABC Scale is a 16-item questionnaire used to measure balance confidence in specific situations, with scores ranging from 0% to 100%. Internal consistency of the ABC Scale in 4 studies19,2830 ranged from .80 to .98. These 4 studies also addressed test-retest reliability of the ABC Scale. The time between testing dates ranged from 1 to 4 weeks in various populations, including personal care home residents, patients from outpatient clinics, and community-dwelling older adults. The test-retest reliability values (ICC [1,1],19 ICC,28 ICC[2,1],29 and r30) were .70 to .92.19,2830 Two of the 4 studies28,29 had sample sizes greater than 30 subjects, and 3 studies19,29,30 provided enough data to calculate MDC95 scores of 18% to 38%. In these studies,19,29,30 MDC95 values also were calculated using the Cronbach alpha, with results ranging from 6% to 15%. Minimal detectable change values by patient diagnoses, including PD, are needed for the ABC Scale.

6MWT

The 6MWT tests endurance by measuring the maximum distance that a person can walk in 6 minutes. For some patients, it is a submaximal test of aerobic capacity. Eighteen articles were obtained on the test-retest reliability of the 6MWT.3148 The majority of these studies used a 7-day interval between tests. None of the 18 studies evaluated test-retest reliability for individuals with PD. Two reliability studies of individuals with stroke demonstrated high ICCs of .99.31,32 In 2 studies, MDC95 values were 34 m without a report of days between tests31 and 36 m with a 7-day testing interval.32

Five studies3337 assessed test-retest reliability for subjects with other neurological disorders, with ICCs ranging from .93 to .96. Sample sizes ranged from 12 to 25 subjects, and testing was conducted between 1 and 14 days apart. The MDC95 values were 20 m for subjects with chronic poliomyelitis,35 53 and 65 m for adults with cerebral palsy,34 71 m for subjects with acquired brain injury,33 and 106 m for subjects with multiple sclerosis.36

Five studies3842 evaluated test-retest reliability of individuals with cardiac problems, with ICCs ranging from .88 to .97. The MDC95 values were 18 m for subjects with congestive heart failure (CHF),39 50 m for subjects with peripheral arterial occlusive disease,41 51 m for subjects with cardiac rehabilitation,38 and 74, 86, and 90 m for subjects with heart failure.40 One study42 reported an SEM of 15 m for patients with CHF. Standard deviations were variable, sample sizes ranged from 43 to 786 subjects, and days between tests ranged from 1 to 14.

Four studies examined test-retest reliability for subjects with lung disease, with ICCs ranging from .88 to .95.39,43,44 The MDC95 values were 53 and 63 m for subjects with chronic obstructive pulmonary disease (COPD),43,45 87 m for subjects with emphysema,44 and 18 m for subjects with lung disease.39 One study46 reported an MCID of 54 m for subjects with COPD. Standard deviations were variable, sample sizes ranged from 15 to 470 subjects, and days between tests ranged from 1 to 10.

Two studies47,48 assessed test-retest reliability in older adults, with ICCs of .87 and .93. The MDC values were 77 m for community-dwelling elderly subjects,48 89 m for those living in retirement homes,47 and 94 m for those living in community centers.47 Sample sizes ranged from 5 to 22 subjects, and days between tests ranged from 7 to 14. Overall, the 6MWT is a reliable test, with different MDC95 values by client population. Minimal detectable change studies with larger samples are needed on the 6MWT for people with neurological diseases, including those with PD.

TUG

The TUG is a mobility test for the geriatric population. It includes a sit-to-stand component as well as walking 3 m, turning, and returning to the chair. Three studies4951 reported test-retest reliability ranging from .92 to .99 in older adults with arthritis in nursing homes or with multiple conditions. The MDC95 in one study of 78 participants was calculated as 15 seconds, with this large value being attributed to a large standard deviation of 17 seconds. 50 A study of 9 men with PD, in Hoehn and Yahr (H&Y) stages 3 to 4 and tested over 7 days, had a test-retest reliability value of .75 and an MDC95 of 5 seconds.52 A study of 26 participants with idiopathic PD, tested over 7 days, reported an ICC of .88 and an MDC95 of 2 seconds.15 Studies with larger samples are needed to verify that a change of 2 to 5 seconds is a meaningful change score on the TUG for people in all stages of PD.

Gait Speed

Gait speed is a measure of overall walking performance, but does not include an endurance component. Both fast and comfortable gait speeds often are measured to ensure that patients have the ability to change walking speed. Fifteen articles13,15,21,32,37,5362 were found on test-retest reliability of measurements for gait speed. Of these 15 studies, 6 studies21,32,37,57,58,60 also measured fast gait speed. The MDC95 scores were calculated from previous literature, using measured distances from 3.3 to 10 m. One study15 evaluated test-retest reliability in individuals with PD and reported an ICC of .81 and an MDC95 value of 0.19 m/s. Six studies assessed test-retest reliability in individuals with stroke13,32,5355 or TBI,37 with ICC (3,2) values ranging from .94 to .98 for comfortable gait speed.13,32,37,53,55 One study37 assessed test-retest reliability of measurements of fast gait speed in people with TBI and reported an ICC of .96. The MDC95 values were 0.11 to 0.24 m/s for comfortable gait speed32,53,55 and 0.24 m/s for fast gait speed.32

Four studies5659 examined test-retest reliability in community-dwelling elderly people. Intraclass correlation coefficients ranged from .79 to .95 overall,5659 and ICCs for fast gait speed ranged from .87 to .97.57,58 The MDC95 values were 0.25 to 0.29 m/s for comfortable gait speed and 0.25 m/s for fast gait speed in individuals with or without assistive devices.57 In one study,58 MDC95 values of 0.06 to 0.14 m/s for comfortable gait speed and 0.08 to 0.15 m/s for fast gait speed were calculated for four 10-year cohort groups over the age of 60 years.

Three studies assessed test-retest reliability in patients with musculoskeletal problems, including osteoporosis,60 knee osteoarthritis,61 and hip fracture.21 Intraclass correlation coefficients ranged from .88 to .97 for comfortable gait speed6062 and from .91 to .94 for fast gait speed.21,60 The MDC95 values for comfortable and fast gait speeds were 0.25 and 0.30 m/s, respectively, for people with osteoporosis60 and 0.49 and 0.51 m/s for people with hip fracture.21 Overall, the MDC95 values for both comfortable and fast gait speeds appeared to be about 0.25 m/s or less for populations tested to date, except for patients with hip fracture, with an MDC95 value of approximately 0.50 m/s.

SF-36

The SF-36 is a quality-of-life questionnaire developed as a part of the Medical Outcomes Study to assess 8 physical and mental health concepts as seen from the respondent's point of view. These concepts are: (1) limitations in physical activities because of health problems (Physical Functioning), (2) limitations in social activities because of physical or emotional problems (Social Functioning), (3) limitations in usual role activities because of physical health problems (Role–Physical), (4) bodily pain (Bodily Pain), (5) psychological distress and well-being (Mental Health), (6) limitations in usual role activities because of emotional problems (Role–Emotional), (7) energy and fatigue (Vitality), and (8) general health perceptions (General Health). These 8 domains are relevant to general functional status and well-being. The survey was designed for self-administration by people 14 years of age and older or for administration by a trained interviewer in person or by telephone. For each scale, item scores are coded, summed, and transformed, with final values (expressed as a percentage) ranging from 0 (worst health) to 100 (best health).

No articles were found on the test-retest reliability of SF-36 scores in patients with PD; however, 17 articles were found for other populations. These articles studied patients with vestibular dysfunction63; veterans64,65; adults with an intensive care unit stay of greater than 24 to 48 hours66; patients with spinal cord injury67; patients with confirmed or suspected ischemic stroke68; patients with rheumatoid arthritis69; patients with systemic lupus erythematosus70; patients with knee disorders71; elderly patients72; patients with ulcerative colitis73; patients with low back pain, menorrhagia, suspected peptic ulcer, or varicose veins73; a nonclinical normative sample74; and general populations in China,75 the Basque region of Spain,76 the Hunter region of New South Wales, Australia,77 Japan,78 and Sheffield, United Kingdom.79 The time interval between tests in these studies often was 2 weeks.64,65,67,69,71,73,75,76,78,79 Other time intervals used were 4 weeks,72 3 weeks,68 1 week,66,70,73,74,77 and 2 days.63,71 Fourteen studies6365,6876,78,79 were self-administered with either paper or a computer, 2 studies66,67 were administered via telephone or personal interview, and 1 study77 was administered via self-administration and telephone interview.

Internal consistency was reported for some of the studies, and the values ranged from .84 to .98 for Physical Functioning, from .83 to .98 for Role–Physical, from .79 to .96 for Bodily Pain, from .72 to .95 for General Health, from .66 to .96 for Vitality, from .39 to .98 for Social Functioning, from .78 to .99 for Role–Emotional, and from .72 to .95 for Mental Health.6568,70,72,73,7579

The test-retest ICCs, Pearson (r) values, or Spearman (r) values reported in these studies ranged from .34 to .98 for Physical Functioning, from .36 to .97 for Role–Physical, from .35 to .95 for Bodily Pain, from .41 to .93 for General Health, from .36 to .93 for Vitality, from .05 to .96 for Social Functioning, from .23 to .99 for Role–Emotional, and from .30 to .95 for Mental Health.6374,7679

The ranges in MDC95 values were reported, with values ranging from 11 to 63 for Physical Functioning, from 23 to 81 for Role–Physical, from 19 to 54 for Bodily Pain, from 14 to 43 for General Health, from 13 to 49 for Vitality, from 16 to 85 for Social Functioning, from 11 to 110 for Role–Emotional, and from 12 to 57 for Mental Health.6367,6971,73,74,76,77 These large MDC95 values occurred because of large standard deviations within groups and the large range of test-retest reliability values among studies. Considering that people with PD often are treated over long periods of time, MDCs should be developed on quality of life (using SF-36 scales), with progress measured at regular intervals.

UPDRS

The UPDRS is the gold standard instrument used to measure disease severity in PD. It has 3 subscales: I—Mentation, Behavior, and Mood (range=0–16), II—Activities of Daily Living (ADL) (range=0–52), and III—Motor Examination (range=0–108). A total score (range=0–176) can be derived by summating the 3 subscales. Lower scores indicate a less involved disease process. The UPDRS has moderate internal consistency values across multiple studies in the 3 subscales and total score. A Cronbach alpha value of .79 has been reported for the Mentation, Behavior, and Mood subscale,80 Cronbach alpha values of .85 to .92 have been reported for the ADL subscale,8083 Cronbach alpha values of .88 to .95 have been reported for the Motor Examination subscale,80,81,8385 and a Cronbach alpha value of .96 has been reported for the total UPDRS score.86

One study87 examined the test-retest reliability of UPDRS scores in 400 patients with early stage, mild PD who were not taking medications. The subjects were examined on 2 occasions, separated by an average of 15 days (SD=8). The ICC (1,1) values were .74 for the Mentation, Behavior, and Mood subscale, .85 for the ADL subscale, .90 for the Motor Examination subscale, and .92 for the total score. The calculated MDC95 values were 2, 4, 7, and 9, respectively.87 In 26 ambulatory subjects with idiopathic PD and no comorbidities, test-retest reliability with a 7-day interval between tests was .84 for the Motor Examination subscale and .74 for the total score.15 The MDCs were 13 and 15, respectively.15 Test-retest reliability of Motor Examination subscale scores was evaluated in 34 patients with advanced PD on 2 separate occasions, 1 to 3 weeks apart, with an ICC (3,1) of .90.88

Minimal detectable changes of 1 to 2 points for the Mentation, Behavior, and Mood subscale, 2 to 4 for the ADL subscale, 7 to 8 for the Motor Examination subscale, and 9 for the total UPDRS encompass the existing studies. Different versions of the UPDRS are being used in studies, and a shorter version is being developed. The lower reliability of the Mentation, Behavior, and Mood subscale scores suggests the need for caution when using reliability values to calculate an MDC value. Physical therapists are most interested in the Motor Examination subscale of the UPDRS to measure responsiveness to change.

The purpose of this study was to determine the MDC95 for people with parkinsonism on the following tests and measures: BBS, forward and backward functional reach, RT and SRT, ABC Scale, 6MWT, comfortable and fast gait speeds, TUG, the 8 subscales of the SF-36, and UPDRS (Mentation, Behavior, and Mood subscale, ADL subscale, Motor Examination subscale, and total score).

Method

Subjects

Participants were recruited via bulletin advertisements and flyer distribution at local fitness centers, physical therapy sites, meal sites throughout southeast Wisconsin, Wisconsin PD organizations, church bulletins, newspapers, and other local news media. Previous research study and pro bono clinic participants also were contacted, and advertisements were placed on the Concordia University Wisconsin Web site and in faculty bulletins.

Eligibility for the study was determined by the presence of a clinical diagnosis of PD or Parkinson-plus syndrome. All potential volunteers were contacted by telephone and given an oral questionnaire. Participants were included if they were able to stand independently for 1 minute and could walk independently with or without the use of an assistive device. Participants were excluded if they reported a history of a heart condition limiting their activity level, experienced a fall as a result of dizziness or fainting within the previous 2 months, or required help with following directions.

A demographic questionnaire (sex, date of birth, date of diagnosis with PD or Parkinson-plus syndrome, ethnicity, living situation, history of falling, other medical conditions, and current medications) was completed on the first day of testing and reviewed in the participant's presence with a researcher to ensure accuracy. Participants were reminded not to change medications during their scheduled test week and to take medications at the same time on both testing days.

During the spring of 2007, 37 participants with PD (n=35) or Parkinson-plus syndromes (n=2) met all inclusion criteria and consented to participate in this study. This sample reflected the general demographics of the PD population, with more men (n=26) than women (n=11) and an elderly age distribution (mean age=71 years, SD=12). There was a wide range of UPDRS total scores (mean=33/176, range=7–70), demonstrating a sample that captured a wide spectrum of disease severity. The average H&Y score was 2 (range=1–4). Distribution of H&Y stages were: 13 subjects in stage 1, 7 subjects in stage 2, 9 subjects in stage 3, 8 subjects in stage 4, and no subjects in stage 5.

The average disease duration was 14 years (SD=6), and participants were primarily of white/non-Hispanic descent (n=36), with 1 participant of Asian/Pacific descent. Of the 37 participants, 32 were living at home with another person, 3 were living at home alone, and 2 were in assisted living facilities. The mean number of falls in the previous 6 months was 7 (range=0–182); 21 participants had experienced more than one fall. None of the participants changed medications during their testing week, and all participants reported taking medications at the same time on both testing days. Thirty-one participants were using levodopa, with an average of 412 mg/d (SD=310, range=125–1,150). Participants, on average, had 3 comorbidities (SD=2, range=0–6), including 17 with arthritis, 3 with asthma, 7 with a history of cancer, 11 with high blood pressure, 5 with low blood pressure, 1 with diabetes, 8 with a previous fracture, 9 with depression or other mental health condition, 6 with a history of heart disease, 7 with osteoporosis, 1 with stroke, and 14 with other, unspecified comorbidities.

Procedure

Testing was administered at Concordia University Wisconsin. Any classes scheduled to occur in the vicinity of the testing area were relocated to limit interruptions, and barriers were placed to ensure participant privacy. After signing consent forms, testing began with the SF-36 questionnaire and completion of the demographic information. Balance testing followed and consisted of 4 tests administered to each participant in the following order: BBS, forward and backward functional reach, RT and SRT (eyes open and eyes closed), and ABC Scale. The ambulation tests and the UPDRS were administered last and done in the following order: 6MWT, UPDRS, TUG, and comfortable and fast gait speeds. Each day total testing time per participant was approximately 1 hour. Prior to testing, all researchers were trained in their assigned test, and they performed the same duties on each testing day. Researchers who collected the reliability data were monitored by the coinvestigators before and during the testing procedures to maintain accuracy. All researchers had previous patient experience using the functional tests. If an assistive device was used, the type was documented and the participant was required to use it on the subsequent testing day. Thirty-nine participants were scheduled for the study; 2 participants cancelled due to weather or transportation issues. Researchers did not have access to the previous test results on the second day of testing.

Test-retest reliability was established over a period of 7 days in all participants, with the exception of 1 participant, who was tested 10 days apart. Although a 14-day separation may be preferred for the SF-36 questionnaire, a 7-day interval was used based on previous test-retest studies of the other functional assessments.

Balance testing.

The method for the BBS test followed the original design,7 which consists of 14 items scored on a scale of 0 to 4. A score of 0 indicates the participant was unable to complete the task, and a score of 4 indicates the participant was able to complete the task based on the assigned criteria. The floor-to-seat height of the chair used on items 1, 3, 4, and 5 was 47 cm. The height of the chair without armrests used on item 5 was 44.5 cm, and the height of the step stool used on item 12 was 23 cm. A 1.27-cm (0.5-in) slipper was used on item 9. The participants were asked to perform each of the items on the original BBS, with rests as needed. The 14 items were scored by a total of 3 researchers. One researcher scored item 8 while the participants performed the FRT, another researcher scored items 7 and 13 while the participants performed the RT and SRT, and 1 researcher scored the remaining 11 items.

Equipment used for forward and backward functional reach included a level with attached wooden sliders fixed to an adjustable tripod with C rings. Participants were asked to make a fist, raise their dominant arm parallel to the floor with the elbow fully extended, and reach as far forward or backward as possible without losing their balance, lifting their feet off the ground, or touching the equipment. The foot placement and method of reach were not controlled, except to keep the arm at the height of the level. Participants who inquired about foot placement were instructed to stand in a comfortable position. Participants were allowed multiple practice trials until they performed the test correctly. Once a participant was able to perform the test correctly, 2 graded trials were completed. The dominant arm was recorded on the first testing day and used on the second testing day to maintain consistency. The averages of the 2 trials for each direction were used for data analysis, due to the high intratrial reliability reported in previous studies.10,89 Measurements were recorded (in centimeters) using the third metacarpal as the reference point. Two researchers participated in the data collection. One researcher gave instructions and maintained participant safety. The other researcher adjusted the equipment to match the participant's acromion height, adjusted the wooden slider during reach, and recorded initial and final measurements.

The RT was performed with feet together and eyes open for 60 seconds and with feet together and eyes closed for 60 seconds. The SRT was performed in a tandem standing position, with the dominant foot behind the nondominant foot for 60 seconds with eyes open and for 60 seconds with eyes closed. Timing started after the participant assumed the proper position and stopped if the participant moved his or her feet from the proper position, touched the table, or opened his or her eyes on the eyes-closed trials or when the maximum balance time of 60 seconds was reached. Participants were given assistance to assume the test position and allowed rest breaks if needed. Up to 3 trials were performed if the maximum balance time was not reached in either of the first 2 trials. Data analysis utilized the longest balance time of all of the trials. Upper-extremity use was not controlled during testing. One researcher administered the RT and SRT to all participants, while another researcher supervised participant safety.

The ABC Scale was administered as an interview consisting of 16 items describing various activities for which participants are asked to rate their confidence in maintaining balance on a scale of 0% (not confident) to 100% (completely confident). Final scores were determined by calculating the average score on the 16 items. To assist participants, an enlarged version of a 0–100 scale was provided.

Ambulation.

All ambulation tests were performed on a level tile floor under quiet conditions. The 6MWT was conducted in a 3-m-wide hallway with a 15-m area marked off at 1-m intervals and large cones placed at each end. Participants were read the following instructions: “When I say ‘go,’ I want you to walk around this track. Keep walking until I say ‘stop’ or until you are too tired to go any further. If you need to rest, you can stop until you're ready to go again. I am interested in measuring how far you can walk. You can begin when I say ‘go.’” The following encouragements were provided: (1) after 1 minute, “You are doing well. You have 5 minutes to go.”; (2) at 2 minutes, “Keep up the good work. You have 4 minutes to go.”; (3) at 4 minutes, “Keep up the good work. You have 2 minutes left.”; and (4) at 5 minutes, “You are doing well. You have only 1 minute to go.” Fifteen seconds prior to completion, participants were informed that time would stop shortly, and the test was stopped at 6 minutes.90 Total distance walked was measured to the nearest meter.

For the TUG, participants were instructed to sit with their back against a chair (47 cm from floor to seat with armrests), feet behind the tape marker, and arms resting in their lap. Participants were instructed to independently rise on the word “go,” comfortably walk a clearly marked distance of 3 m, turn around a cone, walk back to the chair, and sit down with their back against the chair. Time started once the participant's back left the chair and ended when the participant's back returned to the chair. Time to complete the course was measured to the nearest 100th of a second. One practice trial and 2 timed trials were performed; the 2 timed trials were averaged for data analysis.

For the test of comfortable gait speed, participants were asked to walk 10 m and were instructed to “walk at your own comfortable walking speed and stop when you reach the far line.” For the test of fast gait speed, participants walked the 10 m with the instructions to “walk as fast as you can safely walk” and to stop at the far line. Time to complete the central 6 m was measured to the nearest 100th of a second using a stopwatch. Time started when any part of the foot crossed the plane of the first tapeline and ended when any part of the foot crossed the plane of the 6-m mark. Rest breaks were allowed between tests or trials, if needed. Participants completed 2 comfortable gait speed trials, followed by 2 fast gait speed trials. The 2 trials were averaged for data analysis, and gait speeds were calculated (in meters per second).

SF-36.

The SF-36 was administered via personal interview by 2 researchers using the interview script provided in the SF-36 Health Survey: Manual and Interpretation Guide.91 Standard procedures for repeating questions and response choices were followed, as outlined in the SF-36 Health Survey: Manual and Interpretation Guide.91 The participants were able to choose from a typed list of response choices that was enlarged and placed on a table in front of them for each question. To avoid influencing the participants' answers on the SF-36, it was the first test given to each participant on both testing days, before they were asked any other health-related questions.

UPDRS.

The UPDRS subscales were administered as described by Goetz and colleagues,92 and a UPDRS total score was calculated based on the sum of the scores of the 3 subscales. The test was administered by 1 of 2 researchers, both of whom reviewed the UPDRS teaching videotape. The original 5-point (1–5) H&Y Scale staging of PD was used in the study.93 Higher scores on the H&Y Scale indicate greater impairment of PD.

Data Analysis

Internal consistency and test-retest reliability were calculated using SPSS (version 15.0) software.* Internal consistency, assessed using the Cronbach alpha, was calculated for multiple-item tests, such as the BBS, ABC Scale, SF-36, and UPDRS. Internal consistency of .70 or greater was required on the multiple-item test before other forms of reliability were considered trustworthy. The ICC (3,k) was used instead of the Pearson correlation coefficient (r) for test-retest reliability because it assesses rating reliability by comparing the variability of different ratings of the same subject with the total variation across all ratings and all subjects. For test-retest reliability, either a type 3,1 or type 3,2 ICC was used. The ICC (3,1) was used for the BBS, RT, SRT, ABC Scale, 6MWT, SF-36, and UPDRS because final scores on these tests were based on a single measure from one rater. The ICC (3,2) was used for the TUG, forward and backward functional reach, and comfortable and fast gait speeds because final scores for these tests were based on an average of 2 trials. Normal distribution was assessed for each outcome variable at test day 1 using a histogram plot. Data from 2 participants on the SF-36 and 1 participant on the ABC Scale were excluded from the data analysis due to the presence of cognitive deficits, as judged by the researchers administering the tests. Due to fatigue, the gait speed tests were not administered to one participant.

Results

Table 1 reports internal consistency for all multiple-item tests used in this study. All tests met the criterion of Cronbach alpha being .70 or greater, with the exception of day 1 for the Social Functioning subscale of the SF-36 and both days for the Mentation, Behavior, and Mood subscale of the UPDRS. Internal consistency from previous studies also is reported in Table 1. In previous studies, both the SF-36 Vitality and Social Functioning subscales have had internal consistency values less than .70.

Table 1.

Internal Consistency for Balance Tests, a Quality-of-Life Measure, and a Disease Severity Rating Scale in People With Parkinsonism (n=36–37)

Table 2 reports means, standard deviations, and confidence intervals from the first testing day, as well as the ICCs and MDC95 values for all tests and measures administered in this study. The 6MWT was the only test that demonstrated statistically higher retest values (t=−2.15, P<.04), indicating that learning could be a factor for this test.

Table 2.

Sample Sizes, Means, Standard Deviations, 95% Confidence Intervals (CIs), Intraclass Correlation Coefficients (ICCs) for Test-Retest Reliability, and Minimal Detectable Changes (MDCs) for Balance and Ambulation Tests, a Quality of Life Measure, and a Disease Severity Rating Scale in People With Parkinsonism

The BBS and ABC Scale were the most reliable of the balance measures, with MDC95 values of 5 and 13, respectively. The BBS and ABC Scale both demonstrated a right-skewed distribution due to a ceiling effect on these scales.

Comfortable and fast gait speeds had the highest test-retest reliability, normal distributions, and MDC95 values of 0.18 and 0.25 m/s. The 6MWT had excellent test-retest reliability and a normal distribution, but a large standard deviation that created a high MDC95 value of 82 m. The TUG displayed a right-skewed distribution, but its test-retest reliability was low compared with the 6MWT and gait speeds.

All 8 subscales of the SF-36 had a test-retest reliability of .80 or above, except for the Social Functioning subscale. The Physical Functioning subscale, the scale most often used by physical therapists, had an MDC95 value of 28% in our sample.

The UPDRS test-retest reliability values of .89 to .93 for the 3 subscales and total score were high, with MDC95 values of 2, 4, 11, and 13, respectively. Minimal detectable change values for the UPDRS Mentation, Behavior, and Mood subscale should be used with caution due to its low internal consistency.

Discussion

The convenience sample of people with parkinsonism who participated in this study may be similar to patients with parkinsonism seen in outpatient clinics and wellness programs but may have less severe PD than patients with the disease seen in long-term care and acute care inpatient facilities. Minimal detectable change values could vary not only by a disease but also by stage of the disease. The BBS, ABC Scale, SRT with eyes closed, 6MWT, and gait speed tests all demonstrated test-retest reliability values above .90. The MDCs calculated from these test-retest values are considered dependable. Functional tests with test-retest reliability values below .90 (forward and backward functional reach, RT with eyes open and eyes closed, SRT with eyes open, TUG) should be used with caution in people with parkinsonism. All of the SF-36 subscales except the Social Functioning and Role–Physical subscales and all of the UPDRS subscales except the Mentation, Behavior, and Mood and ADL subscales had internal consistency and test-retest reliability values above .80, indicating that the scales measure one concept and that the MDCs are trustworthy.

Balance Testing

Internal consistency for the BBS in this study was similar to findings of a previous study of people with PD.10 Our study's high test-retest reliability and calculated MDC95 values were similar to values reported in current literature (2–12) for people with various disabilities.9,1315,19 An MDC95 value of 5 on the BBS for people with parkinsonism is useful to physical therapists.

The test-retest reliability value of .73 for forward functional reach in this study was within the wide range of .42 to .93 reported in 2 previous studies of subjects with PD.15,22 The calculated MDC95 of 9 cm for forward functional reach is between the values of 4 to 12 cm calculated from the previous literature for subjects with PD.15,22 The low test-retest reliability value of .67 for backward functional reach, with a calculated MDC95 value of 7 cm, indicates that this test should be used with caution. Our test-retest reliability values for the SRT with eyes open and closed were slightly lower than the values obtained for the SRT with eyes open and closed in a previous study of elderly women who were healthy.17 No previous research reports MDC values on these tests for individuals with parkinsonism. Many subjects reached the 60-second ceiling on the RT and SRT with eyes open. A floor effect was seen for the SRT with eyes closed, but this test had higher test-retest reliability values than the SRT with eyes open or the RT with eyes open and closed. Due to the low reliability of scores obtained for forward and backward functional reach and for the RT and SRT (except for the SRT with eyes closed) in this study, these tests should be used cautiously as a measure of responsiveness to change in this population.

The ABC Scale had excellent internal consistency and test-retest reliability, with values being higher than those reported in the previous literature.19,2830 The MDC95 value of 13% in our study fell below the 18% to 38% calculated from the previous literature for other patient populations. A change score of 13% or greater should be used for people with parkinsonism.

Ambulation Testing

The test-retest reliability and MDC95 values obtained for 6MWT in this study fell within the range found in current literature.3141,43,44,47,48 None of these studies, however, assessed individuals with parkinsonism. The MDC95 value of 82 m was larger than desired due to a large standard deviation resulting from a wide range of disease severity of the participants on the H&Y. Even though the MDC95 value was high, test-retest reliability on the 6MWT for people with PD was excellent. Thus, an MDC95 value of 82 m is valid for clinicians using the 6MWT in individuals with parkinsonism. Future studies with greater numbers of patients in each H&Y stage will determine whether the standard deviation decreases secondary to better homogeneity of the group by stage. If so, separate MDCs on the 6MWT should be determined for each stage of the disease. Future researchers should check this functional test for learning effects. The effects of learning on the 6MWT found in this study, although significant, were small.

Test-retest reliability values obtained for the TUG in this study fell within the range of reliability values found in previous research studies.15,50,52 The MDC95 values were higher than desired but fell within the range of values reported in the current literature.50,52 The mean score of 15 seconds on the TUG in this study would make a change score of 11 seconds or better unrealistic for the majority of the group. An MDC study based on each of the H&Y stages may decrease the standard deviation and subsequently provide lower MDCs on the TUG.

Test-retest reliability values for comfortable gait speed in this study fell within the values previously reported13,21,32,37,53,5558,60 and were higher than the reliability values obtained in the only other study reporting test-retest reliability for people with PD.15 Our calculated MDC95 value of 0.18 m/s fell within the range of values reported in the literature13,21,32,37,53,5558,60 and was similar to the SDD value of .19 m/s reported in a previous study of people with PD.15 The reliability and MDC95 values obtained for fast gait speed in the current study were similar to values reported in previous research.21,32,37,57,58,60 The MDC95 values calculated for the gait speed tests in this study are valid for individuals with PD. Of the 4 ambulation tests presented, clinicians should consider using both the comfortable and fast gait speeds to measure responsiveness to change over time because of the high test-retest reliability, normal distribution, and useful MDC scores in people with PD.

SF-36

Internal consistency values for all 8 SF-36 subscales in this study fell within the Cronbach alpha values reported in previous research.6568,70,72,73,7579 Similar to previous research, the Social Functioning scale had the poorest internal consistency (Tab. 1).

The test-retest reliability and MDC95 values (Tab. 1) calculated for all of the SF-36 subscales in this study fell with the ranges reported in the previous literature.6374,7679 None of these studies, however, assessed individuals with parkinsonism. The large MDC95 values can be attributed to the broad diversity of populations tested. Each subscale of the SF-36 can be used independently. Therapy may improve a patient's quality of life as measured by the Physical Functioning subscale (10 items) and the Bodily Pain subscale (2 items), and these SF-36 subscales should be utilized by therapists. A change of 28/100 or higher on the Physical Functioning subscale and a change of 25/100 or higher on the Bodily Pain subscale would demonstrate an improvement in these quality-of-life dimensions.

UPDRS

Internal consistency for the Motor Examination subscale of the UPDRS in this study was similar to the ranges reported in previous studies.10,8185 Cronbach alpha values for the UPDRS subscales and total score in this study all fell slightly below reported values in the literature, which may be due to the large sample sizes used in previous studies.10,82,83,86 Internal consistency of the Mentation, Behavior, and Mood subscale of the UPDRS was below the acceptable level of .70, and this subscale should be used with caution in measuring change over time, despite acceptable test-retest scores.

The test-retest reliability and MDC95 values for the UPDRS subscales and total score were similar to values obtained in previous studies.15,87,88 The MDC95 values of 2 for the Mentation, Behavior, and Mood subscale and 4 for the ADL subscale in this study were the same as the values obtained for those subscales in a previous study that examined 400 patients with early stage, mild PD, but 4 points higher for the Motor Examination subscale and total score.87 The higher MDC95 values in this study may have been due to the smaller sample size and wider representation of PD severity.

Conclusion

Therapists have evaluation choices for measuring balance, ambulation, quality of life, and disease severity when assessing change over time in patients with chronic disease. The MDCs found for the BBS, ABC Scale, SRT with eyes closed, comfortable and fast gait speeds, 6MWT, SF-36 subscales (except Social Functioning), and UPDRS ADL and Motor Examination subscales and total score will be useful to therapists working with patients with parkinsonism in rehabilitation and wellness programs to determine whether change is due to testing error or is a result of intervention techniques. These values also help therapists interpret literature comparing statistical significance with meaningful clinical change. Test-retest reliability studies with larger samples by stage of PD and for patients with Parkinson-plus syndromes will help further define MDC values.

Footnotes

  • Dr Steffen provided concept/idea/research design, project management, fund procurement, subjects, facilities/equipment, and institutional liaisons. Ms Seney provided writing and data collection and analysis.

  • A special thanks to Rebecca Zabkowicz, Stacey Snider, Monique Serpas, Travis Rasinski, Asha Rani, Dana Pechawer, Jennifer Millard, Andrea Kriese, Anne Haseman, Nicole Hale, Stephanie Georgia, Amy Guathier, Stephanie Davis, Kathryn Cushman, Jennifer Braier, and Krista Bitetto, who assisted with the literature review and data collection while they were physical therapy or master of science in rehabilitation students at Concordia University Wisconsin, and to Cheryl Petersen for professional supervision on the project.

  • This research was presented at the Combined Sections Meeting of the American Physical Therapy Association; February 6–9, 2008; Nashville, Tenn.

  • * SPSS Inc, 233 S Wacker Dr, Chicago, IL 60606.

  • Received July 31, 2007.
  • Accepted February 4, 2008.

References

View Abstract