Background and Purpose. Physical therapists routinely observe gait in clinical practice. The purpose of this study was to determine the accuracy and reliability of observational assessments of push-off in gait after stroke. Subjects. Eighteen physical therapists and 11 subjects with hemiplegia following a stroke participated in the study. Method. Measurements of ankle power generation were obtained from subjects following stroke using a gait analysis system. Concurrent videotaped gait performances were observed by the physical therapists on 2 occasions. Ankle power generation at push-off was scored as either normal or abnormal using two 11-point rating scales. These observational ratings were correlated with the measurements of peak ankle power generation. Results. A high correlation was obtained between the observational ratings and the measurements of ankle power generation (mean Pearson r=.84). Interobserver reliability was moderately high (mean intraclass correlation coefficient [ICC (2,1)]=.76). Intraobserver reliability also was high, with a mean ICC (2,1) of .89 obtained. Discussion and Conclusion. Physical therapists were able to make accurate and reliable judgments of push-off in videotaped gait of subjects following stroke using observational assessment. Further research is indicated to explore the accuracy and reliability of data obtained with observational gait analysis as it occurs in clinical practice.
Although instrumented gait analysis systems are frequently used in research, therapists use observation on a daily basis to evaluate gait because ready access to such measurement systems is currently impractical.1 Observational gait analysis (OGA) refers to the qualitative approaches to gait analysis used by clinicians whereby gait deviations are identified in patients from visual observations.2 As with all clinical measures, establishing the accuracy and reliability of OGA measurements is an essential consideration to ensure optimal clinical practice.
Despite its widespread use, there is a scarcity of evidence to support the accuracy (validity) of clinical observational analysis (Tab. 1). In 3 studies,3–5 observational accuracy was compared with various criterion measures and revealed mixed results. In a study of gait following stroke, Pearson product moment correlations (r) between observation of selected gait components and waveform indexes from a foot force measurement device varied from .32 to −.72, with a mean correlation of .55.3 Clinicians observing prosthetic alignment detected only 22% of the deviations predicted by biomechanical analysis of gait.4 Observations of a mixed subject group were compared with kinematic angles obtained from simple protractor measurements of videotaped gait, with variable results.5 Although the accuracy of determining foot placement was high, observers were generally inaccurate at judging joint angles, with an average score of 1.2 (1–2) out of a maximum of 5. Results from these studies contrast with those of a more recent study of gait after stroke,6 which determined high Pearson correlations (r=.88) between physical therapists' judgments of timing symmetry and those concurrently obtained with an instrumented footswitch measurement device. Physical therapists also were highly accurate in judgments of other movements, including single-leg stance steadiness7 and upper-limb tasks after stroke.8,9 These are encouraging findings and suggest that therapists have the capacity to make accurate judgments under certain conditions.
Studies of the reliability of OGA measurements are more common and have included a wide range of gait variables in varied populations, using different methods. The majority of these studies have been reviewed comprehensively by Malouin2 and are further described in Table 1.3–6,10–18 Despite the growing number of studies and the diversity of results, the majority of OGA studies have demonstrated poor to only moderate reliability.12,14,16 The widespread use of OGA in everyday clinical management of gait dysfunction highlights the need for further investigation of this clinical tool. Prior to further evaluation of OGA, we believe it is important to carefully evaluate the methods used in previous studies.
A range of factors have been identified and described as potential influences contributing to the generally poor to moderate reliability of OGA measurements reported.2 These factors include the potential error associated with observation of live gait performances and the lack of both operational definitions and uniform observer instruction and training. Furthermore, the majority of studies of OGA have encompassed a wide variety of gait variables, thus requiring multiple and complex judgments. The low reliability found may relate to these design factors, rather than accurately reflecting the underlying agreement of the observers. Systematic studies that minimize variability due to live gait performances and reduce the number of observed variables provide insight into the capacity of therapists to reliably observe gait.8 Although the current study deviates from the role of OGA in clinical practice, it represents an initial step in the exploration of the potential accuracy and reliability of observation under optimized circumstances.
The selection of gait variables in observational studies also warrants careful attention.2,8 Variables selected should have clinical significance and be representative of gait capacity in the desired population.2 Examination of the variables previously studied reveals wide diversity in both the type and number of selected variables. Almost all of the variables were spatiotemporal or kinematic in nature and reflect those variables typically considered in OGA and appearing on common gait evaluation forms.18–20 Although therapists have traditionally focused OGA on these gait features, a rapid expansion in measurement technology has allowed insight into other important aspects of gait. A review of current biomechanical knowledge of gait suggests strongly that kinetic variables also should be considered in any form of gait analysis.21
A comprehensive biomechanical description of the kinetic forces underlying walking in subjects without known impairments or pathology21–23 and in subjects with hemiplegia24,25 has been provided by instrumented gait analysis. In walking, concentric and eccentric muscle contractions create moment forces across the lower-limb joints. Joint mechanical power is the product of the moment of force and the angular velocity across the joint.21 The visible kinematic gait pattern observed by the clinician is simply an outcome of these invisible kinetic forces. Knowledge of these underlying patterns of power generation and absorption, therefore, provides an explanation of any kinematic deviations that can be seen. Herbert et al26 have argued that clinical analysis of movement dysfunction requires more than a simple description of kinematic deviations. Rather, they suggested that therapists may be able to observe the visible features of gait and combine these observations with biomechanical knowledge to make inferences about the muscle forces that occur. This approach can provide insight into the nature of gait deficits and direct guidance toward appropriate intervention strategies.
A review of kinetic features of gait suggests that ankle power generation in late stance is an important gait variable that is impaired in subjects with gait dysfunction following a stroke.21,24,25 Ankle plantar-flexor muscles contract rapidly in late stance phase to provide the single largest burst of power generation in the gait cycle of adults without impairments.21 The magnitude of ankle power generated by individuals after stroke also is highly correlated with gait speed, a measure of gait performance for this population.24,27 Reduced ankle power also is associated with reduced peak knee flexion in the swing phase, which is assumed to be related to reduced walking efficiency and increased risk of tripping.25 This action in the late stance phase is commonly described as “push-off” and has been considered 1 of 6 critical features of human gait.28
To date, only 3 studies3,5,10 have included push-off as a component of OGA, and they had different methods and highly variable results. Goodkin and Diller10 reported 23 of 30 possible agreements across 3 therapists when observing subjects with hemiparesis. However, the absence of statistical analysis accounting for chance agreement limits conclusions from this study. Miyazaki and Kubota3 reported interobserver agreement of .63 among 4 observers who viewed 48 subjects with hemiparetic gait. A Pearson correlation of .59 also was obtained between the observations of push-off and the data from a foot-force measurement device. Although accuracy and reliability were low, the use of live rating sessions, multiple rating variables, and a limited ordinal scale may have adversely influenced the results. In marked contrast, kinesiology students achieved high reliability on a nominal scale (average agreement: 4.5 out of 5) when judging push-off in videotaped performances by 8 subjects with varied pathologies.5 However, because 2 subjects had amputations, it is not surprising that the observers were consistent in judging push-off as either normal or abnormal. No decisive evidence has emerged that indicates whether therapists can reliably or accurately use OGA to evaluate kinetic aspects of gait such as push-off. Further investigation with optimal methods, therefore, is warranted.
Some authors1,29 have proposed that OGA requires much practice, combined with an understanding of biomechanics. This reflects what we believe is a common assumption by physical therapists that superior OGA skills may be associated with experience and practice. The few researchers30–33 who have examined the relationship between experience and reliability of data obtained with assessment techniques have not provided convincing evidence to support the assumption that experience has a positive influence on reliability. The only study14 of the relationship between therapist experience and reliability of OGA data demonstrated no consistent difference between experienced and less experienced therapists. In our research, therefore, we examined the relationship between the accuracy and reliability of therapists' observations and factors such as length and type of clinical experience.
In summary, the primary aim of this study was to investigate, after eliminating error due to variability in subject performance through the use of videotapes: (1) the accuracy, or criterion-related validity, of observations of push-off in videotaped gait performances of individuals after stroke, (2) the intraobserver and interobserver reliability of such observations, and (3) the relationship between clinical experience and the accuracy and reliability of OGA measurements.
Development of a New Observational Rating Scale
Two factors were considered in selecting and developing a new rating scale for study. These factors included a review of previously used scales and consideration of the role of OGA in clinical settings. A review of existing scales showed that almost all scales previously used to investigate OGA were nominal or ordinal scales of no more than 4 points (see, for example, studies by Potter et al,13 Hughes and Bell,15 and Keenan and Bach16 in Tab. 1). The limited precision of these scales does not necessarily reflect the role of OGA in clinical practice. We believe therapists use OGA in conjunction with other techniques to make refined judgments about movement quality. Furthermore, OGA may be used to monitor change and evaluate the effects of intervention, rather than to simply identify the presence of gait abnormalities.14 A therapist may use OGA to determine whether improvement has occurred in a patient's gait, although the gait may remain abnormal. A 3-point scale may not adequately reflect the small changes that can occur over a short period of gait training. More discriminative scales have been used in studies that have obtained high accuracy and reliability.7–9 Accordingly, we chose an 11-point (0–10) scale to allow observers to form what we would consider more precise discriminations in their judgments of the degree of abnormality of the observed gait performance.
We believe the literature describing ankle power generation in adults without impairment also supports the inclusion of an 11-point (0–10) range for grading “normal” push-off. This range contrasts with the single category typically found in existing studies of OGA.3,5,10,12,14,15 A wider scale enables the full range of normal abilities to be incorporated, recognizing that it may be desirable to detect clinical change in an individual throughout the normal range of ability. Scales that allow discrimination within the normal range of performance values permit continued measurements beyond the lower values of the normal range. For example, normal gait speed for adults may be described as a range of values between 1.2 and 1.5 m/s.34 We believe that clinicians continue to measure gait speed beyond the lower threshold limit when evaluating change in an individual during gait rehabilitation after a stroke. Older subjects (±60 years of age) without impairments also have demonstrated a wide range of values for push-off power.21 Accordingly, a range of 0 to 10 was chosen for the normal scale to reflect the variability of ankle power generation within the population of people without impairments. Together, these two 11-point scales provide a model of ankle power generation as a continuum of human performance values (Fig. 1).
Eighteen physical therapists were recruited as observers as a sample of convenience from 4 rehabilitation centers. Their clinical experience varied between 2.1 and 11.2 years (X̄=6.1, SD=3.0), with “neurology-specific” experience ranging from 0.8 to 11.0 years (X̄=4.8, SD=3.1). All therapists reported always using OGA when assessing gait after stroke. These therapists frequently assessed stroke gait, with 16 therapists assessing stroke gait daily.
Subjects after stroke.
Eleven subjects with hemiplegia following a single stroke were recruited using the subject database of the Human Motion Laboratory, Queens University, Kingston, Ontario, Canada. All subjects walked independently or with an assistive device or an orthosis. Eleven walking trials were selected from a wider database of 32 subjects with hemiparesis, each of whom completed 3 walks. The 11 walking trials were selected to provide a sample with the widest range and most even spread of ankle power generation values. Characteristics of the individuals and their gait variables are described in Table 2. All participants provided informed consent prior to the commencement of the study.
Apparatus and Procedure
Obtaining the measurements of ankle power generation.
The criterion measure data were obtained as the subjects walked along a 9-m walkway at a self-selected, comfortable speed. Reflective markers were attached at the following anatomical landmarks: the fifth metatarsal joint, the heel, the ankle lateral malleolus, the lateral femoral condyle, the greater trochanter at hip joint level, and the acromioclavicular joint. Gait was filmed with a single LOCAM 51 cine camera* (50 frames per second) that was guided manually on a tracking cart beside the walkway. These data were synchronized with force data obtained simultaneously from a single forceplate.† A digitizer (GTCO Datatizer‡ was used to extract the coordinates of the body and background markers from each cine film frame for each stride. After scaling and correction for parallax error, the coordinate data were digitally filtered using a second-order low-pass Butterworth recursive filter. These processed coordinates then were combined with anthropometric and forceplate data using biomechanical software adapted from Winter21 to determine the kinematic and kinetic variables. Anthropometric data in the biomechanical modeling were obtained by combining individual subject measurements with standard constants35 to estimate segment masses and inertias. The data collection process is further described in detail in Olney et al.24
Ankle joint power (Pa) was calculated as the product of the equation: where Ma is the net moment across the ankle joint (in newton-meters) and wa is the ankle joint angular velocity (in radians per second). The data were then normalized by dividing each value by the subject's body weight. The criterion measure selected was the peak value of ankle power generation.
Construction of the gait performance videotape.
Videotapes for each of the 2 testing occasions were produced, with each videotape containing 11 randomly ordered images of subjects walking. Gait was viewed from a sagittal perspective closest to the affected side, as the subjects walked at their self-selected speed. Each walking trial was edited to include 4 repetitions of a single gait stride, from foot contact of the affected lower extremity through to early stance of the affected lower extremity in the subsequent stride. An audible tone was inserted at foot contact to alert the observers to watch the subsequent stance and push-off action. The total duration of the videotape was approximately 7 minutes, with individual subject performances varying from 12 to 40 seconds.
All observers participated in a 15-minute standardized instruction/familiarization session at the beginning of each testing occasion. This session included a brief review of normal ankle joint kinematics and kinetics21 and an operational definition of push-off. Push-off was defined as “a component of gait late in stance phase where the plantarflexors generate a concentric 'explosive' burst of energy, causing the foot to rapidly plantarflex” (adapted from Winter21). The rating scale was introduced, and the anchor descriptors were defined (Fig. 1). Observers practiced rating 2 example gait performances to become familiar with both the rating scale and the observational task. No specific training of OGA was provided, and no feedback was given regarding practice ratings, as this study aimed to investigate the observers' baseline ability.
Observers were tested in small groups at the rehabilitation centers on 2 occasions. The entire videotape was first shown to allow observation of the total range of subject variability. This aimed to minimize potential contraction bias effect.36 This form of bias occurs when observers use a range of responses smaller than the range of stimuli, such as clustering scores into the center of a scale. The videotape was then replayed and rated by the observers, who recorded their scores for each walk by simply circling a number on either the abnormal or normal scale (test 1). The videotape was played at normal speed to reflect the self-selected gait speed of each subject. Stopping, repeating, or slow-speed viewing of the videotape did not occur.
Observers attended an identical second testing session (test 2) approximately 4 weeks after the first testing session and viewed a second test videotape with a randomly altered order of image presentation. The 4-week time delay was used in an effort to maximize the possibility that the observers had forgotten their previous allocation of ratings. The altered order of subjects aimed to minimize the influence of a rating order effect. Observers also completed questionnaires about their clinical practice and experience.
Data Reduction and Analysis
Data available for analysis included both the observational rating scale data and the corresponding criterion measure data. The observational data comprised both categorical data (abnormal/normal) and ordinal data (the ratings). First, the ratings on the 2 scales were considered as parts of a single scale measuring push-off. This continuous scale extended from a “zero” point on the abnormal scale to 10 on the normal scale, with a total range of 22 points. Observers' ratings were converted to numbers on a single scale, with potential values of 0 to 21. In this manner, the values of any ratings of the 11-point (0–10) abnormal scale were unchanged, with a rating of 2 on the abnormal scale represented in the analysis as a score of 2. Ratings of 0 to 10 on the normal scale were converted to numbers 11 to 21, with a rating of 0 on the normal scale converted to a score of 11, a rating of 1 converted to a score of 12, and so on. In view of the interval construction and high number (22) of categories used to measure this continuous variable, parametric statistical methods were selected for data analysis involving the rating scale numbers. Second, data from the categorical judgments of normality were available to enable comparison with other studies of OGA that used categories of normal or abnormal in their scales These data were analyzed using nonparametric techniques.
Criterion-related validity: individual judgment models and accuracy.
The relationships between the recorded ratings and the associated criterion measure data were plotted for each observer, and we visually examined the plots to confirm the linearity of the relationships. Each scatterplot and regression equation, therefore, reflected a unique model of judgment.8 To investigate accuracy, individual Pearson product moment correlations were calculated using each observer's scores of push-off and the measurements of ankle power generation. These correlations were then transformed to Z scores using the Fisher r to Z transformation, averaged, then converted back to Pearson r indexes to provide a group mean correlation value for test 1.
The precision of the rating scale was investigated by calculating the standard error of estimating the peak ankle power generation (sest). This was determined for each of the observers, using the test 1 results. The sest was calculated, according to the following formula.37 where: SD = the standard deviation of the 11 subjects' measurements of ankle power generation and rxy = the Pearson product moment correlation value for test 1 for each observer
These calculations are in the same units (watts per kilogram) as the criterion measure, thus providing a direct method of interpreting the accuracy of the observers' use of the 22-point rating scale. The sest quantifies the amount of judgment error (in watts per kilogram) for each observer using the rating scale.
The discrimination between normal and abnormal push-off.
The categorization of each walk as abnormal or normal was inspected in relation to the associated measurements of ankle power generation. This inspection provided us with insight into the judgment patterns of the observer group. Each of the 18 judgment models was explored to determine the range of measurements of ankle power generation at which observers divided the gait performances into each of the abnormal or normal categories. Regression equations from each judgment model enabled prediction of this division or cutoff point, which reflected the intersection point of the 2 scales. This point was determined in the following manner. A value of 10.5 was selected to substitute into each regression equation as the “x” rating value. This value represented a midline point that may arbitrarily separate the abnormal scale range of points (0–10) and the normal scale range of points (11–21). From this “x” midline point, a corresponding “y” criterion measure value could be predicted, representing the threshold value of the criterion measure at which each observer's judgement model divided the gait performances as normal or abnormal. An example of this prediction process is illustrated in Figure 2. Values for these predicted thresholds of normality can be compared with the range of ankle power generation values attained by older subjects without impairments.
Intraclass correlation coefficients (ICC [2,1]) were selected to investigate the agreement among observers on both test occasions.38 Average ICCs across the 2 tests were determined by converting the ICCs of each test occasion to Z scores using the Fisher r to Z transformation, averaging the Z scores, and converting the average back to ICCs. The standard error of measurement (SEM) for the ratings of push-off among the observers also was determined for each test occasion according to the following formula.39 In this instance, an ICC (2,1) value was used in the calculation. where SD = standard deviation of the rating scale scores recorded by observers on a test occasion and rxx = the ICC value (among observers).
The SEM is provided in the same units as the 22-point observational rating scale, thus providing a direct method of interpreting the reliability of the scale. In this instance, the SEM reflects how much error there was across the observer group when rating the gait performances.
Interobserver reliability of the categorical data was investigated using the Cohen Kappa statistic. Kappa is a corrected measure of agreement between observers that takes into account the effect of chance agreement.40 A value of 1 represents perfect agreement, and a value of 0 indicates agreement that is no better than would be expected by chance. Kappa was calculated for each pair of observers for test 1. Kappa values were transformed to Z scores, averaged, and converted back to Kappa values to provide an average value.
An ICC (2,1) also was calculated to determine the agreement consistency of each observer across the 2 test occasions. Individual ICC (2,1) values were determined for each observer and then averaged by the previously described method. The SEM was calculated for each clinician to reflect the error involved when a single observer repeated the rating on 2 occasions.39 The SEM for each observer then was calculated and averaged by the method described previously, using the standard deviation of the ratings of individual observers (averaged over test 1 and test 2) and the individual observers' ICC (2,1) values. Percentage of agreement and the Cohen Kappa statistic also were determined to evaluate the agreement of the categorical data.
The relationship between the observers' length of experience and individual values of accuracy and intraobserver reliability was investigated by calculation of Pearson product moment correlation coefficients. Prior to calculating the correlation, the individual ICC (2,1) or Pearson product moment values were again transformed to Z scores, and the data were inspected to confirm a linear relationship. The relationship between experience and interobserver reliability also was investigated by selecting the 5 most experienced therapists (mean experience=8.3 years) and the 5 least experienced therapists (mean experience=2.8 years) and then comparing these 2 groups. Intraclass correlation coefficients (2,1) were obtained for the “most experienced” and “least experienced” groups by the method previously described, and confidence intervals around these values were determined.
Use of the Observational Rating Scale
In view of the development of the new 2-part observational rating scale, the continuity of the underlying single scale was examined by investigating the relationship between the observers' ratings and the associated criterion measure values. The observer group used all numbers of the 22-point scale to assign ratings to the 11 gait performances. Because the actual ankle power generation values of the subjects allocated to each scale number were known, the mean ankle power generation of subjects ranked at each rating scale number could be determined. Further examination of this relationship by simple regression analysis (Fig. 3) revealed a linear relationship with a Pearson product moment correlation coefficient of .94, suggesting a strong association between the rating scale and the mean ankle power generation of the subjects rated at each number. Greater variability in this relationship was evident at the higher rating scale numbers; however, these data were based on fewer subjects, and no extreme deviations were apparent. The interval between numbers 10 and 11 did not appear to be different from the other intervals across the scale. This finding suggests that dividing the rating scale into the two 11-point (0–10) scales did not influence the continuity of the underlying single variable.
Therapist Accuracy: Criterion-Related Validity
Inspection of the 18 judgment scatterplots revealed linear relationships, with Pearson product moment correlation coefficients varying from .69 to .91 and an average correlation of .84 for both test occasions (Tab. 3). These correlation coefficients represented fair to high values. Figure 4 illustrates the relationship between the rating scale scores and the criterion measure values for 2 observers: (1) the least accurate and (2) the most accurate. The sest varied from 0.38 to 0.67 W/kg, with a group mean of 0.51 W/kg.
Discrimination of Normality: Regression Analysis
Figure 5 illustrates the pattern of discrimination used by the observers to categorize the gait performances as abnormal or normal. Inspection of the raw scores shows a discrimination of normality between the values of 1.00 W/kg and 1.23 W/kg. Performances of subjects with criterion measure values of ≤1.0 W/kg were rated by all observers as abnormal. At higher values, there was a progressive trend to rate performances as normal. This trend indicates a systematic scoring process relative to power generation.
Further insight into this discriminative process can be gained by closer inspection of the individual judgment models. Table 3 contains the individual predicted threshold of normality values for each therapist during test 1, as predicted by the regression analysis. These values indicate that the therapists varied in the point of the criterion range at which they separated normal performance from abnormal performance, from 1.21 W/kg to 3.02 W/kg, with a mean value of 1.80 W/kg.
Interobserver agreement was moderately high for both test occasions, with ICC (2,1) values ranging between .75 and .77 (Tab. 4). The average SEM (between all observers) was reasonably consistent across observers, at around 2.7 rating scale units.
Prior to determining Kappa values, the categorical data were inspected to ensure that variability was present in each observer's ratings, as a requirement for Kappa calculations. Because observers 1 and 5 consistently rated all subjects' performances as abnormal, data from these subjects were excluded from this analysis. Kappa coefficients varied from .23 to 1.00, with an average of .66, representing substantial agreement among the 16 observers, according to the classification of Landis and Koch.41
Observers were consistent in their individual use of the rating scale, with a mean ICC (2,1) of .89 obtained (Tab. 5). Variability was apparent, with ICC (2,1) values ranging from .64 to .96. The SEM also varied across observers, with a relatively low average of less than 2 rating scale units (Tab. 5). Consistency of the categorical judgment of normal versus abnormal across the 2 tests was similarly high, with average agreement of 89.5%. An averaged Kappa value of .79 for the observers (Tab. 5) indicated substantial to almost perfect agreement of categorical judgment.41
No relationships existed between the observers' general or neurology-specific experience and their accuracy or intraobserver reliability indexes. Experience was poorly associated with interobserver reliability, with the least experienced group showing higher agreement than the most experienced group (ICCs [2,1] of .88 and .82, respectively). Inspection of confidence intervals (CIs) around each value (least experienced group's CIs=.59–.97; most experienced group's CIs=.49–.95) confirmed that no difference occurred between these 2 groups.
Accuracy of Therapists' Observational Judgments
The high correlations between the observational ratings and the criterion measure values indicate a strong relationship between the subjects' ankle power generation, as measured by the motion analysis system, and push-off, as observed by the therapists. This positive linear relationship (Fig. 3) suggests that therapists view both abnormal and normal push-off along a continuum. This finding is consistent with data describing therapists' observations of upper-limb movement after stroke.8
The indexes of accuracy of the observational ratings in this study were high. We believe that this is an important result because clinicians report including push-off as a component of OGA when observing gait.42 The importance of the therapists' individual accuracy coefficients in relation to clinical practice is difficult to define. It cannot be stated with certainty what values of criterion-related validity are acceptable for such a clinical measure to be used with confidence. This study, therefore, provides evidence that therapists are able to accurately discriminate between levels of ankle power generation in gait after stroke, when observing videotaped gait performances.
These high correlations between the observational ratings and the criterion measure values are encouraging. We contend that further insight into observational accuracy can be achieved by examining the metric estimates of the accuracy of the therapists' ratings. These estimates provide an indication of whether therapists are able to detect what we would consider clinically meaningful change. The mean standard error of estimating the criterion measure values was found to be 0.51 W/kg, compared with the range of 3.16 W/kg in the sample. This means that in using the rating scale, therapists could be 68% confident that an individual rating score would fall plus or minus 0.51 W/kg from the true criterion value. A 95% CI would result from approximately 2 times this range, or plus or minus 1.02 W/kg. The mean error range of 1.02 W/kg is large, covering approximately one third of the range of the criterion measure. This error range implies that therapists would only be able to accurately discriminate changes of approximately 1 W/kg. Therapists frequently use measures to quantify changes in push-off across time, thereby evaluating a series of small changes at intervals during rehabilitation. Unfortunately, there are no group data for sequential measurements of ankle power generation after stroke, and the amount of change in ankle power generation that occurs during the rehabilitation phase is unknown. It is impossible, therefore, to know how precise or accurate therapists' observational measures need to be. Further research is needed to quantify the timing and amount of change in ankle power generation after stroke.
This typical error size of 0.51 W/kg can be considered in relation to the ability to discriminate between older people without known impairments and people following stroke. Olney et al24 reported the mean maximum ankle power generation of a group of 30 subjects following stroke as 0.60 W/kg (SD=0.51). This value is substantially different from the reported mean of 2.48 W/kg (SD=0.46) of elderly subjects without known impairments.21 This typical error of estimation of ankle power generation, in our opinion, would allow the use of observational judgments to differentiate among subjects with performance values near to the means of these groups.
Further insight into clinical judgment can be gained by consideration of the predicted normality values. Although therapists commonly decide whether a movement is normal, it is not known how this judgment is made. In the literature, it is common to find individual subject performance measures compared with an appropriate normative reference groups using a statistical framework. For example, gait speed may be considered abnormal when it is beyond 1.65 standard deviations of the mean of a reference group.43 Similarly, the measurements of ankle power generation following stroke in our study can be compared with those obtained from older subjects without impairments. The most appropriate reference group is that described by Winter,21 who collected gait data from 18 older subjects without impairments in an identical manner to that of the collection of gait data in our study. This provides what we consider the most valid comparable normative reference group. An estimate of the fifth percentile of this group's performance indicates that 95% of these subjects obtained ankle power generation values of 1.72 W/kg or greater. These values lie within the range of predicted normality values determined from the therapist judgment models and are remarkably close to the average of 1.8 W/kg. This finding indicates that, on average, the group of 18 therapists were placing the cutoff defining normality at around the fifth percentile mark.
Reliability of the Therapists' Observational Judgments
The indexes of agreement achieved in this study indicate that the therapists were able to make highly stable individual agreements about observation of a single kinetic gait variable on 2 occasions when viewing videotaped gait performances. These observers also demonstrated moderately consistent agreement with each other about the push-off ability of the subjects during gait.
In addition to measures of agreement, metric indexes of error are useful to provide a meaningful indication of the size of measurement error in units of the rating scale. The SEM (within a single observer) was relatively low, at around 1.8 units of the 22-point scale. This value allows estimation of the error size that could be expected when an observation is repeated by the same observer on another occasion. In this instance, a change of plus or minus 3.6 units would be required to be 95% confident that change had occurred beyond that potentially due to measurement error. The average SEM (among the group of observers) was larger, at around 2.5 units. This finding means that change of around 5 units would be necessary to demonstrate change beyond measurement error if an observation were repeated by a group of observers. These error values provide a framework based on measurement error against which true change can be evaluated. Because the magnitude of typical change in ankle power generation in recovery after stroke is not known, the clinical significance of these error sizes cannot easily be determined. However, relative to the 22-point scale, these error sizes appear to be small.
The higher reliability values for individual intraobserver reliability compared with interobserver reliability were expected. It is more likely that an individual observer will agree more consistently with himself or herself than with other therapists. This finding is consistent with previous investigations of OGA data reliability15,16 and other studies of movement observation.7,8 In current clinical practice, patients are commonly assigned to an individual therapist, who is responsible for their treatment for the duration of intervention. The findings of our study confirm that this OGA measurement is more consistent and has lower error when repeated by the same therapist, as compared with OGA measurements obtained by multiple therapists. This finding suggests that in clinical practice an individual therapist should endeavor to be responsible for OGA measures in a single patient across multiple sessions whenever possible. This practice should enhance reliability and minimize between-observer error.
The poor relationship between therapist experience and intraobserver reliability we demonstrated is comparable to that reported by Eastlack et al.14 Furthermore, we showed poor correlations between the obtained accuracy indexes and the experience of the therapists. More experienced therapists were not more accurate and did not obtain more reliable measurements than therapists with less experience. In addition, specialized neurology-specific experience did not enhance either reliability or accuracy. This finding contrasts with the belief that clinicians with more specialized experience are likely to be more accurate and to obtain more reliable measurements. These poor correlations strongly suggest that experience is not a factor that influences accuracy or reliability of measurements obtained for this rating task. We examined single observations only, and we did not examine the way in which these observations are interpreted or how the information is used. It is possible that expert-novice differences may lie more in the complex decisions arising from observations, rather than the observations themselves.
The reliability and accuracy indexes obtained in this study were higher than the majority of those previously reported in the OGA literature (Tab. 1). Our higher values may relate to the method used and the design chosen. Provision of an operational definition, the selection of a single variable, and the simple scoring system allowing discriminating judgments are likely to have enhanced the accuracy of the observers and the reliability of their measurements. Our inclusion of these design features and the favorable results achieved support the premise that observers can be accurate and can obtain reliable measurements if the task is structured with specific guidelines.
The selection of a single videotaped stride as a rating stimulus may have enhanced the achieved reliability by reducing error due to patient variability. Live gait performance raises a strong question of performance inconsistency between gait trials, which is potentially accentuated by fatigue. The poor reliability reported by authors such as Miyazaki and Kubota3 and Goodkin and Diller10 thus may have been influenced by the inclusion of live rating sessions. Designs including videotaped performances have attempted to control performance variability, but reliability has not necessarily improved.5,12,14,16 Such studies, however, have included subjects who walked a number of strides across varied distances over a number of trials. In such instances, videotapes are likely to have controlled the variation in gait performance due to subject fatigue. However, the videotapes, in our opinion, may not have controlled variability due to stride-to-stride or trial-to-trial gait fluctuations, which may have contributed to increased error in rating scores when combined with inadequately defined rating instructions. For example, one observer may observe the best, worst, or a random stride performance for a single variable and then rate the stride performance accordingly. A second observer may seek to rate an average stride after viewing a number of strides. Thus, unless instructions are very specific, error in rating due to stimulus variability is likely and may result in lower reliability. All previous OGA studies have included a number of strides and thus may have been subject to this type of increased rating error.
The higher accuracy and reliability achieved with a wider 22-point scale in our study counters the contention that discriminative grading of OGA is very demanding and that use of broader scales may adversely influence reliability.2,5 Other studies of movement observation have described successful use of similarly discriminating scales, with comparable reliability outcomes.7,8 An expanded scale with clear descriptors may be easier to use than a narrow scale with descriptors relating to ill-defined categories of abnormality, such as “noticeably,” “moderate,” or “severe.” In our study, observers were able to demonstrate discrimination of rating that is more comparable to the refined judgments of movement quality that may be necessary in clinical settings.
In nearly all studies of OGA, multiple variables were rated, with up to 50 variables considered by the observer.5 Furthermore, frequently cited OGA tools, such as the Ranchos Los Amigos Medical Center form,20 demand that multiple joints be assessed over the many phases of the gait cycle, with in excess of 100 decisions to be made. We contend that with a large number of variables to be judged, the greater are the attentional demands on the observer. As the number of demands, or attention required, increases, the potential for error and inconsistency may also increase. It is not known how observers rated multiple variables in previous reliability studies. In clinical practice, it is suggested that a systematic approach to observation should be adopted, with observers attending to single variables at a time.29,44 When observation was structured in this manner in Bernhardt and colleagues' study of 3 kinematic tasks, high accuracy and reliability were demonstrated.8 The results of our study and those of the study by Bernhardt et al8 suggest that it may be desirable for observers to attend to single variables sequentially when observing gait. Use of a videotape for analysis may augment the reliability of data obtained with this decision process. Indeed, videotaping of performance has frequently been advocated as a method to optimize movement analysis.1,29,44
Our study represents an initial step in the exploration of the accuracy of observations of a single variable in OGA by use of videotape. We believe the method we used enhanced observer accuracy, but also limits the immediate generalizability. The findings indicate that therapists are able to make accurate and reliable judgments of a single kinetic gait variable when viewing videotaped gait performances in a quiet environment. This finding confirms that therapists may be able to provide reliable and accurate decisions about observed movement when the task is clearly defined and focused on a single judgment and their attention is focused on preselected videotaped segments. The findings are limited to the context of this study and are not able to be immediately translated into clinical practice, where patients are observed directly and multiple variables are considered. These favorable results, however, establish that further research directed into clinical observation is appropriate. In view of the fundamental role of OGA within clinical decision making, it is important that further research define and direct the clinical utility of such observations.
Considerations for Future Research
The widespread continuing use of OGA to assess gait after stroke suggests that clinicians value information gained from these observations. Despite the emergence and emphasis of other measures such as gait speed, clinicians continue to observe their patients walking and focus on these “qualitative” aspects of gait. We therefore believe that research should focus on exploring and defining the potential value of such observations. Key questions for clinical practice have emerged from this and other studies of movement observation after stroke.8
Observational gait analysis is currently used on a daily basis by therapists because of its easy application and the lack of easily accessible instrumented measurement tools.1,8,29,44 We believe OGA has an important role in clinical decision making about gait training and in evaluation of progress after stroke. The therapists in this study, under specific standardized conditions, were able to accurately infer the impaired ankle power generation after stroke from videotaped gait performances. The high accuracy and moderate to high reliability indexes obtained in this study indicate that further investigation of the measurement properties of OGA is warranted. Despite these favorable findings, further work is needed to generalize these results directly to the clinical situation. Further research is warranted to explore the ability of therapists to judge further kinetic aspects of gait and to define the circumstances that enhance the accuracy of this judgment.
Ms McGinley, Dr Goldie, and Dr Greenwood provided concept/research design, writing, data analysis, and project management. Ms McGinley and Dr Olney provided data collection and subjects. Dr Goldie and Dr Olney provided institutional liaisons. Dr Goldie, Dr Greenwood, and Dr Olney provided facilities/equipment and consultation (including review of manuscript before submission).
This research was approved by the La Trobe University Faculty of Health Sciences Human Ethics Committee.
This research was presented at the Sixth International Physiotherapy Congress of the Australian Physiotherapy Association; May 24–27, 2000; Canberra, Australian Capital Territory, Australia.
↵* Redlake Corp, 1711 Dell Ave, Campbell, CA 95008.
↵† Advanced Medical Technology Inc, 141 California St, Newton, MA 02158.
↵‡ GTCO Corp, 1055 First St, Rockville, MD 20850.
- Received February 22, 2002.
- Accepted September 2, 2002.
- Physical Therapy