|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Reports |
K Starrost, PT, MSc, is Head of Physical Therapy, Clinics Schmieder, Allensbach, Germany.
S Geyh, PhD, is Research Scientist, Swiss Paraplegic Research, Nottwil, Switzerland.
A Trautwein, PT, is Physical Therapist, Clinics Schmieder, Gerlingen, Germany.
J Grunow, PT, is Physical Therapist, Neurological Hospital Tristanstrasse, Munich, Germany.
A Ceballos-Baumann, MD, is Professor and Head, Neurological Hospital Tristanstrasse.
M Prosiegel, MD, is Head, Centre for Swallowing Disorders, Speciality Clinic for Physical Medicine and Medical Rehabilitation, Bad Heilbrunn, Germany.
G Stucki, MD, MS, is Professor and Head, Department of Physical Medicine and Rehabilitation, Ludwig-Maximilian University, Munich, Germany; Head, ICF Research Branch of the WHO CC FIC (DIMDI), Institute for Health and Rehabilitation Sciences, Ludwig-Maximilian University; and Head, Swiss Paraplegic Research, Nottwil, Switzerland. Dr Stucki's institutional mailing address is: Department of Physical Medicine and Rehabilitation, University Hospital Munich, Marchioninistrasse 15, 81377 Munich, Germany.
A Cieza, PhD, is Research Scientist and Group Leader, ICF Research Branch of the WHO CC FIC (DIMDI), Institute for Health and Rehabilitation Sciences, Ludwig-Maximilian University, and Research Scientist and Group Leader, Swiss Paraplegic Research, Nottwil, Switzerland.
Address all correspondence to Dr Stucki at: gerold.stucki{at}med.uni-muenchen.de
Submitted July 27, 2007;
Accepted March 24, 2008
| Abstract |
|---|
Subjects and Methods: A monocentric, cross-sectional reliability study was conducted. A consecutive sample of 30 subjects after stroke participated. Two physical therapists rated the subjects functioning in 166 ICF categories.
Results: The interrater agreement of the 2 physical therapists was moderate across all judgments (observed agreement=51%, kappa=.41). Interrater reliability was not related to rater confidence or to the physical therapists areas of core competence.
Discussion and Conclusion: The present study suggests potential improvements to enhance the implementation of the ICF and the Extended ICF Core Set for Stroke in practice. The results hint at the importance of the operationalization of the ICF categories and the standardization of the rating process, which might be useful in controlling for rater effects and increasing reliability.
| Introduction |
|---|
|
|
|---|
However, with the practical application of the ICF, important challenges arise. The main challenge is the length of the classification, with more than 1,400 categories. To address this challenge, internationally agreed-on ICF Core Sets for various health conditions have been developed in a scientific evidence-based process.12 The common standardized procedure for developing ICF Core Sets integrates evidence gathered from preliminary studies with a formal decision-making and expert consensus process. The methodological approaches in the preliminary studies include, for each health condition: (1) systematic literature reviews of outcome measurements used in clinical trials, (2) Delphi exercises capturing experts views, and (3) collection of empirical data from people undergoing inpatient or outpatient rehabilitation. The results of the preliminary studies are the foundation for the subsequent decision-making and consensus process with a nominal group technique. The resulting ICF Core Sets are practical tools that represent selections of categories from the whole classification. They comprehensively describe the prototypical spectrum of problems in the functioning of patients with specific health conditions. They are based on the universal language of the ICF but enhance its applicability through their manageable size.
In the context of neurorehabilitation, stroke plays a prominent role. In this field, 3 ICF Core Sets can be applied, namely, the ICF Core Set for Stroke13 and the ICF Core Sets for patients with neurological conditions in acute care hospitals14 and early postacute care rehabilitation facilities.15 These ICF Core Sets have been combined to create the Extended ICF Core Set for Stroke. It contains all ICF categories that have been selected for any of the 3 ICF Core Sets mentioned above. The Extended ICF Core Set for Stroke contains 166 categories of the ICF, 59 categories of the component "body functions," 11 categories of the component "body structures," and 59 categories of the component "activity and participation." The influence of the component "environmental factors" is described by 37 categories.
A further challenge for the implementation of the ICF is the operationalization of ICF categories. The ICF comprises "qualifiers" to quantify the level of functioning or the severity of the problem in the various ICF categories. The WHO suggests that all categories of the classification be quantified with the same generic scale (Tab. 1).
|
Therefore, the application of the ICF and the ICF Core Sets is a challenge to the user and poses the question of reliability and rater agreement when qualifiers are assigned to describe patients functioning and disability. So far, only a few studies have dealt with the reliability of ICF qualifiers, and the interrater reliability of the qualifiers used with the ICF Core Set for Stroke has not been studied yet. Okochi et al16 used the ICF Checklist to examine test-retest reliability in geriatric patients and found moderate overall reliability during retesting after 1 week. Reliability varied among categories of the ICF (weighted kappa values=.46 for body functions and .55 for activity and participation). Van Triet et al17 studied the intertester reliability of a schedule based on the International Classification of Impairments, Disabilities, and Handicaps (ICIDH) in patients with musculoskeletal problems. The ICIDH18 is the predecessor of the ICF. Kappa values ranged from –.06 to 1.00 and were higher in "disability" categories than in "impairment" categories.
The study by van Triet et al,17 however, is clearly outdated because it is based on the ICIDH. The authors departed greatly from the categories of the classification and from its qualifier scale in creating their assessment schedule. Both studies16,17 were conducted with poorly specified mixed samples; thus, the results were not generalizable to the functioning ratings for patients with stroke. In addition, the investigators in both studies made arbitrary selections of various areas of functioning, not covering the full scope of the ICF, as reflected in the carefully chosen categories of the Extended ICF Core Set for Stroke. In neither of the studies did the investigators consider the full qualifier scale in their analyses. The investigators in both studies applied designs in which different raters completed their recording of patients functioning at different time points, mixing variation of time points with variation of raters. Thus, the type and the amount of information underlying the ratings might not have been comparable.
Although Okochi et al16 examined the influence of the experience of raters on retest reliability, rater confidence and core competence might be more proximate variables connected to reliability. Within the context of reliability, rater confidence is an important variable frequently examined in clinical research. For example, in studies dealing with the reliability of imaging techniques19,20 and behavioral observations,21,22 confidence ratings are often used as independent outcomes to demonstrate diagnostic accuracy. The results of these studies hint at a possible relationship between agreement and confidence. Confidence might serve as an explanatory factor for rater agreement. Thus, with regard to the reliability and the application of the Extended ICF Core Set for Stroke, the association of rater agreement and rater confidence is of interest.
Furthermore, in reliability studies, the experience and training of raters seemed to be highly relevant and were frequently reported.23–29 In these studies, "experience" and "training" referred not only to the handling of the specific rating instrument used, but also to the clinical experience of the raters within the field and the concepts to be rated and the patient group with the given disease condition. These studies drew an equivocal picture of the relationship between rater competence and interrater reliability and suggested that the different results might have depended on the specific rating instrument examined. Thus, the role of raters areas of competence should be considered for any new rating tool.
Therefore, the overall objective of this investigation was to study the interrater reliability of physical therapists ratings of the functioning of study participants with the Extended ICF Core Set for Stroke. The specific aims were: (1) to study the agreement of the 2 physical therapists in rating participants functioning with the Extended ICF Core Set for Stroke, (2) to explore the relationship between rater agreement and rater confidence, and (3) to explore rater agreement in relation to physical therapists areas of core competence.
| Method |
|---|
|
|
|---|
|
To avoid bias caused by different levels of information, the participants were known to neither of the interviewers or to both. In addition to the disability and confidence ratings, sociodemographic and stroke-related data were recorded for each participant.
Analyses
To study the agreement of the 2 physical therapists in rating participants functioning with the Extended ICF Core Set for Stroke, the overall percentage of observed agreement and an overall kappa coefficient32 with the associated bias-corrected 95% bootstrap confidence interval33 were calculated. These calculations included all judgments across all ICF categories and participants. In addition, for each ICF category, the percentage of observed agreement, the kappa coefficient, and the associated bias-corrected 95% bootstrap confidence interval were calculated.
Because the ICF qualifiers "8: not specified" and "9: not applicable" cannot be integrated into the ordinal scale of the ICF qualifiers 0 to 4 and –4 to 4, respectively, a Cohen kappa statistic for nominal scale response categories was used. The kappa statistic is a measure of agreement that exists beyond the amount of agreement expected by chance alone. Kappa values generally range from 0 to 1, where 1 indicates perfect agreement and 0 indicates no additional agreement beyond that expected by chance alone. A negative kappa value indicates less agreement than that expected by chance alone. The bias-corrected 95% bootstrap confidence interval33 allows determination of the precision of kappa values without assumptions being made about the homogeneity of the marginal distributions in the data. The bootstrap confidence interval is especially useful with small sample sizes.
It is well known that kappa values depend on the prevalence of the attribute measured and can show biased results with skewed marginal distributions.34–36 Consequently, kappa values for various ICF categories cannot be compared with each other properly because their baseline prevalences are unknown and their marginal distributions may be more or less skewed. Thus, in the present study, only information about whether kappa values exceeded chance or not was used for comparisons across ICF categories.37 The percentage of observed rater agreement was preferred as the indicator for the level of agreement. Emphasizing the actual observed agreement can be further justified because the kappa statistic is a chance-corrected measure of agreement, but the definite role of chance in the rating process is not clear.34,35
The levels of confidence of the 2 raters were compared across all judgments. For this purpose, first the confidence ratings of each rater were checked for a normal distribution with the Kolmogorov-Smirnov test. The t test for independent samples could be used to check the null hypothesis of no difference in the levels of confidence of the 2 physical therapists when the requirement for a normal distribution of the data was met. Otherwise, the nonparametric Mann-Whitney U test was applied.
For exploring the relationship between the level of rater agreement and rater confidence, Pearson correlation coefficients were calculated across all judgments for all ICF categories and participants. The level of agreement was quantified as the percentage of observed agreement in each category. The overall correlation coefficient and the percentage of variance in the level of rater agreement explained by confidence are reported. When the levels of confidence of the 2 raters were different, the correlation between the level of agreement and confidence and the percentage of variance explained by confidence were reported for the 2 raters separately as well.
For exploring rater agreement in relation to physical therapists areas of core competence, each ICF category of the Extended ICF Core Set for Stroke was classified according to the results of Glaessel and colleagues (Glaessel A, Kirchberger I, Cieza A, Stucki G; 2007; unpublished research) as a "core competence ICF category" for physical therapists or as a "not-core competence ICF category." ICF categories that have been agreed on by a panel of international experts as areas being treated by physical therapists were classified as core competence ICF categories. For these ICF categories, at least 80% of the experts agreed that patients problems in the areas considered were being treated by physical therapists.
The levels of agreement in the 2 classes of ICF categories (core competence versus not-core competence) were compared by use of the t test for independent samples after the data were examined for the requirement for a normal distribution with the Kolmogorov-Smirnov test. If the requirement for a normal distribution was not met, then the Mann-Whitney U test was applied. The variance in the level of rater agreement accounted for by the variable "core competence" was reported with the Spearman correlation coefficient.
Analyses were conducted with SPSS,*,38 except for the kappa and bias-corrected 95% bootstrap confidence interval calculations, which were conducted with SAS.
,39
| Results |
|---|
|
|
|---|
|
|
Figure 2 provides information about the physical therapists level of confidence in their ratings (for details, see (Supplemental Tables 1, 2, 3, and 4). The medians of the confidence ratings were 90% (range=35%–100%) for rater 1 and 100% (range=70%–100%) for rater 2. For raters 1 and 2, the confidence ratings across all participants and all 166 categories of the Extended ICF Core Set for Stroke showed a normal distribution. The t test for independent samples confirmed a statistically significant difference between the 2 raters (P<.00) with regard to their levels of confidence. Thus, the relationship between confidence and agreement was examined separately for each rater. Pearson correlation coefficients were significantly different from 0 (P<.01) when calculated overall as well as for the 2 raters separately; however, the strength of the correlations was negligible (r=.08 overall, r=.1 for rater 1, and r=.07 for rater 2).
|
=.26, P=.74). | Discussion |
|---|
|
|
|---|
The overall result for interrater agreement was a kappa value of .41. This value can be interpreted as "moderate" according to Altman40 but as "poor" according to El Emam.41 However, interpreting kappa values according to recommendations in the literature is only an initial step. The acceptability of a certain reliability value depends on the intended uses and types of decisions based on the instrument. Instruments used for decisions about an individual should be more reliable than instruments used for decisions about a group or for research purposes. The interrater reliability of the qualifier scale used with the Extended ICF Core Set for Stroke should be enhanced in the future. However, for research purposes involving large samples, it might be acceptable. Furthermore, the usefulness of the Extended ICF Core Set for Stroke in current physical therapy practice essentially is based on its capability to integrate the results of different examinations, measurements, different professionals clinical observations, and patients self-reports into one and the same common framework. Decisions in current ICF-based practice rely on such an integrative perspective and use the complete ICF functioning profile of a patient to select and prioritize intervention goals, to decide on the course of interventions, and to manage the rehabilitation process.42 The 2 physical therapists in the present study gathered and analyzed a large amount of data in rating the 166 categories of the ICF Core Set for each participant. They did so largely without being able to rely on the results of standardized measurements and within a clinical environment that was not yet familiar with ICF-based practice. Under these conditions, the overall kappa value of .41 may be regarded as a promising starting point.
In a recent study, Uhlig et al43 examined the interrater reliability of the ICF Core Set for Rheumatoid Arthritis in a sample of 25 people seen on an outpatient basis. The mean observed agreement across the 95 ICF categories was 47%, a value slightly lower than that in the present study of the Extended ICF Core Set for Stroke (52%). However, the patterns of agreement across the components of the ICF were very similar in both studies (higher for functioning and lower for environmental factors). Uhlig et al43 did not report an overall kappa value, but they indicated that 64% of the ICF Core Set for Rheumatoid Arthritis showed agreement beyond chance. Only 52% of the categories of the Extended ICF Core Set for Stroke demonstrated a significant kappa value. However, Uhlig et al used weighted kappa calculations instead of the nominal kappa calculations used in the present study—a more conservative approach regarding the characteristics of the ICF qualifier scale.
A recent methodological study by Grill et al44 exemplified various approaches to the assessment of rater agreement, including observed agreement rates and kappa calculations. They used as an example the ICF category "d430: lifting and carrying objects" in a survey with a repeated-measurement design of a convenience sample of 25 patients in an acute-care hospital setting. For this single ICF category, Grill et al found results similar to those of the present study. The observed agreement rate was 52%; that in the present study was 57%. The weighted kappa value was .36; the nominal kappa value in the present study was .46 (95% confidence interval=.26–.69).
The observed rater agreement and the results of the chance-corrected kappa analyses suggested that interrater reliability was notably higher for the components of functioning than for the environmental factors. Thus, the level of rater agreement might be connected to content-related features in the various ICF categories. Categories with high agreement were likely to be narrow categories with clearly defined concepts and a less complex rating process. The judgments in these categories could easily be obtained from observation (b215: function of structures adjoining the eye) and common knowledge (d940: human rights), taken from a medical chart (s120: spinal cord and related structures), or obtained by asking the participant (d850: remunerative employment).
Categories with low agreement were mainly broad and complex categories (eg, e440: individual attitudes of personal care providers and personal assistants) frequently encountered in environmental factors. However, for body functions and activity and participation as well, several categories subsumed broad contents and therefore might be difficult to agree on (eg, d455: moving around; and b240: sensations associated with hearing and vestibular function). For broad ICF categories, raters frequently face inconsistent or vague information because of conflicting statements from patients, proxies, and medical charts45 or unawareness of the patient.31 For example, the raters often observed that participants were distracted during the interview or did not completely understand the questions. However, when asked specifically, the participants stated no subjective impairment in "listening" (d115) and "focusing attention" (d160). In the absence of neuropsychological records, the judgments were based on the personal impressions of the physical therapists.
Low agreement in some categories (eg, d640: doing housework; and e425: individual attitudes of acquaintances, peers, colleagues, neighbors, and community members) might have resulted from the fact that most inpatient participants were not yet confronted with their home environment and usual tasks. Thus, the ratings of their difficulties in these categories were based on inferences rather than direct observation and factual information.
However, the difference in agreement between the components of functioning and the environmental factors might also have been influenced by the different numbers of response options for the ICF qualifier scale. For the environmental factors, the qualifier scale contains more steps (–4 to +4), the variability is higher, and it is more difficult to achieve exact agreement between 2 raters.
The ratings of the participants functioning following stroke were based on information gathered by interviewing, observing, and reading medical records and were rarely based on measurement results. For many areas encompassed by the ICF Core Set, no measures exist or measures are not routinely used in the clinical setting. Developing standardized measurement methods and operationalizations, such as patient self-rating questionnaires, for these areas is a future step that needs to be based on, guided, and justified by the evaluation of the ICF Core Set and the qualifier scale as they are now, a goal that has been undertaken in the present study.
It is important to stress that the ICF Core Set is not a measurement instrument but rather a classification-based tool, with the qualifiers assigned being global severity judgments but not measures. Thus, common rater effects (such as the halo effect and severity-leniency errors), the raters knowledge, attitudes, and beliefs, the interaction between the rater and the patient, and demographic characteristics may influence the rating process in a highly individual way.46–48 Within the ICF Core Set, these effects are not controlled for and most probably contribute to nonagreement in several categories. Severity ratings may have differed between the physical therapists because of their personal characteristics (if one would have been stricter while the other would have been more serene about the participants state in general). Confidence ratings may have differed because one of the physical therapists was—as the leader of the present study—familiar and connected with ICF-related issues in a more in-depth way than the other, despite the ICF training completed before the study. The present study involved a simple design and a straightforward method of analysis of rater agreement that did not take into account the factors that influence the rating process in a systematic way. However, in future studies, systematic variations in the rating process could be examined and accounted for by use of a more comprehensive design and modeling methods or Rasch analysis techniques.46
The present study revealed no relationship between agreement and confidence. From a statistical perspective, this result was attributable to the low variability of the confidence ratings of the 2 physical therapists. Two possible explanations might account for this low variability. Some studies on the cognitive processes of decision and judgment provide evidence that confidence in judgment can be understood in terms of stable interindividual differences or as a personality trait.49,50 Thus, in light of this evidence, it is not surprising that the confidence judgments in the present study showed only low variations across different rating situations, that is, different participants and categories, but remained relatively constant for a given individual or rater.
In addition, the ICF qualifier scale includes the option of choosing the qualifier "8: not specified," which can be applied when the available information is not sufficient to make a judgment. Thus, low confidence levels can be avoided by raters by use of the qualifier 8. Doing so reduces the variability of the confidence ratings and complicates any relationship between confidence and agreement. Therefore, future studies on the reliability of the ICF qualifier scale, including confidence ratings, should omit the qualifier "8: not specified." However, in the present study, the effect of the use of the qualifier 8 on the confidence-agreement relationship might have only minor relevance, because only 5% of the judgments of rater 1 and 7% of those of rater 2 were assigned the qualifier "8: not specified" (data not shown).
The results of the present study further showed that rater agreement was independent of the raters areas of core competence as physical therapists. The raters achieved a high level of agreement in several categories that were not among the physical therapists areas of core competence (eg, d330: speaking; d710: basic interpersonal interactions; d850: remunerative employment; and d845: acquiring, keeping, and terminating a job). ICF categories defined as physical therapists areas of core competence are aspects of functioning that are treated by physical therapists. However, physical therapists are trained to have comprehensive knowledge of stroke and are experienced in observing and detecting the full scope of patients problems, for example, when taking their history. This means that physical therapists are well able to identify patients problems, for example, in the category "d330: speaking," even though these problems are typically treated by speech and language therapists. That is, although the ICF categories that cover the core competencies describe the areas in which physical therapists are usually trained, their experience, skills, and knowledge as health care professionals working in an interdisciplinary team surpass the specified focus of these ICF categories.
At this point, the potential of the ICF Core Set as a basis for multidisciplinary communication and cooperation in rehabilitation practice and management arises. Still, the question of agreement and interrater reliability among health care professionals of different specializations remains open. Future studies involving various health care professions should be conducted to clarify this question.
However, the results also revealed several categories that addressed physical therapists areas of core competence but that were rated differently by the 2 physical therapists (eg, b176: mental functions of sequencing complex movements; b755: involuntary movement reaction functions; and b260: proprioceptive functions). These results indicated that for the participants in the present study, the information available on problems was not based on measurements and in-depth examinations but relied on global impressions from the interviews.
The present study has several limitations. Because of the monocentric design, the small sample size, and the small number of participating raters, the generalizability of the results is limited, and the results need to be interpreted with caution. The results suggest next steps for future investigations. In addition, the present study addressed interrater reliability in terms of agreement between 2 raters. It did not address the quality or "truth" of the ratings. Currently, no gold standard exists against which a description of patients problems across the categories of the ICF Core Set can be compared.
The present study suggests potential improvements to enhance the implementation of the Extended ICF Core Set for Stroke in practice. To enhance interrater reliability, the training of health care professionals with regard to the ICF should be further developed and standardized. Implementing the ICF in rehabilitation practice at an institutional level may enhance the availability and accessibility of information about all aspects of functioning in individual patients, which in turn may enhance the reliability of ICF-based ratings.8 In addition, the metric characteristics of the ICF qualifier scale should be taken into account. In particular, the distance between the steps of the scale may be too narrow and therefore may lead to disagreements. Thus, examining and restructuring the rating scale, for example, with Rasch analyses, may also enhance its interrater reliability.43 However, the results of the present study mainly hint at the importance of operationalization of the categories and standardization of the rating process to control for rater effects and to increase reliability.
In the future, 2 paths toward the operationalization of ICF categories can be followed, namely, the development of ICF-based measures and the development of detailed ICF manuals. Efforts in the latter direction are already being made, for example, by the American Psychological Association51 and the Australian Institute of Health and Welfare.52 Operationalizing the categories of the Extended ICF Core Set for Stroke could be an important next step to ease and to facilitate the application of the ICF in clinical practice and to use its full potential. Physical therapists can make valuable contributions to these developments to enhance professional, scientifically founded, multidisciplinary practice for the benefit of patients.
| Footnotes |
|---|
This study was approved by the Ethics Committee of the University of Munich.
* SPSS Inc, 233 S Wacker Dr, Chicago, IL 60606. ![]()
SAS Institute Inc, PO Box 8000, Cary, NC 27513. ![]()
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
L. S Gilchrist, M. L. Galantino, M. Wampler, V. G Marchese, G S. Morris, and K. K Ness A Framework for Assessment in Oncology Rehabilitation Physical Therapy, March 1, 2009; 89(3): 286 - 306. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M Jette Invited Commentary Physical Therapy, July 1, 2008; 88(7): 851 - 853. [Full Text] [PDF] |
||||
![]() |
S. Geyh, G. Stucki, and A. Cieza Author Response Physical Therapy, July 1, 2008; 88(7): 854 - 856. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |