|
|
||||||||
Research Reports |
In his highly informative, cutting-edge commentary1 on our article,2 Jette takes up the theme of interrater reliability in the application of the Extended ICF Core Set for Stroke to point out and discuss the challenges of operationalization and measurement in relationship to the implementation of the World Health Organization's International Classification of Functioning, Disability and Health (ICF).3 Jette presents and explains with deep insight important facts about the ICF, the Extended ICF Core Set for Stroke, and the concept of the ICF qualifiers. He summarizes the value of the ICF and the ICF Core Sets and their different fields of use. In addition, he emphasizes the relevance of the ICF framework within and beyond the field of physical therapy.
In turning to our current study and the question of interrater reliability in the context of ICF implementation, Jette highlights the need for meeting methodological standards in measurement: "the ICF classification system must meet key methodological standards, one of which is acceptable reliability of the ratings when performed by different raters." He concludes that our study "raises serious questions about the methodological adequacy of the ICF Core Set ratings" and that the "methodological concerns appear inherent in the approach" of the ICF and ICF Core Sets.
We agree with Jette that moderate interrater reliability (overall kappa of .41) represents the bottom level of quality requirements for an outcome measure. Reliability for outcome measures definitely should be higher for clinical purposes (eg, detecting change during rehabilitation) and for far-reaching decisions about an individual patient's life (eg, post-rehabilitation discharge to the community or to an institutional setting).
In appraising the reliability of the Extended ICF Core Set for Stroke, first of all, one fundamental issue has to be taken into account: an ICF Core Set is not a measure at all, but is a classification-based tool. The key difference between a measure and a classification tool is that a classification can serve as the reference among different measures, different approaches, different perspectives at different facilities, and even different countries, thus being the framework and organizing principle where all information can flow together into the same common system. This means that a classification-based tool can be used to organize qualitative information (eg, "How are you doing today, Mrs Jones?") as well as quantitative information (eg, Life Satisfaction Scale4 scores), results of observer-rated (eg, Functional Independence Measure5) and patient-reported (eg, visual analog scale for pain) outcomes, physical measures (eg, grip strength dynamometer) and questionnaire scores (eg, Hand Function domain of the Stroke Impact Scale), and so on.
In the current study, the data collections using the ICF Core Sets relied on clinical judgments combining all different kinds of available information (observation, interviews, measurements, and records), summarizing them into one qualifier per ICF category. Considering clinical judgments in relationship to reliability, the question that arises is: Does the Extended ICF Core Set for Stroke make the clinical judgments of physical therapists more or less reliable? The results of the current study illustrates that using the ICF Core Set does not bring clinical judgments of experienced and well-trained physical therapists to a level of metric quality, as would be desirable for standardized quantitative measurements. However, the results document where we are now in terms of ICF operationalization and from what point we start on the way from expert clinical judgments, which can have very diverse levels of interrater reliability,6–8 to quantitative representation of patients functioning state by high-quality standardized measures. The results for interrater reliability of the single ICF categories can be taken as indications of (1) areas of functioning in which it seems to be justifiable to rely on clinical judgments and (2) areas of functioning in which efforts to develop elaborate operationalizations are necessary. In addition, if an ICF-based instrument or an ICF manual for the Extended ICF Core Set for Stroke is developed and tested for interrater reliability, then it might be possible to conclude, using the results of the current study, how much more reliable these developments are in contrast to simply mapping the clinical judgment into the ICF's qualifier scale.
In his commentary, Jette refers to measurement as a contrast or alternative to classification. However, it is important to note: To measure or to classify, that is not the question. The quantitative results of assessments are the objects that enter the classification, which is the ordering principle or the structure to store, retrieve, and convey any kind of information about patients functioning, disability, and health. In this way, the results of quantitative measurement are the flesh or the muscles attached to the bones of the ICF classification. Therefore, obviously, there is no antagonism between classification and standardized quantitative interval scale measurement. On the contrary, they are complementary approaches, each backing up the other. The ICF Core Set approach can indicate what ICF concepts to measure in a certain health condition9 or in a specific setting,10 and, in turn, the results of different measurements can be organized into the framework of the ICF.
In considering the results of our study, the overall reliability indexes should be interpreted with caution. A conclusive final judgment on interrater reliability should not rely only on kappa or percentage of exact agreement because of certain technical features of these indexes. Nominal kappa especially is not sensitive to other important factors (besides chance variation) that may influence reliability and that may lie outside the features of the rating tool. Thus, kappa analyses can be seen only as a first step in gaining a full picture about interrater reliability in the use of the Extended ICF Core Set for Stroke. Further studies are needed to arrive at a definitive conclusion regarding interrater reliability and, consequently, to arrive at well-founded recommendations about suitable strategies for their improvement.
Additionally, it is to be noted that the usefulness of a tool with a specific reliability depends on the purpose for which it is applied. In principle, an instrument with moderate reliability could be used in large samples (eg, for survey purposes) if no better alternative is available. There is currently no alternative to the use of ICF-based tools and no other standard for describing the full scope and all relevant and specific aspects of disability and functioning in a universal, etiology-neutral, cross-disciplinary way.
Our experience with the data collection for the current study showed that it was difficult to achieve exact agreement in ICF categories that are especially broad. Even the most precise ICF category contains and addresses several different aspects of functioning and disability at the same time. This breadth of the categories, in the sense of openness and nonrestrictiveness, is a major advantage of the classification, as it ensures the ICF's applicability in different contexts and for different purposes. Therefore, changing the ICF—as suggested in the commentary's final paragraph—might not be the most practical and beneficial way to enhance measurement standards in the implementation of the ICF. Instead, establishing accompanying operationalizations and their explicit linking to the categories and qualifiers of the classification seems to be a more appealing solution. A suitable approach has been developed by Cieza et al.11
Thus, we fully agree with Jette's commentary about the challenge of operationalization and the need to meet the methodological standards of objectivity, validity, reliability, and feasibility. Clearly, ICF-based measurement instruments need to be developed. Instruments that are based on the ICF's conceptual framework, such as the Burden of Stroke Scale,12 the Stroke Impact Scale,13 and so on, are already an important step in this direction. However, the next step to be taken is the development of measures based on the classification itself (ie, based on the categories of the ICF).
Without doubt, Jette's clear-sighted argument for the application of item response theory (IRT) can only be supported. Today, IRT is the method of choice for developing high-quality measurement instruments and for facilitating computer adaptive testing, the coming method of choice for assessment applications, as it has been realized for the Activity Measure for Post-Acute Care (AM-PAC)14 cited by Jette.
In line with Jette, it can be concluded that there is a need for measures that are consistent with the ICF as a model and as a classification, as well as for developing algorithms that make the relationship between different measures and the classification explicit to enhance the objectivity of the qualifier scaling and to facilitate the implementation of the ICF in physical therapists everyday clinical practice.
| References |
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |