|
|
||||||||
Research Reports |
NC Foley, MSc (Candidate), is Research Associate, Department of Physical Medicine and Rehabilitation, Parkwood Hospital, St Josephs Health Care London, London, Ontario, Canada. Address all correspondence to Ms Foley at 801 Commissioners Rd East, London, Ontario, Canada N6C 5J1
SK Bhogal, MSc, is PhD Candidate, Department of Epidemiology and Biostatistics, McGill University, Montreal, Quebec, Canada
RW Teasell, MD, FRCPC, is Professor and Chair/Chief, Department of Physical Medicine and Rehabilitation, Parkwood Hospital, St Josephs Health Care London and the University of Western Ontario, London, Ontario, Canada
Y Bureau, PhD, is Statistical Consultant, Imaging Program, Lawson Health Research Institute, London, Ontario, Canada
MR Speechley, PhD, is Associate Professor, Department of Epidemiology and Biostatistics, Faculty of Medicine and Dentistry, Schulich School of Medicine and Dentistry, University of Western Ontario
(norine.foley{at}sjhc.london.on.ca)
Submitted July 28, 2005;
Accepted January 16, 2006
| Abstract |
|---|
Key Words: Meta-analysis Quality assessment Reliability Systematic review
| Introduction |
|---|
|
|
|---|
Most quality assessment tools provide standardized administration guidelines to ensure uniform application; however, the scores awarded by abstractors depend markedly on the level of methodological detail described in each study. When reporting lacks clarity, individual interpretation may differ between abstractors, affecting the consistency of agreement and thereby reducing reliability. Disagreements typically are resolved by third-party review or through arbitration between reviewers. Surprisingly, although many scales are in use, few estimates of reliability have been published.
The Physiotherapy Evidence-Based Database (PEDro) Scale, developed by the Centre for Evidence-Based Physiotherapy, is an example of one such quality assessment scale.2 The scale is based largely on the Delphi List3 and was developed to assess the methodological quality of RCTs specifically pertaining to physical therapy interventions that were included in the database. The interrater reliability of the PEDro Scale was assessed previously in only a single trial, and no studies assessing the reliability of the Delphi List have been published. By use of the kappa statistic for pair-wise comparisons, reliability estimates determined with the PEDro tool for 2 raters assessing 120 RCTs were found to range from .50 to .79 after consensus was achieved (.36 to .80 before consensus). The intraclass correlation coefficient (ICC) for total scores was .56 (95% confidence interval [CI]= .47–.65) for ratings by individual raters. The percentage of agreement ranged from 70% to 98%.4
In an effort to determine which tools had been used previously to assess the methodological quality of published reviews, we surveyed 10 randomly selected reviews that evaluated physical therapy interventions from the Cochrane Database of Systematic Reviews. We found a wide range of approaches used to assess the methodological quality of individual RCTs.5–14 Most reviews used a qualitative, checklist approach, whereby individual methodological components were noted to be present or absent, but a total score was not determined. The number of quality items assessed for adequacy ranged from 1 to 10 and most frequently included randomization, allocation concealment, masking, intention-to-treat analysis, and accounting for dropouts. In 1 case, although 10 individual items had been summed, the authors noted that the purpose was to gain an overall impression of quality, and the data were not used for quantitative purposes.8 Three reviews used previously validated tools to assess methodological quality. The Jadad Scale15 was used in 2 reviews,13,14 and the PEDro Scale was used in the third.9 Two reviews used the Delphi List, another previously validated tool, with modifications.5,6 Two reviews quantitatively assessed only whether concealed allocation had been adequately described, although they included a more comprehensive list of quality indicators.7,8 None of these reviews reported estimates of reliability between raters.
We previously used the PEDro Scale to assess the methodological quality of 272 RCTs that were included in a systematic review of the stroke rehabilitation literature.16 In addition to physical and rehabilitation therapies (n=215), many of the therapies assessed in this review were pharmacological or surgical (n=57). The methodological quality of pharmacological trials included in this review was found to be significantly higher than that of nonpharmacological trials when the PEDro Scale was used (mean±SD=6.77±1.3 versus 5.53±1.3; P<.0001).17 The difference in quality between study types was largely attributable to the inherent difficulty of designing single-blind studies (ie, those in which participants are not aware of their group assignments) for nonpharmacological interventions, although double-blind studies (ie, outcome assessors are not aware of group assignments) also were less frequent for nonpharmacological interventions. As a means of formulating final conclusions in this review, only studies that achieved a PEDro score of 6 or greater were used when there was an abundance of evidence. Although an assessment of the reliability of the PEDro Scale was not included in this review and could not be conducted retrospectively, it was of interest to establish whether reliability estimates would vary depending on intervention type (pharmacological versus nonpharmacological).
Therefore, the purposes of this study were to assess how well 2 examiners agreed on specific items when using this scale (reliability), to determine whether the reliability and methodological quality differ between pharmacological and nonpharmacological studies, and to identify what aspects of RCTs tend to detract from their quality because these aspects are not incorporated into the studys design or they are not reported or stated clearly. We anticipated that there would be good agreement between study types (pharmacological and nonpharmacological) for individual PEDro items and that the composite scores for pharmacological trials would once again be higher than those for nonpharmacological trials because of the inability in the latter to keep subjects unaware of group assignments (masking). Discrepancies attributable to interpretation and differences in scoring patterns between study types are discussed, with an emphasis on highlighting the practical considerations encountered with the PEDro Scale under typical use.
| Method |
|---|
|
|
|---|
PEDro Scale
The PEDro Scale consists of 10 criteria assessing the quality of study components related to internal validity2 (Tab. 1). Each item receives either a "yes" or a "no" score. The maximum score that a study can receive is 10. The PEDro score allocates up to 3 points for the level of masking achieved (eg, masking of subject, therapist, and assessor), 2 points for randomization procedures (random allocation, concealment of allocation), 3 points for the reporting of appropriate data (baseline characteristics, between-group comparisons, and point and range estimates of efficacy), and 1 point each for analysis of data (intention-to-treat analysis) and adequacy of follow-up. For the purposes of this review, follow-up (criterion 7) was considered adequate if all of the originally randomized participants were accounted for at the end of the study. This interpretation differs from that described by the PEDro Scale, which defines adequacy as the measurement of the main outcome in more than 85% of the participants. We modified this criterion because we believed that substantial bias could be introduced through imbalanced dropout rates between groups, even though 85% or more of the original participants were analyzed.18
The methodology of each study was scored by 2 experienced, independent raters who were familiar with the PEDro tool and who were well matched in terms of education and knowledge in research methodology (NCF and SKB), although neither had formal training in the application of the PEDro tool. Both raters were unaware of each others results until all of the studies were assessed, at which point discrepancies were identified and discussed. Discrepancies were classified as "error" or "interpretation." Errors were resolved easily when it was evident that 1 of the abstractors had simply missed its reference in the original article, and consensus was easy to achieve. Interpretation discrepancies occurred when the abstractors interpreted the presence or absence of an item differently because of its presentation in the original article. Items of disagreement and reasons for discrepancies were recorded and tabulated.
Statistical Analysis
Both mean (±SD) and median (interquartile range) composite PEDro scores, achieved after consensus was reached, were analyzed. Differences in median scores between pharmacological and nonpharmacological studies were analyzed by use of the Mann-Whitney U test. Differences in proportions of studies meeting criteria between intervention types (nonpharmacological and pharmacological) were evaluated by use of the chi-square statistic with a continuity correction.
The Cohen kappa statistic assessing pair-wise comparisons was used to estimate the interrater reliability of each of the 10 PEDro items for all intervention arms. The kappa statistic is a popular chance-corrected measure of agreement between 2 raters assessing a nominal-level variable.19 The kappa statistic ranges from 0 to 1.00, and a higher value is indicative of better reliability. Agreement between data abstractors on total composite PEDro scores was assessed by use of the ICC with a 2-way mixed-effects model (with the absolute agreement definition). In addition to scores for all studies combined, the kappa and ICC scores were derived for pharmacological and nonpharmacological interventions. SPSS version 12*1 was used for all analyses. A P value of less than .05 was considered statistically significant.
| Results |
|---|
|
|
|---|
Quality
The percentages of all studies that met criteria for each of the PEDro items after consensus was reached are shown in Table 2. The final PEDro scores, achieved after consensus was reached, ranged from a low of 2 (1.2%) to a high of 10 (1.2%). The most frequently occurring intermediate scores were 5 (25.3%), 6 (25.3%), and 7 (28.97%). Seven studies achieved a score of 8 (8.4%), and no studies achieved a score of 9.
|
Reliability
Regardless of study type, there was 100% agreement between raters for PEDro Scale item randomization and reporting of between-group comparisons, whereas the poorest agreement was found for concealed allocation and baseline comparability. The kappa scores for all studies and the breakdown for drug and nondrug studies are shown in Table 3. Kappa scores varied from a low of .452 for concealed allocation among drug trials to perfect agreement (1.00) for randomization and reporting of results from between-group comparisons. Because of the inherent limitations of the statistical test, the kappa score was 0 for 3 of the PEDro items, despite a high percentage of agreement, and the kappa score was small and negative for 1 item, despite a high percentage of agreement (the Appendix shows examples of these 2 phenomena). The ICCs associated with the cumulative PEDro score were .91 (95% CI=.83–.94) for all studies, .89 (95% CI=.78–.95) for pharmacological studies, and .91 (95% CI=.84–.952) for nonpharmacological studies.
|
| Discussion |
|---|
|
|
|---|
Moseley et al2 also assessed the percentages of studies meeting PEDro criteria by evaluating 2,376 RCTs within the PEDro database. Our results are remarkably similar to theirs, with a few exceptions, which may have been attributable to either different inclusion criteria for the studies or the correctness of the ratings. Moseley et al2 reported that 94% of studies fulfilled the criteria for randomization, whereas we included only studies that were clearly randomized and excluded quasi-randomized or controlled trials. Higher percentages of studies included in the present review met the criteria for baseline comparability (84% versus approximately 65%) and for between-group comparisons (100% versus 89%).
Differences in Quality Between Pharmacological and Nonpharmacological Studies
The cumulative PEDro scores of RCTs evaluating drug interventions were significantly higher than those of RCTs evaluating therapy interventions, although drug studies did not consistently outperform therapy trials on an item-by-item basis. The percentages of nonpharmacological studies that met criteria for concealed allocation, baseline comparability, and the inclusion of point estimates were slightly higher, although the differences were small and no statistical tests of significance were performed. As predicted, the greatest difference in scores between intervention types was for subject masking, in which virtually all drug trials succeeded, whereas none of the therapy trials did. An unexpected finding was the difference between study types in the area of masked assessments; only a small percentage (33%) of nondrug trials succeeded in the masking of outcome assessors. Moseley et al2 also reported that a small percentage of trials used masked assessments in evaluations of physical therapies. However, the difference that we report with respect to study type is not easily explained. Although the difficulties with masking of subjects to group assignments in nonpharmacological trials are obvious, the obstacles to ensuring masked assessments are less so. A possible explanation for the shortcoming of therapy trials could be a lack of resources (eg, additional personnel were not available to carry out masked assessments), as these trials may have been conducted in a research setting rather than in a clinical setting. There was no more than a 5% difference between intervention types in the number of studies that met criteria for any of the other 8 PEDro items.
Estimates of Reliability
Although there is no consensus as to what constitutes a "good" or "acceptable" kappa score, for agreement that is less than 100%, guidelines interpreting the strength of agreement have been published.27–29 With the use of any 1 of these 3 published guidelines, our agreement ranged from substantial or good to perfect for each of the 10 pair-wise comparisons. These estimates of reliability are consistent with those in 1 other published report.4 To date, we are not aware of other evaluations of the reliability of the PEDro tool.
In many cases, scoring discrepancies arose from ambiguity in reporting, as it was unclear whether criteria had been satisfied on the basis of what was explicitly stated. The extent to which a literal translation of the eligibility criteria is adhered to will affect the consistency of the agreement. For example, in 1 case, the term "placebo" was used, although the word "blinding" or masking was not. In another case, the authors reported that they attempted to keep assessors unaware of group assignments. Disagreement also arose when details of the study methodology appeared outside of the Method section. Although the value of assessing differences in baseline prognostic factors in RCTs has been the subject of debate,30 it was the second largest source of scoring disagreement. There were 3 cases in which 1 of the abstractors believed that too few clinically important baseline variables had been assessed for the equality of traits. On 2 occasions, a point was not awarded when there appeared to be a significant difference for a variable thought to be important, on the basis of either the results of a significance test or their own judgment. In these cases, abstractors had to make an educated guess as to the potential for bias arising from the imbalance. Subject area knowledge and expertise were influential in scoring under these conditions and resulted in the second lowest kappa score. There were also 3 disagreements as to whether criteria had been fulfilled for the reporting of point estimates and measures of variability. When improvements over time between intervention groups are reported, point estimates are not applicable and can lead to uncertainty in scoring.
The high ICC for the composite PEDro scores also suggests good agreement, although this test does not consider the way in which the final scores were reached. It is possible for 2 raters to reach similar scores without achieving consistency on an item-by-item basis.
Estimates of Reliability Between Intervention Classes (Drug and Nondrug)
In general, there was better agreement between raters for the nonpharmacological studies than for the pharmacological studies, although the differences were small. The item that caused the largest number of disagreements between intervention types was concealed allocation (
=.452 for drug studies versus
=.788 for nondrug studies). This item also represented the criterion met least often (Tab. 2). One possible explanation for the poor agreement was that although it was generally clear for nonpharmacological studies that there had been no attempt to ensure that allocation had been concealed, pharmacological studies more frequently had attempted to achieve this goal, although ambiguous language and incomplete descriptions of processes resulted in disagreements. This finding was particularly true for multicenter trials, in which the term "concealed allocation" often was not used, although one rater thought that it could be inferred, because centralized assignment usually is associated with this trial design. Adequately concealed randomization procedures, such as the use of opaque, sequentially numbered envelopes or off-site randomization, ensure that the investigators have no foreknowledge of subject group assignments and reduce bias by minimizing the possibility that the randomization schedule can be subverted. Although this concept seems straightforward, Schulz and Grimes30 suggested that the definition of concealed allocation generally is not well understood and is often confused with both masking and randomization, a situation that further underscores the need for the use of clear and explicit language. Although the percentages of studies that were awarded a point for masking of subjects and intention-to-treat analysis were low, there was still good agreement between raters, suggesting that the criteria had been stated explicitly.
The mathematical limitations of the kappa statistic were evident for several cases in which the kappa value was 0, despite a high percentage of raw agreement. This situation occurred when the product of 1 of the marginal totals was 0 and obviously remained 0 after it was divided by "n." The kappa value also took on a negative and nonsensical value in 2 cases in which, again, there was a high percentage of raw agreement. This result occurred because the value for expected agreement was greater than that for observed agreement. Examples of these calculations are provided in the Appendix for clarification. As far as we are aware, there is no agreed-upon solution to this dilemma (eg, adding 0.5 to each cell, in a manner similar to a Yates correction of a chi-square statistic).
Limitations of the study are that the "true" PEDro scores remain unknown, and consensus agreement on scale items does not necessarily mean that the raters were correct. In this respect, no claim can be made about the validity of the tool. However, we have likely successfully simulated a practical situation faced by many representative users of the tool attempting to score studies for potential inclusion in systematic reviews. Although the kappa value has statistical limitations, it is a commonly used statistical tool that many clinicians are comfortable using.
| Conclusion |
|---|
|
|
|---|
|
| Appendix |
|---|
|
|
|---|
|
| Footnotes |
|---|
Ms Foley and Sanjit Bhogal were both involved in concept/idea/research design, writing, and data collection and analysis. Dr Bureau and Dr Speechley were consultants on the project. Dr Teasell procured funds and was a consultant.
This project was funded by the Canadian Stroke Network and the Heart & Stroke Foundation of Ontario.
* SPSS Inc, 233 S Wacker Dr, Chicago, IL 60606. ![]()
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
S. K. Subramanian, C. L. Massie, M. P. Malcolm, and M. F. Levin Does Provision of Extrinsic Feedback Result in Improved Motor Learning in the Upper Limb Poststroke? A Systematic Review of the Evidence Neurorehabil Neural Repair, February 1, 2010; 24(2): 113 - 124. [Abstract] [PDF] |
||||
![]() |
J. A. SINGH, S. MURPHY, and M. BHANDARI Assessment of the Methodologic Quality of Medical and Surgical Clinical Trials in Patients with Arthroplasty J Rheumatol, December 1, 2009; 36(12): 2642 - 2654. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. H. Sillen, C. M. Speksnijder, R.-M. A. Eterman, P. P. Janssen, S. S. Wagers, E. F. M. Wouters, N. H. M. K. Uszko-Lencer, and M. A. Spruit Effects of Neuromuscular Electrical Stimulation of Muscles of Ambulation in Patients With Chronic Heart Failure or COPD: A Systematic Review of the English-Language Literature Chest, July 1, 2009; 136(1): 44 - 61. [Abstract] [Full Text] [PDF] |
||||
![]() |
M Asano, D. Dawes, A Arafah, C Moriello, and N. Mayo What does a structured review of the effectiveness of exercise interventions for persons with multiple sclerosis tell us about the challenges of designing trials? Multiple Sclerosis, April 1, 2009; 15(4): 412 - 421. [Abstract] [PDF] |
||||
![]() |
C. G Maher, A. M Moseley, C. Sherrington, M. R Elkins, and R. D Herbert A Description of the Trials, Reviews, and Practice Guidelines Indexed in the PEDro Database Physical Therapy, September 1, 2008; 88(9): 1068 - 1077. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Kwakkel, B. J. Kollen, and H. I. Krebs Effects of Robot-Assisted Therapy on Upper Limb Recovery After Stroke: A Systematic Review Neurorehabil Neural Repair, April 1, 2008; 22(2): 111 - 121. [Abstract] [PDF] |
||||
![]() |
S. A. Olivo, L. G. Macedo, I. C. Gadotti, J. Fuentes, T. Stanton, and D. J Magee Scales to Assess the Quality of Randomized Controlled Trials: A Systematic Review Physical Therapy, February 1, 2008; 88(2): 156 - 175. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Christie, G. Jamtvedt, K. T. Dahm, R. H Moe, E. A Haavardsholm, and K. B. Hagen Effectiveness of Nonpharmacological and Nonsurgical Interventions for Patients With Rheumatoid Arthritis: An Overview of Systematic Reviews Physical Therapy, December 1, 2007; 87(12): 1697 - 1715. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Chan and M. Bhandari The Quality of Reporting of Orthopaedic Randomized Trials with Use of a Checklist for Nonpharmacological Therapies J. Bone Joint Surg. Am., September 1, 2007; 89(9): 1970 - 1978. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |