Background and Purpose: The methodological quality of randomized controlled trials (RCTs) is commonly evaluated in order to assess the risk of biased estimates of treatment effects. The purpose of this systematic review was to identify scales used to evaluate the methodological quality of RCTs in health care research and summarize the content, construction, development, and psychometric properties of these scales.

Methods: Extensive electronic database searches, along with a manual search, were performed.

Results: One hundred five relevant studies were identified. They accounted for 21 scales and their modifications. The majority of scales had not been rigorously developed or tested for validity and reliability. The Jadad Scale presented the best validity and reliability evidence; however, its validity for physical therapy trials has not been supported.

Discussion and Conclusion: Many scales are used to evaluate the methodological quality of RCTs, but most of these scales have not been adequately developed and have not been adequately tested for validity and reliability. A valid and reliable scale for the assessment of the methodological quality of physical therapy trials needs to be developed.

The medical literature is an important resource to guide clinical decision making and research. The evaluation of the methodological quality of studies is an essential step in the process of selecting the best clinical literature. According to Verhagen et al,1 assessment of methodological quality involves evaluation of internal validity (the degree to which a study's design, conduct, and analysis have minimized biases) and external validity (the extent to which the results of a study can be generalized outside the experimental situation) as well as statistical analysis of primary research. Taken together, these validity constructs are important in determining the methodological quality of primary research. Khan et al2 pointed out that some reasons for performing quality assessment include: to determine a minimum quality threshold for the selection of the primary studies for a systematic review; to explore differences in quality as an explanation for heterogeneity in study results; to weigh the results in proportion to the quality in meta-analysis; and, more importantly, to guide interpretation of findings, help determine the strength of inferences, and guide recommendations for future research and clinical practice.

The assessment of the quality of controlled trials is essential because variations in the quality of trials can affect the conclusions about the existing evidence.3 In a review of trials evaluating primarily medical treatments, Moher and colleagues4,5 demonstrated that trials that did not include features such as blinding and allocation concealment tended to report an exaggerated treatment effect compared with trials that did include these features. These facts emphasize the importance of methodological quality assessment in order to provide accurate information on therapeutic effects.

Trial quality can be divided into 2 categories (which overlap to some degree): methodological quality and reporting quality. Methodological quality is defined as “the confidence that the trial design, conduct, and analysis have minimized or avoided biases in its treatment comparisons.”6(p63) Reporting quality is defined as “the provided information about the design, conduct and analysis of the trial.”6(p63) Inadequate reporting makes the interpretation of studies difficult if not impossible.

Scales and checklists are 2 types of instruments that may be used to assess the methodological quality of clinical trials. These 2 types have been used interchangeably; however, they are actually quite distinct. Scales and checklists both include items measuring quality; however, with a scale, the responses to the individual items are summed to create an overall summary score representing trial quality. For example, with the Physiotherapy Evidence Database (PEDro) scale, a summary quality score can be created by determining the number of “yes” responses to items 2 through 11. A single score of trial quality is obviously appealing because it seems easier to interpret than a series of ticks on a checklist. However, unless accepted guidelines have been followed in scale development and the scale has performed well in subsequent psychometric testing (panel of experts; Delphi procedure; and tested for reliability, responsiveness, and content, construct, and concurrent validity),7 scale scores may provide a false impression of meaningfulness.

The identification of a reliable and valid scale to assess the literature on a specific topic minimizes the chances of errors when determining the quality of the scientific literature. Thus, the purposes of this systematic review were: (1) to summarize the content, construction, areas of development, and psychometric properties of scales used to evaluate the quality of the randomized controlled trials (RCTs) in health care research and (2) to identify an appropriate scale to evaluate methodological quality of RCTs in the physical therapy and rehabilitation research field.


Search Strategy

A computerized database search was performed to identify relevant articles. For this review, the literature was searched for published studies describing or using a scale to evaluate the methodological quality of RCTs in health care research.

The studies were searched in all languages according to the search strategy of Dickersin and Lefebvre.8 The search included studies from 1965 up to March 2, 2007, which were obtained through an extensive search of bibliographic databases, including MEDLINE (1966–February 2007, week 4); EMBASE (1988 to 2007, week 8); CINAHL (Cumulative Index to Nursing and Allied Health Literature) (1982–February 2007, week 3); ISI Web of Science (1965–March 2, 2007); EBM (Evidence-Based Medicine) Reviews-Cochrane Central Register of Controlled Trials (CCTR), Cochrane Library, and Best Evidence (1991–first quarter, 2007); All EBM Reviews, comprising the Cochrane Database of Systematic Reviews (CDSR), ACP (American College of Physicians) Journal Club, Database of Abstracts of Reviews of Effects (DARE), and CCTR (1991–first quarter 2007); Global Health (1973–present); and HealthSTAR (1910–February 2007). Key words used in the search were: “scale,” “critical appraisal,” “critical appraisal review,” “appraisal of methodology,” “research design review,” “quality assessment,” “randomized controlled trial,” and “RCT.” Subject subheadings and some word truncations, according to each database, were used as well to map all possible key words. Table 1 provides details on the specific search terms and combinations. The selection of these terms was made with the help of a librarian specializing in health sciences databases. In addition, the literature search also involved manual search of bibliographies of the identified papers, looking for key authors (ie, Jadad, Moher, and Chalmers) and relevant information to meet the objectives of this study. In addition, each study in which the original scale development was described was tracked through the Web of Science database in order to access all studies that referenced the original scale development.

Table 1.

Search Results From Different Electronic Databasesa

Inclusion and Exclusion Criteria

Published studies reporting on scale development or the psychometric evaluation of a scale were eligible for inclusion. The inclusion criterion was: published scales developed to evaluate methodological quality of RCTs in any area of medical research. No unpublished scales were included. Scales were excluded if they were developed for the analysis of the methodological quality for only one specific systematic review or if the development of the scale was not described and the psychometric properties of the scale were not tested. Based on this information, we believe that, although the inclusion of these excluded scales would greatly increase the number of scales in this review, these scales would not contribute to the results because they were most likely not developed systematically. Checklists that clearly were not designed to be summed also were excluded from this systematic review.

Data Extraction

Five independent reviewers (SAO, LGM, ICG, JF, and TS) screened abstracts and titles for eligibility. When the reviewers felt that the abstract or title was potentially useful, full copies of the article were retrieved and considered for eligibility by all reviewers. When discrepancies occurred between reviewers, the reasons were identified and a final decision was made based on the agreement of all reviewers. STATA software (version 9.0)* was used to calculate kappa agreement between raters (multiple raters) in selecting the studies for this review.

The next step involved extracting the information regarding the content, construction, special features (eg, area of development, number of items, how items were selected for inclusion, time to complete, how scales and items were scored, the use of guidelines), and psychometric properties for each scale. Psychometric properties that were extracted and analyzed were: face validity, content validity, construct validity, concurrent validity, internal consistency, and reproducibility (intrarater and interrater reliability/agreement). We used the definitions of Streiner and Norman911 and the guidelines established by Terwee et al12 to determine quality of measurement properties. In short, quality of measurement included internal components of validity (ie, content validity: internal consistency, relevance of items and representativeness of items of the scale) as well as the external component of validity (ie, construct validity: the relationship with other tests in a manner that is consistent with theoretically derived hypotheses). In addition, intrarater and interrater reliability (ie, repeatability of measurements taken by the same tester at different times and repeatability of measurements taken by different testers, respectively) also were considered.

Scales were identified as being important to physical therapy if the authors specifically stated that the scale was developed for the physical therapy practice area or was developed by a group of physical therapist researchers, or if the Web of Science search identified that the scale was used in at least 2 physical therapy reviews.


The initial electronic database search of the literature resulted in a total of 7,720 articles (Tab. 1). Of these, 49 were selected as potential studies based on their titles and abstracts. After the complete article was read, however, only 19 of these actually fulfilled the initial criterion.3,6,1329 Thirty papers were excluded after reading the complete article.1,3058 The main reasons for exclusion were: (1) the tool was a checklist and not a scale, (2) the tool was developed for a single systematic review, and (3) information regarding the scale's construction, development, and psychometric properties was missing or impossible to obtain. The agreement between the reviewers in selecting these articles after applying the inclusion and exclusion criteria was analyzed with a kappa statistic for multiple raters, which resulted in a value of κ=.90.

Each original scale was tracked in the Web of Science database in order to find any additional information that could add to the psychometric properties of the selected scales. A total of 3,158 articles were found by tracking each scale. From these, 56 new articles were selected from the Web of Science.59114

Thirty-six articles also were obtained through a hand search (ie, bibliographies of the identified papers, key authors).115144 Thus, a total of 105 studies were finally included in the study and analyzed. The Figure details the searches.


QUORUM (Quality of Reporting of Meta-analyses) statement flow diagram of the literature search.

The included studies accounted for 21 scales and their modifications including: Jadad,15,17,19,25,27,6062,64,6769,71,72,78,80,118,120,121,144 Maastricht,3,62,126 Delphi List,23,28,79,8183,90,92,99,107110 PEDro,13,74,77,130,131,140,141 Maastricht-Amsterdam List (MAL),29,8590,9397,99102,104106,112,113,143 Van Tulder,81,82,84,91,92,98,103,107,108,110,111,114,142 Bizzini,26 Chalmers,14,16,22,63,65,70,117,124,125,127129,138,139 Reisch,122,123 Andrew,115,116 Imperiale,135 Detsky,16,59 Cho and Bero,119 Balas,133 Sindhu,20 Downs and Black,134 Nguyen,137 Oxford Pain Validity Scale (OPVS),21 Arrivé,76 CONSORT,18,66,73,132 and Yates75 scales. Details of each scale and their modifications regarding content, construction, special features, and psychometric properties are provided in Tables 2 and 3.

View this table:
Table 2.

Characteristics of the Scales

Table 3.

Summary of the Psychometric Properties of the Analyzed Scalesa

The majority of the adapted scales were based on the Chalmers14,16,22,63,65,117,124,125,127129,138,139 and Jadad15,17,19,25,60,61,64,68,69,71,80,118,120,121,144 scales. The Chalmers Scale was developed to assess clinical trials on the use of aspirin in coronary heart disease studies, and the Jadad Scale was developed for pain research. The Chalmers Scale was modified for different areas and topics: abdominal surgical practice,127 alcoholism,129 maxillary sinusitis,22 breast cancer,128 periodontal research,117,124 psychology,65 pulmonary rehabilitation,63 and lung cancer.138 The Jadad Scale also has been adapted for use in many health care areas such as medicine, dentistry, psychology, and physical therapy15,17,19,25,60,61,64,68,69,71,80,118,120,121,144 (Tab. 2). In addition, according to our Web of Science search, the Jadad Scale was by far the most frequently cited and the most commonly used scale by the health care community.

In the majority of cases, the process of constructing these scales was not rigorous. Only 5 of the 21 scales included in this study were developed using a standardized procedure (Delphi method and a panel of experts).7 The Jadad,120 Delphi List,23 Yates,75 and Sindhu20 scales were based on the Delphi method, and Bizzini et al26 (Bizzini Scale) used a panel of experts along with Cochrane Collaboration Handbook guidelines.145 Seven scales were developed based on the authors’ knowledge or information about methodological quality and standards for research design in the literature (OPVS,21 Chalmers,125 Arrivé,76 Reisch,122 Downs and Black,134 Nguyen,137 and Balas133 scales). Five of the scales were based on previous checklists. The CONSORT Scale was based on the CONSORT (Consolidated Standards of Reporting Trials) statement146; the Cho and Bero Scale119 was developed based on the Spitzer guidelines147; and the PEDro,130 MAL,143 and van Tulder142 scales were based on the Delphi List.23 Four scales (the Andrew,116 Detsky,16 Maastricht,126 and Imperiale135 scales) did not report their method of development.

The scales identified as being within the scope of physical therapist practice were the PEDro, Maastricht, Delphi List, MAL, van Tulder, Bizzini, and Jadad scales. The Jadad Scale was the only scale that was not originally developed for physical therapy but has been used in physical therapy reviews.144,148150 It is important to point out that 5 of these scales (the Maastricht, Delphi List, MAL, van Tulder, and PEDro scales) are interrelated. The Maastricht Scale was developed in the Department of Epidemiology of the University of Maastricht without using formal scale development techniques. The same group of authors decided to develop a methodological quality scale using formal techniques of scale development; thus, the Delphi List emerged. Since then, the Maastricht Scale seldom has been used.

The Cochrane Collaboration Back Group (CCBG) used the Delphi List as a basis for their analysis, and added some items they found relevant for back pain. This list was then called the “Maastricht-Amsterdam List” (MAL) (19 items) because of the cooperation between the 2 groups. Later, the CCBG updated the MAL and only considered 11 items. It is this list that is considered the “van Tulder List” and has been used by the CCBG and many systematic reviews since 2003. In addition, the PEDro Scale was derived from the Delphi List. Therefore, the Delphi List is the basis for most of the scales used in physical therapy and its items are included in many of the scales. However, the scales derived from the Delphi List (the MAL, van Tulder List, and PEDro) have added some items and did not consider further validation of these new scales. Table 4 summarizes the items of the most commonly used scales in physical therapy.

View this table:
Table 4.

Items Included in the Analyzed Scales Used in Physical Therapy

All of the scales that were included in this study had “face validity.” The Cho and Bero,119 Chalmers,125 and Sindhu20 scales were tested for content validity. The Jadad,24,120 Delphi List,79 Maastricht,151 Cho and Bero,119 Sindhu,20 Detsky,16,24 Downs and Black,134 Imperiale,24 Reisch,24 Chalmers,125 and van Tulder24 scales were tested for criterion (concurrent) validity. However, it is important to remember that criterion (concurrent) validity is used to validate a tool in relation to a gold standard. In this case, the concurrent validity of the above-mentioned scales was tested using the Chalmers, Delphi List, Jadad, CONSORT, Imperiale, Reisch, van Tulder, Standards of Reporting Trials Group (SRTG),152 and European Lung Cancer Working Party70 tools, which are not recognized as gold standards in this field. Based on this fact, the concurrent validity of the Jadad, Delphi List, Maastricht, Cho and Bero, Sindhu, Detsky, Downs and Black, Imperiale, Reisch, Chalmers, and van Tulder scales may be inappropriate.9

The Jadad120 and Yates75 scales were tested for construct validity. Construct validity, as mentioned previously, refers to the extent to which scores of a scale are based on hypothetical grounds and should be tested by predefined hypotheses (eg, expected correlations between measures, expected behavior of scales in a determined situation).9 For example, the Jadad and Yates scales were tested to confirm whether they could discriminate between articles that had good or bad methodological quality and were previously judged by a group of experts (construct validity evidence).

Two scales (the OPVS and Nguyen scales) were not tested for reproducibility (ie, agreement and intrarater and interrater reliability) and internal consistency. Values such as intraclass correlation coefficient, kappa, Kendall coefficient, and Pearson correlation coefficient were used to analyze interrater reliability. Only one scale (Downs and Black) has reported internal consistency values.134 In general, scales had interrater reliability that ranged from .32 (poor) to 1.0 (excellent). Intrarater reliability was only reported for one of the modified Chalmers scales (κ=.66).139 Test-retest reliability was reported for the Downs and Black (r=.88),134 Jadad (intraclass correlation coefficient=.98),64 and Chalmers (intraclass correlation coefficient=.81)14 scales. The scales that have been tested for reliability in different areas are the Jadad16,19,28,61,64,71,78,99,118,144 and Delphi List28,8183,90,92,99,107110 scales (Tab. 2). The scale that presented the worst reliability was the Andrew Scale (r=.32).115

The Jadad Scale presented the best validity evidence and has been tested for reliability in different settings. However, the Delphi List and Yates scales have been developed based on high standards as well (Delphi procedure, panel of experts, and tested for validity and reliability).7,12 The Delphi List also has been used in many areas; however, it has not been as popular (cited 180 times) as the Jadad Scale (cited 1,780 times). The Yates Scale is a recently published scale that was created for use in cognitive behavior therapy for chronic pain. Its use has been limited (only cited once by the same group of authors).

Regarding the scales used specifically in the physical therapy area, the Delphi List, along with the Jadad Scale, has greater validity evidence compared with the other scales (the MAL, van Tulder, PEDro, and Bizzini scales). However, the Delphi List lacks internal consistency and construct validity. These psychometric properties are of importance in a scale because they indicate that the construct, in this case “methodological quality,” is fully represented by the items of the scale (internal consistency), and that the scores of a scale are based on hypothetical grounds and should behave based on predefined hypotheses.7,9 Scales used in the physical therapy area had interrater reliability that ranged from .37 (poor) to 1.0 (excellent). The reliability values found for each scale depended on the setting used, the raters’ characteristics, the length of the scale, and also the training given on how to use a specific scale.


This systematic review evaluated the content, construction, areas of development, and psychometric properties of scales used to evaluate the quality of the RCTs in health care research. The findings of this study demonstrated that a large number of scales and modified scales are available in the literature to evaluate methodological quality in different health care areas.

The scales analyzed in this review differed in several aspects, such as area of development, complexity, length, type of items, and importance given to the included items. The scale modifications were performed in the majority of the cases to adapt a scale to a specific topic. The primary scales were normally developed with the objective of analyzing the quality of RCTs in a specific area, and the items cover topics that are important to that area. Chalmers and colleagues,125 for example, developed a scale to analyze clinical trials on the use of aspirin in coronary heart disease studies and included an item about the taste and appearance of the drug in the scale. This item is very important for this type of study because it provides information regarding the true blinding of the patients. However, this item would become completely inappropriate when using this scale to analyze the quality of a nonpharmacological study. Based on this fact, many authors modified an original scale so they could use it in a systematic review on a topic different from the one for which the scale was originally developed. However, if one single item is added to or taken from a scale, if the weighting system is changed, or if any other minor change is performed, the psychometric properties of the original scale may be no longer applicable.

A modified scale developed from a validated and reliable primary scale cannot be considered valid and reliable unless it is tested for validity and reliability itself. According to Streiner and Norman, “modifications of existing scales often require new validity studies.”9(p186) This means that the psychometric properties of the modified scale have to be assessed to ensure that the new scale can actually identify papers with good or bad methodological quality. Most of the scales used in the physical therapy area are modifications of the Delphi List. These scales, however, did not follow any further validation process and, therefore, cannot be considered to be as valid as the original (the Delphi List). The use of modified scales is a step forward in creating scales specific to each area; however, they should be used with caution because of the lack of information about their construction, applicability, and psychometric properties. Some reports136,153 did not account for this fact and considered a modified scale to be a new scale. However, this misunderstanding added confusion about the number of existing scales in the literature. Our systematic review grouped all of the original scales and their modifications to highlight the fact that there are few original scales with clear and reported psychometric properties. Nevertheless, many adaptations of these scales have occurred without considering a new validation process. These unvalidated “new” scales have been freely used in health care research, which makes the interpretation and the validity of the results of these scales even more complex and open to question.

Quality assessment instruments have to be developed according to the principles used to create the scales. However, most instruments analyzed in this study have not been developed rigorously. Our results are in agreement with those obtained by Moher et al.6 It has been suggested5,154 that some specific issues must be addressed when developing a scale: definition of the quality construct, definition of the scope and purpose of quality assessment, definition of the population of end-users (background), selection of raters, and trial scoring (open or blind). Validity evidence such as the internal component of validity (internal consistency, relevance of items, and representativeness of items of the scale) as well as the external component of validity (ie, the relationship with other tests) are needed to support the use of scales to measure the methodological quality of RCTs. Intrarater and interrater reliability (ie, repeatability of measurements taken by the same tester at different times and reproducibility of measurements taken by different testers, respectively) also are important considerations when developing a tool.155

As with any procedure, assessment of methodological quality is prone to bias. Thus, in order to be consistent and avoid bias, researchers should use robust tools that are able to objectively evaluate methodological quality. However, which issues are relevant to consider for quality assessment tools? According to our results, randomization is one of the most commonly used items across different scales to measure methodological quality. It has been shown that lack of randomization can change the treatment effects.2 Chalmers et al156 found that trials that were not randomized had a 58.1% difference in case-fatality rates in the treatment group compared with the control group, whereas trials with randomization had a 24.4% difference and blinded randomized trials had a 8.8% difference. Allocation concealment was considered in 44.4% of the analyzed scales in our systematic review. Inadequate allocation concealment has been shown to produce a 37% exaggeration of treatment effects in clinical trials that reported allocation concealment inadequately compared with trials that reported it adequately.5,157 According to these findings, therefore, randomization as well as allocation concealment should be evaluated when assessing methodological quality because they eliminate study selection and confounding biases.158

Blinding was another item frequently used by the scales studied in this paper and has been found to be an important consideration when evaluating methodological quality. Schulz et al159 found that trials with no double-blinding procedure increased the treatment effects by 17%. These results are in agreement with those of Colditz and colleagues.160,161 However, according to the results of Moher et al,5 double blinding did not significantly affect the estimates of effect.

In addition, sample size calculation was an item frequently used by the scales, and it has been shown to be important for methodological quality. Trials with small sample sizes have more of a risk for a type II error. For example, Freiman et al162 and Moher et al163 found that most of the trials with negative results did not have a large enough sample size, leading to wrong conclusions and wasting of resources. Thus, sample size is an important issue when evaluating methodological quality because if trials are not adequate powered, they will probably not show an effect.

Other scale items considered to evaluate methodological quality were items such as appropriate statistical analysis, description of withdrawals and dropouts, baseline comparability, definition of inclusion and exclusion criteria, use of intention-to-treat analysis,164 and outcomes objectivity. These items empirically affect the quality of the trial; however, no studies supporting this information have been performed. Nevertheless, basic methodological standards2,43,146 support the inclusion of these items in quality assessment tools. Future studies should evaluate the influence of these issues in estimates of treatment in different areas because most of the information on this topic comes from medicine5,157159,165 and may be not applicable to physical therapy and rehabilitation.166 This research could guide improvement in the methodological quality assessment scales in physical therapy and rehabilitation. Future research evaluating the relationship between design characteristics and treatment effect sizes in physical therapy is urgently needed in order to determine important items in the scales with a scientific basis for their construction. This also will help to guide research planning.

Although this systematic review focused on scales that summarize their results in a final score, the use of a summary score is still debatable. According to Balk et al,165 it is clear that the assessment of the methodology of studies is useful to better understand the quality of RCTs and meta-analyses; therefore, summary scores could be useful as a tool for clinicians to evaluate the strength of individual studies. This may be important because scales are used not only by researchers but also by clinicians and students who may not have sufficient knowledge to make their own judgments about quality based on individual items. In this case, the use of a summary score may facilitate an understanding of the literature such as is done when using the PEDro database. Most importantly, clinicians have to apply scientific knowledge to practice, and they need to have a simple and interpretable method to be confident about the quality of the research in order to apply this knowledge in clinical practice.

Most of the studies relating a summary quality score and its effect on outcomes have controversial results. Although the use of a summary quality score in meta-analysis has been suggested, its influence on outcome remains unclear and needs further research.136,153,165,166 The results of this review showed that many scales have not been developed with a systematic rigor and have been validated only for the areas for which they were originally designed. Therefore, the results of the articles that question the usefulness of a summary score are put into question. Herbison et al,153 for example, contended that the use of a quality score in adjustments of meta-analysis were not sensitive enough to identify high-quality trials from low-quality trials and that each scale came to a different result. However, lack of sensitivity may be due to the low quality of the scales used and not necessarily related to the summary scores. The scales used may not have represented the construct of “methodological quality,” perhaps because they did not reach suitable standards for its validation. Therefore, caution is necessary when using summary scores before a valid and systematically developed scale is used to test the usefulness of a scoring system.

Methodological Quality Scales and Physical Therapy

The use of quality scales to assess RCTs in physical therapy is a relevant issue. As mentioned above, 7 scales have been used to assess physical therapy trials (ie, the Jadad, Maastricht, Delphi List, MAL, van Tulder, PEDro, and Bizzini scales). The Jadad Scale, because of its brevity and widespread use, the Delphi List, and PEDro scales have been used most frequently for this purpose.144,148150 In addition, the MAL (1997–2003) and van Tulder (2003–present) scales have been used for physical therapy reviews from the CCBG. The Maastricht Scale was developed for physical therapy; however, it has not commonly been used since the appearance of Delphi List and its modifications (MAL, van Tulder, PEDro). Despite the fact that the Jadad Scale is a pain-validated tool that was not created to evaluate specific information related to physical therapy, it has been widely used in physical therapy research,17,144,149,150,167171 and its authors suggest that this scale could be used in other areas. However, until now, there has not been any validation on the use of this scale in areas other than pain.

The Jadad Scale focuses only on randomization, blinding, and withdrawals and dropouts to evaluate methodological quality of primary research. Only 2 of these 3 items would always be applicable to physical therapy because the nature of physical therapy interventions (eg, manual therapy, exercises) does not allow for blinding of the therapists and or the patients on some occasions. Proper double blinding, therefore, is unlikely to be accomplished for most physical therapy trials, and, as a result, this item becomes irrelevant and most likely will not contribute to the ability of the scale to determine the quality of RCTs in physical therapy. Furthermore, the Jadad Scale does not include any item about treatment protocol specifications, treatment adherence, or treatment integrity, which are important issues in physical therapy. As such, the Jadad Scale may not provide the most comprehensive measure of methodological quality for physical therapy trials.

Currently, most clinical trials include the Jadad Scale items in their methodology in order to accomplish with good methodological quality. Herbison et al153 found that, for a large proportion of studies they analyzed, the Jadad Scale did not allow them to divide the studies into “high” and “low” quality and concluded that this scale might not be responsive enough to distinguish between different levels of quality. Therefore, the use of Jadad Scale and its validity should be reassessed, not only for drug trials but also for different areas of research (eg, physical therapy, nonpharmacological trials).

Conversely, the PEDro, Maastricht, Delphi List, MAL, van Tulder, and Bizzini scales appear to be more connected to the physical therapy field. For example, it has been shown that the PEDro Scale, a modification of the Delphi List, offers a more comprehensive measure of methodological quality in stroke rehabilitation literature compared with the Jadad Scale.148 Furthermore, in addition to the blinding component, the PEDro Scale assesses the methodological quality of a study based on other important criteria, such as concealed allocation, intention-to-treat analysis, and adequacy of follow-up. As such, the PEDro Scale appears to be a more useful tool to assess the methodological quality of physical therapy trials.

Because physical therapy clinical trials are much more complex than a pharmacological RCT, the physical therapy–related scales should carefully take into account not only patient adherence and standardization of the treatment protocol but also the precise performance of the intervention as well as the validity, reliability, and responsiveness of the tests and measurements included in the trial. None of these variables have been considered in the scales regularly used (the Jadad, Delphi List, van Tulder, and PEDro scales) to assess the methodological quality of trials in physical therapy. As it has been noted (Tab. 4), all of these scales have neglected the intervention items as well the validity, reliability, and responsiveness of the outcomes used. The Maastricht and Bizzini scales consider parameters of treatment such as frequency, intensity, duration, and adherence, making these scales more comprehensive; however, these scales lack the higher levels of validity evidence. Therefore, a content analysis of the current scales used in physical therapy clearly identifies a gap around the issue of treatment implementation and outcome measurements, which are often relevant for physical therapy.

Another relevant issue is the different interpretation of the items. For example, the Delphi and Bizzini scales ask only whether randomization was performed, but the MAL and van Tulder scales require the reviewer to determine whether the method of randomization is appropriate. In addition, regarding baseline comparability of the most important prognostic factors, the Delphi List requires the reviewer to determine the comparable item, whereas the MAL specifically requires adequate description of the patients’ age and duration of complaints, percentage of patients with pain, and main outcome measures to evaluate similarity. Thus, these items could elicit different responses and scores, depending on the scale. Therefore, precise guidelines with unified criteria should exist in order to provide the same information.

Based on the information given about the scales, the quality of the existing scales (such as validation testing) needs to be improved or a new tool closely related to physical therapy practice needs to be developed and include all items and relevant issues related to rehabilitation and physical therapy. In addition, this new tool must include the concept of quality in its broadest sense and be tested for validity and reliability across different areas of physical therapy practice (eg, orthopedics, neurology, respiratory care) in order to make sure that this tool is relevant and applicable to different areas of physical therapy research.


Based on the findings of this systematic review, many scales are being used to evaluate the methodological quality of RCTs in health care research. Most of the analyzed scales did not follow methodological standards during development7 and have not been tested for validity and reliability in the areas to which they have been applied. Our findings indicate that no scale that is being used to evaluate the quality of physical therapy research has been subjected to a scientifically rigorous development or to testing for validity and reliability. Therefore, readers should be careful when using a scale to assess methodological quality of primary research articles. Scale limitations should be taken into consideration and the information provided by scales should be interpreted with caution. Future research looking at developing a new scale to evaluate the methodological quality of RCTs in the physical therapy should take into consideration the results of this review regarding the flaws and limitations of the existing scales.


  • All authors provided concept/idea/research design and writing. Ms Armijo Olivo, Ms Macedo, Ms Gadotti, Mr Fuentes, and Ms Stanton provided data collection and data analysis. Ms Armijo Olivo provided project management.

  • The authors acknowledge the following agencies and funding: Alberta Provincial CIHR Training Program in Bone and Joint Health, Izaak Walton Killam scholarship from the University of Alberta, Canadian Institutes of Health Research, Government of Chile (MECESUP Program), University Catholic of Maule, Endeavour International Postgraduate Research Scholarships from the University of Sydney, University of Sydney International Research Scholarship, Strathcona Physiotherapy Scholarship, Province of Alberta Graduate Scholarship, and Ann Collins Whitmore Memorial Award from the Physiotherapy Foundation of Canada. The authors also acknowledge Sandra Shores, librarian specializing in health sciences databases at the University of Alberta, for her assistance.

  • A platform presentation of this study was presented at the 15th International Conference of the World Confederation for Physical Therapy; June 3, 2007; Vancouver, BC, Canada.

  • * Stata Corp, 4905 Lakeway Dr, College Station, TX 77845.

  • Received May 14, 2007.
  • Accepted August 27, 2007.


View Abstract