|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Reports |
S.M. Haley, PT, PhD, FAPTA, is Associate Director, Health and Disability Research Institute, School of Public Health, Boston University, Medical Campus, 580 Harrison Ave, 2nd Floor, Boston, MA 02218 (USA).
M.A. Fragala-Pinkham, PT, MS, is Research Associate, Research Center for Children With Special Health Care Needs, Franciscan Hospital for Children, Boston, Massachusetts.
H.M. Dumas, PT, MS, is Manager, Research Center for Children With Special Health Care Needs, Franciscan Hospital for Children.
P. Ni, MD, MPH, is Senior Data Analyst, Health and Disability Research Institute, School of Public Health, Boston University.
G.E. Gorton, BS, is Director, Clinical Outcomes Assessment Laboratory, Shriners Hospital for Children, Springfield, Massachusetts.
K. Watson, PT, DPT, is Research Specialist, Shriners Hospital for Children, Philadelphia, Pennsylvania.
K. Montpetit, OT, MScOT, is Head of Occupational Therapy and Outcomes Coordinator, Shriners Hospital for Children, Montreal, Quebec, Canada.
N. Bilodeau, OT, MScOT, is Occupational Therapist, Shriners Hospital for Children, Montreal.
R.K. Hambleton, PhD, is Executive Director, Center for Educational Assessment, Department of Educational Policy, Research and Administration, University of Massachusetts, Amherst, Massachusetts.
C.A. Tucker, PT, PhD, PCS, is Associate Professor, Department of Physical Therapy, College of Health Professions, Temple University, Philadelphia, Pennsylvania.
Address all correspondence to Dr Haley at: smhaley{at}bu.edu
Submitted January 8, 2009;
Accepted March 17, 2009
| Abstract |
|---|
Objective: The objective of this study was to examine the psychometric properties of a new item bank and simulated computerized adaptive test to assess activity level abilities in children with CP.
Design: This was a cross-sectional item calibration study.
Methods: The convenience sample consisted of 308 children and youth with CP, aged 2 to 20 years (X=10.7, SD=4.0), recruited from 4 pediatric hospitals. We collected parent-report data on an initial set of 45 activity items. Using an Item Response Theory (IRT) approach, we compared estimated scores from the activity item bank with concurrent instruments, examined discriminate validity, and developed computer simulations of a CAT algorithm with multiple stop rules to evaluate scale coverage, score agreement with CAT algorithms, and discriminant and concurrent validity.
Results: Confirmatory factor analysis supported scale unidimensionality, local item dependence, and invariance. Scores from the computer simulations of the prototype CATs with varying stop rules were consistent with scores from the full item bank (r=.93–.98). The activity summary scores discriminated across levels of upper-extremity and gross motor severity and were correlated with the Pediatric Outcomes Data Collection Instrument (PODCI) physical function and sports subscale (r=.86), the Functional Independence Measure for Children (Wee-FIM) (r=.79), and the Pediatric Quality of Life Inventory–Cerebral Palsy version (r=.74).
Limitations: The sample size was small for such IRT item banks and CAT development studies. Another limitation was oversampling of children with CP at higher functioning levels.
Conclusions: The new activity item bank appears to have promise for use in a CAT application for the assessment of activity abilities in children with CP across a wide age range and different levels of motor severity.
| Introduction |
|---|
|
|
|---|
The use of parent-report measures is an accepted method for documenting the physical functioning of a child with CP in home and community environments.6,9–11 Parent-report measures such as the Pediatric Evaluation of Disability Inventory (PEDI)12 and the Pediatric Outcomes Data Collection Instrument (PODCI)13 are commonly used in clinic and research settings to measure activity-level abilities in children with CP. Other instruments, such as the Activity Scale for Kids (ASK),14 the Pediatric Quality of Life–Cerebral Palsy version (PedsQL-CP),15 the Functional Assessment Questionnaire (FAQ),16 and the Functional Mobility Scale (FMS),17 also have been used to measure activity-level changes in children with CP. These measures yield data that are reliable and valid; however, as individual instruments, all have limitations. The FAQ and FMS concentrate on functional mobility only, whereas the ASK, PEDI, and PODCI include other areas of function such as self-care and play, but may not assess mobility skills in sufficient depth. The PODCI combines health-related quality-of-life questions with physical function questions, thereby limiting content breadth and specificity.18 The PEDI and PODCI can take 30 minutes or more to administer, which often creates a substantial response burden in a busy clinic or a research study with multiple outcome measures. Other limitations of these scales include ceiling and floor effects when used across wide age and ability ranges and limited item content, especially in the areas of more-complex tasks.3
There is a need for a parent-report measure that: (1) can be used to document activity abilities at program and individual child levels, (2) is inclusive of young children and teen age groups, and (3) is feasible to administer in a clinical setting. Yet, fixed-length instruments that cover a broad range of ages and functional abilities often have too many items and are overly burdensome. Using traditional methods, it has become clear that no single, fixed-length instrument can meet these content and psychometric standards for children and youth with CP throughout a wide age range (2–21 years).19
The use of computerized adaptive testing (CAT) provides an alternative approach to these measurement and practical issues.20 Computerized adaptive testing utilizes a software algorithm that selects questions appropriate to the child's functional ability by using previous responses to create a score estimate. Item information functions,20 based on the locations and discrimination of each item, form the basis for the selection of each new item within the CAT program, as the item with the maximum information at the current score level is chosen. In effect, items administered in a CAT are customized to the individual parent-report of the child's functional level by skipping items that are clearly too easy or too difficult for the child's expected capabilities, given the previous responses. Computerized adaptive testing software shortens or lengthens the test to achieve the desired precision and scores all children on a standard metric so that results can be compared across children. The potential advantages of CAT programs in the evaluation of functional abilities of children have been documented previously.21–28
Starting from a list of items adapted from existing instruments and newly created items, we collected parent-report data on a new item bank to measure the ability of children with CP to perform activities in their home and community environments. Our goal was to build an item bank that would cover relevant clinical ages (2–21 years) and levels of severity of children and youth with CP typically seen at the Shriners Hospitals for Children (SHC) orthopedic hospitals. In order to fully assess the potential usefulness of the item bank, we created simulations of CAT scores based on the data collected during the item bank calibration phase. Computerized adaptive testing simulations are a common approach for investigating the merits of an item bank and its potential for providing the foundation for a CAT program.29 During the simulation program, as items are selected for administration to parents, responses are taken directly from the actual data set. The complete set of activity item responses and subsequent score estimates serve as the criteria against which CAT-based scores are compared.
The purpose of this study was to examine initial psychometric properties (unidimensionality; local item dependence; item invariance, including stability across groups and differential item functioning [DIF]; scale coverage; score agreement with CAT algorithms; and discriminant and concurrent validity) of an activity item bank and resulting simulated CAT program designed specifically to assess activity level and physical function in children with CP. Our long-term goal was to create a series of multifaceted item banks that can assess global physical health, upper- and lower-extremity skills, and activity via CAT technology for the monitoring of functional outcomes and the assessment of change with interventions.30 In this article, we report on the results of the activity scale.
| Method |
|---|
|
|
|---|
The mean age of the sample was 10.7 years (SD=4.0). Demographic characteristics of the sample are presented in Table 1. Our sample is not fully representative of population data reported elsewhere,31 as we have underrepresented children with more-severe gross motor disabilities. Because 2 of the 3 SHC sites were motion laboratory-based facilities, it was typical at those sites to recruit primarily ambulatory participants.
|
Procedure
Activity items along with items from the 3 other scales (global physical health, lower-extremity and mobility, and upper-extremity skills) were administered to parents using a PC-based tablet. The global physical health scale33 assesses pain and fatigue, the upper-extremity skills scale34 samples functional status in hand and dexterity skills, and the lower-extremity and mobility scale35 examines lower-extremity functioning and mobility, including mobility with devices. In a few cases (n=11), parents who were unable to complete the survey during the clinic visit completed it at home using a Web-based interface. For the calibration testing, the activity items were rated by a parent or caregiver and were judged on the basis of the following 5-point rating scale: 0="unable to do," 1="with much difficulty," 2="with some difficulty," 3="with a little difficulty," and 4="without any difficulty."
External Measures
Severity of CP initially was rated by the parents and then confirmed by the research staff using the Gross Motor Function Classification System (GMFCS),36 which rates children on a 5-point severity scale based primarily on ambulatory ability, and the Manual Ability Classification System (MACS),37 which categorizes children on a 5-point severity scale based on hand function and dexterity. Both classification systems have been shown have reliability and validity for use in children and youth with CP.37–39
Concurrent validity comparisons were chosen on the basis of whether the instruments were currently used across many of the SHC hospitals. To serve as concurrent validity comparisons, subsets of parents also completed the PODCI40 (n=168), the PedsQL-CP15 (n=77), and the Functional Independence Measure for Children (Wee-FIM)41 (n=113). The PODCI was developed specifically to assess changes following pediatric orthopedic interventions for a broad range of diagnoses, including CP. The dimension most similar to the new activity scale was the subdomain of physical function and sports. The Wee-FIM is a standard outcome measure used in many of the SHC hospitals, and its scores include a motor function score. The PedsQL-CP is an adapted form of the generic PedsQL developed specifically for children with CP. The most relevant subscale within the PedsQL-CP is daily activity.
Data Analysis
Unidimensionality.
Item Response Theory (IRT) and CAT methods assume certain measurement properties of item sets. These include the assumptions of unidimensionality, local independence, and stability of item parameters (item invariance) across groups (eg, types of CP). Item sets that violate these assumptions may be less effective in modeling the latent variable and may limit the accuracy of a CAT instrument.42,43 We tested the latent structure of the activity items by confirmatory factor analyses (CFAs)44 and evaluated item loadings and residual correlations among items using Mplus software.45,* Model fit was assessed by multiple fit indexes such as the Comparative Fit Index (CFI), the Tucker Lewis Index (TLI), and root mean square error approximation (RMSEA). Recent simulation reports suggest that for most of the fit indexes, it is difficult to establish strict cutoff criteria.46,47 We used unweighted least squares means and variance-adjusted estimation methods, which are more precise when analyzing small to moderate-size samples with skewed categorical data.48,49 Four pieces of evidence were reviewed to determine the extent to which a unidimensional model adequately represented the activity scale: (1) item loadings on the primary factor, the percentage of variance attributed to the first factor, and the ratio of eigenvalues between the first and second factors; (2) results from overall model fit tests; (3) residual correlations between all possible pairs of items; and (4) the patterns of inter-item correlations among items. We retained items with factor loadings greater than 0.4 in the item bank. We considered items with residual correlations greater than 0.2 to be locally dependent items.50
Item calibrations, fit, and score estimates.
The item parameters for each scale were estimated using the Graded Response Model (GRM), but by restricting the slope parameter to a single value.51 This one-parameter logistic model using GRM was selected as the best solution for this phase of the project because of the relatively small sample size and the observation that most of the items had high, but similar, point-biserial correlations, suggesting that discrimination did not vary much across items. The item parameters and fit statistics were calculated using PARSCALE,51,
which is based on marginal maximum likelihood estimation. We evaluated item fit using the likelihood ratio chi-square statistic. Probability values less than .05 suggest item misfit. We evaluated the individual scores by weighted maximum likelihood estimation.52 The individual scores were standardized to a mean of 50 and standard deviation of 10 (T-scale).
Item invariance.
To examine item parameter stability, we grouped the participants into high- and low-function groups according to activity ability level and calculated the correlation between the item parameters estimated based on those 2 groups. With IRT, the child's score on an item should depend entirely on the latent variable (ability to perform activities). Significant DIF indicates that variables other than the activity variable, such as age (<11 years,
11 years), type (hemiplegic, diplegic, quadriplegic), or severity of CP (as assessed with the GMFCS or the MACS), are likely influencing the responses.53 The analysis of DIF was conducted using ordinal logistic regression.54 If a variable produced significant model coefficients and explained more than 2% of the variance, considering the total score, then an item was considered to exhibit DIF. Because of the number of items that were analyzed in the final item bank (36 items), the alpha level was set to .0014 (using the Bonferroni adjustment, .05/36). If the likelihood ratio test was statistically significant and the R-square change was greater than .07, we designated that as large DIF. If the likelihood ratio test was statistically significant and the R-square change was between .035 and .07, we designated that as moderate DIF. Otherwise, values indicated small DIF.
Scale coverage.
To evaluate the matching of item content with the estimated activity scores of the sample, we produced parallel item maps in which item category expected values55 and person scores were plotted on the same metric (
=50, SD=10). The expected value is the sum of the category values multiplied by their probabilities:
|
|
) is the expected value for item i at score level
, m is the number of rating scale categories, and Pij(
) is the category probability for item i category j at score
. The logit score that is the best estimate of the expected value is in the middle range of each response category. For each item, the expected value of each response category (5 per item) was plotted in this item map. These expected item response category values are used rather than step estimates because the expected values are more representative of the full content range of each item. The content range was based on estimated locations of the item-response categories that represent the lowest and highest levels of ability of the sample. In addition, we identified the number of individuals who received the highest possible score (ceiling) and the lowest possible score (floor).
CAT real data simulations.
We based the activity CAT algorithms on the HDRI software24 developed at the Health and Disability Research Institute, Boston, Massachusetts. The CAT software includes options for item selection, score estimation using weighted likelihood,52 and stop rules based on either the number of items, level of precision, or both. We used a real data simulation approach for investigating the merits of CAT; that is, the complete set of the actual item responses of parents to estimate their ability in activity items (IRT criterion score) served as the criterion against which scores from the CAT were compared. As items were selected for administration in the simulation, responses were taken from the actual data set. We selected the item "getting up from the floor" to be the first activity item administered to all participants because its difficulty parameter was in the middle of the range and content seemed appropriate for most children. After each response, an estimated score based on all administered items to that point in the simulation and the associated standard error was calculated. The selection of the next item was based on the item that could provide the highest information at the estimated score. We established specific stop rules based on the number of items (5, 10, or 15) and did not use precision for stop rule decisions. The validity of this real data simulation approach for studying CAT estimated scores assumes that people respond in much the same way to items regardless of their context; that is, items that precede or follow or short versus long forms would not influence a person's responses to items. Basically, this is the assumption of independence of item responses that is made with all common IRT models. In the present study, we developed 3 CAT scores in the simulations to reflect the 3 stop rules based on number of items (CAT-15, CAT-10, and CAT-5). These simulated scores were compared with a "gold standard" (ie, the actual IRT latent trait score for activity estimated by the full item bank).
Discriminant and concurrent validity.
Our logic in analyzing the concurrent and discriminant validity was to determine whether the person score estimates from the full item bank and the 3 CAT versions could produce interpretable scores. The ability of the full item bank and each CAT version (5-, 10-, and 15-item stop rules) to discriminate between groups of children based on levels of severity was evaluated by comparing average scores across the MACS and GMFCS levels using one-way analysis-of-variance tests with post hoc comparisons. Because of the relatively small numbers in GMFCS and MACS levels IV and V, we combined them. To assess concurrent validity, Pearson correlations were calculated between the full item bank and the PODCI physical function and sports skills summary score, the WeeFIM motor score, and the PedsQL-CP daily activity and school activity subscales.
Funding Source for the Study
This study was supported by the Shriners Hospital for Children Foundation (grant 8957) and an Independent Scientist Award to Dr Haley (National Center on Medical Rehabilitation Research/National Institute of Child Health and Human Development/National Institutes of Health, grant K02 HD45354–01A1).
| Results |
|---|
|
|
|---|
Unidimensionality
Based on the final item bank of 36 items, one factor explained 72% of the item variance, and all of the factor loadings were moderate to very high (range=0.607–0.918). The ratio of the first factor to the second factor was 18.8:1. The CFI and TLI values indicated acceptable fit; the RMSEA of 0.103 was higher than the acceptable range. A summary of the CFA results for the 45- and 36-item scales is provided in Table 2. The average inter-item correlation was .70 (SD=.09). Based on the multiple fit indexes, factor analysis, and inter-item correlation results, we concluded that the unidimensionality assumptions of the activity scale were met. There were no residual correlations greater than .2, so the local independence assumption also was satisfied.
|
2=842, df=815, P=.243). The item fit was generally acceptable in the final item bank, with the exception of 2 items ("eats a meal" [P=.014] and "gets into and out of a car" [P=.02]). We chose to retain these items in the bank due to their importance of content or their location along the activity scale.
Item Invariance
We examined item parameters stability by grouping participants into high- and low-function groups according to activity ability level; the correlation between the item parameters estimated based on those 2 groups was .88. The activity items in the final item bank are listed in the Appendix.
No DIF was noted for MACS level. Only one item ("climbs and moves on high playground equipment") showed moderate DIF for age. Two items ("hops and skips while playing games with other children of similar age, such as during hopscotch or a relay race" and "prepares to eat a meal") showed moderate DIF in CP diagnosis. Children with quadriplegia had more difficulty with these 2 items compared with children with either hemiplegia or diplegia. Three similar items ("prepares to eat a meal," "crosses a quiet 2-lane neighborhood street," and "keeps up with other children of similar age while walking up stairs") showed moderate DIF in GMFCS levels, with children with greater gross motor severity having more difficulty with these items. Because of their important content, as indicated during the cognitive testing sessions, we did not remove any of the DIF items at this time. In the future, these items may be removed or revised.
Scale Coverage
We found generally good coverage of the sample with the 36 items, as indicated in Figure 1, which displays the item map of person scores and expected response category values for all 36 items. For example, the expected category value for "unable to do" for the item "eats a meal" (easiest item) is about 26 on the mean of 50 (SD=10) metric. The expected category value for "without any difficulty" for the most-difficult item ("hikes or jogs for 2 or more miles") is about 75. Locations for all of the other categories for the other items in between are depicted in Figure 1. There were minimal ceiling effects (n=3 [1%]) and floor effects (n=11 [3.6%]). A couple of areas along the vertical item category scale (around a score of 50 to 60 in Fig. 1) had some small gaps in content coverage.
|
|
|
|
| Discussion |
|---|
|
|
|---|
Based on the CFAs and item fit analyses, the final set of activity items were sufficiently unidimensional to meet the assumption of IRT modeling. A number of items with large misfit were removed; for example, "social dancing" could be accomplished by, at a very rudimentary level, children in wheelchairs, and children with high levels of physical functioning ability may choose not to take part in dancing activities due either to lack of peer acceptance or to being uncomfortable with the activity. We did keep in the bank a few items that exceeded the threshold for DIF or fit. These items were retained mainly for content. By keeping items in the bank that exhibit either DIF or misfit, the estimation of scores may have been affected negatively. However, these items appeared to have trivial effect on the CAT-15 scores and likely a small effect on the CAT-10 scores.
Using the full 36 final items, we found minimal ceiling and floor effects across a very diverse sample of children with CP with a wide age range. The children with the minimum score (n=11) were not in the range of 2 to 5 years; the youngest child who received a minimum score was 7 years of age, and the average age of children who received a minimum score was 12 years. These findings indicate that the coverage for all ages was appropriate; we had some problems covering children with the most-severe activity restrictions. We did have relatively small numbers of children at both the low and high ends of the age range of 2 to 21 years. This creates some limitation in knowing how well the scale works for the entire age range. We hope in the future to sample more children in the lower (2–5 years) and higher (14–21 years) age groups to make sure the item bank is robust for these age groups. As shown in Figure 1, we found only small gaps in item coverage along the full activity scale. A previous study19 has reported major problems with ceiling effects for children with CP at GMFCS level I and floor effects for children with CP at GMFCS levels IV and V. We will need to address the floor effects for children who are most severely involved in future item bank development.
The results of the CAT simulations indicate that the 5-, 10-, and 15-item models yield accurate estimates of activity in children with CP. Several other studies on the development of CAT models also have conducted simulations using real data sets, comparing responses to all items in the item bank with CAT simulations of various lengths.21,24,27,28 We believe these simulations are likely good approximations of actual CAT administrations, yet some overestimation is possible. In future studies, administration of the full item bank along with the CAT models is recommended in prospective clinical studies to examine accuracy in real clinical situations.
These preliminary data indicate that the full item bank and all 3 CAT versions can discriminate among GMFCS and MACS levels for children with CP. It is noteworthy that all of the simulated CAT versions were able to discriminate across the 4 categories (combined IV and V) of the GMFCS and MACS. One limitation we should note is that although the GMFCS is valid for children up through the age of 12 years, we applied it to the entire sample. Future work should use the expanded and revised version of the GMFCS.56
We found the activity scale unable to discriminate between MACS levels II and III. In contrast, we found the companion scale on upper-extremity skills clearly discriminated between these 2 levels.34 The PODCI physical function and sports subscale, the PedsQL-CP daily activity scale, and the Wee-FIM motor scale appear to measure related activity concepts, as indicated by their high positive correlations.
Our sample size was relatively small for the analyses that we conducted. The effects of a small sample size were minimized by using a one-parameter model. In the future, a larger sample will be used for the calibration work so that we can be more confident of the findings and extend our analyses to using 2-parameter models, if warranted. We decided to publish results at this stage because the findings were compelling and this work is one of the first examples of developing a CAT for children with CP. Our data are not fully representative of population-based studies of children with CP, as we have underrepresented children with more-severe limitations in mobility. Nevertheless, we had sufficient low-level activity items for most of the children with low activity levels.
We acknowledge an additional concern in the interpretation of the data. We combined arguably 2 modes of data collection—collection of data in the clinic and data collection by parents at home (n=11) using the Internet. We combined these modes primarily to increase our sample size. This difference in mode may have created some error and bias; however, we believe any error or bias was small.
Although most indicators we used (TLI, CFI, inter-item correlations, factor analysis results) suggested good fit to a unidimensional scale, the RMSE was lower than ideal. These findings may have been due to our effort at putting some nontraditional activity items into the item pool, such as taking part in indoor games and sports. We felt these items were important to the activity scale and should be able to be scaled into one activity scale. Future work will be needed to confirm these findings.
There are a number of additional steps in our CAT development that will be needed prior to expectations for widespread use. First, we have conducted test-retest reliability on the activity CAT, and the results will be presented soon in a future report. We are currently testing the sensitivity and responsiveness of the activity CAT in a series of children with lower-extremity surgeries. Another approach has been to expand the item bank to include children with conditions other than CP. In our first phase, we are building and testing new items for children with brachial plexus birth palsy. This effort also may help fill in some items that are needed at the lower end of the activity continuum.
If successful, CAT versions will be used within the SHC system to evaluate activity changes in children with CP after orthopedic surgeries and conservative interventions such as bracing, therapy, and spasticity (hypertonicity) management medications and injections. Using CAT versions of the other scales (global physical health, upper- and lower-extremity skills) in conjunction with the activity scale will assist physical therapists and other clinicians in understanding the relationship among changes in different functional areas following rehabilitation interventions.
| Conclusion |
|---|
|
|
|---|
| Appendix. |
|---|
|
|
|---|
|
| Footnotes |
|---|
Human subject approval was obtained at each participating institution and through the Boston University Institutional Review Board.
This study was supported by the Shriners Hospital for Children Foundation (grant 8957) and an Independent Scientist Award to Dr Haley (National Center on Medical Rehabilitation Research/National Institute of Child Health and Human Development/National Institutes of Health, grant K02 HD45354-01A1).
* Muthén & Muthén, 3463 Stoner Ave, Los Angeles, CA 90066. ![]()
Scientific Software International Inc, 7383 N Lincoln Ave, Ste 100, Lincolnwood, IL 60712-1747. ![]()
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |