Development of a Computerized Adaptive Test for Assessing Balance Function in Patients With Stroke

I-Ping Hsueh, Jyun-Hong Chen, Chun-Hou Wang, Cheng-Te Chen, Ching-Fan Sheu, Wen-Chung Wang, Wen-Hsuan Hou, Ching-Lin Hsieh


Background An efficient and precise measure of balance is needed to improve administration efficiency and to reduce the assessment burden for patients.

Objective The purpose of this study was to develop a computerized adaptive testing (CAT) system for assessing balance function in an efficient, reliable, and valid fashion in patients with stroke.

Design Two cross-sectional prospective studies were conducted.

Setting This study was conducted in the departments of physical medicine and rehabilitation in 6 hospitals.

Patients The participants were inpatients and outpatients who were receiving rehabilitation.

Measurements A balance item pool (41 items) was developed on the basis of predefined balance concepts, expert opinions, and field testing. The items were administered by 5 raters to 764 patients. An item response theory model was fit to the data, and the item parameters were estimated. A simulation study was used to determine the performance (eg, reliability, efficiency) of the Balance CAT. The Balance CAT and the Berg Balance Scale (BBS) then were tested on another independent sample of 56 patients to determine the concurrent validity and time needed for administration.

Results Seven items did not meet the model's expectations and were excluded from further analysis. The remaining 34 items formed the item bank of the Balance CAT. Two stopping rules (ie, reliability coefficient >0.9 or ≤6 items) were set for the CAT. The simulation study showed that the patients' balance scores estimated by the CAT had an average reliability value of .94. The scores obtained from the CAT were closely associated with those of the full item set (Pearson r=.98). The scores of the Balance CAT were highly correlated with those of the BBS (Pearson r=.88). The average time needed to administer the Balance CAT (83 seconds) was only 18% of that of the BBS.

Limitations The convenience sampling of both samples may limit the generalization of the results. Further psychometric investigation of the Balance CAT is needed.

Conclusion The results provide strong evidence that the Balance CAT is efficient and has reliability and validity for patients with stroke.

Measuring balance is important for clinicians in determining the severity of a stroke, selecting the most appropriate therapy, and evaluating treatment outcomes for people with stroke.1,2 To be clinically useful, a short and precise measure is needed in busy clinical settings to enhance administration efficiency and to reduce the assessment burden for raters and patients with severe stroke.

Traditional balance measures, such as the Berg Balance Scale (BBS),3 the Balance Evaluation Systems Test (BESTest),4 and the Dynamic Gait Index (DGI),5 are commonly suggested for both researchers and clinicians. The BBS is identified by physical therapists as the most commonly used measure in patients with stroke. However, possible floor and ceiling effects of the BBS and the time required for administration may limit its application.6 The BESTest recently was developed to help therapists identify the type of balance impairment to determine specific treatments for their patients.4 However, the BESTest, which consists of 36 items, is time-consuming for both clinicians and patients. The DGI contains 8 walking tasks and was developed to capture walking problems in older people. However, the DGI is not appropriate for patients who cannot walk, which is a common problem in patients with stroke.7 Thus, these traditional balance measures are not perfectly suited to making precise and efficient assessments for a wide variety of patients with stroke.

To achieve both precision and efficiency in assessments, computerized adaptive testing (CAT) has been suggested.8,9 This assessment tool involves the use of a computer to administer items to respondents and allows respondents' levels of function to be estimated as precisely as desired (ie, to reach a preset reliability level). Because each assessment is tailored to the unique ability level of each respondent, CAT is adaptive to each patient. Items are administered on the basis of the patient's performance. For example, if a patient cannot stand independently, the computer “knows” not to ask whether he or she can walk. Instead, the computer asks whether he or she can sit without assistance. The result is a decrease in administrative burden, with little loss in precision.8 Thus, CAT has shown efficiency, reliability, and validity in health-related measurements.913

The purpose of this study was to develop a CAT system for raters to assess balance function in patients with stroke. We further investigated the reliability, concurrent validity, and efficiency (ie, time and number of items needed to administer) of the Balance CAT. Our goal was to develop the Balance CAT as an efficient and precise measure for assessing balance deficits in a wide variety of patients with stroke.


The study consisted of 3 parts: (1) development of the Balance CAT, (2) a simulation study to determine the performance and stopping rules of the Balance CAT for further usage, and (3) a validity and efficiency study to determine the concurrent validity and administration efficiency of the Balance CAT.

Development of the Balance CAT

Item construction.

We developed a pool of 41 balance items on the basis of predefined balance concepts (ie, the ability to maintain a given posture and the ability to ensure equilibrium in changing position).14 Clinical experts (including physical therapists, occupational therapists, and physiatrists) were consulted. We also did some field testing to finalize the item description and scoring. We aimed to include items covering the entire spectrum of balance function (ie, items with a wide range and even distribution of difficulty, from very easy to very difficult). Furthermore, we aimed to develop items that were easy and feasible to administer. Items that were difficult to administer in a clinical setting were deleted. For example, items assessing reactive postural control were not included because the raters found it difficult to control for perturbation. We also deleted any items requiring much time and many materials to administer; for example, walking and stair climbing were not included because they would be time-consuming or need materials (eg, stairs). We expected that these principles would make the items applicable to a wide range of patients and settings.

Of the 41 items, 29 items have 2 response categories (able or unable to perform a balance-related task). The other 12 items have 3 response categories (ie, 0=unable, 1=able to complete the task but not smoothly, and 2=able to complete the task smoothly; alternatively, 0=unable, 1=able to maintain balance while performing a task for 1 to 5 seconds, and 2=able to maintain balance while performing a task for more than 5 seconds). The patients were asked to perform the 41 standardized tasks without using any devices. Demonstration of the tasks was performed by raters, when necessary. Only one trial was allowed for each participant. Their performances were rated using the aforementioned categories.

Item testing.

We tested the item pool on 764 patients with stroke. Inpatients and outpatients in 6 hospitals in Taiwan were invited to participate in our study. The following criteria were used to determine whether patients could be included: (1) first or recurrent onset of cerebrovascular accident, (2) stroke with hemiparesis or hemiplegia, (3) ability to follow instructions to complete the assessments, and (4) informed consent given personally or by proxy. We excluded patients with other major diseases (eg, severe rheumatoid arthritis) that might affect their balance ability.

Five raters were trained to administer the 41 items. The interrater reliability of the raw sum score of the items was satisfactory (intraclass correlation coefficient=.95).

Data analysis.

We first examined the basic assumptions of item response theory (IRT), which include unidimensionality and local independence. We deleted items that could not meet the assumptions. Then we used the generalized partial credit model (GPCM)15 to fit our data and estimate the item parameters for the item bank. The GPCM allows items with a varying slope (also called “discrimination”) parameter, and thus provides a better fit to data in general. The misfit items were deleted. The GPCM then was used to fit the rest of the items' responses and to determine the final item parameters.

Evaluation of IRT assumptions.

We used Mplus software* to perform a 1-factor confirmatory factor analysis (CFA) to investigate unidimensionality of all 41 items. The CFA model was estimated by a weighted least squares estimation with robust standard error and mean- and variance-adjusted chi-square statistics using the tetrachoric and polychoric correlations derived from item responses. We used 4 indexes to assess model data fit (ie, unidimensionality). The first index was a chi-square test with critical P value of .05. The second index was the root mean square error of approximation (RMSEA). An RMSEA lower than 0.06 was good and lower than 0.08 was acceptable.16,17 The third and fourth indexes were the comparative fit index (CFI) and the Tucker-Lewis index (TLI), both with critical values above 0.95.17 In addition, the item was deleted if its factor loading was below 0.4.18,19

Local independence indicates that no relationship exists between the responses of any pair of items given an individual score on the latent variable.20 To examine local independence between items, we calculated the residual correlation of each pair of items from the CFA. A correlation lower than 0.10 was considered good and lower than 0.25 was acceptable for local independence.17,19

Fitting an IRT model.

We used PARSCALE 4.1 software,21 to fit the GPCM, calculate the item fit index, and estimate the item parameters. We deleted items with a significant likelihood ratio test (ie, chi-square test) or large standard errors in the estimated parameters. Furthermore, the item response curve was plotted to examine the order of the step difficulties for each item. If the higher numbered step difficulty of one item was smaller than its adjacent lower numbered step difficulty, the response category of the item was considered disordered. Afterward, the response categories of the items were re-ordered to achieve suitable scaling.

CAT Simulations to Propose Stopping Rules and to Validate the CAT

In order to determine the most appropriate stopping rules for the Balance CAT, a Fortran CAT program written by the authors was used to perform a simulation using the actual data of the 764 patients with stroke who participated in the item pool testing. We used the maximum Fisher information (MFI) as the item selection algorithm and the maximum a posteriori (MAP) method to estimate the patients' balance ability. Both the MFI and MAP are efficient for CAT application. We investigated the relationship between the reliability and number of items used. The results of the simulation were used to propose the stopping rules for the Balance CAT.

Then the validity of the CAT was examined. Pearson r was used to examine the relationship between the estimated ability using the CAT with the proposed stopping rules and the ability estimated by all items in the established item bank. We also calculated root mean squared differences between the estimated abilities using the CAT and all items of the item bank.

The CAT reports a participant's score and a standard error for each individual. A CAT's reliability can be computed for each participant. For a group of participants, reliability can be averaged across all participants (called the mean CAT reliability) or a set of participants (eg, those patients with extreme balance ability). We expected that with the Balance CAT, the mean reliability would reach 0.90 or above for all participants and that the reliability would reach 0.85 or above for a participant with extreme balance ability.

Validity and Efficiency Study to Determine Concurrent Validity and Time Needed to Administer the CAT

We administered the Balance CAT and the BBS3 on another independent sample (56 patients who were receiving inpatient or outpatient rehabilitation). The selection and exclusion criteria for the patients in this part of the study were the same as those of the aforementioned study.

Our Balance CAT program was installed on a Web-based server. The raters used a personal digital device (iPod touch) or a cell phone to administer the CAT via the Internet. The Balance CAT presents scoring criteria and animations for each item, which are useful for both patients and raters to speed up assessment. The testing results were shown on the screen and could be exported, if necessary.

One rater administered the CAT, and the other rater administered the BBS independently within 24 hours. The order of administrations was counterbalanced. We recorded the time needed to administer the BBS and CAT individually.

The BBS has 14 items (ie, 1 sitting item and 13 standing items).3 These items are based upon a 5-level scale (0–4). Its total score ranges from 0 to 56. The psychometric properties (including reliability, validity, and responsiveness) of the scale are well established in patients with stroke.22 The BBS has been identified as the most commonly used measure for patients with stroke.23 Thus, we used the BBS as our criterion to validate the Balance CAT.

We used Pearson r to examine the concurrent validity of the Balance CAT. We examined the association between the scores of the Balance CAT and those of the BBS.


A total of 764 patients participated in the item bank development study. These participants, with a mean age of 61.6 years, had experienced a stroke at various points in the past, from 2 weeks to more than 5 years previously. Few (4.2%) of the participants had bilateral hemiparesis. The patients' mean Balance CAT scores were 5.4 (SD=2.1, range=0–10). The patients had a wide range of balance function (ranging from “unable to sit independently with trunk support [on a chair with a backrest]” to “able to hop in place on the more affected foot”). Table 1 shows further characteristics of the patients.

Table 1.

Characteristics of the Participants

Evaluation of IRT Assumptions

Except for the chi-square test (χ2=22.4, df=5, P<.001), the other 3 model-data fit indexes—RMSEA (0.07), CFI (0.99), and TLI (0.99)—showed good model fit. The chi-square test reached significance mainly because of our large sample size. In addition, no items had a factor loading below 0.40, and the residual correlations for all pairs of items were below 0.25. These results indicated that all 41 items generally met both assumptions of unidimensionality and local independence. Figure 1 shows a flowchart of the data analysis and main results.

Figure 1.

The flowchart of data analysis and main results. GPCM=generalized partial credit model, IRC=item response curve, CAT=computerized adaptive testing.

GPCM Model Fit and the Final Item Bank

All items' parameters were estimated using the GPCM. Seven items did not fit the model's expectations. To better fit the model, these 7 items were excluded from the item bank (Tab. 2). In addition, the step difficulties (thresholds) of 3 items were disordered. Their response categories 2 and 3 were collapsed to form a new response category. In the end, the item bank consisted of 34 items (ie, 26 dichotomous items and 8 trichotomous items). We further estimated the item parameters and the fit index of the 34 items. The results were acceptable (χ2=262.7, df=246, P=.220). Table 2 shows the details of the items and parameters.

Table 2.

Contents, Factor Loadings, and Item Parameters of the Balance Item Bank

The IRT modeling generated a range of standardized scores from −2.4 to 2.3 for the balance function in the sample of patients with stroke. Because minus scores could be confusing and awkward for practical use, we linearly transformed these scores to 0 to 10 for ease of interpretation: transformed score = ([2.4 + IRT score]/4.7) × 10. Figure 2 presents the item difficulty hierarchy and participants' balance function.

Figure 2.

Item difficulty hierarchy and participants' balance function. The numbers on the right indicate item number and its response category. For trichotomous items, for example, 34.1 represents the first step difficulty of item 34 and 34.2 represents the second step difficulty of item 34. For dichotomous items, only the item number is shown.

CAT Simulations and Stopping Rules

Figure 3 shows the average and 90% confidence intervals of reliability across all patients against the number of items used. On average, 4 items were sufficient to reach a reliability of .90. The patients needed 6 items, on average, to reach a reliability of .95. Only 13 patients (1.7% of the 764 patients), with extremely poor balance function, had a reliability of .85 after testing with 6 items.

Figure 3.

Reliability of the Balance computerized adaptive test (CAT) under different test lengths. CI=confidence interval.

On the basis of these results, the stopping rules were determined by the combination of participant's reliability and test length. We set the reliability coefficient at ≥.90 because it is a common standard for individual comparison.24 The other rule (<6 items) was set because our preliminary simulation indicated little increase in reliability after 6 items. In addition, because in general the level of difficulty of the items increases or decreases steadily, patients with extremely poor or excellent function usually need more items than patients with moderate function to reach a predefined level of reliability (eg, .90). Thus, for patients with potentially extreme balance function (ie, passing or failing the first 2 items), we used special rules in the CAT, offering the 2 most difficult (or easiest) items for them so as to increase test efficiency for patients with extreme balance function. If the patients passed the 2 most difficult (or failed the 2 easiest) items, they were assigned the highest (or lowest) scores.

The Balance CAT started the testing with items of moderate difficulty (eg, “sitting to standing,” “picking up a pen on the floor”). Subsequent items were automatically provided by the CAT depending on the performance of the patient. Using the aforementioned stopping rules, we found that the ability estimated by the Balance CAT was highly correlated with the ability estimated by all items (Pearson r=.98). The root mean square difference between both estimates was negligible (0.40). The average reliability coefficient was .94, and the average test length was 4.3 items. As expected, the patients with moderate balance function could reach the highest reliability, and the patients with extreme balance function could reach modest reliability. The estimated scores of the CAT for the patients with the poorest/highest balance function had a reliability coefficient of .85/.90. The proportion of the patients with extremely poor balance was 1.7% of the 764 patients.

Concurrent Validity and Efficiency of the Balance CAT

A total of 56 patients (28 men, with an average age of 62 years) voluntarily participated in the validity and efficiency study. Their balance function was distributed widely, as shown by the BBS scores, ranging from 0 to 56 (lowest possible to highest possible scores). The mean BBS score of the patients was 21.5. One patient achieved the maximum score on the BBS. None had the maximum score on the CAT. However, 10.7% of the patients achieved the minimum score on the BBS, indicating a substantial floor effect. Only 1 patient had the minimum score on the CAT. Furthermore, both samples in this study appeared similar in terms of sex, age, lesion side, diagnosis, and balance score (chi-square or t test, where appropriate; P>.10).

The scores of the Balance CAT were highly correlated with those of the BBS (Pearson r=.88). The CAT used only 3 items, on average, for these patients. The average time needed to administer the CAT to the 54 patients was only about 18% that of the BBS (1 minute 23 seconds versus 7 minutes 54 seconds). The longest time required to administer the CAT was 7 minutes, and the longest time needed to administer the BBS was 20 minutes.


To the best of our knowledge, the current study is the first to develop a CAT assessing balance function in patients with stroke. Our results show that the Balance CAT, a performance-based measure, takes 4 items, on average, to estimate a patient's balance function with high reliability and validity. Thus, the Balance CAT is efficient, reliable, and valid for patients with stroke and has potential utility for both clinicians and researchers.

We also found that the estimated scores of the Balance CAT were highly associated with those of the BBS. The results support the concurrent validity of the Balance CAT. Furthermore, we found that the BBS had a substantial floor effect (>10% of the 54 patients had the lowest score), which also has been shown in previous studies.6,22 Only 1 patient had the minimum score on the Balance CAT. The main reason is that the easiest task of the BBS is sitting without support and that of the CAT is sitting with support on a chair with a backrest. These observations indicate that the Balance CAT is more discriminative and useful than the BBS for patients with poor balance function.

One of the greatest advantages of the Balance CAT is the reduction of patients' and clinicians' time and burden of assessments. Our results show that only about 4 items, requiring an average of 83 seconds, were needed to complete the balance assessment with sufficient reliability and validity. In contrast, the raters and patients needed 14 items and, on average, nearly 8 minutes to complete the BBS assessment. Thus, the efficiency of the Balance CAT is highly supported and could be beneficial in time-pressed clinical settings to enhance administration efficiency. In addition, although CATs frequently are used for questionnaires,9,10,13 not for performance-based measures, to reduce respondent burden, our results also indicate that CAT is useful for performance-based measures (eg, balance in this study) to reduce the burden on both raters and patients.

Another advantage of using the Balance CAT is that the test length, or the stopping rules, of the instrument can be adjusted on the basis of a user's needs. If a user would like to increase the reliability of the test (eg, .95), more items (6 items, on average, as shown in our simulation study) are needed. On the other hand, if limited time is available to administer a CAT, the user has to compromise and accept a loss of precision.

The test results of the Balance CAT are reported and stored immediately. Particularly, the reliability and 95% confidence interval of the estimated balance ability can be provided simultaneously. The 95% confidence interval of the estimated score for each patient may be useful for users to determine real change across repeated assessments. That is, the users can claim that a patient has made a real change if the confidence intervals for 2 consecutive assessment scores do not overlap with one another. In addition, the automatic storage of testing results is useful for documentation and further data analysis. The efficiency of clinical data management, therefore, could be further improved.

As expected, the performance of the Balance CAT depended on a patient's balance function. Patients with moderate balance function could reach the highest reliability. However, the Balance CAT, as expected, needs more items to assess patients with extreme balance function in order to reach sufficient reliability, particularly for patients with poor balance function. As only 1.7% of the patients could not reach a reliability coefficient of .90 (ie, .85 for these patients), such a weakness may compromise the performance of the CAT. In the future, we need to develop additional items assessing poor or excellent balance function to improve the performance of the CAT.

The current advantage of the Balance CAT is that it provides efficient and precise assessments of balance deficit. Some potential advantages of the Balance CAT remain to be explored. For example, the item bank of the Balance CAT has been calibrated according to item response theory. The item parameter (ie, item difficulty) is useful for arranging the items from easy to difficult. Thus, a functional hierarchy (eg, sitting, standing, stepping) or recovery of balance function could be developed among the items. Such information could be useful for clinicians in determining treatment plans. It also is worthwhile to explore whether the Balance CAT estimates are associated with stroke severity and falls at hospitals and in the community. The information could be useful for data interpretation and treatment planning.

The Balance CAT might not yet be able to replace the existing balance measures for patients with stroke. One reason is that the cost benefit of the Balance CAT seems limited at present, as both software and hardware are needed for the CAT. In addition, the Balance CAT requires administration of only a few items, which may limit opportunities for clinicians to observe patients' difficulty in performing the tasks. However, if we develop other CATs (eg, motor, mobility, depression), integrate them together, and manage the data storage and reporting, the cost benefit of the CATs would be enormous.

Our study had at least 2 limitations. First, our Balance CAT might be applicable to a wide range of patients because our participants were recruited from a variety of patients and settings (eg, both inpatients and outpatients, from very mild to very severe balance impairment). However, our convenience sampling of both samples may limit the generalization of our results. Second, further psychometric investigation of the Balance CAT is needed. Given the floor effect of the BBS, as found in this study, further validation of the Balance CAT using other balance measures (eg, the BESTest) is warranted. Furthermore, the test-retest reliability and responsiveness of the Balance CAT are particularly important in demonstrating whether it is appropriate (ie, a good outcome measure) for monitoring change in balance in patients with stroke. In addition, the minimal important difference,25,26 which represents a change that is meaningful to patients, is critical in decision making in clinical settings. To further improve the utility of the Balance CAT, future research to examine the responsiveness and estimate the minimal important difference for the Balance CAT is warranted.

In brief, our results provide strong evidence that the Balance CAT is efficient and has reliability and validity for patients with stroke. These results suggest that the Balance CAT holds promise for measuring balance function in patients with stroke in clinical settings.


  • All authors provided concept/idea/research design, writing, and consultation (including review of manuscript before submission). I.-P. Hsueh, J.-H. Chen, Dr C.-T. Chen, and Dr Hsieh provided data analysis. I.-P. Hsueh, C.-H. Wang, and Dr Hsieh provided project management and institutional liaisons. Dr Hsieh provided fund procurement. C.-H. Wang and Dr Hsieh provided facilities/equipment. C.-H. Wang provided clerical support.

  • The study protocol was approved by the Institutional Review Board of National Taiwan University Hospital.

  • This study was supported by research grants from the National Science Council (NSC96–2314-B-002–168-MY2) and the National Health Research Institute (NHRI-EX96–9512PI and NHRI-EX97–9512PI).

  • * Muthén & Muthén, 3463 Stoner Ave, Los Angeles, CA 90066.

  • Scientific Software International, 7383 N Lincoln Ave, Suite 100, Lincolnwood, IL 60712-1747.

  • Apple Inc, 1 Infinite Way, Cupertino, CA 95014

  • Received December 1, 2009.
  • Accepted April 28, 2010.


View Abstract