Berg Balance Scale: Intrarater Test-Retest Reliability Among Older People Dependent in Activities of Daily Living and Living in Residential Care Facilities

Mia Conradsson, Lillemor Lundin-Olsson, Nina Lindelöf, Håkan Littbrand, Lisa Malmqvist, Yngve Gustafson, Erik Rosendahl


Background and Purpose: The Berg Balance Scale (BBS) is frequently used to assess balance in older people, but knowledge is lacking about the absolute reliability of BBS scores. The aim of this study was to investigate the absolute and relative intrarater test-retest reliability of data obtained with the BBS when it is used among older people who are dependent in activities of daily living and living in residential care facilities.

Subjects: The participants were 45 older people (36 women and 9 men) who were living in 3 residential care facilities. Their mean age was 82.3 years (SD=6.6, range=68–96), and their mean score on the Mini Mental State Examination was 17.5 (SD=6.3, range=4–30).

Methods: The BBS was assessed twice by the same assessor. The intrarater test-retest reliability assessments were made at approximately the same time of day and with 1 to 3 days in between assessments. Absolute reliability was calculated using an analysis of variance with a 95% confidence level, as suggested by Bland and Altman. Relative reliability was calculated using the intraclass correlation coefficient (ICC).

Results: The mean score was 30.1 points (SD=15.9, range=3–53) for the first BBS test and 30.6 points (SD=15.6, range=4–54) for the retest. The mean absolute difference between the 2 tests was 2.8 points (SD=2.7, range=0–11). The absolute reliability was calculated as being 7.7 points, and the ICC was calculated to .97.

Discussion and Conclusion: Despite a high ICC value, the absolute reliability showed that a change of 8 BBS points is required to reveal a genuine change in function among older people who are dependent in activities of daily living and living in residential care facilities. This knowledge is important in the clinical setting when evaluating an individual's change in balance function over time in this group of older people.

The maintenance of balance function is essential to stay physically active in life.1 Due to aging processes, diseases, and inactivity, balance often is impaired among older people. The impairment can lead to dramatic consequences such as dependency in activities of daily living (ADL), admission to nursing homes, and falls and fractures.24 Training of static and dynamic balance control, therefore, is of great importance in rehabilitation for older people, and there is a need for valid and reliable instruments for evaluating the effects of treatment.5 It is crucial for the clinician to know whether a change in scores on functional tests is due to a real change in functioning or to measurement error.5

Reliability (ie, when repeated measurements of an individual's performance are consistent from one time to another)5,6 has been described as relative or absolute.5 Relative reliability examines the relationship between 2 or more measurements and the consistency of an individual's position within the group. Absolute reliability examines variability in scores in repeated measurements. The intraclass correlation coefficient (ICC) is commonly used to evaluate relative reliability.7 However, the ICC value is of limited use to the clinician because it is not related to the actual scale of measurement and is dependent on the range of the individuals’ performances.7 If the individuals’ range of scores is low, the ICC value often will show poor reliability, and vice versa.7,8 This means that the clinician cannot be sure whether a high ICC value for an instrument actually means low variability at the individual level.

A more appropriate way of investigating the reliability of an instrument intended for use in a clinical setting seems to be to examine absolute reliability.9 When using absolute reliability, the assessor receives information about how much variability caused by measurement error can be expected in scores for an individual.5 In 1996, Bland and Altman9 presented a way of calculating absolute reliability, which they referred to as the “repeatability.” Other methods also are available for estimating absolute reliability; however, it appears that the same calculations are used but the outcome is given different names, thereby creating confusion. The equation used by Bland and Altman also has been referred to as the “smallest detectable difference” (SDD),10 the “minimum detectable change” (MDC),11 and the “smallest real difference” (SRD).12

Reliability can be tested either by having 2 different observers independently assessing the same individual (interrater reliability) or by having the same assessor performing the assessments (intrarater reliability).5,13 When investigating variability within the assessed individual, intrarater reliability seems preferable because the variability from the assessor is minimized.5

The Berg Balance Scale (BBS) was developed to measure balance among older people with impairment in balance function by assessing the performance of functional tasks.1315 It is a valid instrument used for evaluation of the effectiveness of interventions and for quantitative descriptions of function in clinical practice and research. The BBS has been evaluated in several reliability studies, of which 3 were concerned with intrarater reliability among older people.10,13,14 Two of these studies,13,14 which evaluated relative reliability only (ICC=.97 and .99, respectively), were performed among older people who were living in the community or in a senior citizens’ residence or who were inpatients and among patients with a recent stroke.

To our knowledge, the absolute intrarater reliability of BBS scores has been evaluated in only one study,10 which was performed among people with Parkinson disease (n=26) who were living in the community. The ICC was .87, and the absolute reliability was 2.8 BBS points (95% confidence level). Thus, there is lack of absolute intrarater reliability studies of the BBS among older people, and absolute reliability has never been evaluated among older people in residential care facilities. This is a group of people who are commonly seen for rehabilitation. They have a high prevalence of diseases and cognitive and physical impairments,16 which may cause a multisystem reduction in reserve capacity.17 This makes these older people sensitive to internal and external disturbances, such as pain or environmental stressors, and may result in markedly day-to-day fluctuations in function.17 Cognitive impairment may cause difficulties in processing information and making judgments, and the reliability of assessments, therefore, may be affected due to difficulties in understanding instructions.18,19

Thus, our hypothesis was that, because of the health status of these individuals, the variability between repeated assessments would be large. The purpose of this study was to examine the absolute and the relative intrarater test-retest reliability of BBS scores among older people who are dependent in ADL and living in residential care facilities.



This study was performed in September 2004 at 3 residential care facilities in Umeå, Sweden. It was part of a larger study with the purposes of evaluating reliability and validity for a number of measures of physical function and monitoring change in physical function over a period of 6 months (from March to September 2004) among older people who were dependent in ADL. All facilities comprised private apartments with access to a common dining room, alarms, and on-site nursing and care. In Sweden, a person needs a decision from the municipality to move into this kind of facility and all residents have cognitive or physical impairments and require supervision, functional support, or nursing care. These facilities seem similar to the “assisted living” facilities in the United States. One facility also comprised units for people with dementia, where they have private rooms with staff on-hand.


Inclusion criteria were: aged 65 years or over, dependent on assistance in one or more personal ADL according to the Katz Index,20 able to stand up from a chair with armrests with help from no more than one person, a Mini Mental State Examination (MMSE)21 score of 10 or higher, and approval from the resident's physician.

Screening and Inclusion Process

All participants underwent a screening process carried out by a physical therapist. They were given oral and written information about the study. The participants themselves, or the relatives of those with severe cognitive impairment, gave their informed oral consent. One hundred four people met the inclusion criteria. Forty-five of those individuals completed 2 testing sessions with the BBS in September and were included in this study (Fig. 1).

Figure 1.

Flow chart of inclusion in this study. ADL=activities of daily living, MMSE=Mini Mental State Examination.

Katz ADL scores and proportion of men and women did not differ between the participants (n=45) and those who met the inclusion criteria but were not included in the study (n=59). However, age differed significantly between the 2 groups; the participants were younger than those who were not included in the study (mean age [±SD] of 82.3±6.6 years versus 86.3±7.5 years, respectively; P=.006).


A member of the staff who knew the participant well was interviewed about the participant's ability to manage ADL according to the Barthel Index, which has scores ranging from 0 to 20.22,23 Cognitive function was screened using the MMSE,21 which has scores ranging between 0 and 30. A score of 18 to 23 indicates mild cognitive impairment, and a score of ≤17 indicates severe cognitive impairment.24 The Geriatric Depression Scale (GDS) was used to measure symptoms of depression.25 The scores on the GDS range from 0 to 15, with a score of 5 to 9 indicating mild depression and a score of 10 or higher indicating moderate to severe depression.26 The Functional Ambulation Categories (FAC) were used to measure walking ability with or without a walking aid at 6 levels (0–5).23,27 A score of 3 (verbal supervision or standby help without physical contact) or less was chosen as the indicator of severe physical impairment. The Mini Nutritional Assessment was used to describe nutritional status.28 The scores range from 0 to 30, with scores between 17 and 23.5 indicating risk for malnutrition and scores of less than 17 indicating malnutrition.29 In March 2004, a registered nurse from each facility completed a questionnaire regarding diagnoses and prescribed drugs. This information, together with assessments and measurements, was evaluated by a specialist in geriatric medicine before completion of the final diagnoses. Dementia and depression was diagnosed using the DSM-IV criteria.30

The BBS consists of 14 tasks of varied difficulty, all graded on a 5-point ordinal scale (0–4) in accordance with detailed descriptions.1315 Lower points are given for people in need of supervision or assistance, or if specific time or distance requirements are not met. In this study, the Swedish version of the BBS was used, and the item “tandem standing” was modified allowing 2 attempts.31 The BBS was administered twice by the same assessor for each participant to assess intrarater test-retest reliability. The first test (Berg A) and the retest (Berg B) were done with 1 to 3 days in between tests and were started at the same time of the day (±1 hour). After Berg A, date and time of day were noted and the test protocol for Berg A was placed in an envelope. Thus, only the date and time of Berg A was provided at the administration of Berg B.

Test Procedure

The participants were interviewed and tested in their own homes or in the corridor outside their rooms. They were informed that they could stop the testing session whenever they wanted and that they were allowed to rest between tasks, if necessary. They also were told to wear stable and comfortable shoes. There were 4 assessors—2 physical therapists and 2 physical therapist students. All of the assessors were given education before the study. The assessors received the test manual before the education, which consisted of a half-day session during which they could ask questions and a practice assessment was made on a geriatric patient. In addition, the physical therapists had performed the same assessments in 2 previous data collections. For practical reasons, the assessor for each participant was not randomized.

Data Analysis

To investigate the incidence of outliers, a box plot of the distribution of the absolute differences between tests (Berg A and Berg B) was used. An outlier was defined as a participant with a difference of 1.5–3 × interquartile range (IQR) from the upper or lower edge of the box. An extreme outlier was defined as >3 × IQR. Two subjects were considered to be outliers according to the box plot, but because they were not extreme outliers, their scores were included in the analyses. Their absolute differences were 9 and 11 BBS points, respectively. In Berg A, they had 7 and 48 BBS points, respectively, and their MMSE scores were 12 and 13, respectively.

The absolute reliability—also referred to as “repeatability”—was calculated using an analysis of variance (ANOVA), according to the procedure described by Bland and Altman.9 In the one-way ANOVA, the square root of the within-people residual mean square is the within-subject standard deviation (sw), which enables the size of the measurement error to be calculated.9 The repeatability is calculated according to the equation: Math × 1.96 sw or 2.77 sw. For 95% of pairs of observations, the measurement error is expected to be less than 2.77 sw. To be certain that a change in score is due to a change in function and not just to measurement error, the difference has to be ≥2.77 sw. Likewise, measurement errors were calculated with 90% and 80% confidence levels (Math × 1.645 sw and Math × 1.28 sw, respectively). The exception to the use of this method is when heteroscedasticity occurs (ie, when the measurement error is dependent on the size of the measurement). The occurrence of heteroscedasticity was investigated graphically by plotting each individual's absolute difference against his or her mean and by calculating a rank correlation coefficient using the Kendall Tau-B statistic.9

The relative reliability was analyzed using the ICC. The ICC (version 3,1) was calculated using a 2-way mixed-effects model for a single measure, and the ICC (version 1,1) was calculated using a 1-way random-effects model for a single measure. All error is assumed to be random measurement error with ICC(1,1). With ICC(3,1), it is assumed that systematic error is not part of the measurement error. When ICC(1,1) equals ICC(3,1), no systematic error is present.32 The ICC ranges from 0 to 1, where 1 indicates perfect agreement and 0 indicates no agreement. Intraclass correlation coefficients of .90 or higher are generally considered high.6

Additional analyses also were performed. The variation in each item of the BBS was analyzed by weighted kappa. Suggested interpretation of the results of the weighted kappa analysis is: .41–.60=fair agreement, .61–.80=good agreement, .81–.92=very good agreement.33 The effect of the participant's cognitive function on the difference in total score between tests was evaluated using linear regression analysis. The dependent variable was absolute difference in BBS score. The independent variables were the MMSE score and the MMSE score dichotomized into participants with severe cognitive impairment (MMSE score of ≤17) and those without severe cognitive impairment, respectively. Likewise, analyses were performed evaluating the effect of the participant's age as well as evaluating differences between assessments performed by physical therapists and those performed by physical therapist students. All additional regression analyses (concerning participants’ cognitive function, age, and physical therapist assessment versus physical therapist student assessment) were adjusted for initial BBS score (Berg A) by adding this as an independent variable. Analyses were performed using SPSS software, version 10.0.* A P value of <.05 was considered to indicate statistical significance.


A description of the characteristics of the 45 participants is presented in Table 1. Four participants had declined in cognitive function to an MMSE score of <10 since inclusion in March 2004, but their data were still included in the analyses.

Table 1.

Characteristics of the Participants (N=45)a

Thirty-six participants (80%) had 1 day in between the 2 BBS testing occasions, 8 participants (18%) had 2 days in between tests, and 1 participant (2%) had 3 days in between tests. The difference in the time of the day when the test started between Berg A and Berg B ranged from 0 to 60 minutes.

The participants’ mean BBS score was 30.1 points (SD=15.9, range=3–53) for Berg A and 30.6 points (SD=15.6, range=4–54) for Berg B. The distribution of the participants’ differences in BBS scores between the 2 test occasions is shown in Figure 2. The absolute differences in the BBS scores ranged from 0 to 11 (Tab. 2). Eight participants (18%) showed no difference between the test occasions, 18 participants (40%) had a difference of 1 BBS point or fewer, and 25 participants (56%) had a difference of 2 points or fewer. The mean absolute difference was 2.8 points (SD=2.7, range=0–11), and the median was 2 points. The absolute differences for the 4 participants with an MMSE of <10 were 1, 3, 4, and 8 BBS points, respectively.

Figure 2.

Results of the Berg Balance Scale (BBS) for each participant (n=45): mean value and difference for the 2 test occasions. Berg A=first BBS test, Berg B=BBS retest.

Table 2.

Absolute Difference in Total Scores of the Berg Balance Scale (BBS) Between the 2 Test Occasions

Absolute Reliability

No heteroscedasticity was found, either graphically or statistically (P=.905). The variation analysis, performed according to the procedure described by Bland and Altman,9 gave a residual mean square of 7.6278. The equation of repeatability gives 2.77 × Math = 7.7 for a 95% confidence level. This implies that a change of 7.7 BBS points must occur to reveal a genuine change in function for a participant. Corresponding figures for 90% and 80% confidence levels were 6.4 and 5.0 BBS points, respectively. When performing the analyses without the participants who had declined in cognitive function below initial inclusion criteria (MMSE score of <10), the results for absolute reliability were 7.4 BBS points (95% confidence level), 6.2 BBS points (90% confidence level), and 4.8 BBS points (80% confidence level).

Relative Reliability

Both the ICC(3,1) and the ICC(1,1) were calculated to .97.

Additional Analyses

Twelve of the 14 items showed a good or very good agreement, using weighted kappa (Tab. 3). The variation ranged from .55 (item 14) to .83 (item 5).

Table 3.

Overview of the Berg Balance Scale Items and Absolute Difference in Score for Each Item Between the 2 Test Occasionsa

Regarding absolute difference in scores between Berg A and Berg B, participants with severe cognitive impairment had a mean (±SD) difference of 3.2±3.0 BBS points compared with 2.6±2.4 BBS points for participants without severe cognitive impairment (P=.498). There was no significant association between the MMSE scores and absolute difference in BBS scores between tests (P=.882). In addition, the absolute difference was not significantly associated with the participant's age (P=.286).

The physical therapists assessed 28 participants, and the physical therapist students assessed 17 participants. The absolute differences in BBS scores between tests were a mean (±SD) of 2.5±2.2 BBS points for the physical therapists and 3.5±3.4 BBS points for the physical therapist students (P=.233).


The absolute reliability indicated large variability on an individual level where a change of 8 BBS points must occur before a change in function can be detected, using a 95% confidence interval, among older people who are dependent in ADL and living in residential care facilities. Contrary to the absolute value, the relative intrarater reliability for the BBS was high (ICC=.97). The variability between the 2 test occasions was not significantly associated with the participants’ cognitive function, age, or whether a physical therapist or a physical therapist student performed the assessments.

The variability found in this study of absolute intrarater reliability was higher than in a study of patients with Parkinson disease,10 where absolute reliability was 2.8 BBS points. However, in an absolute interrater reliability study by Stevenson11 of patients who had a stroke, the variability was 6.9 BBS points (95% confidence interval). Stevenson also calculated an absolute reliability of 6.2 BBS points (95% confidence interval) in data from patients with a recent stroke from an interrater reliability study by Berg et al.13 Thus, large variability also has been seen in other studies, even though a direct comparison with these 2 studies11,13 is not applicable because they used interrater reliability and the study by Stevenson used the best performance of 3 attempts at each item.

One important reason for the large variability might be difficulties in measuring physical function.23 However, the main reason for the high variability in the present study is most likely the population's fluctuating function, which is indicated by a high prevalence of diseases and cognitive and physical impairments. This is supported by the large variability in an instrument that evaluates basic mobility and balance—the Timed “Up & Go” Test—when evaluated in a similar population.34 In addition, the mean value for the BBS was rather low in the present study, which may reflect that many of the participants performed their maximum physical capacity in a majority of the items. When assessing maximum physical capacity, one could assume that minor day-to-day fluctuations in function also would lead to considerable variation in repeated measurements.

Cognitive function was not found to be significantly related to the variability in BBS scores in the present study. However, because the great majority of the participants had cognitive impairment, this might have contributed to the large variability found. In earlier studies of reliability of BBS scores,10,11,13 the participants seem to have had a higher level of cognitive function than those in the present study.

The high ICC value in this study supports earlier intrarater reliability studies of the BBS where ICCs ranged from .87 to .99.10,14 However, this finding contradicts the interpretation of the result of the absolute reliability. The limitations with ICC, where the ICC value might be an effect of the range of scores among individuals, also could be demonstrated from earlier reliability studies of the BBS. In the study by Berg et al,13 with a large range among individuals’ scores (mean BBS score [±SD]=37.1±17.2), ICC was calculated as .98 and absolute reliability was 6.2 points11,13 compared with an ICC value of .87 and an absolute reliability value of 2.8 points in a study by Lim et al10 with a lower range of participants’ scores (mean BBS score [±SD]=53.8±2.0). Although absolute reliability showed less variability in the study by Lim and colleagues, the ICC value indicated the opposite. Several authors7,8,11,12,35 have discussed the limited value of ICCs when interpreting results from reliability studies, and, according to Finch,36 relative reliability is used to provide information about whether a scale can discriminate among individual performances. This means that high ICC values show that the individuals tested get different scores along the scale, indicating that the scale can differentiate among individual functions.

One limitation in this study is that the sample size did not reach the recommended 50 people for using Bland and Altman's statistical method.8 However, the 45 people included in the present study approached the recommended number. Even though the interval may have become slightly smaller with a larger sample, the results still indicate wide variability in balance performance for these individuals. The main result in this study was calculated with a 95% confidence level, but one could question whether this confidence level is always necessary in the clinical setting when interpreting results from an assessment such as the BBS. Stevenson11 also has presented lower levels of confidence, which seems useful as an alternative when clinically evaluating an individual's function.

Two people were excluded before the second test because of acute illness. However, a more thorough assessment prior to the test occasions (eg, assessment of whether the participant had slept badly or had increased pain) might have provided additional information that may explain some of the variability between the test occasions. The exclusion at baseline of those people who were independent in ADL and those with very low physical or cognitive function limits the external validity. Despite this limitation, the participants in this study had a wide range of both cognitive and physical performance, making them representative of a large part of the population living in residential care facilities. Inclusion was done 6 months before the start of the study; consequently, 4 participants had declined in cognitive function to a level below the inclusion criterion. However, exclusion of those 4 participants in the analyses did not notably affect the results.

The methodological strengths of the present study are that the testing environment and time of the day were standardized and that adherence was high. It was considered necessary to have at least one day of rest between tests to avoid having a lower result on the retest because of fatigue. A larger number of days between tests might have increased the risk of other incidents occurring that could affect the health of these older people.

Despite large variability at an individual level, the mean for the group was rather consistent between the 2 test occasions, indicating that the BBS is suitable for evaluating groups over time. In addition, the BBS is a functional way of measuring an individual's balance and can provide valuable information for clinicians designing individual exercise programs. It also is easy to administer because it does not require much time or equipment. However, it seems important for clinicians to be careful when using single assessments of the BBS to draw conclusions concerning a change in balance function in the studied population. It has been indicated that the use of the mean of repeated measurements increases the reliability for tests of walking ability.37 It would be interesting to evaluate this with reference to the BBS.


Despite a high ICC value, the result of the absolute reliability indicates that a change of 8 BBS points is required to reveal a genuine change in function between 2 assessments using a 95% confidence level among older people who are dependent in ADL and living in residential care facilities. This knowledge is important in the clinical setting when evaluating an individual's change in balance function over time in this group of older people.


  • All authors provided concept/idea/research design and consultation (including review of manuscript before submission). Ms Conradsson provided writing. Ms Conradsson, Ms Lindelöf, Ms Malmqvist, and Dr Gustafson provided data collection. Ms Conradsson, Dr Lundin-Olsson, Ms Lindelöf, Mr Littbrand, Dr Gustafson, and Dr Rosendahl provided data analysis. Dr Gustafson provided fund procurement, subjects, facilities/equipment, institutional liaisons, and clerical support. The authors acknowledge and thank Michael Stenvall and Halvar Sivertsson for contributing to the data collection and Hans Stenlund for excellent guidance in statistics. They also thank the Social Authorities of the municipality of Umeå, the participants, and the staff at the facilities for their cooperation and participation.

  • The study was approved by the Ethics Committee of the Medical Faculty of Umeå University.

  • Parts of the study were given as abstract presentations at the 18th Nordic Congress of Gerontology; May 28–31, 2006; Jyväskylä, Finland and at a European Union workshop; September 25–26, 2006; Oslo, Norway.

  • This work was supported by grants from the Swedish Research Council K2005–27VX-15357–01A, Erik and Anne-Marie Detlof's Foundation, and the European Union (Interreg Kvarken-MittSkandia).

  • * SPSS Inc, 233 S Wacker Dr, Chicago, IL 60606.

  • Received November 10, 2006.
  • Accepted April 20, 2007.


View Abstract