In 1986, Congress passed Public Law 99–457, which provided incentives for states to develop early childhood intervention programs for qualified infants and toddlers from birth through 2 years of age and their families. The law later became known as the Individuals With Disabilities Education Act (IDEA),1 and, with the 1997 amendments, the early intervention program became Part C of the Act.

Part C of IDEA defines the terms “evaluation” and “assessment” as they relate to early intervention programs. “Evaluation” means procedures used to determine a child's initial and continuing eligibility for services. Eligibility criteria are defined by each state, but they typically include documentation of delay in one or more areas of development listed in the federal law, including cognitive, adaptive (self-help), physical (eg, gross and fine motor), communication, and social-emotional development. “Assessment” is defined as ongoing procedures to identify a child's strengths and needs, and the services required to meet those needs. This is the process of program planning.

Evaluation requires what Kirshner and Guyatt referred to as a “discriminative index,” which distinguishes “between individuals or groups on an underlying dimension when no external criterion or gold standard is available for validating these measures.”2 Measures useful for determining a child's eligibility for services are those that can be used to differentiate between children who have developmental delays and children who do not have such delays.3 Discriminative measures include test items that distinguish between children of varying ages, such as stacking blocks or jumping down from a step. Although such information is can be useful for determining whether a child has a developmental delay, knowledge that a child can or cannot perform such test items often is not useful for program planning purposes. Program planning often requires identification of goals specific to an individual, particularly when intervention is unlikely to enable a child to “catch up” with typically developing peers.3

Another purpose of measurement is what Kirshner and Guyatt referred to as “evaluation.” They defined evaluation as a measure of “the magnitude of longitudinal change in an individual or group on the dimension of interest.”2 Although progress toward individual goals is usually the most appropriate measure of progress for individual children,3 early intervention programs must implement program-based outcome studies that look at longitudinal changes in children and families receiving services.4

One tool that has been used for both determining children's eligibility for services and measuring change longitudinally for program-based studies is the Battelle Developmental Inventory (BDI).5 The BDI was developed in 1984 and is both normreferenced and criterion-referenced. It is a comprehensive test of development that evaluates the 5 domains of development listed in Part C of IDEA: cognitive, adaptive (self-help), motor, communication, and personal-social development. Each of the domains is further divided into subdomains, which can be scored separately.

In addition to covering the 5 areas of development listed in the law, an advantage of the BDI is that it covers an age range from birth to 8 years, which is a wider range than that of many other tests that can be used with infants. The wide age range facilitates longitudinal comparisons of the same measure over a longer period of time than is possible with most other tests.


Examiners can administer the items for each domain separately, or they can test all 5 domains of development. This aspect of the BDI is useful if different team members evaluate different domains of development. A physical therapist, for example, might administer items in the motor domain, an occupational therapist might administer the adaptive items, a speech-language pathologist might administer the communication items, and a teacher might administer the personal-social and cognitive items.

The BDI has 3 administration formats: structured administration, observation, and interviews with parents or other sources. The authors provide detailed instructions for the structured administration procedure, making it, in our opinion, the most clear-cut format to administer and score, followed by observation and then the interview approach.6 Another feature of the structured administration format that sets the BDI apart from other tools is that it allows examiners to make specific adaptations for children with disabilities when needed. It also allows some deviation from the exact words if the child does not understand the instructions.

The availability of 3 test formats increases the likelihood that children receive the highest possible score for all skills they can perform. If a child does not perform well or refuses to perform activities during the structured administration format, the examiner may ask the child's parents or teachers whether the child can perform certain tasks. If the primary interest is in identifying all that the child is able to do, not just what the child is willing to do in a testing situation, then the 3 test formats are a benefit.7 From the point of standardization, the 3 formats are problematic. Depending on the extent to which the examiner uses one format or another, the results could differ. Data obtained through parent report, for example, are not always consistent with results from standardized administration.8

The entire BDI takes approximately 1 1/2 hours to administer, which is similar to other comprehensive developmental assessment tools. The scoring process is complicated, particularly establishing and scoring basal levels.9 Bailey et al9 found that teachers took longer than 30 minutes to score the test and made many errors. Simple math errors were most common (45% of teachers), followed by errors in establishing a basal level (43%) and scoring items below the basal level (29%). Only 14.5% of the teachers made no errors. Based on the results of their study, the authors recommended that people who administer the BDI receive training in administration and scoring of the test.

The BDI is administered by first finding a basal level, which is the age level at which the child gets full credit for all items in a subdomain. Subdomains are specific skill areas that make up a domain, such as the locomotor subdomain of the motor domain. The ceiling is the level of item difficulty at which a child would get a score of 0. The items are scored on a 3-point system. A score of 2 indicates the child's response meets the specified criteria. A score of 1 means the child attempted the item but did not meet all criteria. A score of 0 is given when the response is incorrect or there is no response or opportunity to respond. This system of scoring allows examiners to determine whether children display emerging skills on which they can build. The examiner's manual includes chapters on scoring and interpretation that show how to apply BDI scores.5 The examiner can calculate developmental quotients, which can then be expressed as age equivalents (in months). It is possible to profile domain and subdomain scores and compare strengths and weaknesses in various areas. These profiles can be used to help determine whether a child's deficit is due to weaknesses in all areas of development or in one specific area such as fine motor skills.


The standardizing process of the BDI consisted of testing a norming sample of 800 children, with approximately 100 children (50 male and 50 female) at each 1-year age level from birth to 8 years.5 Geographically, 75% of the children lived in urban areas and 25% lived in rural areas. The sample was 84% white and 16% minorities, primarily African-American and Hispanic-American children. There was no difference in scores when gender or race was considered in this sample.

The developers did not control for socioeconomic status (SES), but the manual states that test sites were 75% urban and 25% rural and represented a wide range of socioeconomic statuses.5 No data were provided on the occupations, income, or educational levels of the parents.

The size of the sample also was potentially limited. The sample consisted of 50 children each in the 0-to 5-, 6-to 11-, 12-to 17-, and 18-to 23-month-old age ranges and 100 children in the 24-to 35-month-old range, for a total of 300 children in the 0-to 35-month-old range. Particularly at these lower age ranges, because young children's development can be so rapid, the wide age spans can cause age-related discontinuities.10 If, for example, a child's age is just short of a cutoff level, the child would appear to be functioning at a higher level than if the same child were tested a few days later and performed in the same way. Examiners should be cautious when testing young children who are close to the age cutoff levels to avoid inappropriate eligibility and intervention decisions.10

Another problem with BDI scoring is that procedures recommended to calculate extreme scores of children who have severe and profound disabilities do not appear to be adequate. Bailey et al9 noted that the tables for calculating deviation quotients (DQs) do not provide DQs less than 65. If children's scores are lower than 65, the test manual gives a method to extrapolate a DQ. Using this method, however, some children can receive a negative DQ. A child 22 months of age, for example, who received a raw score of 20 for the motor domain would have a DQ of —45, if calculated using the formula in the manual. The test manual does not explain why negative scores occur or how they should be interpreted or reported.11 Researchers could remedy this situation by obtaining norms at the lower levels to avoid the need to extrapolate scores.


The BDI test manual reports the standard error of measurement (SEM), test-retest reliability, and interrater reliability.

Standard Error of Measurement

The SEM refers to the “band of confidence around a child's test score.”6 The SEM for all domains listed in the test manual was between 2.12 (6–11 months) and 9.05 (48–59 months).5 The smaller the SEM, the more sure an examiner can be that the score a child obtains on the test is close to the child's true score. The first BDI manual, published in 1984, included an error in directions for calculating of the SEM. Instead of the SEM, the manual provided directions for calculating the standard error of the mean. The authors corrected this error in the 1988 printing of the BDI manual, so it is important to know which publication of the test is being used.6,8

Test-Retest Reliability

The developers of the BDI determined test-retest reliability by retesting 183 children of the 800 children in the sample within 4 weeks of the initial test. Test-retest reliability of BDI total scores was between .90 (at 72–83 months) and .99 (at 6–11 months, 12–17 months, 24–35 months, and 36–47 months).5 For all reliability and validity data, the BDI manual states only that correlation coefficients were calculated. The authors did not identify the statistic(s) used, which makes interpretation of the data somewhat difficult.

Interrater Reliability

Newborg et al5 examined interrater reliability by having a second rater score the tests of 148 children. They found interrater reliability to be high, ranging from .90 (at 72–83 months) to .99 (at 18–23 months and 36–47 months). Subsequent studies of interrater reliability correspond well to these results. McLean and colleagues11 found interrater reliability to be .928 when testing children younger than 30 months with sensory impairments, developmental delay, physical disabilities, and multiple other disabilities. Snyder et al12 tested 78 children with severe disabilities and, using generalizability theory, estimated internal consistency reliability to be .85 or higher.


Validity refers to the degree to which a meaningful interpretation can be inferred from a test.13 Researchers have conducted studies of the content validity,5,7 construct validity,5,12 concurrent validity,5,1420 and predictive validity17,21,22 of the BDI.

Content Validity

Content validity refers to how adequately a test covers “all parts of the universe of content and reflects the relative importance of each part.”23 The BDI test manual does not include data supporting the selection of the items included in the test and the manner in which the items were grouped into domains. Newborg et al5 explained that they selected the 341 items from a pool of 4,000 items from other developmental tests. When selecting items, Newborg et al5 stated that they considered the importance of the items in the functioning of the child's everyday life, support for the items in the literature, the educational practitioner's acceptance of the skill as a milestone in a child's development, and whether therapists and educators could intervene on the item. In 1980, a pilot study of 500 children was conducted to refine the BDI items.5 When 75% of the children passed an item, the authors assigned that item to that particular developmental age level. The authors supported the content validity of the developmental nature of the BDI with t-test comparisons between age groups on parts of the BDI.6,8,24 Keyser and Sweetland7 stated that these age comparisons support the argument that the BDI is developmental in nature, thus supporting the content validity of the BDI.

Construct Validity

Construct validity “reflects the ability of an instrument to measure an abstract concept, or construct.”23 The developers of the BDI based predictions for construct validity on “general developmental theory.”5 They hypothesized that a child who performs well in one domain should perform well in all domains. Correlations were high and positive for total BDI scores against 30 subdomain categories, providing support for the belief that a child's performance should be consistent across domains. The correlations were between .56 for the total BDI score and the muscle control subdomain and .99 for the total BDI score and the motor domain and receptive subdomain. The lower correlation between the total BDI scores and the muscle control subdomain was attributed to an item ceiling, in which all children received the highest possible score for all items at 18 months of age, which restricted the size of the correlation.

Factorial validity was described by Newborg et al5 as a type of construct validity. They measured factorial validity through a factor analysis of the pilot study data and found that the factor structure differs depending on the age of the child. Intercorrelations among domains showed that 5 BDI domains are more accurate for children over the age of 2 years. For children under 24 months of age, it appears there are 3 general factors (which the manual does not specify), so it is important to administer the entire BDI to children under 2 years of age. For children above 24 months of age, the factor analysis supports the structure of the 5 domains, although the communication and cognitive domains overlap.8,44

Snyder et al12 tested the construct validity of the BDI. The subject sample consisted of 78 children with disabilities tested over a 5-year period. The results of the study suggested that examiners should be cautious about obtaining and reporting isolated scores in the social-emotional, cognitive, and communication domains because they appear not to reflect unique developmental domains.

Concurrent Validity

Concurrent validity is “studied when the measurement to be validated and the criterion measure are taken at relatively the same time (concurrently), so they both reflect the same incident of behavior.”23 Although the developers of the BDI did not describe their studies well, the initial correlations between the 10 major BDI components and the Vineland Social Maturity Scale25 and the Developmental Activities Screening Inventory26 were high (.78–93). The correlations between the Stanford-Binet Intelligence Test27 and the BDI were from .40 to .61, but the developers of the BDI did not intend the BDI to be used as a measure of intelligence. The correlations between the Peabody Picture Vocabulary Test-Revised (PPVT-R)28 and the BDI were relatively high, from .36 (adaptive domains) to .83 (total scores).5

Since the development of the BDI in 1984, several studies have provided support for the concurrent validity of the BDI scores with the Vineland Adaptive Scales29 and the Bayley Scales of Infant Development.30 Johnson and colleagues14 studied 67 children with motor delays and a mean age of 18.6 months. Pearson product-moment correlation coefficients for the Vineland Adaptive Scales, the Bayley Scales, and the BDI ranged from .48 to .81 and were reported to be statistically significant, but the error associated with the correlations could be quite high because of the low correlations. These correlations were not as consistently high as those found by McLean et al,11 whose sample consisted of children with disabilities under the age of 30 months. Pearson product-moment correlations between the BDI and the Bayley Scales ranged from .75 to .92, and correlation between the BDI and the Vineland Adaptive Scales ranged from .73 to .95. Boyd et al15 found Pearson product-moment correlations of .27 to .95 between the Bayley Scale and BDI scores of children younger than 30 months. The results were supported by another study of 70 children with disabilities under 30 months of age in which correlations between the Bayley Scale and BDI scores, using canonical analyses, ranged from .65 to .95.31 These studies of concurrent validity support the use of the BDI as an appropriate assessment tool for infants with known or suspected disabilities.

Although most studies support the concurrent validity of the BDI with the Bayley Scales, Gerken et al16 found a Pearson product-moment correlation coefficient of —.03 between the BDI and the Bayley Mental Scale in a sample of 34 midwestern children aged 3 to 30 months who lived with their mothers, who were 15 to 21 years of age. The low correlation between the 2 tests suggests that they measure different elements of development.

Other researchers have related BDI scores with scores of other tests. Guidubaldi and Perry17 evaluated 124 children in kindergarten with the BDI, Draw-A-Person Test,32 PPVT-R,28 Kohn Social Competence Scale,33 Sells and Roff Scale of Peer Relations,34 Bender Visual-Motor Gestalt Test,35 Metropolitan Readiness Test,36 Stanford-Binet Intelligence Scale,27 Vineland Social Maturity Scale,25 and Wide Range Achievement Test (WRAT)37 to measure each of the 5 domains. The results generally indicated significant relationships between the tests and the individual domains of the BDI, although some of the correlations were not high. They found the strongest and most consistent relationships between the cognitive, personal-social, and communication domains of the BDI and the other tests.

Smith18 tested 30 typically developing preschool children aged 3 years 11 months to 6 years 2 months using the BDI cognitive domain, the Stanford-Binet Intelli-gence Scale,27 and the Kaufman Assessment Battery.38 Pearson product-moment correlations among the 3 tests ranged from .41 (Stanford-Binet Intelligence Scale composite score versus BDI cognitive domain score) to .63 (Stanford-Binet Intelligence Scale composite score versus Kaufman Assessment Battery mental processing composite score). The BDI cognitive domain total score was lower than the Stanford-Binet Intelligence Scale and Kaufman Assessment Battery composite scores, but Smith concluded that the BDI measures similar constructs. Lidz19 compared the BDI cognitive domain scores and Stanford-Binet Intelligence Scale scores of 32 African-American children aged 3 to 5 years. Lidz found that more than one half of the children had a lower DQ on the BDI cognitive domain than on the Stanford-Binet Intelligence Scale.

Mott20 investigated the concurrent validity of the BDI communication domain using the PPVT-R,28 the Pre-school Language Scale-Revised (PLS-R),39 and the Arizona Articulation Proficiency Scale-Revised (AAPS-R), which are tools that test children's speech and language. Mott did not specify the statistic used, but found correlations supporting the concurrent validity of the total communication domain and the expressive language subdomain of the BDI. The correlation between the PPVT-R score and the BDI total communication domain and expressive communication subdomain scores was .60. The correlation between the PPVT-R score and the BDI total score was .66. The correlation between the PLS-R score and the BDI total communication domain score was .81, and the correlation between the AAPS-R score and the BDI expressive communication subdomain score was .68. These correlations for the total communication score and the expressive communication subdomain score support using the BDI for testing children who have general speech and language problems. The BDI receptive communication subdomain did not correlate with any of the language measures, so it is important to use the entire communication domain when testing children who have suspected speech problems.

Tests that are discriminative “emphasize the ability to distinguish between individuals or groups. These tools can lead to the identification of children who are not functioning within age-appropriate or performance-based expectations.”40 After reviewing the studies that investigated concurrent validity, the BDI appears to be an appropriate discriminative tool for the following groups of children: children who are typically developing from birth to 8 years of age,5,18 children with motor delays with a mean age of 18 months,14 children with disabilities under the age of 30 months,11,15 and children with suspected language disorders aged 35 to 60 months.20 Examiners should be cautious when evaluating and interpreting the BDI results for African-Ameri-can children aged 3 to 5 years and children who have severe and profound disabilities.9,14 Gerken et al16 cautioned examiners of all children aged 3 to 30 months who are at risk for developmental delay, but other research11,14,15,31 supports using the BDI with this group of children.

Predictive Validity

Predictive validity refers to the ability of a measure to be used to predict some future event. If the BDI has good predictive validity, then it provides a basis on which decisions are made by predicting outcomes and future behaviors.23 Three studies have addressed the predictive validity of the BDI. Guidubaldi and Perry17 investigated the predictive validity of the BDI with 124 kindergarten children who they retested in the first grade using the WRAT.37 They did not specify the statistic used, but found correlations between .30 and .62, thus moderately supporting the predictive validity of kindergarten BDI scores for performance in first grade, as measured by the WRAT. Merrell and Mauk21 used the BDI and the Social Skills Rating System (SSRS)41 to determine the predictive validity of the BDI as a measure of future social-behavioral development and developmental outcomes. The children entered the study at ages 2 to 5 years and were retested at a 2-to 3-year interval. The authors found relationships between the BDI and SSRS to be very weak (between negative coefficients and .25). They concluded that interpretations and decisions based on BDI results are limited in the area of future social-behavioral development. The authors did not states the statistic used to calculate the correlation coefficients.

Behl and Akers22 investigated the ability of the BDI to predict later achievement with a sample of children with or at-risk for developmental delays using the BDI and the Woodcock-Johnson Test of Achievement-Revised (WJR-ACH).42 The authors found low correlations between later achievement and BDI scores at age 1 year. When children were tested at age 3 years and older, correlations remained stable; for example, Pearson product-moment correlations between BDI-computed DQ total scores at ages 3, 4, 5, and 6 years and corresponding WJR-ACH Broad Knowledge scores at ages 9, 10, 11, and 12 years were .67, .72, .75, and .82, respectively.22 The results of this study suggest that BDI scores of children aged 3 years and older consistently predict Woodcock-Johnson Test of Achievement-Revised scores at 6 to 12 years of age. The BDI does not demonstrate good predictive validity for children younger than 18 months of age.

Battelle Developmental Inventory Screening Test

The Battelle Developmental Inventory Screening Test (BDIST)5,24,43 is available for more rapid administration than the BDI, but it has drawbacks. The BDIST consists of 96 items that were taken from the parent BDI and can be administered in 10 to 30 minutes, depending on the age of the child. The test developers list the purposes of the BDIST as general screening of preschool and kindergarten children, monitoring children's progress, identifying strengths and weaknesses of children to determine which children would benefit from a comprehensive assessment, and making placement and eligibility decisions.

Kramer and Conoley43 identified several problems with the BDIST. Instead of norming the BDIST, the authors used the norms and reliability and validity data derived from the BDI, so it is not possible to know whether the BDIST yields valid and reliable data. Standards for screening tests of sensitivity and validity indicate that a sensitivity of approximately 80% and a specificity of approximately 90% are preferable to detect children with disabilities correctly (sensitivity) and to detect children who do not have disabilities (specificity).4446 Glascoe and Byrne47 studied 89 children aged 7 to 70 months to determine the sensitivity and specificity of the BDIST. They found that the BDIST correctly identified 72% of the children with disabilities, indicating that it has good sensitivity. They found specificity to be 76%, which indicates the BDIST overidentifies children who do not have disabilities. They also found the BDIST time-consuming to administer, but they did not report mean times of administration. McLean and colleagues48 studied 65 children aged 7 to 72 months. They found that the BDI accurately identified only 13 of the 35 subjects without disabilities, with 22 children referred for further testing. This finding demonstrates a specificity lower than 50%. Sensitivity was much higher. Of the 30 children with disabilities, 29 children were referred for further testing. Little research has been conducted with the BDIST, but the work that has been done suggests that its use results in over-referral of children for further testing. Another problem is that the reliability and validity of this 96-item version of the BDI cannot be assumed.


Studies generally support the reliability of data obtained with the BDI and its content, construct, and concurrent validity. Its predictive validity is better for older children than for younger children. The strengths of the BDI and the comprehensiveness of the domains it measures are reasons that it has been used by many researchers as a tool for longitudinal studies, determining developmental trajectories and outcomes, and classifying children. In early intervention programs, the BDI appears to be a good discriminative measure to determine children's eligibility for services based on degree of developmental delay, and it appears to have promise as a tool to measure change over time for population-based longitudinal studies. The relatively long administration time, however, could be a drawback for repeated measurements across a population of children. If administration time is a problem, the BDIST might be of value, but not until research has demonstrated that it yields reliable and valid data.


  • Concept, research design, and writing were provided by Berls and McEwen.

  • This work was supported, in part, by a personnel preparation grant (#H029G60186) from the US Department of Education, Office of Special Education and Rehabilitative Services. The article does not, however, necessarily reflect the policy of that office, and official endorsement should not be inferred.


    View Abstract