BHCOE ABA Outcomes White Paper

s
www.bhcoe.org • 1
TABLE OF CONTENTS
Part 1: Overview 3
Section 1: Executive Summary 4
Section 2: About Autism Spectrum Disorder 5
Section 3: About Applied Behavior Analysis 5
Section 4: Societal and Economic Considerations 6
Section 5: ABA-based Treatment for ASD 6
Section 6: Measuring Outcomes of ABA-based Treatments in Research 7
Section 7: The Need for a Unified Approach to Selecting Assessment 8
Instruments in ABA
Section 8: How We Selected the Assessment Instruments in This Guide 11
Part 2: Practical Applications 15
Section 1: Considerations 16
Section 2: Selecting Measurement Instruments: A Decision Model 16
Section 3: Additional Considerations for Selecting Measurement Instruments 22
Part 3: Considerations for Storing Assessment Data 31
Section 1: Keeping Your Data Sets Tidy 32
Section 2: Some Considerations Regarding Tidy Data 36
Section 3: Working with Related Datasets 42
Section 4: Examples of Basic Analytics with Assessment Data 46
Summary 50
References 51
Appendices 59
Acknowledgments 87
www.bhcoe.org • 2
: Overview www.bhcoe.org • 3
Section 1: Executive Summary
This document is meant for applied behavior analysis (ABA) practitioners, researchers,
insurance providers, and other relevant stakeholders. The purpose of this document is
twofold: first, provide a systematic approach for selecting instruments to assess and
plan treatment for individuals with autism spectrum disorder (ASD); and second,
inform data collection and reporting on treatment outcomes.
Documenting treatment outcomes in health care professions has become increasingly

important to both patients and stakeholders (e.g., third-party payors). Specifically,
practitioners are increasingly asked to demonstrate that their treatments work and to
be more accountable for the costs of treatment. Practitioners and stakeholders, such
as third-party payors, agree that accountability is important, but the challenge lies in
agreeing how to achieve this.
The guidelines we have provided in this document are based on the best available
research evidence and subject matter expertise regarding instrument1 selection for the
assessment and planning of treatment for individuals diagnosed with ASD. We intend
for these guidelines to be practical and digestible for diverse audiences ranging from
researchers and practitioners to insurance providers. Because all treatment is
individualized, these guidelines are intended to inform and are not meant as a
substitute for the expertise of the practitioner who has observed a patient directly.
Where stakeholder opinions diverge, significant weight should be given to the
recommendations of the qualified practitioner who has observed the patient in person.
1
Throughout this document we have used the terms assessment instruments, assessment tools, and
measurement instruments interchangeably.
www.bhcoe.org • 4
Section 2: About Autism Spectrum Disorder
ASD is a lifelong, pervasive developmental disability characterized by deficits in social
and communication skills, as well as restricted and repetitive behaviors (American
Psychiatric Association, 2013). There is no genetic test or single known cause for ASD.
Furthermore, the autism spectrum describes a broad range of behavioral patterns and
levels of functioning. This variability in skill deficits and behavior excesses continues to
challenge efforts to standardize treatment of the condition. Today, ASD is diagnosed by
experienced licensed professionals through standardized behavior observation tools in
conjunction with parents’ reports about their child’s behavior and development. Over
the last few decades, the number of children diagnosed has steadily increased as
awareness and acceptance of ASD have increased along with better screening and
diagnostic tools (Autism Speaks, 2021). According to data from the Centers for Disease
Control (CDC) and Prevention, one in 54 children had a diagnosis of ASD by age 8 in
2016 (CDC, 2020).
Section 3: About Applied Behavior Analysis

Applied behavior analysis (ABA) is a term used to describe the application of
known behavioral principles and processes to help people improve their quality
of life (Cooper et al., 2020). More specifically, researchers working in laboratory,
clinic, and other workplace settings have studied how changing the environment
around someone will reliably lead to changes in behavior. ABA practitioners work
with their patients to identify ways that behavior change and learning improve the
quality of life for them and their families. Then, leveraging the evidence published
by researchers, they work with their patients to design and implement
environmental changes toward the behavior change their patients desire.
ABA practitioners currently work across many different areas of society such
as gerontology, education, health and fitness, substance abuse and misuse,
workplace safety, and—most commonly—ASD and related developmental
disabilities (Behavior Analyst Certification Board [BACB], 2021).
www.bhcoe.org • 5
Section 4: Societal and Economic
Considerations
In addition to the significant emotional and social impact of ASD on individuals and
their loved ones, the financial cost of supporting an individual with ASD is high.
Researchers have found the cost of raising an individual with ASD throughout their
lifespan averages between $1.4 million and $2.4 million in the U.S., depending on
whether the person also has an intellectual disability (Chasson et al., 2007; Buescher
et al., 2014). In contrast, the cost of raising a child without any disabilities in the
U.S. in 2015 was estimated at $233,610 (USDA, 2017). In addition, the medical
expenditures for children and adolescents with autism are on average between
4.1 and 6.2 times greater than for those without autism (Shimabukuro et al., 2007).
The reported annual cost of ABA therapy services for individuals with ASD ranges
between $40,000 and $60,000 per year (Rogge & Janssen, 2019). Globally, this
translates to an overall cost for providing care and services to children with ASD
estimated at between $61 and $66 billion a year (Autism Speaks, 2021). For adults
with ASD, the overall costs are estimated at $175 to $196 billion a year for
accommodation, direct medical costs, and individual productivity loss.
Section 5: ABA-Based Treatment for ASD

Currently, there is no cure for ASD and there is limited empirical support for
pharmaceutical or medical treatments as the primary intervention for ASD
(Williamson et al., 2017). The increasing prevalence and pervasive nature of the
condition, coupled with an unknown cause, may make this population or their
caregivers more vulnerable to ineffective interventions or fads that come and go.
However, interventions based on ABA are consistently shown to lead to significant
improvements in core ASD impairments. These improvements include clinically
meaningful gains in adaptive skills and intellectual functioning, and a reduction in
autism severity (e.g., Cohen et al., 2006; Eikeseth et al., 2012; Howard et al., 2014;
Sallows & Graupner, 2005; Waters et al., 2018). Interventions based on ABA have
www.bhcoe.org • 6
gained the support of organizations such as the American Academy of Pediatrics
(Hyman et al., 2020) and the Centers for Disease Control and Prevention (CDC, 2019).
The goal of ABA for individuals with ASD is to improve the individual’s quality of life
by changing the environment around the individual to teach them useful skills. For
example, ABA practitioners may target increasing communication skills to help
individuals with ASD state their needs and wants, use their voices effectively,
and make choices based on their preferences.
Section 6: Measuring Outcomes of ABA-

Based Treatments in Research
Thus far, all studies using controlled clinical trials to assess ABA outcomes have
focused on early intensive behavioral interventions (EIBIs) where “intensive” typically
means 20–40 hours per week of individualized therapy. The results of these studies
and systematic reviews have shown that, compared with treatment as usual and
other treatment types, EIBI leads to the best outcomes for children with ASD,
with significant improvements in cognitive and adaptive skills (Eldevik et al., 2009;
Makrygianni et al., 2018; Peters-Cheffer et al., 2011; Reichow et al.,2018; Rodgers et
al., 2021; Virués-Ortega, 2010). EIBI can lead to increased learning rates such that
children who are diagnosed as developmentally delayed may catch up with their
typically developing peers, with some scoring within the normal range on valid
standardized tests of intelligence, language, socialization, and daily living skills
(Klintwall & Eikeseth, 2015).
The effects of EIBI also tend to persist over time. For example, a recently published
study followed up with individuals after they had received EIBI services and found
that the effects of EIBI remained 10 years later (Smith et al., 2019). Another important
finding in this study was that none of the participants developed additional diagnoses
such as anxiety, attention deficit hyperactivity disorder, or depression. These co-
occurring disorders are otherwise common for adolescents and adults with ASD
(Gjevik et al., 2011). In sum, research clearly suggests that EIBI —at present—is one
of the best treatment options for children with ASD.
www.bhcoe.org • 7
Several assessment instruments have been used to measure the outcomes of EIBI. Most
studies have reported the effects of EIBI using standardized measures of adaptive
behavior in everyday life (e.g., Vineland Adaptive Behavior Scales), intellectual
functioning (e.g., Bayley Scales of Infant Development, Wechsler Preschool and Primary
Scales, Mullen, Standford-Binet-5), and various measures of autism severity (e.g., ADOS,
ADI-R, Childhood Autism Rating Scale, Social Responsiveness Scale; Ridout & Eldevik,
2021). When measuring across patients, past research has consistently demonstrated a
dose–response relationship between the number
of hours per week of ABA (dose) and the amount
of change in an outcome measure (response),
such as IQ scores or adaptive behavior in
everyday life. Additionally, at the group level,
parent engagement and participation in
treatment is associated with better outcomes.
However, at the individual patient level, response
to intervention can vary for currently unknown
reasons. Thus far, systematic reviews have been
unable to identify consistent predictors or risk
factors for individual patient outcomes following ABA.
Section 7: The Need for a Unified Approach

to Selecting Assessment Instruments in ABA
Considerable variability in service delivery exists across ABA service providers. As a
result, there is an industry-wide need for ABA service providers to monitor treatment
outcomes, identify where treatment outcomes differ, and determine why those
differences occur. There are numerous reasons why ABA service providers should
document the effectiveness of the behavioral treatments they provide, and
organizations should adopt a standardized approach to measuring outcomes.
At the forefront are a practitioner’s and organization’s ethical responsibility for
accountable patient care (Guideline 4.04 – BACB, 2020; Standard F.03 – BHCOE, 2021).
In addition, and particularly where ASD is concerned, documenting treatment outcomes
www.bhcoe.org • 8
allows ABA practitioners to set themselves apart from providers who provide treatments
for which there is little to no research evidence. Yet, despite widespread agreement that
practitioners and researchers must evaluate treatment outcomes, there is little
agreement about how this should be done in ABA (Smith, 2013).
Systematically evaluating the effectiveness of ABA treatment has been complicated

by several factors. First, ABA service providers offer patient care across the lifespan,
which makes it difficult to use the same measurement instruments for individuals
at different ages. This presents a challenge for comparing ABA treatment outcomes
across patients as well as for the same patient over time. Second, many ABA
treatment providers have not been trained in, or are not credentialed to use,
the primary measurement instruments and tests historically used by researchers
to measure ABA treatment effects (e.g., intelligence tests, diagnostic instruments).
Third, treatments based on ABA are highly individualized to the specific needs of
the patient. Therefore, instead of using standardized measurement tools, ABA
treatment providers have historically relied on their training and experience to
target and measure specific behavior(s) needing improvement. This approach leads
to idiosyncratic data collection methods and visual inspection of data unique to the
patient to determine treatment progress.
For example, consider a child who is screaming and throwing himself on the floor
when asked to use the restroom. Based on the patient’s stated definition of a quality
life, an ABA treatment provider may be asked to develop an intervention to teach the
child how to approach and use the toilet effectively. Here, a standardized measure of
problem behavior or adaptive skills would have limited value for designing the
intervention and capturing improvement. Instead, the results of a functional behavior
assessment allow the treatment team to develop a function-based intervention plan
where behavior data before, during, and after the intervention demonstrate whether
a change in behavior occurs that improves the patient’s quality of life. The data
resulting from this more nuanced and individualized approach are likely to be more
meaningful to the treatment team for monitoring the patient’s improvement.
www.bhcoe.org • 9
Nevertheless, equipped only with individualized data, one finds it difficult to compare
outcomes of idiosyncratic approaches across patients and across providers. To do
this, practitioners must agree on common data definitions and methods of data
collection to measure the overall treatment gains across patients and to predict
rate of improvement for patients with similar behavioral presentation. Standardized
measurement tools provide common definitions and methods of data collection that
allow for large-scale, cross-patient comparison of adaptive skills changes and socially
significant gains. Succinctly, standardized instruments offer the field a unified
approach for a more global analysis of the effectiveness of the general approaches
that ABA practitioners take. A critical first step to create fieldwide systems for
documenting treatment outcomes is to develop a systematic and objective process
for selecting appropriate measurement instruments to evaluate treatment
effectiveness (Romanczyk & Gillins, 2008).
www.bhcoe.org • 10
Section 8: How We Selected the Assessment
Instruments in This Guide
We considered a few important factors when selecting the measurement instruments
outlined in this document. First, we aimed to select scientifically robust instruments
that are also practical to administer, considering time and cost. To do this, the team
of subject matter experts considered the following:
1. The reliability and validity of the instrument

2. The sample size and characteristics (e.g., ethnicity) of the norm comparison
group
3. Evidence the instrument has been effectively used with individuals with
ASD
4. Time required to administer the instrument
5. Qualifications needed to administer the tool
and interpret the results
6. The cost of purchasing protocols
7. The age range for the instrument
8. Availability of protocols in alternative languages
9. The utility of the instrument for treatment planning
10. The acceptability and likelihood of adoption by practitioners
Next, we used Consensus-based Standards
for the Selection of Health Measurement
Instruments (COSMIN). COSMIN is a
framework that assists with improving the
selection of outcome measure instruments in
both research and clinical practice. It is an
initiative of an international multidisciplinary
team of researchers who aim to improve the
selection of outcome measurement
instruments by developing tools for selecting
the most appropriate available instrument
(Mokkink et al., 2010). COSMIN calls for standardization of outcomes and outcome
measurement instruments by developing Core Outcome Sets (COS) and COS
methodology. A COS is a consensus-based minimum set of outcomes that should be
measured and reported in all clinical trials of a specific disease or trial population
(COSMIN, n.d). We followed the recommended four steps to assess whether a study
met the standard for good methodological quality (Mokkink et al., 2010):
1. Determine which properties are evaluated in an article (i.e., internal

consistency, reliability, measurement error, content validity, structural
validity, hypotheses testing, cross cultural validity, criterion validity,
responsiveness, and interpretability)
2. Determine if the statistical methods used in the article are based on
Classical Test Theory (CTT) or on Item Response Theory (IRT)
3. Determine if a study meets the standards accompanying the properties
chosen in step 1 for good methodological quality
4. Determine the generalizability of the results
We then followed the following four steps to select the measurement instruments
(Mokkink et al., 2016):
1. Consider the following concepts:

a. The construct (e.g., outcome or domain) to be measured
b. The target population (e.g., age, gender, disease characteristics)
2. Find existing outcome measurement instruments through systematic
reviews, literature search, and other sources
3. Check the quality of the outcome measurement instruments through the
evaluation of measurement properties and feasibility aspects of the
identified instruments
4. Select one instrument for each outcome in a COS1
a. The minimum requirements for including an outcome measurement
instrument in a COS are
i. At least strong evidence for good content validity and for
good internal consistency (if applicable)
ii. Whether the instrument is feasible
b. A consensus procedure is used to agree on the instruments for
each outcome included in the COS
Additionally, we considered the measurement instruments’ acceptability to the

research and practice community. In a recent study, Padilla (2020) surveyed 1,428
individuals recruited from the BACB listserv to evaluate global use of assessments
by behavior analysts. The author found that the three most frequently reported
instruments were the Verbal-Behavior Milestone Assessment and Placement Program
(VB-MAPP), Assessment of Basic Language Skills (ABLLS-R®), and the Vineland
Adaptive Behavior Scales (VABS). These findings were in line with what researchers
who conducted similar surveys had reported in previously published studies (see
Padilla, 2020 for a review of these studies).
Though not a perfect measure, widespread adoption of an instrument typically
suggests that a tool is feasible to administer and useful to clinical practice. If a large
proportion of practitioners already use a particular measurement instrument, they are
much more likely to continue using the instrument. However, practitioner adoption of
instruments is sometimes influenced by third party funding sources. Therefore, we
also reviewed instruments that large insurance providers currently require their
provider network to administer and found those to be the same instruments
practitioners reported using most frequently in Padilla (2020). At a broader level, we
found that the measurement tools most frequently utilized by practitioners were
already on the list of instruments that we selected based on the COSMIN method.
Lastly, we considered several other practical factors that should affect decisions
regarding which measurement instruments to utilize. For example, we only included
instruments that practitioners of ABA are qualified to administer. Also, when possible,
we selected instruments that are comparatively time and cost efficient and can be
administered to a broad patient population (e.g., larger age range, availability of
instrument in different languages).
Section 1: Considerations
In this part, we provide recommendations that reflect established research findings and
best clinical practices for selecting measurement instruments to assess the outcome of
ABA therapy. However, assessment decisions should be made based on the individual
needs of each patient. These guidelines should not be used to diminish access, quality,
or frequency of currently available behavioral treatment services. Coverage of behavioral
treatments for ASD by healthcare funders should not supplant responsibilities of
educational or governmental entities, except where required by state or federal law.
In this document, guidance is limited to assessments for ABA treatment only.
Section 2: Selecting Measurement

Instruments: A Decision Model
When selecting a measurement
instrument, the assessor must consider
several factors. For this reason, payor
guidelines should convey flexibility when
describing assessment requirements. The
decision-making process should include
reason for referral, criteria for instrument
selection, and practical considerations
such as the time to administer and score
the instrument (see Figure 1). Below, we
have outlined the factors that affect the
practitioner’s decision to select a
particular instrument and provided this
information in the Appendices to help
assessors make decisions about each
measurement instrument.
Assessor Must First Consider the Reason(s) for Assessment
The initial referral concern(s) is usually the first thing that should influence the
assessor’s selection of specific assessment tools. For example, the assessor may use
an instrument that specifically addresses problem behaviors for a patient referred for
aggression. Furthermore, an assessor may wish to evaluate a patient’s performance
compared with an objective pre-determined criterion (criterion-referenced
interpretation) or compare the patient’s performance with the performance of same-
age peers (norm-referenced interpretation).
Measurement instruments enable practitioners to assess individual skills in an objective

manner (Raynolds & Livingstone, 2012). In this guide we focus on two types of
interpretations of assessment results: norm-referenced and criterion-referenced
(Sattler, 2014).
Norm-referenced interpretations. Tools that rely on

norm-referenced interpretations allow comparison of the
patient’s performance with the performance of same-age
peers that represent a norm group or standardization
sample (Sattler, 2014). A norm provides an average
(typical) performance of the comparison group and the
spread of the scores above and below the average. In
other words, norms allow for the comparison of an
individual’s performance to same-age peers, which
provides valuable information about the level of deficit.
A norm-referenced interpretation answers the following
question: “How does the examinee’s performance on a
specific skill compare to others of the same age?” (Raynolds & Livingstone, 2012).
All assessments produce raw scores for assessed skills. When norm-referenced
assessment measures are used, the raw scores alone are an uninterpretable measure of
performance because they do not provide any way of contextualizing the score (Sattler,
2014). For example, a 4-year-old child’s raw score on the Expressive Vocabulary Test-3
(EVT-3) might be 25. This score alone does not help the evaluator determine if the child
shows deficits or age-level skills as a speaker because the evaluator has no information
on how other 4-year-olds performed on the test. To make a comparison, the evaluator
must first convert the raw score to a derived score using the mean and standard
deviation of the standardization sample of same-aged peers (i.e., 4-year-olds). After
converting the raw score of 25 to a derived score (e.g., a standard score) and comparing
the derived score to scores obtained from the standardization sample, the evaluator can
determine if the child’s performance is within, below, or above the average range
compared with the performance of all 4-year-olds in the standardization sample
(Raynolds & Livingstone, 2012).
When using an assessment instrument that is norm-referenced, the practitioner should

select instruments that include a standardization sample that has at least 100 individuals
in each age group and accurately represents the population of the country or region
where the patient lives (Sattler, 2014).
For norm-referenced interpretations, the

characteristics of the population that are Raynolds & Livingstone (2012) recommend
most often considered for selecting a considering the following variables when
standardization sample are age, gender, selecting a norm-referenced measures:
grade, geographic region, ethnicity, and How representative is the
socioeconomic status (Sattler, 2014). standardization sample of the
Comparing the standardization sample to population across age, race, sex,
the data from the most recent census is the education level, language, and
best way to evaluate how well the child geographic location?
being assessed is represented. For Is the standardization sample

example, consider a practitioner who is large enough to provide stable
evaluating two measurement instruments statistical information (it should
include at least 100 participants
that assess the same skill. The
for each age group)?
standardization sample for measure A
includes 110 individuals from each age
group, and individuals sharing the patient’s specific ethnicity comprise 22% of the
sample. Measure B includes 101 individuals for each age group, and individuals sharing
the patient’s specific ethnicity comprise 28% of the sample. Census data show that 27%
of the U.S. consists of that specific ethnic group. Though both measures meet the
minimum sample size requirement for each age group, measure B is a better measure
because it is more representative of the patient’s background. For all norm-referenced
measures, the standardization sample data is included in the manual for that measure.
Criterion-referenced interpretations of scores. In criterion-referenced interpretation of

scores, the patient’s performance is compared with an objective criterion (Sattler, 2014).
When using a criterion-referenced interpretation of the assessment results, the emphasis
is on the behaviors in the patient’s repertoire and not behavioral patterns relative to
others of the same age or grade (Raynolds & Livingstone, 2012). A criterion-referenced
interpretation answers the following question: “Does the child’s performance reach a
predetermined level of proficiency?” (Raynolds & Livingstone, 2012) A common example
of criterion-referenced interpretation of assessment results is performance on the VB-
MAPP assessment. If a 4-year-old child being assessed using the VB-MAPP obtained a
raw score of 10 on the tact subtest, the raw score
would show only the tacts the child has in her or
his repertoire compared with the performance
criterion set for the tact subtest (i.e., a score of 15;
Raynolds & Livingstone, 2012). Given that the VB-
MAPP uses criterion-referenced interpretation of
scores, the assessor cannot determine if a raw
score of 10 is average, below average, or above
average for the 4-year-old client because there is
no information on how a sample of typical 4-year-
old children would perform on the same subtest of
the VB-MAPP (Sattler, 2014).
The most common criterion-referenced interpretations of assessment results are

percentage and mastery testing (Raynolds & Livingstone, 2012). Percentage is derived
using perfect performance as the criterion (i.e., answering all exam questions correctly).
When conducting mastery testing, the assessor measures whether the examinee has
achieved a specific level of mastery for the given skill. For example, the VB-MAPP uses
mastery criteria as a measure of performance. The child is considered to show mastery of
the targeted subtest if the set criterion for that given subtest is met.
Assessor Must Consider the Following Criteria for
Instrument Selection
When selecting assessment instruments, practitioners must consider the reason for
referral, the patient’s age, the primary language spoken by the patient, and the
validity and reliability of the assessment instruments. It is important that the
practitioner selects instruments that are valid and reliable to give confidence that the
assessment results accurately reflect the patient’s abilities.
Reliability and Validity of the Instruments. Reliability in the context of assessments

refers to stability and consistency of the measure (Sattler, 2018; Raynolds & Livingstone,
2012). Reliable assessment results must show stable scores across time (test–retest
reliability), within the scores themselves (internal consistency), with parallel forms of the
measure (alternate form reliability), and when used by various assessors (interrater or
interobserver reliability). Validity in this context asks if the test measures what it is
supposed to measure. There are several forms of validity that ask whether the items
represent the domain that is assessed (content validity), if the test appears effective
relative to its stated goal (face validity), and whether the test measures the specified
constructs (construct validity). For the purposes of this practice guideline, reliability and
validity related to the selection of assessment instruments will be discussed instead of
the specific type of reliability and validity measures2.
All manuals for norm-referenced interpretation of results include information on past

measures of the reliability of the test results. Reliability is reported as the ratio of the
change in true score and the change in observed score (i.e., reliability coefficient;
Sattler, 2014). Reliability is a critical component when selecting among assessment
instruments. Low reliability coefficients indicate high level of error in obtained scores,
which calls into question the interpretation of the obtained scores (Sattler, 2014). For
selecting assessment measures, tests with higher reliability are better measures than
tests with lower reliability (Sattler, 2014; Raynolds & Livingstone, 2012). What is
considered sufficient reliability may vary depending on the assessment under
2
For more information on reliability and validity, interested readers are referred to Sattler (2014) and
Raynolds & Livingstone (2012).
consideration. For example, diagnostic assessments require a reliability coefficient
of.90 or above. But tests for screening purposes or for purposes of skill acquisition
programming require reliability coefficients of.80 or above (Sattler, 2014).
Primary Language of the Patient. The patient’s primary language must also be
considered when choosing an assessment and, to the best ability of the assessor, the
assessment should be conducted in the child’s primary language (Sattler, 2018). When
using tests that rely on norm-referenced interpretation, translation of the assessment
questions can affect the reliability of the test and increase score errors (Sattler, 2018).
It is strongly recommended to use assessment instruments that have been norm
referenced in the patient’s primary language (Sattler, 2018). For example, the
Receptive and Expressive One-Word Picture Vocabulary Tests–Fourth Edition uses
a norm-referenced interpretation of the scores and has both English and Spanish
versions. If one needed a standardized language test for a child whose primary
language is Spanish, the Receptive and Expressive One-Word Picture Vocabulary
Tests–Fourth Edition could be an option. If there are no assessment instruments in
the patient’s primary language, use of interpreters who are fluent in English and the
patient’s primary language is the best option (Sattler, 2018). When reporting the
results, you should mention that an interpreter was used during the assessment.
Age of the Patient. When using assessment measures, the patient’s age must also be
considered. For instruments that assess skills, age of the patient provides information
about what is developmentally appropriate for the patient to have in their repertoire.
When using tests that use norm-referenced interpretation, it is important to use
assessment instruments that have been norm referenced with individuals who are in
the patient’s age group. If a specific measurement instrument is the only option to
use and it is outside of the patient’s age range, then the assessor can use a qualitative
interpretation of the test scores instead of reporting standard scores.
Specific Areas of Need. When selecting assessment measures, practitioners should

consider the patient’s individualized needs. This can include the patient’s own wants
and desires, the patient’s family or caretaker’s wants and desires, personal safety,
and the skills needed to assist the individual in living a full and meaningful life as
defined by the patient. The areas of need include, but are not limited to, severity of
ASD, communication skills, readiness to learn, daily living skills, social and play skills,
problem behaviors, and executive functioning. Additionally, at a higher level, socially
important outcomes that matter when providing care include the individual’s quality
of life, stress, and overall happiness and satisfaction.
Section 3: Additional Practical Considerations

for Selecting Measurement Instruments
It is important for the practitioner to select instruments that are easy to use, cost
effective, and efficient. Furthermore, it is imperative that the assessor has the training
and qualifications required to administer the instrument for valid results. The
following paragraphs list additional considerations for selecting measurement
instruments.
Assessor Qualifications. The Behavior Analyst Certification Board (BACB), American

Psychological Association (APA) and the Behavioral Health Center of Excellence
(BHCOE) include boundaries of competence as an ethical standard. The BACB ethical
guidelines include the following: (a) All behavior analysts provide services, teach,
and conduct research only within the boundaries of their competence, defined as
being commensurate with their education, training, and supervised experience;
and (b) Behavior analysts provide services, teach, or conduct research in new areas
(e.g., populations, techniques, behaviors) only after first undertaking appropriate
study, training, supervision, and/or consultation from persons who are competent in
those areas.
Guidelines from the APA include the following: (a) Psychologists provide services, teach,
and conduct research with populations and in areas only within the boundaries of their
competence, based on their education, training, supervised experience, consultation,
study, or professional experience; and (b) When psychologists are asked to provide
services to individuals for whom appropriate mental health services are not available and
for which psychologists have not obtained the competence necessary, psychologists
with closely related prior training or experience may provide such services to ensure
that services are not denied if they make a reasonable effort to obtain the competence
required by using relevant research, training, consultation, or study.
Standards for ABA organizations from the BHCOE are that an ABA organization must
act honestly and responsibly to a) promote ethical practices of its employees and b)
supports certified employees in complying with ethical and professional requirements
of their certifying and/or licensing body. The organization never directs employees to
act in violation of those requirements and resolves any conflicts between the
company policy and those requirements.
It is the responsibly of the assessor to follow ethical guidelines regarding the

selection and administration of tests. When purchasing assessment instruments or
obtaining information on minimum qualification needed to administer and use the
results of a measure, the test publishing companies (e.g., Pearson, WPS) provide
clear information on assessor qualifications for all tests sold by the publishers.
Most publishers use three main qualification levels (mostly designated in letters).
For example, while Pearson publishing uses A, B and C qualification levels, the WPS
uses A, B, C and N qualification levels. The minimum requirements to use tests for
each qualification level are provided by the publishers. For example, the Pearson
publishing website shows that to purchase or use the Vineland-3 one must have a
level B qualification, which means that the assessor must possess the following:
• A master's degree in psychology, education, speech–language pathology,

occupational therapy, social work, counseling, or a field closely related to
the intended use of the assessment, along with formal training in the ethical
administration, scoring, and interpretation of clinical assessments; or
• Certification by or full active membership in a professional organization
(such as ASHA, AOTA, AERA, ACA, AMA, CEC, AEA, AAA, EAA, NAEYC, NBCC)
that requires training and experience in the relevant area of assessment; or
• A degree or license to practice in the healthcare or allied healthcare field.
A Unified, Systematic Approach to Selecting Measurement
Instruments To Measure ABA-Treatment Outcomes:
The BHCOE ABA Outcomes Framework™
In Figure 2 we depict a decision-making model that includes the measurement
instruments selected in this guide. For each instrument, we provide information in
Tables 1–6 in the Appendices, such as the age range, duration of assessment, relative
cost, qualifications required to administer the instrument, and general pros and cons.
If more than one instrument met criteria for selection during review by subject matter
experts, we included them all in Figure 2 and left room for the practitioner/assessor
to choose between assessments based on their training and experience. These
instruments are depicted in alphabetic order and separated by the word “or.”
In Step 1, the practitioner should

select one instrument to measure
impact or severity of ASD. It
would be best if each ABA
provider-organization considers
organization wide adoption of
one instrument based on their
patients’ age ranges,
practitioners’ experiences, and
organizational preference.
Selecting one instrument will
make it easier to compare
patients’ scores across time and to other patients. In Step 2, for each skill area and
based on the patient’s reason for referral, the practitioner should use an
assessment that allows comparing the patient’s score to a norm-group. In Step 3,
again based on the need of the patient and practitioner experience, the
assessor/practitioner should use an assessment that allows targeting skills to teach
to a criterion. For example, for communication skills, norm-referenced instruments
provide scores on level of deficits in expressive (i.e., speaker behaviors) and
receptive (i.e., listener behaviors) language. But these instruments are too broad
to select specific verbal behaviors to teach. Criterion-referenced instruments in
this guide allow expressive and receptive language to be separated into specific
components such as requesting, labeling, answering “Wh” questions (i.e., Who,
What, Where, When, Why), identifying nouns, and following directions.
For practicality, some instruments minimize overall assessment time and facilitate
analyses across patients because they measure different skill domains (e.g.,
communication, daily living, social) and broad age ranges. For example, a patient in
comprehensive EIBI may benefit from the CARS-2 for measuring severity of ASD; the
Vineland-3 for measuring communication, daily living, and social skills; and the VB-
MAPP for measuring basic communication, learning readiness, skills barriers, social
skills, and play.
Step 4. In addition to measuring the impact of ASD and specific behaviors targeted
for treatment, assessors should measure the social significance of their treatment. At
a higher level, meaningful intervention for ASD should improve the individual’s, and
their families’, quality of life and emotional distress. Research evidence suggests that
the quality of life of families with a child with ASD is more impaired than the quality of
life of families with children with other developmental disabilities. The same trends
have been found regarding parent stress. Along the same lines, adolescents and
adults with ASD report lower quality life. Therefore, evaluating change in quality of
life and stress is important for guiding the course of a patient’s progress during
treatment and as outcome measures, allowing for evaluation of more global family-
system effects.
Another socially relevant and important factor to measure is the patient, and their
parents/guardians’, satisfaction with treatment. Engaging parents in treatment is
important (see BHCOE Accreditation, 2021) as it is related to better treatment
outcomes. Parental satisfaction with, and acceptability of treatment, likely relates to
their engagement with the treatment team, continuity with treatment, and
involvement in treatment sessions. Furthermore, a patient’s satisfaction with their
own treatment may be related to their self-observed improvements or likelihood to
adhere to treatment protocols. Along the same lines, there is some evidence that
parents’ scores on acceptability of a treatment is related to their likelihood to adhere
to the treatment protocol. Currently, however, the literature is mixed regarding the
relationship between measures of satisfaction and acceptability and treatment
effectiveness or parental adherence to treatment. The mixed findings may be
because unlike skills assessments, measures of social validity are self-reports and
capture the informants’ perceptions of treatment, which may be affected by factors
other than skills improvement or reductions of problem behaviors. Further research is
needed in this area.
At this time, compared with instruments that measure skills, problem behaviors,
and severity of ASD, very few instruments that measure social significance following
ABA treatment have been developed and examined for reliability and validity in
the literature. For example, in the 14 studies reviewed by McNaughton (1994),
14 different instruments were used, each individually developed for a specific
research project, to measure parent satisfaction.
The few instruments that have been administered with individuals with ASD relate to the
individual, or their parent’s, quality of life; their level of stress; general satisfaction with
treatment; and treatment acceptability. Unfortunately, few instruments include both
self- and proxy-report (e.g., PedsQL™). Most instruments have been developed either
for the patient or their parents/guardians. Therefore, at this time, practitioners/assessors
are encouraged to interpret results with caution as much more research is needed in
this area.
These social validity instruments are included in Figure 2. To measure social

significance, ABA organizations are encouraged to select the instruments most
appropriate for their patient population and adopt instrument(s) they can use across
their patients. Given the scarcity of robust measures of social validity at this time,
organizations may adapt these instruments and make slight revisions to suit their
needs (e.g., change “EIBI” in PSS-EIBI to ABA Services).
All instruments in Step 4 are typically not administered by assessors/practitioners, but

instead completed by the patient or their parents/guardians and other caregivers as
proxies. These instruments are typically administered outside of clinical sessions and
may be administered via an automated email survey.
FIGURE 2: THE BHCOE ABA OUTCOMES FRAMEWORKTM
FIGURE 3: CASE EXAMPLE
A 3-year old patient is referred for ABA therapy based on communication, daily living, and
social & play.
Assessment Re-administration. Some instruments use rating scales to measure mood
or symptomatology and are more sensitive to short-term changes. These instruments
do not provide a norm group for reference or specific criterion for treatment
planning, but they can be re-administered as often as every few months. Generally,
however, to observe progress, it is best to readminister instruments that contain a
norm group or criterion for reference every 6 months to 1 year. For standardized
measurement instruments, meaningful changes are unlikely to be detected until at
least 1 year of treatment has occurred.
In conclusion, Figure 4 shows an overview of the overall assessment process.

The assessor begins by selecting the informants who would be the best sources for
information regarding the patient based on their relationship, the patient’s needs,
and the context of services. The patient’s needs and the underlying reason for referral
lead the assessor’s decision making regarding the specific types of assessment.
Social validity assessments allow the assessor to gain a more global understanding
of the impact of treatment. The assessment results are then interpreted collectively,
and the interpretations lend themselves to various recommendations for treatment.
The clinician can then begin treatment and should conduct follow-up assessments
throughout the treatment period to measure patient’s progress.
Summary Regarding Instrument Selection
Choosing assessments to measure treatment outcomes is a multi-faceted process
that requires careful consideration of many factors. Measuring all patient progress
with a single instrument fails to adequately capture the many behavioral and skill
changes a patient is likely to experience following ABA treatment. In the first part
of this guideline, we provided practitioners and other stakeholders with a decision-
making model that outlines the measurement instruments that can be used with
individuals with ASD depending on the reason for referral and why they choose to
receive ABA therapy. To do this, we used the COSMIN method to select reliable and
valid measurement instruments that are comparatively efficient in terms of cost and
time to administer and that are applicable to broad age ranges. We have provided a
roadmap for selecting measurement instruments based on the referral problem.
We have also outlined factors that should affect the practitioner’s decision to use
one instrument rather than another, including the patient’s age, primary language,
and other personal characteristics. Lastly, we provided guidelines regarding
re-administration of assessment instruments for detecting change. In the next part
of this guideline, we discuss how to record, store, analyze, and interpret the data
obtained from administering these measurement instruments.
Having read Part II, we hope the reader can differentiate between different types
of assessments (e.g., norm-referenced, criterion-referenced) and determine which
assessment(s) is (are) most appropriate for the types of skills that patients of their
organization target through ABA. Hopefully, the reader also has identified how
they want to analyze those data at the organizational level. Such analyses allow the
organization to understand how well different practitioners lead programs that
positively affect patient progress and how the organization compares with other
organizations on treatment outcomes. But it is one thing to collect assessment
data and know what you want to do; it is another thing entirely to make use of
those data efficiently. In Part II, we discuss basic principles for creating and
maintaining tidy data sets and relational databases and how these translate to
best practices when practitioners store assessment data in Excel documents or
the latest cloud-based big data database.
Section 1: Keeping Your Data Sets Tidy

Perhaps the most important aspect of storing assessment data is how each Excel
sheet or dataframe is structured. It is often stated that 80% of data analysis is cleaning
and preparing data in a format that can be analyzed (Dasu & Johnson, 2003). If analyses
are to be repeated, then the behaviors necessary to clean and prepare data would be
repeated every time the analyses are conducted. However, with a little bit of planning
and foresight, data can be cleaned, prepared, and stored in Excel documents or
databases to facilitate efficient data analytics. Structuring datasets to facilitate efficient
analysis is sometimes referred to as keeping “tidy data” (Wickham, 2014).
One Row per Observation. In Figure 5 we have shown one example of a well-
organized VB-MAPP dataframe (bottom panel) as well as a dataframe that would
require cleaning and preprocessing to be used (top panel). The first characteristic
of a tidy dataframe is that a single row is used for each observation. For example,
if the purpose of this dataframe is to compare the progress a patient made between
two assessments, and if we have data from more than two assessments for one
patient, then each assessment comparison would be placed in a separate row.
Row 2, Column H of the top panel in Figure 5 shows an example where multiple
assessment scores are stored in a single row. These should be separated into multiple
rows (e.g., rows 2–3; bottom panel of Figure 5).
One Variable per Column. The second characteristic of the tidy dataframe is one
variable per column. Again, using the example untidy dataframe in Figure 5, column D
contains the data from multiple VBMAPP assessments for the patient in row 2. Similarly,
Column I contains the data from multiple subdomains for a single assessment for the
patients in rows 2 and 4. Storing the data in this structure makes it challenging to easily
ask questions of these data without a lot of manual work. In contrast, by keeping one
variable per column and separating out the subdomains into their own column (one for
each subdomain), the resulting tidy dataframe can be analyzed more efficiently.
One Data Type per Column. A third characteristic of a tidy dataframe is that all data
entered in a column is of an identical data type. Common data types that readers are
likely to use include numbers (whole or decimal), text, and dates. Importantly, to make
use of all rows of a column, all rows must have an identical data type rather than mixed
data types. For example, Column C in the untidy dataframe contains both dates (e.g.,
02/22/2016, Sep-14) and text (e.g., “Sometime in July of 2017”). Note that another
common challenge encountered will occur when dates are entered in different formats.
Thus, in the above example, “Sep-14” may or may not have the correct year attached to
it, and the data analyst would have to do more work to verify the accuracy of that datum.
As additional examples of mixed data types, Columns D and E contain numeric data
(e.g., 11, 27, 22), text data (e.g., “13/170,” “28/170,” “16, maybe 17”), and date data
(e.g., 24-Dec3). Though humans can easily parse what data is useful when looking at
these data, computers cannot without being explicitly directed what data to extract and
how to extract it for each individual cell. This can become tedious and time consuming
and will decrease the overall utility of the dataframe.
3
Readers familiar with Excel are likely familiar with Excel’s autoformatting when entering data as
fractions (e.g., 12/24, which gets converted to 24-Dec). Without careful inspection upon data entry,
you may end up with a dataframe containing inaccurate or invalid data.
FIGURE 5. Demonstration of the exact same data saved in an untidy format (top panel) and tidy format (bottom panel).
UNTIDY DATAFRAME
TIDY DATAFRAME
One Datum per Cell. A fourth characteristic of a tidy dataframe is that a single
datum is entered per cell and as close to the raw data as possible. If two data
elements are needed to capture an observation (e.g., how many points were earned
on an assessment and how many points were possible), then best practice is to create
two columns, one for each data element, and to enter data accordingly. The top
panel in Figure 5 shows examples of multiple data entered within a single cell (e.g.,
rows 2 or 4 of columns D, H, I, and K), and the bottom panel shows how these data
would ideally be entered (e.g., rows 2, 3, and 5). For example, to use patient age at
the time of the assessment, it would be more helpful to store the patient’s date of
birth and the date of the assessment (two columns) rather than only the patient’s age
in years (one column). Collecting and storing the dates of birth and assessment allows
us to derive the patient’s age as needed or —if age is often used for analyses—age
could be derived and stored as a third column in the dataframe. In contrast, if only
the patient’s age is stored, organizations would be unable to analyze assessment
results based on age cohorts (e.g., millennials, generation Z, generation alpha),
historical periods in the organization (e.g., assessments conducted between 2018
and 2020), or other interesting scenarios where the raw dates would be needed.
Section 2: Some Considerations of Tidy Data
What Data to Store. The previous section highlights an inherent tradeoff that occurs
during data entry. The more columns we include, the more data entry that must occur
across separate columns which is more work. As a result, when entering data, people
may be inclined to either add multiple variables to the same column or to aggregate
data into a single score and store only the aggregated data. For example, rather than
entering the raw data for each subdomain score for the VBMAPP in a separate
column (bottom panel, Figure 5), people may be tempted to enter only a single,
total score column as this takes less work when initially entering the data. However,
by entering only a single, aggregate score per assessment, the organization would be
unable to use any subdomain scores for analyses in the future, which may significantly
inhibit the types of analyses that can be conducted.
Another consideration for behavior analysts is the temporal nature of much of the data
we collect. For assessment data, this might be the scores on specific assessments
conducted at intake and every year thereafter. At least two options exist to store these
kinds of time series data. One option is to capture different assessments as different
variables in the dataframe (i.e., as different columns). Here, the level of observation
(i.e., what defines a row) would be at the patient level, as each patient would take up one
row. The benefit to this approach is that analyzing trends for individual patients would be
straightforward for most individuals, as all observations are on the same row. You would
simply create another column with the analysis you wanted to conduct. This dataframe
structure is sometimes referred to as a wide format.
A downside to creating wide dataframes is that they grow indefinitely to the right.
For each new assessment for a patient, you would need to add a new set of columns
for all data elements. Assessments with 20–25 elements collected per assessment
would expand to over 100 columns per row, which might be difficult to keep track of
or scan efficiently. Further, if you were interested in looking at only the most recent
assessment for all patients, these might be contained across different columns for
different patients (e.g., the second set of assessment scores for one patient,
the fourth set of assessment scores for a different patient). This would involve
significant restructuring of your data to complete any analyses.
A second option is to capture different assessments as different observations (i.e., as
different rows). Here, the level of observation would be at the assessment level, as
each assessment would take up one row. This method of structuring a dataframe is
sometimes referred to as a long format. The benefit to creating long dataframes is
that the number of columns in the dataframe remains unchanged and consists only of
those elements that are collected with each assessment. This makes managing and
visually inspecting the dataframe very easy. The downside to this approach is that
within-patient analyses require the use of a patient ID or other unique identifier to
map the rows to one another and slightly more advanced data querying skills. Though
this sounds like it might be challenging, it is important to note that most analyses
require restructuring of the data in some manner (e.g., removing observations with
missing data, limiting the analysis to specific groups of patients).
Data Management. A second topic of significant consideration for behavior analysts

is metadata. Metadata is data that provides information and context about other
collected data. Metadata can be captured in two ways. The first is at the observation
level (i.e., data entered with each row of a dataframe). For example, with assessment
data, we likely want to store data for each assessment conducted, such as the date of
the assessment, who conducted the assessment, where it was conducted (e.g., home,
clinic, school), the patient’s age at the time of the assessment, and any other information
that might be relevant for interpreting or analyzing the data4. For example, perhaps the
time of day when a patient is assessed plays a significant role in their behavior. If so, then
we would want to collect this data for all patients (not just the one for whom we believe
it matters) so that our dataframe is consistent and complete and (more practically) so
4
Though tempting, including important information in a “Notes” column often limits the usefulness of
that information for two reasons. First, unless similar and consistent notes are entered for every row, the
amount of missing data makes that information unlikely to be useful for analytic purposes. Entering notes
takes time and resources. You should always ask whether the time and resources needed to transcribe
those notes into the database is worth that effort. Second, analyzing open text data objectively and
consistently requires skills in an area known as natural language processing (e.g., Bird et al. 2009; Vajjala
et al., 2020). Most behavior analysts are unlikely to have received training in these analyses. Thus, again,
you may end up with a column filled with data that go unused. If the information contained in a “Notes”
or open-text column is believed to be valuable, a better approach is to create a formal column in the
dataframe that captures that data element and to ensure that people consistently collect those data.
that we know the precise degree to which time of day influences assessment scores for
one patient compared with all other patients.
The second way that metadata can be captured is at the dataframe level. These
metadata are often stored as a separate tidy dataframe to describe the contents of
the original dataframe. Figure 6 shows an example of a data dictionary that contains
metadata about the assessment dataframe in the bottom panel of Figure 5. Like the
examples above, each row has a single observation, each column has data of a single
type, and each cell has a single data element. This specific kind of metadata
dataframe can be referred to as a data dictionary because it provides (ideally) all
information someone might need to understand what is contained in the dataframe it
references and how they might then use the referenced dataframe. In addition to the
name, definition, and data type for each variable, data dictionaries often also contain
information such as whether missing values are allowed (i.e., the “mandatory”
column), how categorical data are stored or transformed into numbers so they can
be analyzed, whether the variable is a primary or secondary key (more on this below),
and when and what changes have been made to variable definitions over time.
FIGURE 6. Example of a data dictionary with information about the variables stored in a dataframe. Many data
dictionaries include much more information. This is an example of the minimum information you would likely store.
Handling Missing Data. A final set of proactive decisions that are worth noting relative
to working with related datasets is how to manage missing values. As practitioners begin
to combine multiple datasets for more advanced data analysis, it is likely that some rows
of one dataset do not have all the corresponding information in the second (or third, or
fourth) datasets. For example, the untidy dataset shown in Figure 5 shows common
patterns of missing data. As a result, most analytic datasets that are the combination of
several related datasets will contain missing values. Although a full treatment of ways to
manage missing data is well beyond the scope of this white paper, two broad strategies
are commonly used.
The first strategy is to drop the observations with missing data. The benefit of dropping
rows with missing data is that you know the results of your analyses are accurate because
they use only observations containing all necessary information. The downside is that the
rows dropped from the final analysis may contain important information or ranges of the
data that are relevant to what you are analyzing. For example, dropping patients with
any missing information from the dataframe in Figure 5 would reduce the dataset from
30 patients to five patients. One way to mitigate this challenge is to drop only the rows
with missing data from the subset of columns specific to your analysis. For example,
perhaps we are only interested in looking at differences in overall milestone scores from
patients in the top panel of Figure 5. Dropping only patients without an overall scoring
would reduce the total number of rows from 31 to 21.
A second strategy is to fill in the missing data (referred to as data imputation) using
dummy coding, logical values from domain expertise, or mathematical modeling (e.g.,
Molenburgh et al., 2020; van Buren, 2018). Filling missing data with dummy coding is
essentially creating a specific data value that represents “missing” and that matches the
data type for that column. Using the top panel of Figure 5 as an example, you could use
“missing” for the “1st Assessor ID” column, 01/01/1900 for missing data in the date
columns, and -10 for data in the assessment score columns. With this approach the idea
is to assign values that would produce outliers or non-logical values to make it readily
identifiable that you have handled those missing values but that they are, in fact, missing
data and are not a true representation of that variable for that observation.
Filling in missing data using logical values involves looking at each missing value and
filling it in based on what you know about that observation and the variable/column with
missing data. For example, common practice at an organization may be to conduct the
VBMAPP sometime within 30 days of intake. Thus, if those data were missing and the
intake date was known, you could fill in the VBMAPP assessment date with a best guess
of a calendar date within 30 days of the known intake date. Filling in missing data using
mathematical modeling is a more advanced topic. Succinctly, however, these techniques
involve looking for patterns in the available data, predicting the likely value of the
missing data, and inserting that predicted value into the dataframe.
Filling in missing data using logical values or mathematical modeling has benefits and
drawbacks. The benefit of these approaches is that they improve the usability of those
observations for analyses as they contain less missing information. The downside to this
approach is that it can reduce the overall accuracy of analyses because the data being
used for analyses are of unknown accuracy. For dataframes with a lot of missing data,
imputing missing values may alter the results of subsequent analyses.
Data Markup. A final consideration pertains to datasets stored using programs with data
markup options. For example, when storing data in Excel, the user can provide
information about the data by changing the font color or font style or by using
conditional highlighting. These methods of data markup can be valuable sources of
information when a user is visually looking at a dataframe in Excel or a similar software
program. However, when data is moved between database systems or storage methods,
the information contained within data markup is often lost5. For organizations interested
in conducting analytics across larger stores of data, tools like Excel are inefficient, and
data are often read into alternative analytic environments (e.g., R, Python, SPSS).
Thus, if there is important information about the data that you currently capture with
data markup, it is better to add this information as a new column so the information is
maintained regardless of who analyzes the data and the environment they use.
5
The easiest way to see what information is retained is to save your data as a plaintext file (e.g.,.txt,.csv),
close the document, then reopen the plaintext file. Best practice is to save data as plaintext files only to
force data restrictions at the dataframe level and allow for reliable use of data across analytic platforms.
Section 3: Working with Related Datasets
When maintaining tidy datasets, practitioners will likely find that different datasets
define unique observations (rows) at different levels. For example, we likely need one
observation per patient for a dataset storing patient demographic information (e.g., date
of birth, cultural variables relevant to intervention, gender identity). But we would need
multiple observations per patient for a dataset containing quarterly change in the
number of programs mastered or annual Vineland scores for patients. If we wanted to
know whether annual change in assessment scores differed based on patient age,
primary language, or household size, the patient demographics and annual assessment
scores datasets would need to be combined.
As another example, the patient assessment dataset may contain a column about who
conducted the assessment or the patient’s program supervisor. If another dataset
contained information about each employee (e.g., education, training, years of
experience in ABA, number of cases with different patient profiles), we might want to
combine these datasets to ask questions about how an employee’s background, training,
and success with different patient profiles (i.e., the employee’s competence) might relate
to annual change in assessment scores. To do this, the employee competence dataset
would need to be combined with the annual assessment scores dataset.
Data Models. The relationship between different datasets is often stored graphically in
data models (Mosley et al., 2009). Figure 7 shows one example of a very simple data
model relative to patient assessment data. The purpose of a data model is to show what
information is contained in each dataset and which columns are used to relate one
dataset to another. Data models are helpful for at least two reasons. First, creating data
models helps an organization efficiently develop and execute data strategies to learn
more from their data than any single dataset can tell them. Second, data models are
critical for understanding where data lives and what can be accomplished with the data
for employees working as data analysts or database managers6.
6
Many books exist on the topic of data modeling and how different database designs are more or less
useful depending on the ways that data are used for an organization. Curious readers may want to begin
with Silverston and Agnew (2009), Simsion and Witt (2004), or Umaneth and Scamell (2014).
Primary and Foreign Keys. To combine datasets, each dataset must contain a primary
key and one or more foreign keys. A primary key is a column in a dataset where every
observation or row in that column is unique (Mosley et al., 2009). For example, column A
in the bottom panel of Figure 5 is the primary key. Each patient can exist in only one row,
and the patient identifier number is unique to each patient. If we want to combine this
dataset with the data in Figure 8, then Figure 8 must contain a column with the primary
key from the bottom panel in Figure 5. These columns in Figure 8 provide a link for
combining the data between the two tables and are thus referred to as foreign keys
(Mosley et al., 2009). When designing and storing datasets, practitioners should consider
what additional datasets might be combined with an assessment dataset so that they can
better understand variables related to patient outcomes. Once additional datasets are
identified, developing primary keys and embedding foreign keys across datasets
improves the efficiency with which larger scale analyses can be conducted.
FIGURE 7. Example data model showing how the data from different tables relate to one another. Data models are
helpful as they show you how you might combine data from multiple sources into one analytic dataframe.
FIGURE 8. Example dataframe with simple supervisor characteristics and the patients currently in their caseload.
Section 4: Examples of Basic Analytics with
Assessment Data
On some regular cadence, organizations are likely interested in analyzing patient
assessment scores. Perhaps they are interested in patients’ rates of improvement.
Maybe the organization is interested in the types of patients (e.g., age, reason for
referral) that they are better or worse at treating. Or maybe they are interested in
understanding the supervisors who are performing better or worse in treating different
patients. Answering all these basic questions will likely require tidy datasets and the
ability to combine data from multiple dataframes.
Rate of Patient Improvement. The top panel in Figure 9 shows how the structure of
the tidy dataframe allows us to easily analyze the rate of change in VBMAPP scores per
month for each patient in our dataframe7. The top-left panel shows the average overall
change in VBMAPP Milestones scores per month for each patient as a jitter plot where
each marker represents a single patient’s average change score. The top-right panel
shows the same data but as a box-and-whisker plot. For these box-and-whisker plots,
the ‘X’ corresponds to the average change score for all patients; the line across the
middle of the box represents the median (i.e., 50th percentile); the top and bottom
edges of the box correspond to the 75th and 25th percentiles, respectively; the lines
extending out from the box (aka the whiskers) show the maximum and minimum values
excluding outliers; and individual circle markers correspond to those outliers.
To plot the top panel in Figure 9 we needed a minimum of four datum. Two are overall
milestone scores for at least two assessments; the other two are the calendar dates
when those assessments were conducted. The structure of the tidy dataframe makes it
easy because we can simply call the four columns needed to plot these data. However,
if the columns contained different data types or multiple data per cell, we would have
been unable to plot these data as easily and would have had to spend time cleaning
and organizing the data, creating inefficiencies.
7
These data correspond to the sheet titled, “Demo-Px Improve – Wide” in the accompanying
Excel document.
FIGURE 9. Demonstration of basic assessment analyses using combined dataframes.
Focused vs. Comprehensive. The middle panel in Figure 9 shows the same change
scores per month as the top panel but with patients stratified based on the type of
intervention they receive – comprehensive or focused. Unlike the top panel, creating
these plots required joining data from two different dataframes. First, we needed the
calculations for obtaining the change score per month from a tidy assessment
dataframe. Second, we needed each patient’s authorized number of treatment hours
per week from a “Patient Information” dataframe to get labels of comprehensive or
focused intervention8. To join these dataframes we used a function in Excel called
“VLOOKUP” that allows you to use keys as described above to find the value in one
sheet that is associated with a value in a second sheet. Similar simple functions for
joining multiple dataframes exist for many analytic software programs or user
interfaces for databases. The trick is making sure the data are in the right format
and are complete so that joining dataframes is possible.
Teasing Out the Details. The data in the middle panel of Figure 9 suggest that patients
receiving comprehensive services show more change in assessment scores per month
than patients receiving focused intervention. A natural follow-up question is whether
there is something specific to receiving comprehensive vs. focused services or if the
differences are the result of simply receiving more hours of ABA. The bottom panel in
Figure 9 aims to answer this question for these fake patients and hypothetical data.
Specifically, these scatterplots show the average change scores per month based on the
number of ABA therapy hours each patient receives per week (bottom left panel) and
based on the percentage of authorized hours used by each patient per week (bottom
right panel). No trends in either plot are noticeable, indicating that for these fake
patients at this fake organization there is something specific to the different types of
services they provide that contributes to the average change in assessment scores per
month beyond just the raw number of hours of ABA being received. This insight might
be something that the Clinical Director can follow-up.
8
We arbitrarily chose any patients authorized for ten or fewer hours per week as receiving focused
intervention and any patient authorized for more than 10 hours per week as receiving comprehensive
services. This was done as a simple demonstration of the methods practitioners can use to join
dataframes for analysis – not to define service type solely based on this criterion.
Getting Fancy. At this point, readers are likely thinking of many ways they might slice
and dice their data to understand the variables that contribute to differing rates of
changes in assessment scores across all the patients in their organization. For
example, Figure 10 shows how we might analyze average change in assessment
scores based on the supervising practitioners’ years of experience in ABA or their
level of education. These analyses required joining data from a patient assessment
sheet and an employee information sheet, with the takeaway that analyses of the
variables that contribute to patient change in assessment scores using many different
datasets becomes much easier and more efficient when the data is stored in a tidy
format using a basic relational schema between the datasets. If this is done well,
practitioners need not spend their time entering and wrangling data into workable
structures and formats. Instead, behavior analysts could move to more advanced
analyses of change in assessment scores such as controlling for patient risk factors,
improvement as a function of patient cohorts, or improvement as a function of other
therapist characteristics.
FIGURE 10. Demonstration of analyses based on employee education and experience.
SUMMARY
The field of ABA has a long history of collecting, analyzing, and using assessment data to
drive the delivery of evidenced-based treatment for the individual patients they serve.
Increasingly, patients and other stakeholders (e.g., third-party payors) are asking ABA
providers to demonstrate more broadly that their treatments work, to quantify the cost
of treatment relative to improvement in patient behavioral health, and to allow for
patients and stakeholders to compare different ABA organizations against one another:
patients have a right to choose the provider best suited to their needs. ABA providers
are also increasingly interested in objectively measuring the effectiveness of the services
they provide compared with other organizations. Systematically measuring patient
outcomes in a manner similar to that of other providers allows organizations to better
understand their strengths and weaknesses. In turn, they can take actionable steps to
improve their services and better treat their patients.
Efficiently analyzing and reporting on treatment outcomes requires that the assessment
data being collected meet at least two criteria: first, that the data are stored in a tidy
format; and second, that each dataframe containing potentially important information
includes the necessary data to join multiple dataframes with related information. When
assessment data are stored in ways that meet these two criteria, ABA organizations can
begin to leverage more advanced data analytics to self-assess their employees’ skills and
abilities thoroughly. Most importantly, these analyses will result in improved patient care
and more efficient use of clinical resources.
REFERENCES
Abidin, R. R. (2012). Parenting Stress Index, 4th Edition | PSI-4. Parinc.
https://www.parinc.com/Products/Pkey/333
American Psychiatric Association. (2013). Diagnostic and Statistical Manual of Mental

Disorders, 5th Edition: DSM-5. Washington, DC: Publisher.
Autism Speaks. (2020). CDC estimate on autism prevalence increases by nearly

10 percent, to 1 in 54 children in the U.S. Autism Speaks.
https://www.autismspeaks.org/press-release/cdc-estimate-autism-prevalence-
increases-nearly-10-percent-1-54-children-us
Autism Speaks. (2021). Financial resources. https://www.autismspeaks.org/financial-

resources
Autism Speaks. (2021). Autism statistics and facts.

https://www.autismspeaks.org/autism-statistics-asd
Behavior Analyst Certification Board. (2020). Ethics code for behavior analysts.
Littleton, CO: Author. Retrieved from https://www.bacb.com/wp-
content/uploads/2020/11/Ethics-Code-for-Behavior-Analysts-2102010.pdf
Behavior Analyst Certification Board (2021). About behavior analysis.

https://www.bacb.com/about-behavior-analysis/
BHCOE Accreditation. (2021). BHCOE accreditation standards.

https://bhcoe.org/standards/
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python.
O’Reilly.
Brownell, R. (2010). Receptive and Expressive One-Word Picture Vocabulary Tests

(4th ed.). Available from
https://www.pearsonassessments.com/store/usassessments/en/Store/Professional-
Assessments/Speech-%26-Language/Receptive-and-Expressive-One-Word-Picture-
Vocabulary-Tests-%7C-Fourth-Edition/p/100000338.html
Bruni, T. P. (2014). Test review: Social Responsiveness Scale–Second Edition (SRS-2).
Journal of Psychoeducational Assessment, 32(4), 365–369.
https://doi.org/10.1177/0734282913517525
Buescher AVS, Cidav Z, Knapp M, Mandell DS. (2014). Costs of autism spectrum
disorders in the United Kingdom and the United States. JAMA Pediatrics, 168(8),
721–728. https://doi.org/10.1001/jamapediatrics.2014.210
Centers for Disease Control and Prevention (2019). Treatment and intervention services
for autism spectrum disorder. https://www.cdc.gov/ncbddd/autism/treatment.html
Centers for Disease Control and Prevention (2020). Autism and Developmental
Disabilities Monitoring (ADDM) Network.
https://www.cdc.gov/mmwr/volumes/69/ss/ss6904a1.htm?s_cid=ss6904a1_w
Chasson, G. S., Harris, G. E., & Neely, W. J. (2007). Cost comparison of early intensive
behavioral intervention and special education for children with autism. Journal of
Child and Family Studies, 16(3), 401–413. https://doi.org/10.1007/s10826-006-
9094-1
Chezan, L. C., Liu, J., Cholewicki, J. M., Drasgow, E., Ding, R., & Warman, A. (2021). A
psychometric evaluation of the Quality of Life for Children with Autism Spectrum
Disorder Scale. Journal of Autism and Developmental Disorders.
https://doi.org/10.1007/s10803-021-05048-y
Cohen, H., Amerine-Dickens, M., & Smith, T. (2006). Early intensive behavioral
treatment: Replication of the UCLA model in a community setting. Journal of
Developmental & Behavioral Pediatrics, 27(2), S145–S155.
https://doi.org/10.1097/00004703-200604002-00013
Cohen, I. L., & Sudhalter, V. (1999). PDD Behavior Inventory. Parinc. Available from
https://www.parinc.com/Products/Pkey/318
Cooper, J.O., Heron, T.E., & Heward, W.L. (2020). Applied behavior analysis (3rd ed).
Pearson.
Conners, K. C. (2008). Conners (3rd ed.). Pearson. Available from

Assessments/Behavior/Comprehensive/Conners-3rd-Edition/p/100000523.html
Constantino, J. N. (2012). (SRSTM-2) Social Responsiveness Scale (2nd ed.). Available
from https://www.wpspublish.com/srs-2-social-responsiveness-scale-second-edition
COSMIN. (n.d.). About the initiative. Retrieved June 16, 2021, from
https://www.cosmin.nl/about/
Dasu, T, & Johnson, T (2003). Exploratory data mining and data cleaning. Wiley.
Dixon, M. R. (2019). PEAK comprehensive assessment: administration manual. Shawnee

Scientific Press.
Dunn, D. (2019). Peabody Picture Vocabulary Test, Fifth Edition (PPVT-5). Pearson.
Available from
https://www.pearsonassessments.com/content/dam/school/global/clinical/us/assets
/ppvt-5/ppvt-5-sample-score-summary-report.pdf
Eikeseth, S., Klintwall, L., Jahr, E., & Karlsson, P. (2012). Outcome for children with
autism receiving early and intensive behavioral intervention in mainstream
preschool and kindergarten settings. Research in Autism Spectrum Disorders, 6(2),
829–835. https://doi.org/10.1016/j.rasd.2011.09.002
Eldevik, S., Hastings, R.P., Hughes, J.C., Jahr, E., Eikeseth, S., & Cross, S. (2009). Meta-
analysis of early intensive Behavioral intervention for children with autism. Journal of
Clinical Child & Adolescent Psychology, 38, 439–450.
https://doi.org/10.1080/15374410902851739
Gilliam, J. E. (2016). A test review. Journal of Psychoeducational Assessment, 35(3),

342–346. https://doi.org/10.1177/0734282916635465
Gjevik, E., Eldevik, S., Fjæran-Granum, T., & Sponheim, E. (2010). Kiddie-SADS reveals
high rates of DSM-IV disorders in children and adolescents with autism spectrum
disorders. Journal of Autism and Developmental Disorders, 41(6), 761–
769. https://doi.org/10.1007/s10803-010-1095-7
Goldstein, S., & Naglieri, J. A. (2012). Autism Spectrum Rating Scales (ASRS) [Technical
Report #1]. https://www.acer.org/files/ASRS-Tech-Supp.pdf
Gresham, F., & Elliot, S. (2008). Social Skills Improvement System SSIS rating scales.
Assessments/Behavior/Social-Skills-Improvement-System-SSIS-Rating-
Scales/p/100000322.html
Grey, I., Coughlan, B., Lydon, H., Healy, O., & Thomas, J. (2017). Parental satisfaction
with early intensive behavioral intervention. Journal of Intellectual Disabilities, 23,
174462951774281. https://doi.org/10.1177/1744629517742813
Hendrickson, N. K., & McCrimmon, A. W. (2018). Test review: Behavior Rating

Inventory of Executive Function®, Second Edition (BRIEF®2) by Gioia, G. A.,
Isquith, P. K., Guy, S. C., & Kenworthy, L. Canadian Journal of School Psychology.
https://doi.org/10.1177/0829573518797762
Howard, J. S., Stanislaw, H., Green, G., Sparkman, C. R., & Cohen, H. G. (2014).
Comparison of behavior analytic and eclectic early interventions for young children
with autism after three years. Research in developmental disabilities, 35(12), 3326-
3344. https://doi.org/10.1016/j.ridd.2014.08.021
Hyman SL., Levy SE., Myers SM. (2020). Identification, evaluation, and management of
children with autism spectrum disorder. American Academy of Pediatrics, 145(1), 1-
63. https://doi.org/10.1542/peds.2019-3447
Rogge, N., Janssen, J. (2019). The economic costs of autism spectrum disorder: A
literature review. Journal of Autism and Developmental Disorders, 49, 2873–2900.
https://doi.org/10.1007/s10803-019-04014-z
Klintwall, L., Eldevik, S., & Eikeseth, S. (2015). Narrowing the gap: Effects of
intervention on developmental trajectories in autism. Autism, 19, 53–63.
https://doi.org/10.1177/1362361313510067
Lino, M., Kuczynski, K., Rodriguez, N., & Schap, T. (2017). Expenditures on children by
families, 2015 (No. 1528–2015). U.S. Department of Agriculture, Center for
Nutrition Policy and Promotion. https://fns-
prod.azureedge.net/sites/default/files/crc2015_March2017.pdf
Markowitz, L. A., Reyes, C., Embacher, R. A., Speer, L. L., Roizen, N., & Frazier, T. W.
(2016). Development and psychometric evaluation of a psychosocial quality of life
questionnaire for individuals with autism and related developmental disorders.
Autism : The International Journal of Research and Practice, 20(7), 832–844.
https://doi.org/10.1177/1362361315611382
Makrygianni, M. K., Gena, A., Katoudi, S., & Galanis, P. (2018). The effectiveness of
applied behavior analytic interventions for children with autism spectrum disorder:
A meta-analytic study. Research in Autism Spectrum Disorders, 18–31.
https://doi.org/10.1016/j.rasd.2018.03.006
Martens, B. K., Witt, J. C., Elliott, S. N., & Darveaux, D. X. (1985). Teacher judgments
concerning the acceptability of school-based interventions. Professional
Psychology: Research and Practice, 16(2), 191–198. https://doi.org/10.1037/0735-
7028.16.2.191
Mokkink, L. B., Prinsen, C. A. C., Bouter, L. M., Vet, H. C. W. de, & Terwee, C. B.
(2016). The Consensus-based Standards for the selection of health Measurement
INstruments (COSMIN) and how to select an outcome measurement instrument.
Brazilian Journal of Physical Therapy, 20(2), 105. https://doi.org/10.1590/bjpt-
rbf.2014.0143
Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Stratford, P. W., Knol, D. L.,
Bouter, L. M., & Vet, H. C. W. de. (2010). The COSMIN checklist for assessing the
methodological quality of studies on measurement properties of health status
measurement instruments: an international Delphi study. Quality of Life Research,
19(4), 539. https://doi.org/10.1007/s11136-010-9606-8
Molenberghs, G., Fitzmaurice, G., Kenward, M.G., Tsiatis, A., & Verbeke, G. (Eds.)
(2020). Handbook of missing data methodology. CRC Press.
Mosley, M., Brackett, M., & Earley, S. (Eds.) (2009). The DAMA guide to the data
management body of knowledge enterprise server version. Technics Publications.
Partington, J. W. (2010). The ABLLS-R—The Assessment of Basic Language and

Learning Skills–Revised. Behavior Analysts, Inc.
Partington Behavior Analysts. (2012). AFLS.

https://partingtonbehavioranalysts.com/collections/afls
Peters-Scheffer, N., Didden, R., Korzilius, H., & Sturmey, P. (2011). A meta-analytic
study on the effectiveness of comprehensive ABA-based early intervention
programs for children with autism spectrum disorders. Research in Autism Spectrum
Disorders, 5, 60–69. https://doi.org/j.rasd.2010.03.011
Phelps-Terasaki, D., & Phelps-Gunn, T. (2007). (TOPL-2) Test of Pragmatic Language

(2nd ed.). Available from https://www.wpspublish.com/topl-2-test-of-pragmatic-
language-second-edition
Reichow, B., Hume, K., Barton, E. E. & Boyd, B. A. (2018). Early intensive behavioral
intervention (EIBI) for young children with autism spectrum disorders (ASD).
Cochrane Database of Systematic Reviews.
https://doi.org/10.1002/14651858.CD009260.pub3
Reynolds, C. R., & Livingston, R. B. (2012). Mastering modern psychological testing:

Theory and methods. Pearson Education.
Reynolds, C. R., & Kamphaus, R. W. (2015). Behavior Assessment System for Children
(3rd ed.). Pearson.
Assessments/Behavior/Comprehensive/Behavior-Assessment-System-for-Children-
%7C-Third-Edition-/p/100001402.html
Ridout, S., & Eldevik, S. (2021). Measures used to assess treatment outcomes in
children with autism receiving early and intensive behavioral interventions: A
Review. Manuscript submitted for publication.
Rodgers, M., Simmonds, M., Marshall, D., Hodgson, R., Stewart, L. A., Rai, D., Wright,
K., Ben-Itzchak, E., Eikeseth, S., Eldevik, S., Kovshoff, H., Magiati, I., Osborne, L. A.,
Reed, P., Vivanti, G., Zachor, D., & Couteur, A. L. (2021). Intensive behavioural
interventions based on applied behaviour analysis for young children with autism:
An international collaborative individual participant data meta-analysis. Autism,
25(4), 1137–1153. https://doi.org/10.1177/1362361320985680
Romanczyk, R. G., & Gillis, J. M. (2008). Practice guidelines for autism education and
intervention: historical perspective and recent developments. In J. Luiselli, D. C.
Russo, & W. P. Christian (Eds.), Effective practices for children with autism:
educational and behavior support interventions that work. Oxford University Press.
Sallows, G. O., & Graupner, T. D. (2005). Intensive behavioral treatment for children
with autism: Four-year outcome and predictors. American Journal on Mental
Retardation, 110(6), 417–438.
Sattler, J. M. (2014). Foundations of behavioral, social and clinical assessment of

children. Jerome M. Sattler.
Waters, C. F., Amerine Dickens, M., Thurston, S. W., Lu, X., & Smith, T.
(2018). Sustainability of early intensive behavioral intervention for children with
autism spectrum disorder in a community setting. Behavior Modification, 00(0), 1–
24. https://doi.org/10.1177/0145445518786463
Williamson E., Sathe N. A., Andrews J. C., Krishnaswami, S., McPheeters, M. L.,
Fonnesbeck, C., Sanders, K., Weitlauf, A., Warren, Z. (2017). Medical therapies for
children with autism spectrum disorder—An update. Agency for Healthcare
Research and Quality (U.S.).
Schoper, E., Bourgondien, M. E. V., Wellman, G. J., & Love, S. R. (2010). (CARSTM-2)
Childhood Autism Rating ScaleTM (2nd ed). https://www.wpspublish.com/cars-2-
childhood-autism-rating-scale-second-edition
Silverston, L., & Agnew, P. (2009). The Data Model Resource Book. Wiley.
Simsion, G.C., & Witt, G.C. (2004). Data Modeling Essentials (3rd ed.). Morgan
Kaufman.
Shimabukuro, T. T., Grosse, S. D., & Rice, C. (2007). Medical expenditures for children
with an autism spectrum disorder in a privately insured population. Journal of
Autism and Developmental Disorders, 38(3), 546–
552. https://doi.org/10.1007/s10803-007-0424-y
Smith T. (2013). What is evidence-based behavior analysis? The Behavior Analyst, 36(1),
7–33. https://doi.org/10.1007/BF03392290
Smith T., Antolovich M. (2000). Parental perceptions of supplemental interventions

received by young children with autism in intensive behavior analytic
treatment. Behavioral Interventions. 15, 83–97. https://doi.org/10.1002/(sici)1099-
078x(200004/06)15:2<83::aid-bin47>3.0.co;2-w
Smith, D. P., Hayward, D. W., Gale, C. M., Eikeseth, S., & Klintwall, L. (2019). Treatment
gains from early and intensive behavioral intervention (EIBI) are maintained 10 years
later. Behavior Modification, 1–21. https://doi.org/10.1177/0145445519882895
Sparrow, S. S., Cicchetti, D. V., & Saulnier, C. A. (2016). Vineland Adaptive Behavior
Scales (3rd ed.).
Assessments/Behavior/Adaptive/Vineland-Adaptive-Behavior-Scales-%7C-Third-
Edition/p/100001622.html
Sundberg, M. (2008). VB-MAPP. https://marksundberg.com/vb-mapp/
Umaneth, N.S., & Scamell, R.W. (2014). Data modeling and database design (2nd ed.).
Cengage Learning.
USDA. (2020). The cost of raising a child.

https://www.usda.gov/media/blog/2017/01/13/cost-raising-child
Van Buren, S. (2018). Flexible imputation of missing data. CRC Press.
Vajjala, S., Majumder, B., Gupta, A., & Surana, H. (2020). Practical natural language
processing: A comprehensive guide to building real-world NLP systems. O’Reilly.
Virués-Ortega, J. (2010). Applied behavior analytic intervention for autism in early

childhood: Meta-analysis, meta-regression and dose-response meta-analysis of
multiple outcomes. Clinical Psychology Review, 30, 387–399.
https://doi.org/10.1016/j.cpr.2010.01.008
Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23.
TABLE 1: Tests Using Norm Referenced Interpretation of Scores Measuring Severity of Autism
Autism Spectrum Rating Scales (ASRS), Goldstein and Naglieri (2009)
ASRS™ is a multi-informant norm-referenced measure using a 5-point Likert scale that can be used to identify severity of symptoms and behaviors associated with
ASDs completed by caregivers and teachers.
Administration Time &

Age Range Total Items Cost Qualification of the Assessor
Scoring
A master's degree in psychology, education,

Full form:
Full form: speech-language pathology, occupational
70 items (age 2-5) and 71 $405.00 ⎯ (hand-scored)
20 minutes therapy, social work, or counseling, or in a field
items (age 6-18) $613.00 ⎯ (software)
12⎯18 years closely related to the intended use of the
$25.00 ⎯ (package of 25
Short form: assessment, and formal training in the ethical
Short form: forms)
5 minutes administration, scoring, and interpretation of
15 items (age 2-18)
clinical assessments
General Strengths General Weaknesses
Provides standardized scores in the areas of communication, social

interactions, unusual behaviors, self-regulation (full form for ages 6–18), peer
socialization, adult socialization, atypical language, and stereotypy
Items in the rating scales are based on DSM-V diagnostic criteria for autism
The obtained scores are based on caregiver reports rather than direct
observation of the client
Allows comparisons of performance within age groups
Includes scoring software
Includes Spanish language forms
Childhood Autism Rating Scale (CARS™-2), Schopler et al., (2010)
CARS™-2 is a brief rating scale that helps identify autism in children.

Scoring
15-items scored on a
4-point Likert scale $237.00 ⎯ (Starter and
speech-language pathology, occupational
Three Forms: complete kit, print and digital)
therapy, social work, counseling, or in a field
Questionnaire for $41.00 ⎯ (Test forms and
2⎯57 years 5-10 minutes closely related to the intended use of the
parents/caregivers, High reports)
assessment, and formal training in the ethical
functioning (6 to 57), $41.00 ⎯ (All products: tests
administration, scoring, and interpretation of
Standard Version and materials for CARS2)
(2 to 36)
Includes enhancements that make the test more responsive to individuals on

the high functioning end of the autism spectrum – those with average or higher
IQ scores, better verbal skills, and more subtle social and behavioral deficits
Items are based on DSM-IV diagnostic criteria for autism None identified in the literature
Gives quantifiable ratings based on direct behavior observation.
Has a Spanish language version available
Gilliam Autism Rating Scale, Third Edition (GARS-3), Gilliam (2013)
GARS-3 is a multi-informant norm-referenced measure that can be used to identify severity of symptoms and behaviors associated with ASDs completed by
caregivers, teachers, and clinicians.

Scoring
$172.00 ⎯ (GARS-3 Kit)

$62.00 ⎯ (GARS-3 A master’s degree in psychology, school
Summary/Response counseling, occupational therapy, speech–
3⎯22 years 56 items 5-10 minutes Form,Pack of 50) language pathology, social work, education,
$31.00 ⎯ (GARS-3 Spanish special education, or
Summary/Response Form, related field
Pack of 15)
Provides standardized scores in the areas of restrictive and repetitive

behaviors, social interaction, social communication, emotional responses, and
cognitive style
The obtained scores are based on caregiver reports rather than direct
observation of the client
Items in the rating scales are based on DSM-V diagnostic criteria for autism
Includes Spanish language forms
PDD Behavior Inventory (PDDBI), Cohen and Sudhalter (1999)
PDDBI™ is a rating scale filled out by caregivers or teachers designed to assess children having a pervasive developmental disorder.

Scoring
$464.00 ⎯ (PDDBI
Teacher and Parent A master's degree in psychology, education,
Comprehensive Kit)
Standard form: speech-language pathology, occupational
Standard form: $111.00 ⎯ (PDDBI Teacher
124 items therapy, social work, counseling, or in a field
20-30 minutes Rating Form, Pack of 25)
5 months⎯18 years Extended forms: closely related to the intended use of the
Extended form: $33.00 ⎯ (PDDBI Parent Score
180-188 items assessment, and formal training in the ethical
30-45 minutes Summary Sheets, Pack of 25)
Scored on a 3-point administration, scoring, and interpretation of
$33.00 ⎯ (PDDBI Teacher
Likert scale clinical assessments
Score Summary Sheets)
Provides age-standardized scores in two broad categories:

approach/withdrawal problems and receptive/expressive social communication
Subscales are based on DSM-IV diagnostic criteria for autism
The obtained scores are based on caregiver and teacher reports rather than
Gives consistent measurements of progress over time when compared against
direct observation of the client
treatment plan goal progress
Includes online scoring software
Has a Spanish language version available
Social Responsiveness Scale- Second Edition (SRS-2), Constantino (2012)
SRS™-2 identifies social impairment associated with ASD and quantifies its severity. Completed by multiple raters who have at least 1 month of experience with
the rated individual.

Scoring
$334.00 ⎯ (SRS-2 Introductory

Kit) A master's degree in psychology, schol
Four forms with 65 items
$224.00 ⎯ (Child/Adolescent counseling, occupational therapy, speech-
3⎯89 years scored on 4-point Likert 15-20 minutes
Introductory Kit) language pathology, social work, education,
scale
$224.00 ⎯ (Adult Introductory special education, or related field
Kit)
Provides standardized scores in the areas of restrictive and repetitive

behaviors, social interaction, social communication, emotional responses, and
cognitive style
SRS-2 provides preschool and adult forms but majority of independent research
has been limited to school-age form. Additional research is needed to provide
Two subscales are based on DSM-V diagnostic criteria for autism
information on preschool and adult forms
Allows the assessment of social impairment in natural settings

Test questions at 8th grade reading level could be too high for some parents
and for some ASD-affected individuals (self-report form) with language deficits
Can be used to monitor progress over time and response to intervention
Includes scoring software
TABLE 2: Tests Using Norm Referenced Interpretation of Scores Measuring Communication Skills Reliability Coefficient of 0.8 or above
Receptive and Expressive one-word picture Vocabulary Tests ⎯ Fourth Edition (ROWPVT-4, EOWPVT-4),
Edited by Brownell (2010), Spanish-Bilingual (2012)
EOWPVT-4 and ROWPVT-4 are individually administered, co-normed tests that measure receptive and expressive vocabulary skills

Scoring

$195.00 ⎯ (ROWPVT-4
therapy, social work, or counseling, or in a field
Manual scoring: complete kit)
2⎯70 years 190 items closely related to the intended use of the
15-25 minutes $195.00 ⎯ (EOWPVT-4
complete kit)
Provides standardized in the areas of receptive and expressive language that

can be used to determine severity of communication deficits Does not break down the expressive language skills to specific verbal operant
Allows comparisons of performance within age groups Provides one overall standard score for receptive and one for expressive
language
Allows comparison between receptive and expressive vocabulary
Online scoring and report generation not available
Has a Spanish-language version
Expressive Vocabulary Test ⎯ Third Edition (EVT-3) , Williams (2019)
EVT-3 is a norm-referenced and individually administered test of expressive vocabulary that measures use of nouns, verbs adjectives and adverbs.

Scoring

Online scoring: $405.00 ⎯ (Complete Kit Form
2 yrs 6 mo⎯90 yrs 190 items closely related to the intended use of the
10-15 minutes A and B)
Provides standardized scores in the areas of receptive and expressive language

that can be used to determine severity of communication deficits Allows
comparisons of performance within age groups Allows comparison between Does not break down the expressive language skills to specific verbal operant
receptive and expressive vocabulary Includes Growth Scale Values (GSVs), EVT-3
which allows to measure progress with the use of two parallel formsAvailability
of online administration, scoring and report generation at additional cost
Peabody Picture Vocabulary Test ⎯ Fifth Edition (PPVT-5), Dunn (2018)
PPVT™-5 is a norm-referenced and individually administered measure of receptive vocabulary based on words in Standard American English. Assesses use of
nouns, verbs, adjectives, and adverbs of a speaker (expressive language).

Scoring

$405.00 ⎯ (Complete kit,
2 yrs 6 mo⎯90 yrs 240 items 10-15 minutes closely related to the intended use of the
Form A and B)
Provides standardized scores in the areas of receptive and expressive language

that can be used to determine severity of communication deficits
Allows comparisons of performance within age groups

Allows comparison between receptive and expressive vocabulary Does not break down the expressive language skills to specific verbal operant
Growth Scale Values (GSVs): an objective score for measuring changes in PPVT-5 Complete Kit (Form A and B)
performance over time
Availability of online administration, scoring and report generation at

additional cost
Test of Pragmatic Language ⎯ Second Edition (TOPL-2), Terasaki and Phelps-Gunn (2007)
Uses norm-referenced interpretation of scores to evaluate pragmatic language skills that involve social communication in context, selecting appropriate content,
expressing feelings, manding, and handling other aspects of pragmatic language.

Scoring
A master’s degree in
psychology, school counseling, occupational
6⎯18 yrs 11 mo 43 items 45-50 minutes $270.00 ⎯ (TOPL-2 Kit) therapy, speech-language pathology, social
work, education, special education, or
related field
Assess six core subcomponents of pragmatic language: physical setting,

Long Administration time
audience, topic, purpose (speech acts), visual-gestural cues, and abstraction
English only test

Can be used with children and adolescents who have average scores on tests
that measure expressive and receptive vocabulary tests to identify needs
Online scoring and report generation not available
related to use of pragmatic language
Vineland Adaptive Behavior Scales ⎯ Third Edition (Vineland-3), Sparrow, Ciccheti and Saulnier
(2016)
Vineland-3 uses normed reference interpretation to measure communication, daily living skills, socialization, and motor skills.

Scoring
Survey Interview Form:

195 items (Domain), 502 40-50 minutes for ages
items (comprehensive) 3-9 when motor skills
and maladaptive $215.50 ⎯ (Starter and
Parent/Caregiver Rating behavior domains are complete kits, print & digital) A master's degree in psychology, education,
Form: 180 items included school counseling, occupational therapy,
Birth⎯90 years
(Domain), 502 items $3.20 ⎯ (Test forms & reports) speech-language pathology, social work,
(comprehensive) 8-10 minutes without education, special education, or related field
the motor skills or $159.00 ⎯ (Support materials)
Teacher Rating Form: maladaptive behavior
149 items (Domain), 333 domains included
items (comprehenive)
\ General Weaknesses
Provides standard scores
Allows comparisons of performance withing age groups using normal curve Based on indirect assessment using interviews with caregivers and teacher or
rating scales completed by the caregivers and teacher
Allows comparison between communication, daily living and social skills
The communication domain uses receptive, expressive, and written
Has Spanish forms for parent/caregiver forms interpretation
Has online scoring and report options
TABLE 3: Tests Using Criterion Referenced Interpretation of Scores Measuring Communication Skills
Assessment of Basic Language and Learning Skills, Revised (ABLLS-R), Partington (2010)
The ABLLS-R® is an assessment tool that helps identify deficiencies in language, academic, self-help, and motor skills and progress monitoring using criterion-
referenced interpretation of scores.

Scoring

3-10 hours depending $69.95 ⎯ (ABLLS-R Kit includes therapy, social work, counseling, or in a field
0⎯12 years 25 functional areas on age and functioning guide and 1 protocol) closely related to the intended use of the
level of child. $39.95 ⎯ (Each protocol) assessment, and formal training in the ethical
Based on Skinner’s analysis of verbal behavior and provides information on

basic verbal operant and listener skills Does not provide standardized score
Measures other skills such as imitation, matching, and basic academic skills Does not include standardized assessment procedures
Provides the practitioner with options for selecting goals for intervention Age level comparisons cannot be made
Can be used to track progress over time Limited published studies that have assessed the psychometric properties of
assessment protocols or the efficacy treatments based on them
Can be administered in any language
PEAK Comprehensive Assessment (PCA), Dixon (2019)
PCA is designed as an assessment instrument and treatment protocol for addressing language and cognitive deficits in children with autism.

Scoring
PEAK level 2 certification

$495.00 ⎯ (3 full color stimulus
books with over 500 pages of
Board certified/Licensed Behavior Analyst
images, assessment
Special education teacher or administrator
30 minutes to 2 hours Administration manual,
2+ years 344 Items depending on skill challenging behavior index,
School Psychologist
ability level assessment fidelity checklists,
client record forms, and
Clinical Psychologist
accompanying case to
store/transport all materials
1 year experience with PEAK
safely kit)
Includes standardized assessment directions
Does not provide standardized score and as a result age level comparisons
Addressed foundational learning skills, foundational speaker and listener skills
cannot be made
Generalization module is designed to build a generalized repertoire

Limited published studies that have assessed the psychometric properties of
assessment protocols
Transformation module is designed not only to teach abstract concepts but
also perspective taking
Verbal Behavior Milestones Assessment and Placement Program (VB-MAPP), Sundberg (2008)
VB-MAPP is an assessment tool curriculum guide, and skill-tracking system that uses criterion referenced interpretation of scores.

Scoring

30 minutes to 2 hours speech-language pathology, occupational
depending on the skill $69.95 ⎯ (VB-MAPP Kit therapy, social work, counseling, or in a field
170 measurable learning
0⎯4 years and compliance level includes guide and 1 protocol) closely related to the intended use of the
and language milestones
of the individual being $25.95 ⎯ (Each protocol) assessment, and formal training in the ethical
evaluated administration, scoring, and interpretation of
Based on Skinner’s analysis of verbal behavior and provides information on

Does not provide standardized scoreDoes not include standardized assessment
basic verbal operant and listener skillsMeasures other skills such as imitation,
procedures Age level comparisons cannot be made Limited published studies
matching, and basic academic skillsProvides the practitioner with options for
that have assessed the psychometric properties of assessment protocols or the
selecting goals for intervention Can be used to track progress over time Can
efficacy treatments based on them
be administered in any language
TABLE 4: Tests Measuring Daily Living and Social Skills
Assessment of Functional Living Skills (AFLS), Partington and Mueller (2012)
The AFLS uses criterion referenced interpretation of the scores. Provides a systematic way to evaluate, track, and teach functional, adaptive, and self-help skills so
that individuals with autism or developmental delays can become more independent.

Scoring
$144.95 ⎯ (AFLS Bundle)

(Note: School Skills
Assessment Protocol,
30 minutes to 2 hours Vocational Skills Assessment A master’s degree (MA, MS, MSW, CAGS) in
depending on the skill Protocol, and Independent psychology, school counseling, occupational
1900 skills in 66
2 years⎯Adulthood and compliance level Living Skills Assessment therapy, speech–language pathology, social
functional areas
of the individual being Protocol are not included in work, education, special education, or related
evaluated the Bundle) field
$179.95 - (AFLS Starter Set)
$249.95 - (AFLS Assessments)
$39.95 - (Each protocol)
Skill ratings completed by parents, educators, or therapists, using information

from observation, interview, and task performance—plus a progress-tracking
Does not provide standardized scoreAge level comparisons cannot be made
grid and teaching guidance based on task analysesAFLS online allows
completion of the assessment online and generate report
Social Skills Improvement System (SSIS) Rating Scales, Gresham and Elliot (2008)
Offers a targeted and comprehensive assessment of an individual’s social skills (conversations, cooperation, assertion, responsibility, empathy, engagement, and
self-control), problem behaviors and academic competence.

Scoring
A master’s degree in
$139.00 ⎯ (SSIS Rating Scales)
psychology, school counseling, occupational
$57.75 ⎯ (Handscoring
3⎯18 years 140 items 10-25 minutes therapy, speech–language pathology, social
package of 25)
work, education, special education, or
$71.75 ⎯ (Computer scoring
related field
package of 25)
Provides standard scores
Allows comparisons of performance within age groups using normal curve
The SSIS uses caregiver, teacher and client self-reports which could result in
Rating scales for parents, clients, and teachers.
over or underestimation of client’s actual social skills
English and Spanish forms
Computer scoring is available
Vineland Adaptive Behavior Scales ⎯ Third Edition (Vineland-3), Sparrow, Ciccheti and Saulnier (2016)
Vineland-3 uses normed reference interpretation to measure communication, daily living skills, socialization, and motor skills.

Scoring
Survey Interview Form:

195 items (Domain), 502 40-50 minutes for ages
items (comprehensive) 3-9 when motor skills
and maladaptive $215.50 ⎯ (Starter and
Parent/Caregiver Rating behavior domains are complete kits, print & digital) A master's degree in psychology, education,
Form: 180 items included. school counseling, occupational therapy,
Birth⎯90 years
(Domain), 502 items $3.20 ⎯ (Test forms & reports) speech-language pathology, social work,
(comprehensive) 8-10 minutes without education, special education, or related field
the motor skills or $159.00 ⎯ (Support materials)
Teacher Rating Form: maladaptive behavior
149 items (Domain), 333 domains included
items (comprehensive)
Provides standard scoresAllows comparisons of performance withing age

Based on indirect assessment using interviews with caregivers and teacher or
groups using normal curveAllows comparison between communication, daily
rating scales completed by the caregivers and teacherThe communication
living and social skillsHas Spanish forms for parent/caregiver forms Has online
domain uses receptive, expressive, and written interpretation
scoring and report options
TABLE 5: Tests Using Norm Referenced Interpretation of Scores Measuring Severity of Problem Behaviors and Executive Functioning Skills
Aberrant Behavior Checklist - Second Edition (ABC), Aman and Singh (1994)
ABC is a symptom checklist for assessing problem behaviors of children and adults with developmental disabilities.

Scoring
$179.75 ⎯ ABC-2-1 complete

kit
$145.00 ⎯ ABC-2-1C complete
kit
$145.00 ⎯ ABC-2-1R complete
58 items that resolve kit
5 years to Adult 10⎯15 minutes closely related to the intended use of the
onto five subscales. $115.00 ⎯ ABC-2-2 Combo
Manual
$48.75 ⎯ ABC-2-3 Community
Forms
$48.75 ⎯ ABC-2-4 Residential
Forms
Empirically developed by factor analysis on data from 1,000 residents

ABC-C may over or underestimate behavior problems in younger children; more
Translated into 35 foreign languages and dialects ABC subscales have high
extensive investigation is needed on the utility of the ABC-C for children under 5
internal consistency, good reliability, and established validity
Behavior Assessment system for Children - Third Edition (BASC-3), Reynolds and Kamphaus (2015)
BASC-3 is a multi-informant norm-referenced measure that can be used to measure severity of problem behaviors in the community, school, and home settings.

Scoring
BASC-3 Parenting Rating $671.00 ⎯ (BASC-3 Hand-

Scales - Preschool: 139 Scored Starter Set, English)
items $996.00 ⎯ (BASC-3 Hand-
BASC-3 Parenting Rating Scored Starter Set,
Scales - English/Spanish) A master's degree in psychology, education,
Child/adolescentl: 173 $605.30⎯- (BASC-3 starter kit speech-language pathology, occupational
items with 1-year q-global online therapy, social work, or counseling, or in a field
2⎯22 years BASC-3 Teaching Rating 10⎯20 minutes scoring subscription) closely related to the intended use of the
Scales - Preschool: 105 $882.00 ⎯ (BASC-3 starter kit assessment, and formal training in the ethical
items with 1-year q-global online administration, scoring, and interpretation of
BASC-3 Teaching Rating scoring subscription - Spanish clinical assessments
Scales - Child: 156 items and English)
BASC-3 Teaching Rating $45.00 ⎯ (Record forms for
Scales - Adolescent: 165 package of 25 for each age
items group)
Provides standardized scores in the areas of hyperactivity, attention problems,

aggression, conduct problems, anxiety, depression, atypical behaviors,
The obtained scores are based on caregiver reports not on direct observation of
adaptability, social interactions, functional communication, and adaptive
the client Limited items measuring adaptive and functional communication skills
skillsIncludes Q-global™ web-based administration, scoring and
reportingIncludes Spanish language forms
Behavior Rating Inventory of Executive Function - Second Edition (BRIEF-2), Gioia, Isquith, Guy and Kenworthy (2015)
BRIEF-2 is a multi-informant norm-referenced rating scale that can measures executive functioning skills in home and school environments

Scoring
$457.00 ⎯ (BRIEF-2
Parent/Teacher/Self-Report
Hand-Scored Kit) A master's degree in psychology, education,
$350.00 ⎯ (BRIEF-2 speech-language pathology, occupational
Parent, Teacher, and Parent/Teacher Hand-Scored therapy, social work, or counseling, or in a field
5⎯18 years Self-Report forms: 5⎯10 minutes Kit) closely related to the intended use of the
12 items $278.00 ⎯ (BRIEF-2 Screening assessment, and formal training in the ethical
Parent/Teacher/Self-Report administration, scoring, and interpretation of
Hand-Scored Kit) clinical assessments
$83.00 ⎯ (BRIEF-2 forms,
package of 25)
Provides standardized scores in the areas of impulsivity, self-monitoring,

The obtained scores are based on caregiver reports not on direct observation of
emotional control, task completion, planning and organization, working
the client
memory and organization of materialsIncludes Spanish language forms
Conners - Third Edition (Conners-3), Conners (2012)
Conners-3 is a multi-informant norm-referenced measure that can be used to measure defects in executive functioning, attention and levels of
hyperactivity/impulsivity.

Scoring
$115.00 ⎯ (Conners-3 Manual)

$495.00 ⎯ (Conners-3 Hand-
Scored Kit With DSM-5
Parent form and self- Update)
report form: 99 items $911.00 ⎯ (Conners-3
speech language pathology, occupational
Short form- Parent: Software Kit With DSM-5
45 items Update)
6⎯18 years 10⎯20 minutes closely related to the intended use of the
$72.00 ⎯ (Conners-3 Short
Short form- teacher and forms, Pack of 25)
self-report form: $72.00 - (Conners-3 long
41 items forms, Pack of 25)
$371.00 ⎯ (Conners 3
Unlimited-Use Scoring
Software Installation).
Provides standardized scores in the areas of hyperactivity, impulsivity,

attention problems, executive functioning, learning problems, aggression and The obtained scores are based on caregiver reports not on direct observation of
peer and family relations Includes scoring and reporting software Includes the client
Spanish language forms
TABLE 6: Measuring Social Significance
Child and Family Quality of Life Scale - Second Edition (CFQL-2), Frazier et al. (2020)
CFQL evaluates clinically relevant aspects of psychosocial quality of life in individuals at risk for or with an existing developmental disorder diagnosis and is
completed by caregivers.

Scoring

32 items scored on a
9 months⎯19 years 5⎯10 minutes Free closely related to the intended use of the
5-point Likert-Scale
The scores are based on self-reportValidation of the CFQL-2 was based on a

The measure was designed to measure change in quality of life and provides
single clinical site, although they included a large and clinically-representative
information that can be useful for treatment planning
sample.
Family Empowerment Scale (FES), Koren et al. (1992)
FES is designed to measure empowerment in families with children who have emotional, behavioral, or mental disorders.
Administration Time
& Scoring
A master's degree in psychology,

education, speech-language pathology,
occupational therapy, social work, or
34 items scored on a counseling, or in a field closely related
Birth⎯20 years
5-point Likert-Scale Free to the intended use of the assessment,
and formal training in the ethical
administration, scoring, and
interpretation of clinical assessments
More data is needed on the cultural appropriateness of the

Robust psychometric properties Captures change over time itemsRemains to be investigated whether FES is a sensitive measure of
empowerment in families of diverse background
Intervention Rating Profile (IRP-15), Witt and Elliot (1985)
IRP-15 is a single factor scale that has been demonstrated to assess treatment acceptability of various interventions. It can be completed by teachers,
parents/caregivers, and interventionists.

Scoring
IRP-15: 15 items scored

on a6-point Likert-Scale
Children Rating Profile
3⎯21 years 5⎯7 minutes Free closely related to the intended use of the
(CIRP): 7 items scored on
7-point Likert-Scale and
children 8 to 18 can use
it
Provides standardized scores in the areas of classroom intervention general

acceptability, risk to the target student, amount of teaching time required for
The obtained scores are based on self-report rather than direct observations
implementation, negative effects on nontarget students, and amount of
teaching skill required for implementation
Pediatric Quaity of Life Inventory (PedsQL), Varni et al. (2001)
PedsQL is a generic health status instrument with parent and child forms that assesses five domains of health (physical functioning, emotional functioning, social
functioning, and school functioning) in children and adolescents.

Scoring

2⎯18 years 7⎯10 minutes Free closely related to the intended use of the
5-point Likert-Scale
It has good validity, reliability, and internal consistencyTranslated into multiple Reliance on the caregiver version of the scale may show imperfect agreement
languagesResponsive to clinical change over time between children and parents, as well as parents and professionals.
Parenting Stress Index (PSI-4), Abidin (2012)
PSI-4 is screening and triage measure for evaluating the parenting system and identifying issues that may lead to problems in the child’s or parent’s behavior.

Scoring

$314.00 ⎯ (PSI-4 Introductory
kit with professional manual,
10 reusable item booklets, 25
Birth⎯12 years 120 items 10⎯20 minutes closely related to the intended use of the
answer sheets, and 25 profile
forms)
Provides standardized scores in the areas of child characteristics, parents

characteristics and situational/demographic life stressRevised to improve
cultural sensitivity of language and to include fathers in the standardization The obtained scores are based on self-report rather than direct observations
sampleIncludes scoring and reporting softwareIncludes Spanish-language
version
Parental Satisfaction Scale-EIBI (PSS-EIBI), Gray et al. (2019)
PSS-EIBI assesses parental satisfaction following an early intervention program.

Scoring

35 items scored on a 10-
2-? years 15⎯20 minutes Free closely related to the intended use of the
point Likert-Scale
Provides standardized scores in the areas of child outcome, effects on the

family, intervention characteristics, and relationships with team members. Thus The obtained scores are based on self-report rather than direct observations
far, the instrument has been used only with young children.
Quality of Life for Children with Autism Spectrum Disorder (QOLASD-C), Chezan et al. (2021)
QOLASD-C assess quality of life (QOL) as a treatment outcome for children with ASD. Parents rate their childs satisfaction level across three domains:
interpersonal relationships, self-determination, and emotional well-being.

Scoring

5⎯10 years 10 minutes closely related to the intended use of the
4-point Likert-Scale Free
Small sample size included in the analysis One of the three factors (i.e.,
Consists of simple structure with three domainsShort length of the scale emotional well-being) had marginal reliability than the other two factors
Decent psychometric properties Demographic data related to children’s age, gender, and school attendance
were available only for a subsample of children
ACKNOWLEDGMENTS
BHCOE thanks the volunteers and subject matter experts for their assistance
in developing this publication and the resources associated with it, as well as
additional support.
This document should be referenced as follows:

Behavioral Health Center of Excellence (2021). Selecting Appropriate Measurement
Instruments to Assess Treatment Outcomes of Individuals with Autism Spectrum
Disorder: Guidelines for Practitioners, Payors, Patients, and Other Stakeholders. Los
Angeles, CA: Author.
No part of this publication may be reproduced in any form, in an electronic retrieval

system, or otherwise, without the prior written permission of the publisher.
The information below constitutes guidance and recommendations to managed

care organizations, third-party payors, health insurance issuers, state agencies, ABA
organizations, and practitioners on best practices for assessment selection and
interpretation for applied behavior analysis (ABA) services.
This guidance does not replace payor-specific requirements regarding assessment

requirements and interpretation of Current Procedural Terminology (CPT) and
Healthcare Common Procedure Coding System (HCPCS) codes typically found in the
Provider Manual and/or Contract unless explicitly stated by the payor. Deviation from
this format or its requirements should not be used to deny or limit coverage of
applied behavior analysis services.

BHCOE ABA Outcomes White Paper

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BHCOE ABA Outcomes White Paper

Uploaded by

Copyright:

Available Formats

s

Section 1: Executive Summary 4

Section 2: About Autism Spectrum Disorder 5

Section 3: About Applied Behavior Analysis 5

Section 4: Societal and Economic Considerations 6

Section 5: ABA-based Treatment for ASD 6

Section 6: Measuring Outcomes of ABA-based Treatments in Research 7

Section 7: The Need for a Unified Approach to Selecting Assessment 8

Section 8: How We Selected the Assessment Instruments in This Guide 11

Part 2: Practical Applications 15

Section 2: Selecting Measurement Instruments: A Decision Model 16

Section 3: Additional Considerations for Selecting Measurement Instruments 22

Part 3: Considerations for Storing Assessment Data 31

Section 1: Keeping Your Data Sets Tidy 32

Section 2: Some Considerations Regarding Tidy Data 36

Section 3: Working with Related Datasets 42

Section 4: Examples of Basic Analytics with Assessment Data 46

Documenting treatment outcomes in health care professions has become increasingly

Section 3: About Applied Behavior Analysis

Section 5: ABA-Based Treatment for ASD

Section 6: Measuring Outcomes of ABA-

Section 7: The Need for a Unified Approach

Systematically evaluating the effectiveness of ABA treatment has been complicated

1. The reliability and validity of the instrument

1. Determine which properties are evaluated in an article (i.e., internal

1. Consider the following concepts:

Additionally, we considered the measurement instruments’ acceptability to the

Section 2: Selecting Measurement

Measurement instruments enable practitioners to assess individual skills in an objective

Norm-referenced interpretations. Tools that rely on

When using an assessment instrument that is norm-referenced, the practitioner should

For norm-referenced interpretations, the

being assessed is represented. For Is the standardization sample

Criterion-referenced interpretations of scores. In criterion-referenced interpretation of

The most common criterion-referenced interpretations of assessment results are

Reliability and Validity of the Instruments. Reliability in the context of assessments

All manuals for norm-referenced interpretation of results include information on past

Specific Areas of Need. When selecting assessment measures, practitioners should

Section 3: Additional Practical Considerations

Assessor Qualifications. The Behavior Analyst Certification Board (BACB), American

It is the responsibly of the assessor to follow ethical guidelines regarding the

• A master's degree in psychology, education, speech–language pathology,

In Step 1, the practitioner should

These social validity instruments are included in Figure 2. To measure social

All instruments in Step 4 are typically not administered by assessors/practitioners, but

In conclusion, Figure 4 shows an overview of the overall assessment process.

Section 1: Keeping Your Data Sets Tidy

Data Management. A second topic of significant consideration for behavior analysts

FIGURE 10. Demonstration of analyses based on employee education and experience.

American Psychiatric Association. (2013). Diagnostic and Statistical Manual of Mental

Autism Speaks. (2020). CDC estimate on autism prevalence increases by nearly

Autism Speaks. (2021). Financial resources. https://www.autismspeaks.org/financial-

Autism Speaks. (2021). Autism statistics and facts.

Behavior Analyst Certification Board (2021). About behavior analysis.

BHCOE Accreditation. (2021). BHCOE accreditation standards.

Brownell, R. (2010). Receptive and Expressive One-Word Picture Vocabulary Tests

Conners, K. C. (2008). Conners (3rd ed.). Pearson. Available from

Dixon, M. R. (2019). PEAK comprehensive assessment: administration manual. Shawnee

Gilliam, J. E. (2016). A test review. Journal of Psychoeducational Assessment, 35(3),

Hendrickson, N. K., & McCrimmon, A. W. (2018). Test review: Behavior Rating

Partington, J. W. (2010). The ABLLS-R—The Assessment of Basic Language and

Partington Behavior Analysts. (2012). AFLS.

Phelps-Terasaki, D., & Phelps-Gunn, T. (2007). (TOPL-2) Test of Pragmatic Language

Reynolds, C. R., & Livingston, R. B. (2012). Mastering modern psychological testing: