You are on page 1of 15

Describing Data and Ethics

Overview
When starting on the journey of learning statistics one of the first things that comes
to mind are the controversial ways in which statistics are used. This elicits a
conversation around the ethics of statistics. Consider the following scenario:

Case-Study: Funding for Social Services


Maria works in social services for a program which advocates for abused,
abandoned and neglected children. The program keeps detailed records of the
children and their moves and services throughout the dependency care system.
Much of the funding for the program is determined by the numbers associated with
the record keeping. In the U.S., sources of funding varies from state to state but it
is generally a mix of local, state and federal dollars. The allocation of grants and
other funding is typically tied to the result of some metric that can be measured
over time. Maria is struggling to understand how best to describe her metrics and
perhaps more importantly how to ethically balance the need to comply with
reporting and the need to justify additional funding. How can Maria work with her
data and balancing the competing forces within an ethical decision-making
process?

Since you are reading these notes you are probably thinking about statistics. Just
like Maria in the example above, we need to think about how we use basic
descriptive statistics in our work and daily lives. Descriptive statistics are basic
operations performed on data. Descriptive statistics do not convey any
significance, prediction, nor certainty, they simply describe the data in front of you.
Before Maria can have any conversations with funding agencies or regulatory
bodies she must be familiar with her clients, staff, hours of service, and outcomes,
which she can learn from descriptive statistics. Maria would not (or should not)
attend a meeting without being informed of this information. As statistical concepts
become more complex, the importance of understanding the fundamental building
blocks of descriptive statistics becomes more apparent, allowing us to correctly
select and apply more advanced inferential statistics. These advanced techniques
allow us to model the real world, based on our own local or sample data.

The first step in understanding descriptive statistics is to start thinking about some
ways in which statistics and numbers are used. Select some media content on a
study, poll, or trend, and you will see statistics. Whether political polling, the
effectiveness of a product, or an analysis of your favorite sports team, statistics is
all around us. Often, we are not even aware of some of the potential misuses of
data in our world. For example, how was the sample drawn? What exact question
was asked? What data were included and what data was not included? The
information in the accompanying materials on ethics (links: ethics in statistics, use
and misuse of numbers, and statistics ethics advice) are presented to get you
thinking about the issues around ethics.

So, as we begin our journey of learning statistics, let’s start with some of the ethics
involved in statistics along with some basic concepts of descriptive statistics.

Objectives
Upon completion of this lesson, you should be able to:

 Identify ethical dilemmas


 Choose among alternative actions using the ASA ethical guidelines as a
framework for values
 Correctly identify measures of central tendency (mean, median, and mode)
 Match appropriate descriptive statistics with the type of data
 Match the appropriate graph with type of data

1.1 - Classifying Statistics


What's a variable?
Let’s get to know some of the descriptive statistics. The first challenge is
determining what kind of data you are dealing with. There are generally two main
types of data, qualitative and quantitative.

Qualitative data is typically words, but could also be images or other media, we will
refer to this data in this course as categorical. Qualitative data may be labeled with
numbers allowing this type of data to be analyzed using some of the techniques in
the course. Maria might encounter some qualitative data in her work by labeling
some of the mental health diagnoses (depression might be a “1”; anxiety a “2”).
Note how these numerical labels are arbitrary. On the other hand, quantitative data
is the focus of this course and is numerical. If Maria counts the number of patients
seen each day, this data is quantitative.

Quantitative variables may be discrete or continuous. Discrete variables can only


take on a limited number of values (e.g., only whole numbers) while continuous
variables can take on any value and any value between two values (e.g., out to an
infinite number of decimal places).
Before we get too far along, let’s take a moment to think about what the word
“variable” means. A variable, notice this is a noun, not a verb, is an element or a
feature. In statistics, this is typically something that is measured or recorded. In
Maria’s case, the “number of patients” is a variable, the mental health diagnosis is
a variable.
Summarizing Types of Variables
To summarize:

Categorical variable

Names or labels (i.e., categories) with no logical order or with a logical order
but inconsistent differences between groups, also known as qualitative.

Example: Eye Color

Quantitative variable

Numerical values with magnitudes that can be placed in a meaningful order


with consistent intervals, also known as numerical.
Continuous variable

Characteristic that varies and can take on any value and any value between
values

Example: Gas Prices


Discrete variable

Characteristic that varies and can only take on a set number of values

Example: Number of Customers

If a child admitted to Maria’s program is weighed upon admission, this weight is


a quantitative variable because it takes on numerical values with meaningful
magnitudes. It is a continuous variable because, theoretically, weight could take
on any value. Any value between any two values is a possibility.

Example 1.1: Favorite Ice Cream Flavor


If each child at Maria’s organization is offered an ice cream cone, there may be
three choices of flavors, chocolate, vanilla, or strawberry. The ice cream flavor is
a categorical variable because the different flavors are categories with no
meaningful order.
1.2 - Summarizing Data Visually
Summarizing Categorical Variables
Once the type of data, categorical or quantitative is identified, we can consider
graphical representations of the data, which would be helpful for Maria to
understand.

Frequency tables, pie charts, and bar charts are the most appropriate graphical
displays for categorical variables. Below are a frequency table, a pie chart, and a
bar graph for data concerning Mental Health Admission numbers.

 Frequency Table
A table containing the counts of how often each category occurs.

Diagnosis Count Percent


Depression 40835 48.5%
Anxiety 29388 34.9%
OCD 5465 6.5%
Abuse 8513 10.1%
Total 84201 100.0%

 Pie chart
Graphical representation for categorical data in which a circle is partitioned into “slices” on
the basis of the proportions of each category.
Pie Chart of Diagnosis

Pitfalls
One of the pitfalls of a pie chart is that if the “slices” only represent percentages the
reader does not know how many actual people fall in each category.
Bar Chart
Graphical representation for categorical data in which vertical (or sometimes
horizontal) bars are used to depict the number of experimental units in each
category; bars are separated by space.

Note that in the bar chart, the categories of mental health diagnoses (bars) have
white spaces in between them. The spaces between the bars signify that this is a
categorical variable.

Pie charts tend to work best when there are only a few categories. If a variable has
many categories, a pie chart may be more difficult to read. In those cases, a
frequency table or bar chart may be more appropriate.

Pitfalls
While bar charts can be presented as either percentages (in which case they are
referred to as relative frequency charts) or counts, the differences among the
heights of the bars are often assumed to be different, even when they are not.

Summarizing Quantitative Variables


But what of variables that are quantitative such as math SAT or percentage taking
the SAT? For these variables we should use histograms or boxplots. Histograms
differ from bar graphs in that they represent frequencies by area and not height. A
good display will help to summarize a distribution by reporting the center, spread,
and shape for that variable.

For now, the goal is to summarize the distribution or pattern of variation of a single
quantitative variable.

Histogram
Histograms are graphical displays that can be used with one quantitative variable.
In these plots the horizontal axis represents the values of the variable and the
height of the bar represents how many observations are equal to the particular
value.

From the histogram of children’s heights below, Maria can see that about 10
children have a height equal to “60”.

Pitfalls
People frequently confuse bar charts and histograms. The first test should be to
identify what kind of data you are charting (or what kind of data was charted),
quantitative or categorical. Another hint will be that the x-axis of the histogram will
contain labels that reflect a quantitative variable, bar charts will have an x-axis that
contains category labels, generally not numbers.
To draw a histogram by hand we would:

1. Divide the range of data (range is from the smallest to largest value within the data
for the variable of interest) into classes of equal width.
2. Count the number of observations in each class.
3. Draw the histogram using the horizontal axis as the range of the data values and
the vertical axis for the counts within the class.

Choosing the appropriate display


When selecting a visual display for your data you should first determine how many
variables you are going to display and whether they are categorical or quantitative.
Then, you should think about what you are trying to communicate. Each visual
display has its own strengths and weaknesses. When first starting out, you may
need to make a few different types of displays to determine which best
communicates your data.
1.4 - Measures of Central Tendency
The ability to visually summarize data is effective, but someone like Maria will
probably need to present some numerical summaries of her data to use in her
reporting. The most common measures to describe data are measures of central
tendency.
Mean, Median, Mode
A measure of central tendency is an important aspect of quantitative data. It is an
estimate of a “typical” value. Maria may be asked for the typical number of children
seen per month.

Three of the many ways to measure central tendency are the mean, median and
mode.

There are other measures, such as a trimmed mean, that we do not discuss here.
Effects of Outliers
One shortcoming of the mean is that means are easily affected by extreme values.
Measures that are not that affected by extreme values are called resistant.
Measures that are affected by extreme values are called sensitive. As stated,
Maria would use the median if she felt her numbers were could be impacted by
outliers because the median is resistant to outliers.
Adding and Multiplying Constants
What happens to the mean and median if we add or multiply each observation in a
data set by a constant?
Consider for example if an instructor curves an exam by adding five points to each
student’s score. What effect does this have on the mean and the median? The
result of adding a constant to each value has the intended effect of altering the
mean and median by the constant.

For example, if in the above example where we have 9 participation rates for the
South Atlantic states, if 5 was added to each participation rate the mean of this
new data set would be 71.11 (the original mean of 66.11 plus 5) and the new
median would be 78 (the original median of 73 plus 5).

Similarly, if each observed data value was multiplied by a constant, the new mean
and median would change by a factor of this constant. Returning to the 9
participation rates, if all of the original rates were multiplied by 1.20 (a 20 percent
increase), then the new mean and new median would be found by multiplying the
original mean and median by 1.20. As we will learn shortly, the effect is not the
same on the variance!

Shape and Central Tendency


The shape of the data helps us to determine the most appropriate measure of
central tendency. The three most important descriptions of shape are Symmetric,
Left-skewed, and Right-skewed. Skewness is a measure of the degree of
asymmetry of the distribution. Maria might want to examine the shape of the
distribution of the number of children seen.
1.5 - Ethical Considerations
Like in Maria’s case, you now may be thinking, have I ever misused data or read a
report that data may have been misused? This thinking is part of the process of
“ethics spotting”, the first step to realizing data may have some ethical issues.
Fortunately, the American Statistical Association (ASA) publishes guidelines on
ethics to guide us to appropriate courses of action when working with data. While
there are not necessarily right or wrong answers to these questions, the fact that
you are now thinking about ethics in data is the important part.

American Statistical Association's Committee on Professional Ethics published the


following: Ethical Guidelines for Statistical Practice
From their website:

The Ethical Guidelines address eight general topic areas and specify important
ethical considerations under each topic.

1. Professionalism points out the need for competence, judgment, diligence, self-
respect, and worthiness of the respect of other people.
2. Responsibilities to Funders, Clients, and Employers discusses the
practitioner's responsibility for assuring that statistical work is suitable to the needs
and resources of those who are paying for it, that funders understand the
capabilities and limitations of statistics in addressing their problem, and that the
funder's confidential information is protected.
3. Responsibilities in Publications and Testimony addresses the need to report
sufficient information to give readers, including other practitioners, a clear
understanding of the intent of the work, how and by whom it was performed, and
any limitations on its validity.
4. Responsibilities to Research Subjects describes requirements for protecting the
interests of human and animal subjects of research-not only during data collection
but also in the analysis, interpretation, and publication of the resulting findings.
5. Responsibilities to Research Team Colleagues addresses the mutual
responsibilities of professionals participating in multidisciplinary research teams.
6. Responsibilities to Other Statisticians or Statistical Practitioners notes the
interdependence of professionals doing similar work, whether in the same or
different organizations. Basically, they must contribute to the strength of their
professions overall by sharing nonproprietary data and methods, participating in
peer review, and respecting differing professional opinions.
7. Responsibilities Regarding Allegations of Misconduct addresses the
sometimes painful process of investigating potential ethical violations and treating
those involved with both justice and respect.
8. Responsibilities of Employers, Including Organizations, Individuals, Attorneys, or
Other Clients Employing Statistical Practitioners encourages employers and clients
to recognize the highly interdependent nature of statistical ethics and statistical
validity. Employers and clients must not pressure practitioners to produce a
particular "result," regardless of its statistical validity. They must avoid the potential
social harm that can result from the dissemination of false or misleading statistical
work.
So in dealing with data, not only must we be technically correct in
determining the type of data we have and matching the appropriate
descriptive statistics and graphical representations, we also must do so
in a manner that accurately represents our phenomena and not allow
our own biases and perspectives bend the data. Finally, as a data
consumer, you should become more aware to the possibilities of
misrepresentation of data, the material in this course will facilitate you
learning critical questions as you harness the incredible power and
influence of statistics.

You might also like