You are on page 1of 29

5 Exploring Data – Summarizing Data

5.1 Sorting and Filtering a Table


When confronted with a large mass of information, we usually start by trying to condense it.
Summarizing data may start with grouping similar values together, but we will also want to find
simple “landmark” values that describe important characteristics of the data set.

However, before summarizing the data, it is advisable to simply explore the data to see what is
there. Play with the data as if you were in a sandbox, before you start forming opinions about
the data and prematurely drawing conclusions.

Exploring in Excel Tables

If your data is in a Table, then there are some functions that make basic exploration easy. In the
header row at the top of the Table, each variable name is followed with a down arrow. If you
click on it you can see that you can sort or filter your data. The first two sorts (smallest to largest
and largest to smallest for numeric variables and A to Z vs Z to A for text) are self-explanatory.

Open the file Student Survey Data 2010.xlxs.

Figure 5-1 Sorting and Filtering a numeric variable in a Table

5 Exploring Data – Summarizing Data Page 1 of 29


Figure 5-2 Sorting and Filtering a text variable in a Table

The third is labeled Sort by Color. This is a custom sort. You can sort at multiple levels. That is,
sort students by their Program, then within each Program you can sort by Gender, then among
each Gender-Program combination you can sort by GPA,… Note that the whole file is sorted, not
simply the variable that you selected. Normally in Excel, you have to select the data you wish to
sort.

Figure 5-3 Custom Sort

5 Exploring Data – Summarizing Data Page 2 of 29


Figure 5-4 Custom Sort dialog box

You can also Filter your data. This could be by selecting the specific values you wish to examine,
or by using a variety of functions to select ranges of values. You can do a Custom Filter that
combines ranges by using logical operators, such as OR and AND. You can filter on several
different variables separately. This is useful when you want to explore specific sub-groups in
your data.
Suppose that we want to look just at the Business students. We can click on the down arrow by
Program and uncheck all programs except Business.

Maybe we want only the International Business students. Click on the down arrow by Home and
select International.

We can continue selecting just Females or just 1st year students. We can also use more complex
text filters. Or we can select just those that expect to earn over $50,000, or between $40,000
and $80,000…

Remember to take filters off to return to the full data set.

We can also Sort the data in a similar fashion. If you select Sort by Color you can then choose
Custom. Sort that allows you to sort by several variables concurrently.

When first exploring data, doing a simple sort on each numeric variable can be insightful. For
example, sort Salary, either smallest to largest or largest to smallest.

Look at the smallest values. Someone expects to earn $0? $500? $9,450?

Look at the largest values. Someone expects to earn $650,000? $560,000?

5 Exploring Data – Summarizing Data Page 3 of 29


Are these students crazy or are these data entry errors? Looking at the extremes will give you
some sense of data quality issues. Should we keep these values when we do our analyses?
Maybe the $650,000 was supposed to be $65,000, but do we know for sure? Maybe we should
investigate this record more closely. Are there other responses for this student that may give us
insights? Maybe the student gave all erroneous answers and we should discard all responses for
the student.

Notice how many gave no answer. Why? Should we be concerned?

When you are done sorting, it is recommended that you return your data to the original sort
order. Initially, the data was likely sorted by the record ID. Resort the Table by ID, smallest to
largest. If your data set does not have an ID column, it is recommended that you create one.
Without a record ID column that is initially set as smallest to largest, you do not have any way to
restore your original sort order.

As we explore our data and come to understand it, we may start asking ourselves questions
about it.

Sorting and Filtering in Python

Like most actions on “objects” in Python, these are executed by using “methods” attached to
the object. If we wanted to sort the data frame with respect by Program2 and then by Salary,
we would give the following instruction.

The default sort is ascending, but this can be changed by changing the ascending setting to
False. Note that we are sorting on Program2 and Salary, so the sort order will be descending
for BOTH of these variables.

5 Exploring Data – Summarizing Data Page 4 of 29


The sort-values method only sorts a copy of the data frame. To make the sort permanent,
either assign the results to a new object, or use the inplace=True option to overwrite the data
frame.

In the examples above, only the first and last 5 rows are displayed. To display the top (head) 10
rows, add the method .tail(10)

And if you want just the bottom (tail) 10 rows, add the method .tail(10)

Observe that we can chain “methods” together.

To filter cases, we must specify the condition(s) on which we wish to filter. For example, if we
want to select only observations for which Salary exceeded $100,000, we would write

5 Exploring Data – Summarizing Data Page 5 of 29


When you want observations that equal a specific value, such as Age equal to 20 or Program2
equal to Arts, then you must write ==. You can also specify multiple conditions such as

Using tab to save typing

Typing out long expressions, such as repeatedly typing df_Grad_Exp, can be simplified by
typing the first one or two letters and then pressing the tab key. Python will respond with a
list of all variables starting with these letters.

If you want to know what methods or functions you can apply to an object, type a period
after the object and then press tab.

5 Exploring Data – Summarizing Data Page 6 of 29


Getting help with Python syntax

If you can’t remember the syntax associated with the function, type shift+tab and you will get
a help screen for the function.

5.2 Simple Summary Values


Although sorting gives you a sense of what are small and large values and whether they are just
extreme outliers or common, we also want some other “landmark” measures. Geographical
landmarks assist in giving you a sense of where you are by giving you reference points.
Statistical landmarks serve a similar purpose.

Summary Statistics in Excel


When in a Table, if you select the Design tab at the top of the screen, there is a checkbox labeled
Total Row. Click it.

Figure 5-5 - Adding a Total Row

A row will be added to the bottom of the table. Scroll to the last row. Select the entry in the
Salary column. A down arrow should appear. Click on the arrow.

5 Exploring Data – Summarizing Data Page 7 of 29


Figure 5-6 Summary functions in Total Row

Select Count Numbers. 752 students reported a salary value. Look at Average, Max and Min.
The average salary was $52,914.76, with a maximum of $650,000 and a minimum of $0. The
$650,000 appears unrealistic. We noted before when Sorting, that some students gave values
that we thought were suspicious.

Go to the down arrow by Salary in the header row and select Number Filters. Select Between
and then select 20000 and 120000 as your limits. You now get an average of $51,480.40. It
would appear that the extremes at each end did not seriously distort our average.

Click on cell ID for Total row 813 (cell A813). Click on the arrow and select Count.
811 students attempted the survey. This is all of the students. Click on the Total cell for Year.
802 told us the year they started. Repeat this for Home and for Home2. Why is the count 811 for
Home and 801 for Home2? When we applied VLOOKUP to Home, blanks were coded as #NA.
This may be a problem.

What happens when you look at Gender and Gender2? 811 and 665? Why? Initially, we used
VLOOKUP to code 0 and 1 as Male and Female, but VLOOKUP treated a blank as zero = Male. We
added an IF statement to treat blanks as blanks. But, Excel coded blanks as blank text. So the
cells are no longer empty and they were counted.

Lesson: Recoding may have unintended consequences.

Question: why did 802 tell us when they started but only 665 tell us gender?

Lesson: Watch out for survey fatigue with long surveys.

Are salary expectations of Business students similar to those of Arts students? We know that we
have some strange salaries reported, so let us Filter on those between $20,000 and $120,000.

note: do not insert $ and commas. Type 20000 and 120000.

The average salary is $51,480.80.

5 Exploring Data – Summarizing Data Page 8 of 29


Now filter on Program and select Business. The average is $52,463.20. Filter on Program and
select Arts. The average is $48,185. If you filter on Science you will see they have the highest
expectations, at $57,774.77.

Although tedious, you can answer many simple questions with the Total row and Filters.

Video: Sorting, Filtering and Total in an Excel Table

https://youtu.be/rs_H0T1sY3U

Summary Statistics in Python


Common summary measures can be found by applying the appropriate method. For example,
mean() will give us the average.

If you do not specify a column, you will get the mean value for every numeric variable.

5 Exploring Data – Summarizing Data Page 9 of 29


You can obtain all of the most common summary measures by using the describe() method.

We will explain the meaning of each of these measures in the next chapter.

With string (text) data, we can count the values with value_counts(). This method does not give
us a count of the total observations, but counts the number of observations taking on each
unique value.

This last example, is a good segue into our next topic of aggregating data.

5 Exploring Data – Summarizing Data Page 10 of 29


5.3 Aggregating Data with Frequency Distributions
A frequency distribution is a tabular summary of how many values are in each group of values.
Suppose you take our breakdown of salaries from very low to unrealistic and count how many fit
each description.

Table 5-1 - Proposed grouping for Expected Salary data

Usually, frequency distributions do not have text descriptions, just the ranges of values.

We have looked at how to group data by making numeric values into categorical values. In
Excel, this was done with VLOOKUP using an approximate match. In Python, we used the cut
function in a similar manner. Both methods create categorical variables, but do not
automatically make a summary like the table above.

To build a frequency distribution, you must group the data.

What are the “best practices” in grouping?


The objectives for grouping data are to:
• Make the summary concise.
• Make the summary informative.

What does this mean? How do we do it well?


• Don’t have too many groups – not concise – normally limit to a maximum of 20.
• Don’t have too few groups – loss of information – have at least 4-5.
• Groups should be unambiguous (should not overlap so each value fits in only one group).
• The summary should include all the data.
• Use “nice numbers. We like multiples of 2, 5, 10, 25, 50, 100 and not awkward starting or
ending values, such as 13 or 27 or 146.3.
• It is easier to compare groups that are of “equal size”. The proposed groups above went in
increments of $20,000, but “very high” had a range of $40,000. This is NOT a good practice.
• To satisfy all of the above, the 1st and/or last group may need to be larger to capture
outliers.

5 Exploring Data – Summarizing Data Page 11 of 29


Frequency Distributions in Excel – Histograms

A histogram is a graphical depiction of a frequency distribution. There are several ways to create
a histogram in Excel. We will look at two methods in this section and a third method when we
look at pivot tables and charts.

If you wish to set up groupings of different sizes, then you will have to use the Analysis ToolPak
add-in in Excel.

Installing the Analyis ToolPak


In Excel for Windows:
• Go to the File tab.
• Select Options (bottom of list).
• Select Add Ins (near bottom of list).
• Select Manage |Excel Add Ins| Go (at the bottom of the screen).
• Click on Analysis ToolPak.

Figure 5-7 - Installing the Analysis ToolPak in Excel for Windows

If you are using Office 365 in the cloud, then installation is somewhat more complicated. The
Analysis Toolpak is not included in 365. Go to Insert and then select Add-Ins

5 Exploring Data – Summarizing Data Page 12 of 29


Figure 5-8 Excel 365 Add-ins

Select STORE from the options listed at the top of the pop-up screen. From the list of categories
at the left, select Data Analytics. Scroll down the list of suggestions until you see XLMiner.

Figure 5-9 XLMiner Add-in

The XLMiner add-in is free and behaves in a very similar way to the Analysis Toolpak.

If you are using a Mac, then you must go through a similar process to the above. You may need
to google to find installation instructions.

5 Exploring Data – Summarizing Data Page 13 of 29


Using the Histogram Function

To use the Histogram function in Excel, you must define your groups. Excel calls the groups
“bins”.

Since the end of one bin is the start of the next, we only need to define the end of the bin. Note
that this differs from VLOOKUP that required us to define the start of each group. (confusing!)

Let us use a bin width of $10,000 and define bins up to $120,000. To keep our data sheet clean,
we recommend that you put the bins on a separate sheet, such as your LookUp Table
worksheet.

In column R, put a heading “Salary bins” in the 1st row and then enter 10000, 20000, 30000,…. In
the 2nd, 3rd, 4th,….rows. Note: do not use commas. Write 10,000 as 10000.

Figure 3-10 - Salary bins for building Histograms

Go to the Data tab and on the far right, you should see Data Analysis. Click on it. A pop up
window appears. Select Histogram.

5 Exploring Data – Summarizing Data Page 14 of 29


Figure 3-11 - Histogram function
A dialogue box opens.

Figure 5-12 - Histogram dialog box

The Input Range is the location of the data to be summarized $L$1:$L$812.


The Bin Range is the location of the bins ,'Lookup Tables'!$R$1:$R$13.
Our columns have Labels, so check this box.
We want Chart Output, so check this box.
Ask for a New Worksheet, so we do not add clutter to our data sheet.

5 Exploring Data – Summarizing Data Page 15 of 29


Figure 5-13 - Histogram output

The histogram is a simple column chart. You can format the labels on the chart to make it more
attractive. It is common practice to remove the space between columns in a histogram.

Click on any column and then right click.


Select Format Data Series.

Figure 5-14 - Formatting the histogram

To the right are several formatting options. Change the Gap Width to 0.

5 Exploring Data – Summarizing Data Page 16 of 29


Figure 5-15 - Histogram without gaps

The More group on the right is deceptive. Excel makes all columns the same width. But this
group actually represents values from $120,000 to $650,000! The tail on the right should be very
very long, but that would distract us from looking at where most of the data is.

Figure 5-16 - Histogram with 65 bins

Don’t be distracted by outliers. They are important, but not the focus.
It would be nice to filter out extreme values. If you apply a filter and then construct your
histogram, Excel ignores the filter! The only way to filter data and then make a histogram is to
• Apply the filter.
• Copy the filtered data to a new sheet.
• Build the histogram using the copied data.

5 Exploring Data – Summarizing Data Page 17 of 29


Video: Creating a Frequency Distribution and Histogram in Excel

https://youtu.be/3ysPbcSqDXE

Use the Chart Tools to create a Histogram

The most recent version of Excel has an alternate way to construct histograms.
• Select the Salary column.
• Go to Insert and select Recommended Charts.
• You may see an image of a histogram, but if not, select All Charts and select Histogram
from the list.

Figure 5-17 Histogram Chart

5 Exploring Data – Summarizing Data Page 18 of 29


The result is very crude.

Figure 5-18 Histogram Chart Output

But, we can improve this significantly. Right click anywhere on the horizontal axis. Select
Format Axis. A dialog box appears. You will see that Automatic is checked. Let us select Bin
Width and enter 10000 (do not use a comma 10,000). We would also like to limit the number of
bins, but Excel only allows you to fix one setting.

Figure 5-19 Histogram Chart with Bin width set

The advantage of using an Excel Chart instead of the Analysis Toolpak is that we can use the
Filtering tools in Table. Select the down arrow beside Salary and select Number Filters and then
select Less than or equal to. Enter 120000.

5 Exploring Data – Summarizing Data Page 19 of 29


Figure 5-20 Histogram Chart with Filtered values

This now looks similar to what we obtained with the Histogram function in the Analysis
Toolpak. As a Chart, we can change the title and label the axes. In general, it is recommended
that all charts have informative titles and labelled axes.

Figure 5-21 Formatted Histogram Chart

5 Exploring Data – Summarizing Data Page 20 of 29


Frequency Distributions and Histograms in Python

As noted in a previous section, the value_counts() method will count the number of
observations taking a specific value of a categorical variable. For example, with Program2, we
had

If you wish to obtain relative frequencies, then set normalize=True.

Too many decimals are distracting and not informative. To round the results to 2 decimal places,
add the method round(2).

With numeric data, we saw that we can use the cut function to change numeric data to
categories.

In the example we looked at before, we defined the bins with a series of upper limits for the
bins.

5 Exploring Data – Summarizing Data Page 21 of 29


This looks very confusing because by default, value_counts are sorted largest to smallest. We
can set sort=False to reverse this.

We can also simply specify the number of bins we want and let Python create equal sized bins.
You lose control over whether the end points to the bins are “attractive” values and outliers can
seriously distort the groups. For example, if we were to split the Salary data into 20 bins, we
would get

Almost all the data is in the first 4 groups.

To make a frequency distribution into a histograms, we can do this in Pandas with the plot
method, or we can load Python’s library of plotting functions.

The quick approach would be to ask for a bar chart of the frequency distribution. Below is an
example with bins with widths of $10,000 and the last bin being all values above $100,000. We
can add axis labels and a title.

5 Exploring Data – Summarizing Data Page 22 of 29


If you plan to let Python build your bins, then we do not need to create categories and frequency
distributions to get a histogram. You can ask for a histogram directly. If you don’t specify the
number of bins, Python uses 10 bins. The example below uses 50 bins.

5 Exploring Data – Summarizing Data Page 23 of 29


Frequency Distributions versus Histograms – a Table or a Chart

Before leaving this topic, we must ask why we want frequency distributions or histograms?
The rationale for a frequency distribution may be very different from that for a histogram. It
goes to the question of why/how we use tables versus charts.

In a table, we are trying to summarize information while still retaining some detail. We are
interested in how the groups (bins) are defined and the percentage of values within a group.
Readability is important.

In a chart, we are visually looking for patterns. You “read” a table, but “feel” a chart. In most
instances, you do not look at the scales on the chart axes. You don’t care if the vertical axis
shows frequency or percentage. But you also don’t pay attention to the labeling of the
horizontal axis. You may look for a sense of where the middle is, and what are low and high
values, but the exact numbers are not important. You get a sense of shape – symmetry or
skewness.

Although we present frequency distributions and histograms as a single topic, they serve very
different purposes. Defining groups is very important for frequency distributions (tables), but
may not be important for histograms (charts). This distinction in how we mentally process
information will be revisited several times throughout the text.

5.5 Comparing Histograms in Excel


What if we wanted to compare the pattern of salaries for Arts students to Business students?
This might give deeper insight than just looking at averages as done before. You will need to
Filter on Business, then copy the data to a new sheet. Then you would do it for Arts and copy.
Then again for Science.

You could then construct 3 histograms and compare them. But this comparison is hard to do
visually.

5 Exploring Data – Summarizing Data Page 24 of 29


Figure 5-22 - Comparing Histograms

How do you compare these distributions? Arts and Business samples are similar size, but Science
is less than half the size. If comparing frequency distributions or histograms, you need
comparable scales. Convert each frequency to percentage of the total sample size. We call these
relative frequencies.

We must take the three frequency distributions and combine them in a single table, with
columns for each of Arts, Business and Science. For each, create three new columns of relative
frequencies, by dividing the frequencies by the corresponding sample sizes.

5 Exploring Data – Summarizing Data Page 25 of 29


Figure 5-23 Frequencies and Relative Frequencies

Select the relative frequency columns and Insert a Clustered Column Chart.

Figure 5-23 Relative Frequencies as a Clustered Column Chart

Format the chart with an informative title and labelled axes.

5 Exploring Data – Summarizing Data Page 26 of 29


Figure 5-24 Salary Expectations by Program

This chart allows you to more easily compare the patterns among the programs. We will see in
the next chapter an alternative way to summarize data and build charts that is more flexible
than using the Histogram function.

In a later chapter, we will look at a graphical technique that is better suited to comparing the
distributions of many groups. Although histograms are simple charts, they still contain too much
detail for comparing multiple groups.

5.6 Common Shapes of Histograms (Distributions)


What are these distributions and graphs showing us?
• Where are most values located?
• What are small (large) values?
• Are there many small (large) values?
• Are values widely scattered or closely bunched?
• Are there lots of stragglers at the top or bottom?
• Is the distribution symmetrical? Bell shaped?
• Is the distribution skewed with a long tail to the left? or the right?

Classic patterns include the Bell shape with balanced tails on left and right – most common when
summarizing averages. This curve is known as the Normal Distribution and is the basis of much
of statistical analysis.

5 Exploring Data – Summarizing Data Page 27 of 29


Figure 5-25 - Normal Distribution

Skewed distributions with a long tail to left or right are very common. Long right tails are
common with financial data.

Figure 5-26 - Skewed distributions

Distributions that are lumpy, are often ones that describe two distinct groups. For example, if we
were to construct a histogram of summer earnings (q10 in the original data set) and we had the
first group ending in zero, we would get

5 Exploring Data – Summarizing Data Page 28 of 29


Figure 5-27 - Lumpy distribution

The first group represents those students who did not have a summer job and so earned
nothing, and the rest of the distribution is those with jobs. What causes the spikes? Most
students rounded their estimate to the nearest $1,000, but the chart has groupings in $500
increments. A different grouping would result in a smoother histogram. However, unless your
sample is extremely large, most histograms have some lumpiness due to randomness in the
observations. This is to be expected and is not a deficiency in the quality of the data.

Image Citations:
Figures 5-1 to 5-24: Images courtesy of author using Microsoft Excel
Figure 5-25: Image courtesy of Shishirdasika under CC BY-SA 3.0
Figures 5-26 to 5-27: Images courtesy of author using Microsoft Excel

5 Exploring Data – Summarizing Data Page 29 of 29

You might also like