Professional Documents
Culture Documents
Interactive Computational
Methods in Data Analysis
(Lecture 6)
Dr. Farid Zamani
Room: 207 | faridzamani@upm.edu.my
Department of Mathematics and Statistics
Faculty of Science
Response Transformation for linear regression
• Let’s look at some (fictional) salary data from the (fictional) company Initech.
• Plotting the data on the transformed log scale and adding the fitted line, the relationship again
appears linear, and we can already see that the variation about the fitted line looks constant.
Introduction: Data Transformation
• Data Transformation is one of the key aspects of working for business data analysis, data science or even
for the pre-work of artificial intelligence.
• The work of data science begins with a dataset. These datasets can be so large that any manual inspection
or review of them, say using editing software like TextEdit or Notepad++, becomes totally infeasible.
• To overcome this, data scientists rely on computational tools like R for working with datasets.
• Data scientists talk a lot about the importance of data cleaning, stating that without data cleaning no data
analysis results are meaningful.
• Some go further to say that the most important step in the data science life cycle is data cleaning
because, from their point of view, the analysis process following data cleaning is a routine to a great
degree.
• As such, another important aspect of working with datasets is transforming data, i.e., rendering data
suitable for analysis.
• The R package that we will use here is tidyverse.
• Let us begin by looking the Motor Trend Car Road Test dataset which is made available through
tidyverse. It was extracted from the 1974 Motor Trend US magazine and contains data about
fuel consumption and aspects of automobile design for 32 car models.
Filtering a Data Set
• We will want to filter a data set so that we only see the observations that satisfy a certain
condition. For example, suppose we want to see only the cars in the mtcars data set that have at
least 6-cylinder engines. The code for this is:
• The full filtered data set by assigning it a name and then viewing it:
• Logical statements are often expressed as equations or inequalities. For example, suppose we
make the assignments x <- 7 and y <- 9.
• The following table lists various ways to construct logical statements in terms of these two
variables.
• Notice that the double equal sign == is used to check equality. You can also negate statements or
form compound statements. For example:
• The values TRUE and FALSE can actually be handled arithmetically in R.
• The value TRUE is given the numerical value 1, and FALSE is given 0. For example:
• This is useful when you want to count how many values in a column satisfy a certain condition. For
example, suppose we want to know how many cars in mtcars have 4 gears.
• The condition mtcars$gear == "4" assigns a value of TRUE or FALSE to all of the values in the gear
column of mtcars depending on whether the gear value is “4” or not.
• We can also use the mean function to calculate the percentage of values in a column that satisfy
a condition. For example, the percentage of 4 gears drive cars in mtcars is calculated by:
• The reason this works is that mean finds the average of all of the 1’s and 0’s assigned to the
values of the gear column by the statement mtcars$gear == "4"
Sorting a Data Set
• This function arranges the observations in order. It takes a column name or a set of column
names as arguments.
• For a single column name, it sorts the column’s data with other columns following that column
and for multiple column names, each additional column breaks ties in the values of preceding
columns.
• For example, suppose we want to sort the observations in mtcars according to the gross
horsepower (hp):
• By default, arrange sorts the observations in ascending order. To sort in descending order, do the
following:
• For example, suppose we want to filter out the cars in mtcars that get less than 20 miles per
gallon (mpg) and then sort the remaining cars according to displacement (disp) in descending
order. There are a couple of naive ways to do this. One is:
Selecting Columns
• A very common procedure when analyzing a data set is to get rid of columns that are irrelevant
to your analysis.
• For example, suppose we want to pare down the mtcars data set so that the only variables
displayed are mpg, cyl, disp, and wt.
• It’s also often easier to specify the columns you don’t want. For example, suppose we want to
keep every column of mtcars except gear and carb:
Renaming columns
• The syntax is NEW_NAME = OLD_NAME. Below, we replace the name wt with weight and
cyl with cylinder.
Relocating column positions
• Sometimes you may want to change the order of columns by moving a column from the present
location to another.
• We can relocate a column using the relocate function by specifying which column should go
where. The syntax is relocate(DATA_NAME,ATTRIBUTE,NEW_LOCATION).
• The specification for the new location is either by .before=NAME or by .after=NAME, where
NAME is the name of a column.
ACTIVITY 1
These exercises require the nycflights13 package you installed earlier. Be sure the library is
loaded.
1. The nycflights13 package contains a data set called flights. Load this data set and read its
documentation: ?flights. How many observations does it have? How many variables?
2. Use filter to find all of the flights that
a) departed in February.
b) were operated by United or American Airlines.
c) departed in summer (June, July, and August).
d) arrived more than two hours late, but did not leave late.
e) were delayed by more than an hour, but made up over 30 minutes during the flight.
f) departed between midnight and 6am.
3. How many flights were canceled? (Think about how a canceled flight might be detected from
the data set.)
4. What was the on-time arrival rate for Delta Airlines during 2013? What was it during the
winter months (January, February, and December)? A flight that did not arrive, due to a
cancellation, crash, emergency landing, etc, should not figure into the on-time arrival rate.
ACTIVITY 2
These exercises require the flights data set from
the nycflights13 library.