You are on page 1of 22

MTH 4407

Interactive Computational
Methods in Data Analysis
(Lecture 6)
Dr. Farid Zamani
Room: 207 | faridzamani@upm.edu.my
Department of Mathematics and Statistics
Faculty of Science
Response Transformation for linear regression

• Let’s look at some (fictional) salary data from the (fictional) company Initech.

• We will try to model salary as a function of years of experience.

• The data can be found in initech.csv.


• This model appears significant, but does it
meet the model assumptions?
However, from the
fitted versus
residuals plot it
appears there is
non-constant
variance.
Specifically, the
variance increases
as the fitted value
increases.
• Fitting this model in R requires only a minor modification to our formula specification.

• Plotting the data on the transformed log scale and adding the fitted line, the relationship again
appears linear, and we can already see that the variation about the fitted line looks constant.
Introduction: Data Transformation

• Data Transformation is one of the key aspects of working for business data analysis, data science or even
for the pre-work of artificial intelligence.

• The work of data science begins with a dataset. These datasets can be so large that any manual inspection
or review of them, say using editing software like TextEdit or Notepad++, becomes totally infeasible.

• To overcome this, data scientists rely on computational tools like R for working with datasets.

• Data scientists talk a lot about the importance of data cleaning, stating that without data cleaning no data
analysis results are meaningful.

• Some go further to say that the most important step in the data science life cycle is data cleaning
because, from their point of view, the analysis process following data cleaning is a routine to a great
degree.

• As such, another important aspect of working with datasets is transforming data, i.e., rendering data
suitable for analysis.
• The R package that we will use here is tidyverse.

• The tidyverse is an opinionated collection of R packages designed for data science.

• Let’s install and load the package first

• Let us begin by looking the Motor Trend Car Road Test dataset which is made available through
tidyverse. It was extracted from the 1974 Motor Trend US magazine and contains data about
fuel consumption and aspects of automobile design for 32 car models.
Filtering a Data Set

• We will want to filter a data set so that we only see the observations that satisfy a certain
condition. For example, suppose we want to see only the cars in the mtcars data set that have at
least 6-cylinder engines. The code for this is:

• The full filtered data set by assigning it a name and then viewing it:

• In general, the syntax for the filter function is:


• The condition arguments in filter are logical statements, meaning they are statements whose
values are either TRUE or FALSE.

• Logical statements are often expressed as equations or inequalities. For example, suppose we
make the assignments x <- 7 and y <- 9.

• The following table lists various ways to construct logical statements in terms of these two
variables.
• Notice that the double equal sign == is used to check equality. You can also negate statements or
form compound statements. For example:
• The values TRUE and FALSE can actually be handled arithmetically in R.

• The value TRUE is given the numerical value 1, and FALSE is given 0. For example:

• This is useful when you want to count how many values in a column satisfy a certain condition. For
example, suppose we want to know how many cars in mtcars have 4 gears.

• The condition mtcars$gear == "4" assigns a value of TRUE or FALSE to all of the values in the gear
column of mtcars depending on whether the gear value is “4” or not.
• We can also use the mean function to calculate the percentage of values in a column that satisfy
a condition. For example, the percentage of 4 gears drive cars in mtcars is calculated by:

• The reason this works is that mean finds the average of all of the 1’s and 0’s assigned to the
values of the gear column by the statement mtcars$gear == "4"
Sorting a Data Set
• This function arranges the observations in order. It takes a column name or a set of column
names as arguments.

• For a single column name, it sorts the column’s data with other columns following that column
and for multiple column names, each additional column breaks ties in the values of preceding
columns.

• The function used to sort a data set is arrange.

• For example, suppose we want to sort the observations in mtcars according to the gross
horsepower (hp):
• By default, arrange sorts the observations in ascending order. To sort in descending order, do the
following:

• We can also include a “tie-breaker” variable in arrange.


The Pipe

• We will often want to use two or more transformation functions in succession.

• For example, suppose we want to filter out the cars in mtcars that get less than 20 miles per
gallon (mpg) and then sort the remaining cars according to displacement (disp) in descending
order. There are a couple of naive ways to do this. One is:
Selecting Columns
• A very common procedure when analyzing a data set is to get rid of columns that are irrelevant
to your analysis.

• This is accomplished using the select function.

• The function is for selecting rows by specifying the rows position.

• For example, suppose we want to pare down the mtcars data set so that the only variables
displayed are mpg, cyl, disp, and wt.
• It’s also often easier to specify the columns you don’t want. For example, suppose we want to
keep every column of mtcars except gear and carb:
Renaming columns

• This function allows you to rename a specific column.

• The syntax is NEW_NAME = OLD_NAME. Below, we replace the name wt with weight and
cyl with cylinder.
Relocating column positions
• Sometimes you may want to change the order of columns by moving a column from the present
location to another.

• We can relocate a column using the relocate function by specifying which column should go
where. The syntax is relocate(DATA_NAME,ATTRIBUTE,NEW_LOCATION).

• The specification for the new location is either by .before=NAME or by .after=NAME, where
NAME is the name of a column.
ACTIVITY 1
These exercises require the nycflights13 package you installed earlier. Be sure the library is
loaded.

1. The nycflights13 package contains a data set called flights. Load this data set and read its
documentation: ?flights. How many observations does it have? How many variables?
2. Use filter to find all of the flights that
a) departed in February.
b) were operated by United or American Airlines.
c) departed in summer (June, July, and August).
d) arrived more than two hours late, but did not leave late.
e) were delayed by more than an hour, but made up over 30 minutes during the flight.
f) departed between midnight and 6am.

3. How many flights were canceled? (Think about how a canceled flight might be detected from
the data set.)

4. What was the on-time arrival rate for Delta Airlines during 2013? What was it during the
winter months (January, February, and December)? A flight that did not arrive, due to a
cancellation, crash, emergency landing, etc, should not figure into the on-time arrival rate.
ACTIVITY 2
These exercises require the flights data set from
the nycflights13 library.

1. Where does arrange sort the NA values in a column? (Experiment


with a data set that has missing values.) How could you
force arrange to sort all of the NA values to the top of the list? (Try
using is.na.)
2. What was the longest delay of any flight?
3. What flight left the earliest in the day?
4. What flight averaged the fastest speed while in the air? (Average
speed is the total distance traveled divided by the total time spent
in the air.) What flight averaged the slowest speed?
5. What flight traveled the farthest distance? Which one traveled the
shortest distance?
TERIMA KASIH / THANK YOU
www.upm.edu.my

You might also like