You are on page 1of 52

What is said about ...

data scientists spend from 50 to 80 percent of their time wrangling big


data.
source : NY T imes
Remaining 20 percent they use in plotting and fitting models So
focus on 80 percent of time.

Data Analysis using R February 26, 2016 2 / 43


Data Analysis

Data Analysis is a process that apply statistical methods on data to get the
knowledge and insight.
Steps in Data Analysis

Data Analysis using R February 26, 2016 3 / 43


Data Analysis

Data Analysis is a process that apply statistical methods on data to get the
knowledge and insight.
Steps in Data Analysis
Store the Data

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 3 / 43
Data Analysis

Data Analysis is a process that apply statistical methods on data to get the
knowledge and insight.
Steps in Data Analysis
Store the Data
transform the data

Data Analysis using R February 26, 2016 3 / 43


Data Analysis

Data Analysis is a process that apply statistical methods on data to get the
knowledge and insight.
Steps in Data Analysis
Store the Data
transform the data
Visualization

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 3 / 43
Data Analysis

Data Analysis is a process that apply statistical methods on data to get the
knowledge and insight.
Steps in Data Analysis
Store the Data
transform the data
Visualization Model
fitting

Data Analysis using R February 26, 2016 3 / 43


Data wrangling packages in R

• Tidyr - to make the data tidy


• Plyr - split-apply-combine
• Dplyr - A new version of plyr
• reshape2 - to reshape the data

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 4 / 43
Tidy Data

Definition of tidy data is given by Hadley Wickham as follows


Every value belongs to a variable and an observation.
Variables in columns.
Observations in rows.

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 5 / 43
Tidy Data

> load("D:/new/table1.rdata")
> table1
country y ea r cases population
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
cases refers to the number of people diagnosed with TB per country per year
Question: calculate the rate of TB cases per country per year.

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 6 / 43
Solution

> rate<-table1$cases/table1$population

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 7 / 43
Tidy Data
> load("D:/new/table2.rdata")
> table2
country y e ar key value
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 8 / 43
Tidy Data
> load("D:/new/table2.rdata")
> table2
country y e ar key value
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Question: calculate the rate of TB cases per country per year.
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 8 / 43
Tidy Data

> load("D:/new/table3.rdata")
> table3
country y ea r rate
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 9 / 43
Tidy Data

> load("D:/new/table3.rdata")
> table3
country y ea r rate
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Question: calculate the rate of TB cases per country per year.

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 9 / 43
Tidy data

> load("D:/new/table4.rdata")
> load("D:/new/table5.rdata")
> table4
country 1999 2000
1 Afg hanis tan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
> table5

country 1999 2000


1 Afg hanis tan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 10 / 43
1. Tidyr package

tidyr package is written by Hadley Wickham.


Important functions of this package are:-

spread(): transform a long table into wide table


gather() :transform a wide table into long table
separate(): break a column into multiple column
unite(): unite multiple column into one column.

Data Analysis using R February 26, 2016 11 / 43


spread()

• spread() returns a copy of your data set that has had the key and value
columns removed.

• In their place, spread() adds a new column for each unique value of the
key column.

• These unique values will form the column names of the new columns.

• spread() distributes the cells of the former value column across the
cells of the new columns and truncates any non-key, non-value
columns in a way that prevents duplication.
spread()
spread() turns a pair of key: value columns into a set of tidy columns.
> table2
country y e ar key value
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583

Data Analysis using R February 26, 2016 12 / 43


spread()

spread() turns a pair of key:value columns into a set of tidy columns.


> library(tidyr)
> spread(table2,key,value)
So u rce : l o c a l d a t a frame [ 6 x 4]

co u n t ry year cases population


(fctr) (int) (int) (int)

1 Afghanistan 1999 745 19987071


2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583

Data Analysis using R February 26, 2016 13 / 43


gather()
gather() is reverse of spread()
> library(tidyr)
> gather(table1,key="KEYS",value = "DataValue",3:4)
So u rce: l o c a l d a t a frame [1 2 x 4 ]

country y e ar KEYS DataValue


(fctr) ( i n t ) (chr) (int)

1 Afghanistan 1999 cases 745


2 Afghanistan 2000 cases 2666
3 Brazil 1999 cases 37737
4 Brazil 2000 cases 80488
5 China 1999 cases 212258
6 China 2000 cases 213766
7 Afghanistan 1999 population 19987071
8 Afghanistan 2000 population 20595360
9 Brazil 1999 population 172006362
10 Brazil 2000 population
Data Analysis using R174504898 February 26, 2016 14 / 43
separate()

separate divide a column into multiple column. use of separate()


separate(data,columnToBeSeparated,into=name of new columns,sep=regx of
separator) default separator is first non-alphanumeric character.

Data Analysis using R February 26, 2016 15 / 43


separate()

separate divide a column into multiple column. use of separate()


separate(data,columnToBeSeparated,into=name of new columns,sep=regx of
separator) default separator is first non-alphanumeric character.
> table3
So u rce : l o c a l d a t a frame [ 6 x 3]

co u n t r y year rate (chr)


(fctr) (int)

1 Afghanistan 1999 745/19987071


2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 15 / 43
separate()

> separate(table3,rate,into=c("cases","population"))
So u rce : l o c a l d a t a frame [ 6 x 4 ]

co u n t r y year cases population


(fctr) (int) (chr) (chr)

1 Afghanistan 1999 745 19987071


2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
>

Data Analysis using R February 26, 2016 16 / 43


separate()

> separate(table3,rate,into=c("cases","population"),sep="/")
So u rce : l o c a l d a t a frame [ 6 x 4 ]

country year cases population


(fctr) (int) (chr) (chr)

1 Afghanistan 1999 745 19987071


2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 17 / 43
separate()

> t<-separate(table3,rate,into=c("cases","population"),sep="/"
> t
So u rce : l o c a l d a t a frame [ 6 x 4 ]

co u n t ry year cases population


(fctr) (int) (int) (int)

1 Afghanistan 1999 745 19987071


2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 18 / 43
separate()
You can also pass an integer or vector of integers to sep. separate() will
interpret the integers as positions to split at. Positive values start at 1 at the
far-left of the strings; negative value start at -1 at the far-right of the strings.

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 19 / 43
separate()
You can also pass an integer or vector of integers to sep. separate() will
interpret the integers as positions to split at. Positive values start at 1 at the
far-left of the strings; negative value start at -1 at the far-right of the strings.
> t1<-separate(t,year,into=c("centuary","year"),sep=2)
> t1
So u rce: l o c a l d a t a frame [ 6 x 5 ]

co u n t ry ce n t u ar y year cases population


(fctr) (chr) (chr) (int) (int)

1 Afghanistan 19 99 745 19987071


2 Afghanistan 20 00 2666 20595360
3 Brazil 19 99 37737 172006362
4 Brazil 20 00 80488 174504898
5 China 19 99 212258 1272915272
6 China 20 00 213766 1280428583
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 19 / 43
unite()

unite is reverse of separate()


> ut<-unite(t1,"year1",centuary,year,sep = " " )
> ut
So u rce : l o c a l d a t a frame [ 6 x 4 ]

co u n t r y year1 cases population


(fctr) (chr) (int) (int)

1 Afghanistan 1999 745 19987071


2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 20 / 43
unite()

unite is reverse of separate()


> ut1<-unite(t1,"rate",cases,population,sep = " / " )
> ut1
So u rce : l o c a l d a t a frame [ 6 x 4 ]

co u n t ry centuary year rate (chr)


(fctr) (chr) (chr)

1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 21 / 43
help

for more information type


> ?spread
> ?gather
> ?separate
> ?unite

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 22 / 43
2. dplyr: a grammar of data manipulation

It is a next iteration of plyr(a tool for manipulating all data structure)


dplyr provides a flexible grammar of data manipulation.
This is a tool for manipulating data frame (table).

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 23 / 43
Grammar of dplyr

select: return a subset of the columns of a data frame


filter: extract a subset of rows from a data frame based on logical
conditions
arrange: reorder rows of a data frame
rename: rename variables in a data frame
mutate: add new variables/columns or transform existing variables
summarise: generate summary statistics of different variables in the
data frame,
%>%: the "pipe” operator is used to connect multiple verb actions
together into a pipeline

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 24 / 43
Common Properties of dplyr functions

• The first argument is a data frame.

• The subsequent arguments describe what to do with the data frame


specified in the first argument, and you can refer to columns in the data
frame directly without using the $ operator

• The return result of a function is a new data frame

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 25 / 43
Data set
How to convert month number into month name?

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 27 / 43
Data set
How to convert month number into month name?
> ai rq< -air qualit y
> f o r ( i i n 5: 9) { l<-airq$Month==i
airq$Month[l]=month.abb[i]}
> str(airq)
'dat a.frame ': 153 o b s . o f 6 variables:

$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA . . .
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 . . .
$ Wind : num 7 . 4 8 1 2 .6 1 1 .5 1 4 . 3 1 4 .9 8 . 6 1 3 .8 2 0 .1 8 . 6 .
$ Temp : int 67 72 74 62 56 66 65 59 61 69 . . .
$ Month : chr "May" "May" "May" "May" . . . 1 2 3
$ Day : int 4 5 6 7 8 9 10 . . .
>
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R 27 / 43
select()

select():extract columns from a data frame


> library(dplyr)
> names(airq)
[ 1 ] "Ozone" " S o l a r . R " "Wind" "Temp" "Month" "Day"
> subset<-select(airq,Solar.R:Temp)
> head(subset)
Solar.R Wind Temp
1 190 7.4 67
2 118 8.0 72
3 149 12.6 74
4 313 11.5 62
5 NA 14.3 56
6 NA 14.9 66

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R 28 / 43


select()

> subset<-select(airq,starts_with("So"))
> subset[1:5,]
[ 1 ] 190 118 149 313 NA

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 29 / 43
select()

> subset<-select(airq,ends_with("mp"))
> subset[1:5,]
[ 1 ] 67 72 74 62 56

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 30 / 43
filter()

filter(): Extract a subset of row of given data frame.


> rsubset<- filter(airq,Temp>75)
> rsubset[1:5,]
Ozone Solar.R Wind Temp Month Day
1 45 252 14.9 81 May 29
2 115 223 5.7 79 May 30
3 37 279 7.4 76 May 31
4 NA 286 8.6 78 Jun 1
5 NA 186 9.2 84 Jun 4
>

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 31 / 43
filter()

> rsubset<- filter(airq,Temp>70,Temp<80)


> rsubset[1:5,]

Ozone Solar.R Wind Temp Month Day

1 36 118 8.0 72 May 2


2 12 149 12.6 74 May 3
3 7 NA 6.9 74 May 11
4 11 320 16.6 73 May 22
5 115 223 5.7 79 May 30
>

Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 32 / 43
help
> rsubset<- filter(airq,Temp>70,Temp<80,Month %in%
c("May","Aug))
> rsubset[1:5,]
Ozone Solar.R Wind Temp Month Day
1 36 118 8.0 72 May 2

2 12 149 12.6 74 May 3

3 7 NA 6.9 74 May 11
4 11 320 16.6 73 May 22

5 115 223 5.7 79 May 30


> tail(rsubset)
Ozone Solar.R Wind Temp Month Day

11 31 244 10.9 78 Aug 19

12 44 190 10.3 78 Aug 20

13 21 259 15.5 77 Aug 21

14 9 36 14.3 72 Aug 22

15 NA 255 12.6 75 Aug 23


Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 33 / 43
arrange()

arrange: reorder row according to one of the variable


> asub<-arrange(airq,Temp)
> asub[1:10,]
Ozone Solar.R Wind Temp Month Day
1 NA NA 14.3 56 May 5
2 6 78 18.4 57 May 18
3 NA 66 16.6 57 May 25
4 NA NA 8.0 57 May 27
5 18 65 13.2 58 May 15
6 NA 266 14.9 58 May 26
7 19 99 13.8 59 May 8
8 1 8 9.7 59 May 21
9 8 19 20.1 61 May 9
10 4 25 9.7 61 May 23

Data Analysis using R February 26, 2016 34 / 43


arrange()

> asub<-arrange(airq,desc(Temp))
> asub[1:10,]
Ozone Solar.R Wind Temp Month Day
1 76 203 9.7 97 Aug 28
2 84 237 6.3 96 Aug 30
3 118 225 2.3 94 Aug 29
4 85 188 6.3 94 Aug 31
5 NA 259 10.9 93 Jun 11
6 73 183 2.8 93 Sep 3
7 91 189 4.6 93 Sep 4
8 NA 250 9.2 92 Jun 12
9 97 267 6.3 92 Jul 8
10 97 272 5.7 92 Jul 9

Data Analysis using R February 26, 2016 35 / 43


rename()

rename():rename a variable
> resub<-rename(airq,NewTemp=Temp)
> head(resub)

Ozone Solar.R Wind NewTemp Month Day


1 41 190 7.4 67 May 1
2 36 118 8.0 72 May 2
3 12 149 12.6 74 May 3
4 18 313 11.5 62 May 4
5 NA NA 14.3 56 May 5
6 28 NA 14.9 66 May 6

Data Analysis using R February 26, 2016 36 / 43


rename

> resub<-rename(airq,NewTemp=Temp,"New Wind"=Wind)


> head(resub)
Ozone Solar.R NewWind NewTemp Month Day
1 41 190 7.4 67 May 1
2 36 118 8.0 72 May 2
3 12 149 12.6 74 May 3
4 18 313 11.5 62 May 4
5 NA NA 14.3 56 May 5
6 28 NA 14.9 66 May 6

Data Analysis using R February 26, 2016 37 / 43


mutate()

mutate():add a new column to data frame


> msub<-mutate(airq,mdTemp=Temp-mean(Temp,na.rm = T))
> head(msub)

Ozone Solar.R Wind Temp Month Day mdTemp


1 41 190 7.4 67 May 1 -10.882353
2 36 118 8.0 72 May 2 -5.882353
3 12 149 12.6 74 May 3 -3.882353
4 18 313 11.5 62 May 4 -15.882353
5 NA NA 14.3 56 May 5 -21.882353
6 28 NA 14.9 66 May 6 -11.882353

Data Analysis using R February 26, 2016 38 / 43


transmute()

transmute():similar to mutate() but drop all non transformed variable


> tmsub<-transmute(airq,mdTemp=Temp-mean(Temp,na.rm=T),WindSquare=(Wind* Wind))
> head(tmsub)

mdTemp WindSquare
1 -10.882353 54.76
2 -5.882353 64.00
3 -3.882353 158.76
4 -15.882353 132.25
5 -21.882353 204.49
6 -11.882353 222.01

Data Analysis using R February 26, 2016 39 / 43


group by()

group by():group a data frame by one or more variable.


> #scramble the rows
> sairq<-airq[sample(1:153,15),]
> s airq[ 1:10,]
Ozone Solar.R Wind Temp Month Day

6 28 NA 14.9 66 May 6
85 80 294 8.6 86 Jul 24
99 122 255 4.0 89 Aug 7
80 79 187 5.1 87 Jul 19
68 77 276 5.1 88 Jul 7
89 82 213 7.4 88 Jul 28
12 16 256 9.7 69 May 12
86 108 223 8.0 85 Jul 25
153 20 223 11.5 68 Sep 30
73 10 264 14.3 73 Jul 12
Data Analysis using R February 26, 2016 40 / 43
group by()
> grsub<-group_by(sairq,Month)
> summarize(grsub,tmean=mean(Temp,na.rm = T),
max(Wind),min(Solar.R, na.rm=T))
S o u rc e: l o c a l d a t a frame [ 5 x 4]

Month tmean max(Wind) minSolar


(chr) (dbl) (dbl) (int)

1 Aug 83.96774 15.5 24


2 Jul 83.90323 14.9 7
3 Jun 79.10000 20.7 31
4 May 65.54839 20.1 8
5 Sep 76.90000 16.6 14

Data Analysis using R February 26, 2016 41 / 43


%>% ”pipe operator”

pipe operator combines multiple functions in a sequence


> f(x)  >  f(y)

> #equivalent to
> f(f(x),y)

> # I f we use placeholder


> f(x)  >  f(y,.)

> #equivalent to
> f ( y, f ( x ) )
> t hi r d( s ec ond( f i r s t ( x ) ) )
> first(x)  >  second()  >  third()

Data Analysis using R February 26, 2016 42 / 43


%>% ”pipe operator”
> airq%>% group_by(Month)%>%summarise
(mtemp=mean(Temp,na.rm=T),
+ Smean=mean(Solar.R,na.rm=T))
So u rce : l o c a l d a t a frame [ 5 x 3]
Month mtemp Smean

(chr) (dbl) (dbl)

1 Aug 83.96774 171.8571


2 Jul 83.90323 216.4839
3 Jun 79.10000 190.1667
4 May 65.54839 181.2963
5 Sep 76.90000 167.4333

Data Analysis using R February 26, 2016 43 / 43

You might also like