Professional Documents
Culture Documents
Data Analysis is a process that apply statistical methods on data to get the
knowledge and insight.
Steps in Data Analysis
Data Analysis is a process that apply statistical methods on data to get the
knowledge and insight.
Steps in Data Analysis
Store the Data
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 3 / 43
Data Analysis
Data Analysis is a process that apply statistical methods on data to get the
knowledge and insight.
Steps in Data Analysis
Store the Data
transform the data
Data Analysis is a process that apply statistical methods on data to get the
knowledge and insight.
Steps in Data Analysis
Store the Data
transform the data
Visualization
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 3 / 43
Data Analysis
Data Analysis is a process that apply statistical methods on data to get the
knowledge and insight.
Steps in Data Analysis
Store the Data
transform the data
Visualization Model
fitting
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 4 / 43
Tidy Data
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 5 / 43
Tidy Data
> load("D:/new/table1.rdata")
> table1
country y ea r cases population
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
cases refers to the number of people diagnosed with TB per country per year
Question: calculate the rate of TB cases per country per year.
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 6 / 43
Solution
> rate<-table1$cases/table1$population
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 7 / 43
Tidy Data
> load("D:/new/table2.rdata")
> table2
country y e ar key value
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 8 / 43
Tidy Data
> load("D:/new/table2.rdata")
> table2
country y e ar key value
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Question: calculate the rate of TB cases per country per year.
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 8 / 43
Tidy Data
> load("D:/new/table3.rdata")
> table3
country y ea r rate
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 9 / 43
Tidy Data
> load("D:/new/table3.rdata")
> table3
country y ea r rate
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Question: calculate the rate of TB cases per country per year.
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 9 / 43
Tidy data
> load("D:/new/table4.rdata")
> load("D:/new/table5.rdata")
> table4
country 1999 2000
1 Afg hanis tan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
> table5
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 10 / 43
1. Tidyr package
• spread() returns a copy of your data set that has had the key and value
columns removed.
• In their place, spread() adds a new column for each unique value of the
key column.
• These unique values will form the column names of the new columns.
• spread() distributes the cells of the former value column across the
cells of the new columns and truncates any non-key, non-value
columns in a way that prevents duplication.
spread()
spread() turns a pair of key: value columns into a set of tidy columns.
> table2
country y e ar key value
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 15 / 43
separate()
> separate(table3,rate,into=c("cases","population"))
So u rce : l o c a l d a t a frame [ 6 x 4 ]
> separate(table3,rate,into=c("cases","population"),sep="/")
So u rce : l o c a l d a t a frame [ 6 x 4 ]
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 17 / 43
separate()
> t<-separate(table3,rate,into=c("cases","population"),sep="/"
> t
So u rce : l o c a l d a t a frame [ 6 x 4 ]
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 18 / 43
separate()
You can also pass an integer or vector of integers to sep. separate() will
interpret the integers as positions to split at. Positive values start at 1 at the
far-left of the strings; negative value start at -1 at the far-right of the strings.
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 19 / 43
separate()
You can also pass an integer or vector of integers to sep. separate() will
interpret the integers as positions to split at. Positive values start at 1 at the
far-left of the strings; negative value start at -1 at the far-right of the strings.
> t1<-separate(t,year,into=c("centuary","year"),sep=2)
> t1
So u rce: l o c a l d a t a frame [ 6 x 5 ]
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 20 / 43
unite()
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 21 / 43
help
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 22 / 43
2. dplyr: a grammar of data manipulation
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 23 / 43
Grammar of dplyr
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 24 / 43
Common Properties of dplyr functions
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 25 / 43
Data set
How to convert month number into month name?
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 27 / 43
Data set
How to convert month number into month name?
> ai rq< -air qualit y
> f o r ( i i n 5: 9) { l<-airq$Month==i
airq$Month[l]=month.abb[i]}
> str(airq)
'dat a.frame ': 153 o b s . o f 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA . . .
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 . . .
$ Wind : num 7 . 4 8 1 2 .6 1 1 .5 1 4 . 3 1 4 .9 8 . 6 1 3 .8 2 0 .1 8 . 6 .
$ Temp : int 67 72 74 62 56 66 65 59 61 69 . . .
$ Month : chr "May" "May" "May" "May" . . . 1 2 3
$ Day : int 4 5 6 7 8 9 10 . . .
>
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R 27 / 43
select()
> subset<-select(airq,starts_with("So"))
> subset[1:5,]
[ 1 ] 190 118 149 313 NA
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 29 / 43
select()
> subset<-select(airq,ends_with("mp"))
> subset[1:5,]
[ 1 ] 67 72 74 62 56
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 30 / 43
filter()
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 31 / 43
filter()
Hukam Singh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 32 / 43
help
> rsubset<- filter(airq,Temp>70,Temp<80,Month %in%
c("May","Aug))
> rsubset[1:5,]
Ozone Solar.R Wind Temp Month Day
1 36 118 8.0 72 May 2
3 7 NA 6.9 74 May 11
4 11 320 16.6 73 May 22
14 9 36 14.3 72 Aug 22
> asub<-arrange(airq,desc(Temp))
> asub[1:10,]
Ozone Solar.R Wind Temp Month Day
1 76 203 9.7 97 Aug 28
2 84 237 6.3 96 Aug 30
3 118 225 2.3 94 Aug 29
4 85 188 6.3 94 Aug 31
5 NA 259 10.9 93 Jun 11
6 73 183 2.8 93 Sep 3
7 91 189 4.6 93 Sep 4
8 NA 250 9.2 92 Jun 12
9 97 267 6.3 92 Jul 8
10 97 272 5.7 92 Jul 9
rename():rename a variable
> resub<-rename(airq,NewTemp=Temp)
> head(resub)
mdTemp WindSquare
1 -10.882353 54.76
2 -5.882353 64.00
3 -3.882353 158.76
4 -15.882353 132.25
5 -21.882353 204.49
6 -11.882353 222.01
6 28 NA 14.9 66 May 6
85 80 294 8.6 86 Jul 24
99 122 255 4.0 89 Aug 7
80 79 187 5.1 87 Jul 19
68 77 276 5.1 88 Jul 7
89 82 213 7.4 88 Jul 28
12 16 256 9.7 69 May 12
86 108 223 8.0 85 Jul 25
153 20 223 11.5 68 Sep 30
73 10 264 14.3 73 Jul 12
Data Analysis using R February 26, 2016 40 / 43
group by()
> grsub<-group_by(sairq,Month)
> summarize(grsub,tmean=mean(Temp,na.rm = T),
max(Wind),min(Solar.R, na.rm=T))
S o u rc e: l o c a l d a t a frame [ 5 x 4]
> #equivalent to
> f(f(x),y)
> #equivalent to
> f ( y, f ( x ) )
> t hi r d( s ec ond( f i r s t ( x ) ) )
> first(x) > second() > third()