This action might not be possible to undo. Are you sure you want to continue?
(from Satyendra Srivastava, Student)
This study is based on predicting human activity from Samsung smart phone Activity data. Thirty volunteers within an age brac et of !"#$% years performed activities of daily living, wearing Samsung &ala'y S (( on the waist. )sing its embedded accelerometer and gyroscope, *#a'ial linear acceleration and *#a'ial angular velocity was captured at a constant rate of +,-.. This has use in health care applications, e.g. the e'ploitation of Ambient (ntelligence in daily activity monitoring for elderly peoplei. /ased on this massive 0machine0 data, this study tries to answer the 1uestion2 can one find out the few variables which predict sub3ect0s activity efficiently4 and after e'ploring various statistical techni1ues, choose a model for it4 Such a challenge involves two steps2 selecting a statistical techni1ue and second, selecting features(variables) relevant to outcome. 5hile linear models wor best for continuous outcomes, Tree and 6andom 7orest are the recommended techni1ues for categorical outcomes, as is the case here. 5hile tree recursively partitions response variables into subsets based on their relationship to predictor variablesii, 6andom 7orest grows many such trees. Then each tree 8votes8 for the class. The forest chooses the classification having the most votes (mode). Selection of variables is important since in 0big data0, many variables may simply be adding noise or mas ing other variables. Methods Data Collection The data provided by 9oursera Data analysis course was reportedly pre#processed to ma e it easier to load into 6. (ts dimensions are :*+2 records for +;* variables< 7or each record in the dataset, it provides2 # Tria'ial acceleration from the accelerometer # Tria'ial Angular velocity from the gyroscope # =ectors with time and fre1uency domain variables # Activity label (laying, sitting, standing, wal , wal down, wal up# 9olumn +;*) # An identifier of the sub3ect who carried out the e'periment (9olumn +;2). Exploratory Analysis >'plorations were done using 6 (ver 2.!+.2) with a purpose to understand the data and to compare various predictive models. There are no missing values in the entire dataset. There are many variables with the same names but non#identical values. Therefore no attempt was made to remove these 0duplicate0 variables. (nstead, by importing the 0samsungData0 into a new data frame, all the variables were renamed and problem was solved. To help with analysis, variables were renamed from v! to v+;!. (t seems that some variables (eg. f/odyAcc#bands>nergy) are derived from, and hence related to other variables. (n a sense, such variables could be confounders. (t is difficult to ta e a decision about e'cluding such variables without in#depth domain nowledge and better grasp of statistical techni1ues. The data set did not include all the sub3ects from ! to *,. There are no records for these sub3ects2 2,$,",!,,!2,!*,!%,2,,2$. /ut the e'isting sub3ects have ample records # on an average *+, per sub3ect (min#2%!, ma'#$,"). There are sufficient observations for each of the si' activities# ranging from !*? to !"? of the total.
Sampling The dataset was divided in three (Train ;,?, =alidate 2,?, Test 2,?) based on sub3ects# not observations. 6andom sampling was not done, to avoid mi'ing Test sub3ects with the other two groups and to preserve the sub3ects0 integrity. -ere is the composition and si.e of these three subsets2 Subsets Train =alidate Test Observations (n= !"#$ $*:* !$"$ !$%+ % +".+? 2,.*? 2,.2? Sub&ects I' !,*,+,;,:,%,!!, !$,!+,!;,!:,!",2! 22,2*,2+,2; 2:,2%,2",*,
As it turned out, both validation and test sets were used for test purposes only, since 6andom 7orest generated @@/ error on Training data itself. All the analyses gave different error rates on validation and test set, indicating something 0special0 about the four test sub3ects. (n retrospect, dividing the data set in 3ust two# train and test, would have been sufficient. Activity was converted to factor, before further e'ploration. 0Sub3ect0 variable was not useful in the prediction model, hence it was e'cluded while studying effect of predictors on outcome variable# the activity being performed by the sub3ect. Model Selection Some clustering techni1ues (S=D, A9A etc.) were e'plored but they were comple' and did not help with prediction. &BC method gave the response # Dalgorithm did not convergeE, because of non# linearity of the data, ( thin . Tree method was used but it gave higher misclassification error rate (!!.*"?) with Training data # but interestingly, lower error rates with =alidation data (,.+*?) and Test data (2.$2?). Model (ariable )sed +;! +;! OO* error Miss+classification Miss+classification rate on ,rror on ,rror on Test data Training 'ata (alidation (%$ (%$ !.%+ !!.*" !,.:% ,.+* +.:2 2.$2
6andom 7orest Tree
(t appears that Tree model, as above, may not generali.e well to new data. Also, summary indicated that it used 3ust % variables for classification. (t became clear that for +;! variables, 6andom 7orest will be a more accurate and efficient approach and hence it was selected in stead of Tree.iii. 6andom forest is a 0bagging0 type ensemble classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual treesiv. (ts accuracy, efficiency to deal with large datasets and the fact that it can handle thousands of input variables without variable deletion sets it apart from other machine learning algorithmsv. @n the other hand disadvantages are over#fitting for some datasets with noisy classificationFregression tas svi and unli e decision trees, the classifications made by random forests are difficult for humans to interpretvii. The error rates in 6andom 7orest depend on two factors2 !) The correlation between any two trees in the forest increases the forest error rate. (n other words, for better results every tree G e.g. the ones
voting for wal ing and wal ing up# should be different. 2) The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. (ncreasing the strength of the individual trees decreases the forest error rate. Variable Selection The first run of 6andom 7orest with the Training dataset, included all the +;! variables. This generated a 1uite low @@/ error of !.%+? but too considerable time. To understand the role of different variables in contributing to this accuracy, Cean Decrease Accuracy and Cean Decrease &ini output were studied for top +, contributors (7ig !). Cean Decrease Accuracy represents the difference between prediction error on the out#of#bag data before and after permuting each predictor variable. Cean Decrease &ini is the total decrease in node impurities from splitting on the variable, averaged over all treesviii.
Si'ty nine (;") variables out of +;! were selected on the basis of their importance. This set of variables was used since it performed better and faster# time reduced to one third appro'imately. /ut ( decided to use another method for selection of variables2 a gradient boosted machine learning tool which uses the 8gbm8 pac age in 6i'. This techni1ue fits a series of small decision trees, up#weighting the observations that are predicted poorly at each iteration. )sing following function, a list of si'ty variables was generated which gave
better results that the ;" variables selected using random 7orest. ( stuc with this list and further e'plored it, to find an optimum list. >ventually ( found $, variables' which gave efficient predictions on validation and test data with acceptably low miss#classification error. 7ollowing function was used2
gbm(activity ~., train,n.trees !""", shrin#age "."!, distrib$tion %ga$ssian%, interaction.depth &, bag.'raction ".(, cv.'old ), n.minobsinnode )"*
7ig.$ below shows the relative influence of the variables, determined by gbm on training data. About ;,#:, variables contribute ma'imum to the variability in data, concentrated in the top left corner of the chart below.
(f we loo at the table below, comparing error rates for different set of variables, derived either from &/C or 6andom 7orest model, this model ($, variables) gives lesser misclassification errors on validation data but slightly higher error rate on test data (:.!$?). This rate appears to be acceptable for a predictive model. To avoid over#fitting, no more variables were ta en into the design.
(ariables used in the random -orest All !2, ;" ;, 2, *, +, +, $, (selected) (ariable selection based on.. 6 7orest &/C 6 7orest &/C &/C &/C 6 7orest &/C &/C OO* error% rate on Training 'ata !.%+ !.$! !.$; !."$ 2.!+ !.$" !.++ !.$; !.$" Miss+classification Miss+classification ,rror on (alidation (%$ ,rror on Test data (%$ !,.:% !,.": !,.+% !!.!: !2.*! !,.,$ !,.2$ !,.+, ".!, +.:2 ;.2; %.%" :.$: ".", :.$! :.$: :.,, :.!$
5hile e'ploring, ( discovered that adding more variables, did not always improve accuracy. (t shows that some variables only add discordance and noise to the data. Core e'plorations are needed to isolate these variables. So, here is the final formula for predicting DactivityE2
random+orest(activity ~v),-v)."-v,/-v)!-v01-v)1-v)")-v))-v&"-v/"/-v/!"-v)",-v,)/v,0-v).-v!."-v&)-v&.-v)0-v&,-v&&-v).!-v0"0-v!((-v))(-v,!-v)(-v!,0-v!0"-v,)!-v,"-v/" ,-v/((-v)&-v!,"-v!/,-v!1.-v0&"-v0(-v)"-v,.(-v,.!-v/"&-v,00-v,/&-v)"., data train, proximity 234E, importance 2, do.trace !""*
Error rates Cisclassification error rapidly reduces over first !,, trees (7ig 2, 7ig *). (t approaches .ero 1uic ly and stays at .ero, without any noise for DlayingE (red straight line at errorH,). @ther activities show some noise, highest being for DstandingE, followed by DsittingE and Dwal upE. 5hen sub3ect is standing, not much data is being generated and hence the greater margin for error, ( thin .
This is evident in the following table of prediction errors for Test data. Accuracy is lowest (%!.2:?) for standing and highest for laying and wal ing for obvious reasons.
observed laying sitting standing walk walkdown walkup laying 1.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 Predictions sitting 0.00000000 0.85227273 0.18727!15 0.00000000 0.00000000 0.00000000 for Test data standing walk 0.00000000 0.00000000 0.1 772727 0.00000000 0.81272085 0.00000000 0.00000000 1.00000000 0.00000000 0.0 000000 0.00000000 0.0138888! walkdown 0.00000000 0.00000000 0.00000000 0.00000000 0.!5000000 0.0138888! walkup 0.00000000 0.00000000 0.00000000 0.00000000 0.01000000 0.!7222222
rCS> >rror rate in &/C for the selected $, variables was as follows2 %,.2? for Training set, ;$.;? for validation set and +.:$? for Test set.
3eprod$cibility The 6 code, though optional, is being given in the appendi'. /esults 6andom 7orest offers distinct advantages over Tree and &BC, where large number of variables are potential predictors. The selected $, variables offer an acceptable level of accuracy ("* ?) and computational efficiency over original +;! variables. (t offers accuracy ranging from %!? to !,,? for predicting various activities. 0onclusions The proposed predictive model is based on 6andom 7orest techni1ue of statistical classification. (t is efficient and accurate but essentially a blac bo' model and therefore sub3ect to unforeseen prediction errors. -ence the need for caution in real life scenario. @ut of +;! variables, $, were selected using &/C. Aerhaps variable selection could be further fine tuned. (t would further help to have some domain specific nowledge about these variables and their measurement, to further wor on the model. This predictive model for -uman Activity recognition, based on smart phones could have many useful applications# in health, crime prevention etc. @n the other hand, concerns have been raised about privacy issues'i. A11endi2 5ptional 3 code I9hec ed on system2 )buntu !!.,$, 6 studio ,.":.*!2, 6 version 2.!+.2 (2,!2#!,#2;) library(random7orest) library (gbm) setwd(8JFDownloadsFdataanalysisFassign28) load(8samsungData.rda8) data K# data.frame(samsungData) numbered.var K# function(')Lpaste(8v8,',sepH88)M< cnames.orig K# colnames(data)< cnames.new K# c(numbered.var(se1(!2+;!)),c(8sub3ect8,8activity8))< colnames(data) K# cnames.new dataNactivity K# as.factor(samsungDataNactivity) train! K# dataOdataNsub3ect ?in? c(!,*,+,;,:,%,!!,!$,!+,!;,!:,!",2!),P train K# train!O#+;2P validate! K# dataOdataNsub3ect ?in? c(22,2*,2+,2;),P validate K# validate!O#+;2P test! K# dataOdataNsub3ect ?in? c(2:,2%,2",*,),P test K# test!O#+;2P I6un &/C to select $, variables gbmCodK#gbm(activity J., train,n.treesH!,,,, shrin ageH,.,!, distributionH8gaussian8, interaction.depthH:, bag.fractionH,.", cv.foldH+, n.minobsinnode H +,) best.iter K# gbm.perf(gbmCod,methodH8cv8) iScore K# summary(gbmCod, best.iter) names(iScore)O!P K# 8attribute8 names(iScore)O2P K# 8importance8
Ito calculate error in gbm actual=aluesK# as.numeric(factor(testNactivity)) resultK# predict(gbmCod,test,best.iter, typeH8response8) rCS>percentK# s1rt(mean((actual=alues#result)Q2))Fmean(actual=alues)R!,, I6andom 7orest with $, selected variables on Train data fit H random7orest(activity J v+$Sv+;,Sv$2Sv+!Sv*%Sv+%Sv+,+Sv++Sv:,Sv2,2Sv2!,Sv+,$Sv$+2Sv$*Sv+;Sv!;,Sv:+Sv:;Sv +*Sv:$Sv::Sv+;!Sv*,*Sv!""Sv++"Sv$!Sv+"Sv!$*Sv!*,Sv$+!Sv$,Sv2,$Sv2""Sv+:Sv!$,Sv!2 $Sv!%;Sv*:,Sv*"Sv+,Sv$;"Sv$;!Sv2,:Sv$**Sv$2:Sv+,;, dataHtrain, pro'imityHT6)>, importanceHT, do.traceH!,,) print(fit) Ifit with validate data data.predict H predict(fit, validate) t H table(observedHvalidateO,0activity0P, predictHdata.predict) t prop.table(t, !) Ifit with test data data.predict H predict(fit, test) t H table(observedHtestO,0activity0P, predictHdata.predict) t prop.table(t, !) I&6AA-S and to selecet variables from Train set thru random forest var(mpAlot(fit, sortHT6)>, n.varHmin(;,, nrow(fitNimportance)), typeHT)BB, classHT)BB, scaleHT6)>, mainH87ig !. (mportance of =ariables8, pchH!", colH$) plot(fit, mainH87ig *. Codel ($, selected variables)8) /eferences
ii iii iv v vi vii viii i' ' 'i
Davide Anguita, Alessandro &hio, Buca @neto, Uavier Aarra and Vorge B. 6eyes#@rti.. -uman Activity 6ecognition on Smartphones using a Culticlass -ardware#7riendly Support =ector Cachine. (nternational 5or shop of Ambient Assisted Biving ((5AAB 2,!2). =itoria#&astei., Spain. Dec 2,!2 (http2FFarchive.ics.uci.eduFmlFdatasetsF-umanSActivityS6ecognitionS)singSSmartphones ) http2FFplantecology.syr.eduFfridleyFbio:"*Fcart.html This decision was based on 9lass videos and this write up2 http2FFplantecology.syr.eduFfridleyFbio:"*Fcart.html http2FFen.wi ipedia.orgFwi iF6andomWforestIBearningWalgorithm http2FFstat#www.ber eley.eduFusersFbreimanF6andom7orestsFccWhome.htm Segal, Car 6. (April !$ 2,,$). Cachine Bearning /enchmar s and 6andom 7orest 6egression. 9enter for /ioinformatics X Colecular /iostatistics. /erthold, Cichael 6. (2,!,). 6$ide to 7ntelligent Data Analysis. Springer Bondon. http2FFstat#www.ber eley.eduFusersFbreimanF6andom7orests (nspired by this post2 http2FFwww.cybaea.netF/logsFDataF-ow#to#win#the#YDD#9up#9hallenge#with#6#and#gbm.html These forty variables are2 v5 "v5#0"v 2"v51"v38"v58"v505"v55"v70"v202"v210"v50 "v 52"v 3"v5#"v1#0"v75"v7#"v53"v7 "v77"v5#1" v303"v1!!"v55!"v 1"v5!"v1 3"v130"v 51"v 0"v20 "v2!!"v57"v1 0"v12 "v18#"v370"v3!"v50 http2FFandroidheadlines.comF2,!2F!,Ffeatured#samsung#wants#to#patent#your#daily#activity.html
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.