Basics of Data Science

Welcome to the Online Session
on
Data Analytics using R
by
Mrs. Vrushali A. Patil
Contents of session
• Introduction to Data Science.
• Components of Data Science: Big Data, Data Mining, Machine
Learning, Data Analytics.
• Introduction to R
• Data Structures in R: Vector, Factor, Array, Matrix
• Strings in R
Data Science
• Data Science is basically about using data to solve some problem.
• The problem could be of type decision making, prediction or
recommendation.
• E.g. Weather forecast.
To predict the winner of any sport series/ competition.
Recommendation of products to the customers during online shopping.
Finding out the reason for decline in sales of any product.
• It is an area of study where vast amount of data is analyzed to extract
some useful information by using various scientific methods and
algorithms.
• The data being used by algorithms can be structured or unstructured.
• Data Science is also useful to communicate results in the form of
graphics, charts etc.
Components of Data Science
1. Big Data:
• The data which is huge in volume and continuously growing with the
time.
• E.g. The data that is generated from social media websites, emails etc.
The data generated through online education portals/ shopping portals.
The data generated in Stock Exchanges, banking sectors etc.
The data generated by satellites.
satellites
• Big Data can be categorized into 3 types:
• Structured data: Highly Organised data stored in spreadsheets, database
tables.
• Unstructured data: Data with no predefined structure that contains images,
videos, text etc.
• Semi-Structured data: Loosely organised Data stored in XML, HTML, JSON
format.
Characteristics of Big Data
1. Volume: It refers to the amount of data generated through social media,
phones, satellites etc.
• Distributed type of DBMS is used to store such large volume of data.
2. Velocity: It refers to the speed at which large amount of data are being
generated, collected and analyzed..
• Big data allows us to analyze real time data i.e. No need to store data in
database for processing.
3. Value: It refers to the worth of data being extracted.
• Big data is useless until we retrieve some useful information from it.
4. Variety: It refers to the different types of data and heterogeneous sources
from which data are being generated and collected.
5. Veracity: It refers to the quality or trustworthiness of data.
Components of Data Science(contd.)
2. Data Mining:
• It is a process of discovering or extracting knowledge from big data
stored in multiple data sources such as files, databases, data
warehouse etc.
• Knowledge extracting is the process of finding related, useful or
previously unknown information from big data.
Architecture of Data Mining
Graphical User Interface
Pattern Evaluation Module

Knowledge
Base
Data Mining Engine
Database or Data warehouse

Server
Data Cleaning, integration &

Selection
Data Other
WEB Database Warehouse Sources
Architecture of Data Mining(contd.)
1. Data Sources: Actual storage of data.
data
2. Different Processes:
• Data Cleaning is the process of detecting & removing incomplete,
corrupted or inaccurate data from data source.
• All data is not required for processing so data selection is also necessary
before passing it to the server.
3. Database or Data Warehouse Server:

Server
• Server is responsible for retrieving relevant data based on data mining
request of user.
Architecture of Data Mining(contd.)
4. Data Mining Engine:
• It consists of software to obtain knowledge and judgements from collected
data.
5. Pattern Evaluation Module:
• It is used to find required pattern from mined data.
6. GUI:
• It is communication medium between user and data mining system.
7. Knowledge Base:
• It contains user experiences that might be useful to make results more
accurate.
• Pattern Evaluation engine interacts with knowledge base to get inputs from it
also to update it.
Types of Data Mining Architecture
1. No Coupling:
• Data mining system retrieves data from data source other than database or data
warehouse, processes data using data mining algorithms and store results into the file
system.
• It does not take any advantage of database functionality.
2. Loose Coupling:
• This system retrieves data from database or data warehouse, processes that data and
stores result back to the database or data warehouse.
3. Semi-tight Coupling:
• This system is linked to database and data warehouse.
• It also uses several database features like indexing, sorting, aggregation to perform
mining tasks.
4. Tight Coupling:
• This system uses all the features of database and data warehouse to perform mining.
• It offers high scalability and high performance.
performance
Data Mining Techniques
1. Decision Trees:
• It uses tree representation to solve problem.
problem Creates Classifier model to classify
input.
• Root node represents simple question or condition to retrieve data and to make
decision.
• Internal nodes also contains some simple question or condition.
• Leaf node represents action to be performed.
performed
Employed
N Y
Credit Score Income

High Low High Low
Accept Reject Accept Reject

No. of family members>3
N
Y
Is Married? Is Married?
N Y N Y
1 BHK salary>40000 Salary>40000 3 BHK
N Y N Y
1 BHK 2 BHK 1 BHK 2 BHK
Fig 1. Decision Tree

2. Association:
• Patterns are recognised based upon relationship of items in a single
transaction.
• E.g. Company may use association technique to research customer’s buying
habits, based on historical data.
3. Clustering:
• Classes are created as per user requirements.
requirements
4. Classification:
• Input data items are classified into predefined classes.
5. Prediction:
• It is used to predict results from historical data.
Machine Learning
• Machine Learning is an application of Artificial Intelligence.
• It provides computer the ability to automatically learn from past
experiences, and improve system functionality without being explicitly
programmed.
• Machine learning algorithms enable computers to learn from data set and
improve themselves.
• It enables software applications to predict results more accurately.
• The main purpose of machine learning is to build algorithms that can
receive input data and use statistical analysis to predict outcome.
• Machine learning algorithm uses either labelled or unlabeled data sets as
input training data.
Machine Learning(contd.)
• There are two types of machine learning algorithms:
1. Supervised Learning:
• In this algorithm, system is trained using well labelled data.
• Input data is tagged with correct answer.
answer
• After training, a system is tested by providing test data set. A model is
accurate if it produces correct output.
output
• A system can easily classify new inputs/observations into pre-labelled
classes by determining features of input data.
• It can not handle complex data sets..
• It can not produce correct output if test data is different from training
dataset.
• Types of supervised machine learning algorithms:
1. Classification:
• In this model, input data is classified in predefined classes.
• Output is also from one of the predefined classes.
2. Regression:
• This technique is mainly used to establish relationship model between two
variables.
• It is used to predict the value of output variable Y from one or more input
variables X.
2. Unsupervised Learning:
• In this algorithm, system is trained using unlabelled data.
• No predefined classes are present.
• System creates classes by identifying similarities, differences and patterns
of input data.
• Types of supervised machine learning algorithms:
1. Clustering
2. Association
Data Analytics
• Data analytics is the process of analyzing and organizing data for deriving
important judgements.
• Types of Data Analytics:
1. Descriptive Analysis:
• It is used to identify what happened in the past?
• Represented in the form of graphical visualizations.
visualizations
2. Diagnostic Analysis:
• It is used to identify why did it happen? Root cause analysis.
3. Predictive Analysis:
• It predicts what is likely to happen in future? Predicts future using past data.
4. Prescriptive Analysis:
• It analyzes outcomes of all analytics and then allows us to make decisions based
on them.
Data Analytics problem example
• On last Saturday, I was travelling from source A to destination B.
• The distance between both the places is 20 km.
• Usually it takes 20-25 minutes to reach to the destination.
• But on last Saturday it took 45 minutes for me to reach to B.
• Descriptive analysis: on last Saturday, it took more time to reach to the
destination.
• Diagnostic: It happened due to heavy traffic.
• Predictive: If I follow same root on this Saturday also then again it will
take more time to reach to the destination.
destination
• Prescriptive: It will be better if I try another route for travelling.
Problem solving steps in Data Science
1. Define the Problem:
• Identify the problem to be solved.
2. Collecting data:
• Every data driven problem solving approach needs data in a hand.
• Data can be either ready to use or we need to gather data from different
data sources.
3. Data Preparation:
• It involves different steps like data cleaning, data selection, data
integration, data analysis etc.
4. Model planning:
• Decide the type of machine learning algorithm to train your system for
processing.
• We may train system using different possible models to get the outcome.
Problem solving steps in Data Science
5. Model Building:
• After verifying system outcomes using different machine learning model,
now we need to select one that is producing more accurate results.
6. Driving insights and generating reports:
reports
• By analyzing the outcomes/predictions produced by system, judgements
can be made to solve the problem.
• Judgement reports are prepared to communicate the results.
7. Taking decisions based on Insights::
• Based on the judgement reports decisions can be taken to take care of
problem in future.
Job Roles in Data Science
1. Data Scientist
2. Data Analyst
3. Business analyst
4. Statistician
5. Database Administrator
6. Data Engineer
7. Data Architect
8. Machine Learning Engineer
Installation of R
Following are the steps to download R
• Go to www.r-project.org
• Click on CRAN link to choose CRAN mirror for downloading.
• Select any of the CRAN mirrors.
• In Download and Install R section, Select appropriate link to
download R according to you operating System.
• Click on install R for the first time.
time
• Click on Download R.
• .exe file will get downloaded. Run .exe file and follow all instructions
till finish.
• R terminal is ready to use.
Installation of RStudio
Following are the steps to download R
• Go to www.rstudio.com
• Click on Download.
• Select Rstudio Desktop to download.
download
• Run downloaded .exe file.
• Follow all instructions till finish.
CRAN(Comprehensive R Archive Network):
Network)
• CRAN is a network of ftp and web servers around the world that store
identical, up-to-date, versions of code and documentation for R.
• It is supported by R foundation.
R Package:
• Collection of R functions, sample data sets, complied code.
• Stored under directory “library”.
• Some packages get installed during R installation.
• Packages can be installed as per requirement.
requirement
R Language
• R is a language and environment for statistical computing and
graphics.
• Its a GNU project.
• It is different version of S language which was developed at Bell
laboratory by John Chambers and colleagues.
• R is available as free software.
• It is platform independent. Runs on any Operating system.
• It is extensible. i.e. Developers can easily write their own software and
distribute it in the form of R add on packages.
• R is an interpreted language. No need to compile program into object
language.
• Each expression takes a form of function calls.
e.g. A<-2 is converted to function call as ‘<-’ (A,2)
R Language
• It provides massive packages for statistical modelling, machine
learning, visualization etc.
• Easily produces html, pdf reports.
reports
• It has powerful meta programming facility.
• # is used to give single line comment.
comment R does not supprot multiline
comments.
Data Structures in R
• R data structures are also called as Objects.
Objects
• They are organised by their dimensionality.
dimensionality
Dimension Homogeneous Heterogeneous
1-D Atomic Vector List
2-D Matrix Dataframe
N-D Array
• No scalar type of data object.

• Single number or character is an object of type vector.
Data Structures in R
1. Vector:
• Basic data structure.
• Two types: Atomic Vector and List.
• Common Properties are:
• Type: typeof()
• Length: length()
• Atomic vectors are of four types:
types Logical, Integer, Numeric/double,
Character.
• is(): used to check whether vector is of specified type.
• c(): used to combine more that one vectors.
vectors
• is.NA(): used to check NA values.
Vector(
Vector(contd)
• Modifying vector elements:
• A[2]<-3
• A[A<5]<-0
• Delete Vector:
• A<-NULL
• Sorting vector:
• sort(A)
• sort(A, decreasing=TRUE)
• Assigning names to vector components:
components
• V1<-c(1,2,3)
• Names(V1)<- c(“first”, ”second”, ”third”)
• V1[“first”] # displays component with name “first” (character index)
• V2<- c(1,2,3,4,5,6,7,8,9,10)
• V2[1:5] # displays first 5 elements.
elements
• V2[4:9] # displays elements from specified index positions..
Vector(contd.)
• seq():
• It is used o generate sequence of input values.
• E.g : a<-seq(from=1, to=10) or a<-seq(1,10)
• a<- seq(from=1, to=10,by=2)
• a<-seq_len(10)
• a<- seq(from=-5, to=5)
• a<-1:10
• assign():
• It is used to assign value to vector.
• E.g. assign(“a”,10), assign(“a", "computer”)
• rep(): Repeat vector values for specified number of times.
• rep(a,3) or rep(c(
rep(c(1,2),times=3)
• rep(a, each=3)
• rep(c(1,2),times=c(2,4))
Vector(contd.)
• rm():
• Used to remove data object from environment.
environment
• E.g : rm(a)
• scan():
• It is used take input from user .
• a<- scan() #read numeric /double value
• a<- scan(what=integer()) #read Integer value
• a<- scan(what=character()) #read character value
• a<- scan(what=“ “) #read string value
• A[c(2,4)] #display 2nd and 4th components
• A[-1] #Skip 1st component in output.
R Environment
• Environment is a virtual place to store data objects.
• Default Environment of R is R_GlobalEnv.
R_GlobalEnv
• Some R commands to work with environment:
environment
• environment(): get the name of the current environment.
• ls() is used to list out all objects created and stored under any environment.
• new.env() is used to create new environment.

environment
• E.g. demo<- new.env()
• Creating data objects in new environment:

environment
• demo$a<-10 or #create “a” in environment “demo” with value 10.
• assign(“a”,10,envir=demo)
• ls(demo)
R Environment
• Removing object from environment:
environment
• rm(a, envir=demo) #”a” is name of object to be removed
• Check whether object is present in environment:

environment
• exists(“a”, envir=demo) # “a” is an object to be searched in environment.
• Display the value of data object from any environment:

• demo$a or
• get(“a”, envir=demo)
Working Directory
• Its a default path in computer system to store data objects and work done.
• When R want to import any dataset, it is assumed that data is stored in
working directory.
• We can have a single working directory at a time called current working
directory.
• getwd() is used to display current working directory.
• setwd() is used to change current working directory and set new one.
• setwd(dir=“ path“)
Factors in R
• Factor is a data object that is used to store categorical data.
• It takes only predefined types of values, called levels.
• Factors components are of character type but internally they are stored as
integer and levels are associated with them.
them
• E.g shifts<-factor(c(“first”,”second”)) ”))
• typeof(shifts) : display type of a factor
• levels(shifts) : display levels of vector
• nlevels (shifts) : display number of levels
• Directions<-c(“North”,”South”,”West
North”,”South”,”West”)
• Directions<-factor(Directions, levels=c(“north”,”south”,”west”,“East”))
levels=c(“
• str(Directions) :display structure of factor.
• as.numeric(shifts) : convert factor to number
• as.character(shifts) : convert factor to character.
Array in R
• Its a multidimensional data structure.
• 1D array is a vector, 2D array is matrix.
matrix
• Creating array:
• Syntax:
• Array_name=array(data, dim=c(rowsize
rowsize, columnsize, matrix), dimnames)
• “array” function is used to create array in R.
• “data” are input values from which array is created. It can be either vector or
number sequences.
• “dim” parameter specifies dimensions of array. i.e. number of rows, no. of
columns and no. of matrix in a array.
• “dimnames” parameter is used to specify names of dimensions.
• V1<-c(1,2,3,4,5) no. of rows no. of columns

• V2<-c(6,7,8,9)
• A<-array(c(V1,V2), dim=c(3,3)) :3x3 array “A” created from vectors V1,V2.
Array in R(contd.)
• B<-array(c(V1,V2), dim=c(3,3,2))
• “B” array will be created with dimensions 3 rows, 3 columns and 2 matrices. It has
two 3x3 matrices created with vectors V1 and V2.
• C<-(1:4,dim=c(2,2)) :2x2 array “C” is created from sequence(1:4)
• Assigning names to rows and columns & matrix:

• column.names=c(“c1,”c2”,”c3”) : set column names
• row.names=c(“r1,”r2”,”r3”) : set row names
• matrix.names=c(“m1,”m2”) : set matrix names
• D <- array(c(V1,V2), dim=c(3,3,2), dimnames=

dimnames list(row.names, column.names,
matrix.names)
• Vectors from which array is created can be of either of same lengths or of different
lengths.
• Array can be created using single vector, two vectors or by using more than two
vectors.
Array in R(contd.)
• Access/display array elements:
• Elements in array can be accessed using row index, column index and matrix
index/level.
• Syntax: array_name[row_index, column_index,
column_index matrix_level]
• print(A[2,3]) :prints element at 2nd row and 3rd column.

• print(A[2,3,1]) :prints element at 2nd row and 3rd column from 1st matrix.
• print(A[2,3, ]) :prints element at 2nd row and 3rd column of both matrix.
• print(A[ , ,1 ]) :prints first matrix.
matrix
• print(A[ 1, , ]) :prints first row of both matrix.
• print(A[ ,2 ,1 ]) :prints 2nd column of matrix 1.
• print(A[ c(1,2), ,1 ]) :prints 1st & 2nd rows of matrix 1.
• print(A[ ,c(2,3), ]) :prints 2nd & 3rd columns of both matrix.
• print(A[ c(1 ,2),c(2,3), 2]) :prints the values of 2nd & 3rd columns for 1st &
2nd rows of matrix 2.
Array in R(contd.)
• Array Arithmetic:
• Different arithmetic and logical operations can be performed on two or more than two
arrays.
• Arrays on which we want to perform arithmetic operations must be of same
dimensions.
• A<-array(1:4, dim=c(2,2))
• B<-array(5:8, dim=c(2,2))
• C<-array(9:12, dim=c(2,2))
• A+B+C
• A-B-C
• A*B*C
• A/B/C
• A<-array(1:4, dim=c(2,2,2))
• A[, , 1]+A[, , 2] :add 1st and 2nd matrix of array.
• A[1, , 1]+A[1, , 2] :add 1st row of both matrix in array.
Array in R(contd.)
• Array Arithmetic:
• A<-array(1:4, dim=c(2,2))
• B<-array(5:8, dim=c(2,2))
• A&B
• A||B
• sum(A)
• min(A)
• max(A)
• Change dimensions of array:
• A<-array(1:9, dim=c(3,2,2))
• dim(A)<-c(3,3,2)
Matrix in R
• Matrix is 2 dimensional array.
• Syntax:
• matrix_name=matrix(data, nrow, ncol,
ncol dimnames, byrow=TRUE)
• “matrix” function is used to create matrix in R.
• “data” are input values from which matrix is created. It can be either vector
or number sequences.
• “nrow” specifies number of rows in matrix.
• “ncol” specifies number of columns in matrix.
• “dimnames” parameter is used to specify names of rows and columns.
• Bydefault matrix elements are arranged by columns, by setting parameter
byrow=TRUE we can arrange elements row wise.
• m1<-matrix(1:4, nrow=2, ncol=2, dimnames=list(c(“r1”,”r2”),C(“c1,”c2”)))

dimnames
• m2<-matrix(5:8, nrow=2, ncol=2, dimnames=list(c(“r1”,”r2”),C(“c1,”c2”)))
dimnames
Matrix Arithmetic(contd.)
• Matrices must be of same dimensions.
dimensions
• M1+M2 M1-M2
• M1*M2 M1/M2
• Scalar arithmetic:
• M1+2, M1*5
• Matrix Multiplication:
• M1 %*% M2 :M1 is 2x3 matrix & M2 is 3x2 matrix.
• It displays matrix multiplication. If dimensions of matrices are different then
no. of rows in 1st matrix must be equal to no. Of columns in 2nd matrix.
• Access Matrix elements:

• M1[1]
• M3[c(1:3),c(2,3)]
• Matrix Concatenation:
• M3<-cbind(M1,M2) #combine 2 matrices column wise
• M4<-rbind(M1,M2) #combine 2 matrices row wise
• M1<-cbind(M1,c(5,6)) # Add new column to existing matrix
• M2<-rbind(M2,c(9,10)) # Add new row to existing matrix
• Miscellaneous functions:
• sum(M1)
• rowSums(M1)
• colSums(M1)
• min(M1)
• max(M1)
• is.matrix(m1)
• nrow(M1)
• ncol(M1)
• Different ways to assign names to rows & columns:
• row=c(“r1”,”r2”,”r3”)
• columns=c(“c1”,”c2”,”c3”)
• M3<-matrix(1:9,nrow=3,dimnames=list(row,
,dimnames=list(row, columns))
• M4<-matrix(1:9,nrow=3, ncol=3)
• rownames(M4)=c (“r1”,”r2”,”r3”)
• colnames(M4)=c (“c1”,”c2”,”c3”)
• Modify Matrix elements:

• M1[1,3]<-0
• M2[M2%2==0]<-1 #replace all even elements by 1.
• Transpose of matrix:
• M4<-t(M4)
Strings in R
• Sequence of characters is a string, written within single/double quotes.
• A<-”hello” B<-”computer”
• Concatenation of strings:
• paste(A,B)
• Read input from user:

• C<-readline(“enter your name:”)
• paste(A,C)
• paste(A, ”world”)
• nchar(A) #count no. of characters in string including blank spaces.
• Char_arr<-character(3) #create character array of size 3.

• Char_arr[1]<-”first” #assign value to first index.

Basics of Data Science

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basics of Data Science

Uploaded by

Copyright:

Available Formats

Welcome to the Online Session

Pattern Evaluation Module

Database or Data warehouse

Data Cleaning, integration &

3. Database or Data Warehouse Server:

Credit Score Income

Accept Reject Accept Reject

Fig 1. Decision Tree

• No scalar type of data object.

• new.env() is used to create new environment.

• Creating data objects in new environment:

• Check whether object is present in environment:

• Display the value of data object from any environment:

• V1<-c(1,2,3,4,5) no. of rows no. of columns

• C<-(1:4,dim=c(2,2)) :2x2 array “C” is created from sequence(1:4)

• Assigning names to rows and columns & matrix:

• D <- array(c(V1,V2), dim=c(3,3,2), dimnames=

• print(A[2,3]) :prints element at 2nd row and 3rd column.

• m1<-matrix(1:4, nrow=2, ncol=2, dimnames=list(c(“r1”,”r2”),C(“c1,”c2”)))

• Access Matrix elements:

• Modify Matrix elements:

• Read input from user:

• Char_arr<-character(3) #create character array of size 3.

You might also like