Professional Documents
Culture Documents
on
Data Analytics using R
by
Mrs. Vrushali A. Patil
Contents of session
• Introduction to Data Science.
• Components of Data Science: Big Data, Data Mining, Machine
Learning, Data Analytics.
• Introduction to R
• Data Structures in R: Vector, Factor, Array, Matrix
• Strings in R
Data Science
• Data Science is basically about using data to solve some problem.
• The problem could be of type decision making, prediction or
recommendation.
• E.g. Weather forecast.
To predict the winner of any sport series/ competition.
Recommendation of products to the customers during online shopping.
Finding out the reason for decline in sales of any product.
• It is an area of study where vast amount of data is analyzed to extract
some useful information by using various scientific methods and
algorithms.
• The data being used by algorithms can be structured or unstructured.
• Data Science is also useful to communicate results in the form of
graphics, charts etc.
Components of Data Science
1. Big Data:
• The data which is huge in volume and continuously growing with the
time.
• E.g. The data that is generated from social media websites, emails etc.
The data generated through online education portals/ shopping portals.
The data generated in Stock Exchanges, banking sectors etc.
The data generated by satellites.
satellites
• Big Data can be categorized into 3 types:
• Structured data: Highly Organised data stored in spreadsheets, database
tables.
• Unstructured data: Data with no predefined structure that contains images,
videos, text etc.
• Semi-Structured data: Loosely organised Data stored in XML, HTML, JSON
format.
Characteristics of Big Data
1. Volume: It refers to the amount of data generated through social media,
phones, satellites etc.
• Distributed type of DBMS is used to store such large volume of data.
2. Velocity: It refers to the speed at which large amount of data are being
generated, collected and analyzed..
• Big data allows us to analyze real time data i.e. No need to store data in
database for processing.
3. Value: It refers to the worth of data being extracted.
• Big data is useless until we retrieve some useful information from it.
4. Variety: It refers to the different types of data and heterogeneous sources
from which data are being generated and collected.
5. Veracity: It refers to the quality or trustworthiness of data.
Components of Data Science(contd.)
2. Data Mining:
• It is a process of discovering or extracting knowledge from big data
stored in multiple data sources such as files, databases, data
warehouse etc.
• Knowledge extracting is the process of finding related, useful or
previously unknown information from big data.
Architecture of Data Mining
Graphical User Interface
Data Other
WEB Database Warehouse Sources
Architecture of Data Mining(contd.)
1. Data Sources: Actual storage of data.
data
2. Different Processes:
• Data Cleaning is the process of detecting & removing incomplete,
corrupted or inaccurate data from data source.
• All data is not required for processing so data selection is also necessary
before passing it to the server.
6. GUI:
• It is communication medium between user and data mining system.
7. Knowledge Base:
• It contains user experiences that might be useful to make results more
accurate.
• Pattern Evaluation engine interacts with knowledge base to get inputs from it
also to update it.
Types of Data Mining Architecture
1. No Coupling:
• Data mining system retrieves data from data source other than database or data
warehouse, processes data using data mining algorithms and store results into the file
system.
• It does not take any advantage of database functionality.
2. Loose Coupling:
• This system retrieves data from database or data warehouse, processes that data and
stores result back to the database or data warehouse.
3. Semi-tight Coupling:
• This system is linked to database and data warehouse.
• It also uses several database features like indexing, sorting, aggregation to perform
mining tasks.
4. Tight Coupling:
• This system uses all the features of database and data warehouse to perform mining.
• It offers high scalability and high performance.
performance
Data Mining Techniques
1. Decision Trees:
• It uses tree representation to solve problem.
problem Creates Classifier model to classify
input.
• Root node represents simple question or condition to retrieve data and to make
decision.
• Internal nodes also contains some simple question or condition.
• Leaf node represents action to be performed.
performed
Employed
N Y
3. Clustering:
• Classes are created as per user requirements.
requirements
4. Classification:
• Input data items are classified into predefined classes.
5. Prediction:
• It is used to predict results from historical data.
Machine Learning
• Machine Learning is an application of Artificial Intelligence.
• It provides computer the ability to automatically learn from past
experiences, and improve system functionality without being explicitly
programmed.
• Machine learning algorithms enable computers to learn from data set and
improve themselves.
• It enables software applications to predict results more accurately.
• The main purpose of machine learning is to build algorithms that can
receive input data and use statistical analysis to predict outcome.
• Machine learning algorithm uses either labelled or unlabeled data sets as
input training data.
Machine Learning(contd.)
• There are two types of machine learning algorithms:
1. Supervised Learning:
• In this algorithm, system is trained using well labelled data.
• Input data is tagged with correct answer.
answer
• After training, a system is tested by providing test data set. A model is
accurate if it produces correct output.
output
• A system can easily classify new inputs/observations into pre-labelled
classes by determining features of input data.
• It can not handle complex data sets..
• It can not produce correct output if test data is different from training
dataset.
Machine Learning(contd.)
• Types of supervised machine learning algorithms:
1. Classification:
• In this model, input data is classified in predefined classes.
• Output is also from one of the predefined classes.
2. Regression:
• This technique is mainly used to establish relationship model between two
variables.
• It is used to predict the value of output variable Y from one or more input
variables X.
Machine Learning(contd.)
2. Unsupervised Learning:
• In this algorithm, system is trained using unlabelled data.
• No predefined classes are present.
• System creates classes by identifying similarities, differences and patterns
of input data.
• Types of supervised machine learning algorithms:
1. Clustering
2. Association
Data Analytics
• Data analytics is the process of analyzing and organizing data for deriving
important judgements.
• Types of Data Analytics:
1. Descriptive Analysis:
• It is used to identify what happened in the past?
• Represented in the form of graphical visualizations.
visualizations
2. Diagnostic Analysis:
• It is used to identify why did it happen? Root cause analysis.
3. Predictive Analysis:
• It predicts what is likely to happen in future? Predicts future using past data.
4. Prescriptive Analysis:
• It analyzes outcomes of all analytics and then allows us to make decisions based
on them.
Data Analytics problem example
• On last Saturday, I was travelling from source A to destination B.
• The distance between both the places is 20 km.
• Usually it takes 20-25 minutes to reach to the destination.
• But on last Saturday it took 45 minutes for me to reach to B.
• Descriptive analysis: on last Saturday, it took more time to reach to the
destination.
• Diagnostic: It happened due to heavy traffic.
• Predictive: If I follow same root on this Saturday also then again it will
take more time to reach to the destination.
destination
• Prescriptive: It will be better if I try another route for travelling.
Problem solving steps in Data Science
1. Define the Problem:
• Identify the problem to be solved.
2. Collecting data:
• Every data driven problem solving approach needs data in a hand.
• Data can be either ready to use or we need to gather data from different
data sources.
3. Data Preparation:
• It involves different steps like data cleaning, data selection, data
integration, data analysis etc.
4. Model planning:
• Decide the type of machine learning algorithm to train your system for
processing.
• We may train system using different possible models to get the outcome.
Problem solving steps in Data Science
5. Model Building:
• After verifying system outcomes using different machine learning model,
now we need to select one that is producing more accurate results.
6. Driving insights and generating reports:
reports
• By analyzing the outcomes/predictions produced by system, judgements
can be made to solve the problem.
• Judgement reports are prepared to communicate the results.
7. Taking decisions based on Insights::
• Based on the judgement reports decisions can be taken to take care of
problem in future.
Job Roles in Data Science
1. Data Scientist
2. Data Analyst
3. Business analyst
4. Statistician
5. Database Administrator
6. Data Engineer
7. Data Architect
8. Machine Learning Engineer
Installation of R
Following are the steps to download R
• Go to www.r-project.org
• Click on CRAN link to choose CRAN mirror for downloading.
• Select any of the CRAN mirrors.
• In Download and Install R section, Select appropriate link to
download R according to you operating System.
• Click on install R for the first time.
time
• Click on Download R.
• .exe file will get downloaded. Run .exe file and follow all instructions
till finish.
• R terminal is ready to use.
Installation of RStudio
Following are the steps to download R
• Go to www.rstudio.com
• Click on Download.
• Select Rstudio Desktop to download.
download
• Run downloaded .exe file.
• Follow all instructions till finish.
CRAN(Comprehensive R Archive Network):
Network)
• CRAN is a network of ftp and web servers around the world that store
identical, up-to-date, versions of code and documentation for R.
• It is supported by R foundation.
R Package:
• Collection of R functions, sample data sets, complied code.
• Stored under directory “library”.
• Some packages get installed during R installation.
• Packages can be installed as per requirement.
requirement
R Language
• R is a language and environment for statistical computing and
graphics.
• Its a GNU project.
• It is different version of S language which was developed at Bell
laboratory by John Chambers and colleagues.
• R is available as free software.
• It is platform independent. Runs on any Operating system.
• It is extensible. i.e. Developers can easily write their own software and
distribute it in the form of R add on packages.
• R is an interpreted language. No need to compile program into object
language.
• Each expression takes a form of function calls.
e.g. A<-2 is converted to function call as ‘<-’ (A,2)
R Language
• It provides massive packages for statistical modelling, machine
learning, visualization etc.
• Easily produces html, pdf reports.
reports
• It has powerful meta programming facility.
• # is used to give single line comment.
comment R does not supprot multiline
comments.
Data Structures in R
• R data structures are also called as Objects.
Objects
• They are organised by their dimensionality.
dimensionality
Dimension Homogeneous Heterogeneous
1-D Atomic Vector List
2-D Matrix Dataframe
N-D Array
• scan():
• It is used take input from user .
• a<- scan() #read numeric /double value
• a<- scan(what=integer()) #read Integer value
• a<- scan(what=character()) #read character value
• a<- scan(what=“ “) #read string value
• A[c(2,4)] #display 2nd and 4th components
• A[-1] #Skip 1st component in output.
R Environment
• Environment is a virtual place to store data objects.
• Default Environment of R is R_GlobalEnv.
R_GlobalEnv
• Some R commands to work with environment:
environment
• environment(): get the name of the current environment.
• ls() is used to list out all objects created and stored under any environment.
• Creating array:
• Syntax:
• Array_name=array(data, dim=c(rowsize
rowsize, columnsize, matrix), dimnames)
• “array” function is used to create array in R.
• “data” are input values from which array is created. It can be either vector or
number sequences.
• “dim” parameter specifies dimensions of array. i.e. number of rows, no. of
columns and no. of matrix in a array.
• “dimnames” parameter is used to specify names of dimensions.
• Scalar arithmetic:
• M1+2, M1*5
• Matrix Multiplication:
• M1 %*% M2 :M1 is 2x3 matrix & M2 is 3x2 matrix.
• It displays matrix multiplication. If dimensions of matrices are different then
no. of rows in 1st matrix must be equal to no. Of columns in 2nd matrix.
• Miscellaneous functions:
• sum(M1)
• rowSums(M1)
• colSums(M1)
• min(M1)
• max(M1)
• is.matrix(m1)
• nrow(M1)
• ncol(M1)
Matrix Arithmetic(contd.)
• Different ways to assign names to rows & columns:
• row=c(“r1”,”r2”,”r3”)
• columns=c(“c1”,”c2”,”c3”)
• M3<-matrix(1:9,nrow=3,dimnames=list(row,
,dimnames=list(row, columns))
• M4<-matrix(1:9,nrow=3, ncol=3)
• rownames(M4)=c (“r1”,”r2”,”r3”)
• colnames(M4)=c (“c1”,”c2”,”c3”)
• Transpose of matrix:
• M4<-t(M4)
Strings in R
• Sequence of characters is a string, written within single/double quotes.
• A<-”hello” B<-”computer”
• Concatenation of strings:
• paste(A,B)