You are on page 1of 91

Introduction to Data Science with R

Programming

Dr. D. Vimal Kumar


Associate Professor
Department of CS
Nehru Arts and Science College
Coimbatore
TABLE OF CONTENTS
 Vector
 Array
 Lists
 For Loop With Syntax and Examples
 Import Excel into Rstudio
Vectors
• Vector is a basic data structure in R. It contains
element of the same type. The data types can be
logical, integer, double, character, complex or raw.
A vector's type can be checked with the typeof()
function. Another important property of a vector is
its length.
• How do I remove a Vector
remove and rm can be used to remove objects.
Vector
a <- c(3,4,5,6,7,8, 9,101,102,204,209, 506)
print (length(a))
print (head(a,2))
print (tail(a , 3))
print (tail(a , 1))
Matrix
• Matrix is the R object in which the elements are arranged in a two-dimensional
rectangular layout.

Syntax
matrix(data, nrow, ncol, byrow, dimnames)
Where:
• data is the input vector which becomes the data elements of the matrix.
• nrow is the number of rows to be created.
• ncol is the number of columns to be created.
• byrow is a logical clue. If TRUE, then the input vector elements are arranged by
row.
• dimname is the names assigned to the rows and columns.
Example
Mymatrix <- matrix(c(1:25), nrow = 5,  ncol = 5, byrow = TRUE)
Arrays
Arrays in R are data objects which can be used to store data in more
than two dimensions. It takes vectors as input and uses the values
in the dim parameter to create an array.
The basic syntax for creating an array in R is −
array(data, dim, dimnames) Where:
• data is the input vector which becomes the data elements of the
array.
• dim is the dimension of the array, where you pass the number of
rows, column and the number of matrices to be created by
mentioned dimensions.
• dimname is the names assigned to the rows and columns.
Example - Myarray <- array( c(1:16), dim=(4,4,2))
Array
Example for Array
vector1 <- c(6,2,1)
vector2 <- c(10,16,22,43,15,25)
Result <-array(c(vector1,vector2),dim=c(3,3,2))
print (Result)
List
•What is a llist?
Lists are the R objects which contain elements of different
types like − numbers, strings, vectors and another list inside
it.
•Listcan containelementsofdifferentdata types
•Itcancontain numbers, vectorsorlistinsideitself
•Itiscreatedusinglist()function
Syntax
Listname <- list(values)
List
• Create a list
listdata <-list("Green","Red", c( 21, 32, 11),TRUE, 24.5, 11)
print (listdata)
print (listdata [1])
print (listdata [3])
• Assign names to the list
names(listdata) <- c("lst quarter","A_matrix","A Innerlist")
• Add element to some position in the list
listdata[4] <- "New element"
Cont…..
print (listdata[4])
• Remove the fourth element from list
listdata[4] <- NULL
• Print 4th Element
print(listdata[4])
• Update 3rd element
listdata[3] <- "Updated element"
print(listdata[3])
Difference

S.No Vectors List

1 Vector stores elements of the A list holds different data


same type or converts such as Numeric, Character,
implicitly. logical, etc

2  vector is not recursive Lists are recursive

3 The vector is one- list is a multidimensional


dimensional object
Data Frame
A Data Frame is a table or a two-dimensional array-like structure
in which each column contains values of one variable and each
row contains one set of values for each column. Below are
some of the characteristics of a Data Frame that needs to be
considered every time we work with them:
• The column names should be non-empty.
• Each column should contain the same amount of data items.
• The data stored in a data frame can be of numeric, factor or
character type.
• The row names should be unique.
emp_id = c(100:104)
emp_name = c("John","Henry","Adam","Ron","Gary")
dept = c("Sales","Finance","Marketing","HR","R & D")
emp.data <- data.frame(emp_id, emp_name, dept)
emp.data
Arithmetic Operators
Relational Operat0rs
Logical Operators
Conditional Statements
Example of IF statement
• num1=10
• num2=20
 
• if(num1<=num2){
• print("Num1 is less or equal to Num2")
Else Statement
• Else Statement: The else statement is used when all
the other expressions are checked and found invalid.
This will be the last statement that gets executed as
part of the If – Else if branch.
R Programming: Loops

• A loop statement allows us to execute a statement or


group of statements multiple times. There are mainly
3 types of loops in R:
repeat Loop: It repeats a statement or group of
statements while a given condition is TRUE. Repeat
loop is the best example of an exit controlled loop
where the code is first executed and then the
condition is checked to determine if the control
should be inside the loop or exit from it. Below is the
flow of control in
Flow Chart - Repeat
Example - Repeat
x<-2
repeat {
x= x^2
print(x)
if(x>1000) {
break
}
}
For Loop
For Loop: It is used to repeats a statement or group of
for a fixed number of times. Unlike repeat and while
loop, the for loop is used in situations where we are
aware of the number of times the code needs to
executed beforehand. It is similar to the while loop
where the condition is first checked and then only
does the code written inside get executed. Lets see
the flow of control of for loop now
Flowchart
For
Syntax
for (variable in vector)
{
Commands
}
Example – 1
for(i in 1:5)
print(i^2)
Example - 2
for(i in c(-3,6,2,5,9))
{
print(i^2)
}
Example - 3
x<-c(-3,6,2,5,9)
for(i in x)
print (c(i, i^2))
Example – 4
Storage <- numeric(5)
for(i in 1:5){
Storage[i] <- (i^2)}
print (Storage)
y<-mean(Storage)
print (y)
Cont...
x<-c(-3,6,2,5,9)
Storage 1<- numeric(5)
for(i in 1:5){
Storage1[i] <- (x[i])^2}
print (Storage1)
Fibonacci sequence:
# take input from the user
nterms = as.integer(readline(prompt="How many terms? "))
# first two terms
n1 = 0
n2 = 1
count = 2
# check if the number of terms is valid
if(nterms <= 0) {
print("Plese enter a positive integer")
} else {
if(nterms == 1) {
print("Fibonacci sequence:")
print(n1)
} else {
print("Fibonacci sequence:")
print(n1)
print(n2)
while(count < nterms) {
nth = n1 + n2
print(nth)
# update values
n1 = n2
n2 = nth
count = count + 1
}}}
Simple Calculator
# Program make a simple calculator that can add, subtract, multiply and divide using functions
add <- function(x, y) {
return(x + y)
}
subtract <- function(x, y) {
return(x - y)
}
multiply <- function(x, y) {
return(x * y)
}
divide <- function(x, y) {
return(x / y)
}
# take input from the user
print("Select operation.")
print("1.Add")
print("2.Subtract")
print("3.Multiply")
print("4.Divide")
choice = as.integer(readline(prompt="Enter choice[1/2/3/4]: "))
num1 = as.integer(readline(prompt="Enter first number: "))
num2 = as.integer(readline(prompt="Enter second number: "))
operator <- switch(choice,"+","-","*","/")
result <- switch(choice, add(num1, num2), subtract(num1, num2), multiply(num1, num2), divide(num1, num2))
print(paste(num1, operator, num2, "=", result))
VISUALIZING DATA IN R.
•Data-Visualization tools and techniques offer executives and other
knowledge workers new approaches to dramatically improve their ability
to grasp information hiding in their data.

•Data visualization is a general term that describes any effort to help


people understand the significance of data by placing it in a visual
context. Patterns, trends and correlations that might go undetected in
text-based data can be exposed and recognized easier with data
visualization software.
•It isn't just the attraction of the huge range of statistical analyses
afforded by R that attracts data people to R. The language has also
developed a rich ecosystem of charts, plots and visualizations over
the years.
ggplot2
ggplot2 is a data visualization package for the statistical programming
language R.

Created by Hadley Wickham in 2005, ggplot2 is an implementation of


Leland Wilkinson's Grammar of Graphics—a general scheme for data
visualization which breaks up graphs into semantic components such as
scales and layers.

ggplot2 can serve as a replacement for the base graphics in R and contains
a number of defaults for web and print display of common scales.

Since 2005, ggplot2 has grown in use to become one of the most popular
R packages. It is licensed under GNU GPL v2.
•Basic Visualization
•Histogram
•Bar / Line Chart
•Box plot
• Scatter plot

•Advanced Visualization
•Heat Map
•Mosaic Map
•Map Visualization
• 3D Graphs Correlogram
Histogram
• Histogram is a visual tool for presenting variable
data. It
• organises data to describe the process performance.
• Additionally histogram shows the amount and
pattern of the
• variation from the process.
• Histogram offers a snapshot in time of the process
performance.
When Are Histograms Used?
• Summarize large data sets graphically.
• Compare measurements to specifications.
• Communicate information to the team.
• Assist in decision making.
What are the parts of a Histogram?
Histogram is made up of Four parts:
(1).TITLE: The title briefly describes the information that is
contained in the Histogram.
(2).HORIZONTAL or X AXIS: The horizontal or X-axis shows you
the scale of values into which the measurements fit. These
measurements are generally grouped into intervals to help you
summarize large data sets.
(3).BARS: The bars have two important characteristics—height
and width. The height represents the number of times the values
within an interval occurred. The width represents the length of
the interval covered by the bar. It is the same for all bars.
(4).VERTICAL or Y AXIS: The vertical or Y-axis is the scale that
shows you the number of times the values within an interval
occurred.

37
Constructing a Histogram
• Step 1 - Count number of data points
• Step 2 - Summarize on a tally sheet
• Step 3 - Compute the range
• Step 4 - Determine number of intervals
• Step 5 - Compute interval width
• Step 6 - Determine interval starting points
• Step 7 - Count number of points in each interval
• Step 8 - Plot the data
• Step 9 - Add title

38
Advantages of Histogram

• Display large amount of data that are difficult to


interpret in a tabular form.
• Show the relative frequency of occurrences of the
various data values.
• Reveal the variation ,centering and distribution
shape of the data.
• Very useful when calculating capability of a
process.
• Helps predict future performance of process.
Disadvantages of Histogram

• Use only with continuous data.


• More difficult to compare two data sets.
• Cannot read exact values because data is
grouped into categories.
Histogram
Cont...
• The pictorial nature of the histogram enables us to
see patterns that are difficult to see in a table of
numbers.
hist(mtcars$mpg)
hist(mtcars$mpg, breaks=5, col="red")
• Data always have variation
• Variation have pattern
• Patterns can be seen easily when summarized
pictorially
Basic Elements for Construction of
Histogram
For constructing the histogram we need to know the
following:
• Lowest value of the data set
• Highest value of the data set
• Approximate number of cells histogram have
• Cell width
• Lower cell boundary of first cell
Histogram – Syntax
• The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)
• Following is the description of the parameters used −
v is a vector containing numeric values used in
histogram. main indicates title of the chart.
col is used to set color of the bars.
border is used to set border color of each
bar. xlab is used to give description of x-axis.
xlim is used to specify the range of values on
the x-axis.
ylim is used to specify the range of values on the y-
axis. breaks is used to mention the width of each bar.
Histogram – Example
# Create data for the graph.
v <- c(9,13,21,8,36,22,12,41,31,33,19)

# Give the chart file a name.


png(file = "histogram.png")

# Create the histogram.


hist(v,xlab = "Weight",col =
"yellow",border = "blue")

# Save the file.


dev.off()
Histogram – Example
Charts and Graphs supported
• Pie chart
• Bar chart
• Box plots
• Historams
• Line graphs
• Scatter plots
Pie charts
• R Programming language has numerous libraries
to create charts and graphs.
• A pie-chart is a representation of values as slices of
a circle with different colors.
• The slices are labeled and the numbers
corresponding to each slice is also represented in
the chart.
• In R the pie chart is created using the pie()
function which takes positive numbers as a vector
input.
• The additional parameters are used to control
labels, color, title etc.
Pie charts – Syntax
• The basic syntax for creating a pie-chart using the R is −
pie(x, labels, radius, main, col, clockwise)
• Following is the description of the parameters used −
– x is a vector containing the numeric values used in the
pie chart.
– labels is used to give description to the slices.
– radius indicates the radius of the circle of the pie chart.
(value between −1 and +1).
– main indicates the title of the chart.
– col indicates the color palette.
– clockwise is a logical value indicating if the slices are
drawn clockwise or anti clockwise.
Pie charts – Example
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore",
"Mumbai")

# Give the chart file a name.


png(file = "city.png")

# Plot the chart.


pie(x,labels)

# Save the file.


dev.off()
Pie charts – Example
Pie chart example with colors
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("Pune",
"Nashik", "Aurangabad",
"Mumbai")

# Give the chart file a name.


png(file = "city_title_colours.png")

# Plot the chart with title and


rainbow color pallet.
pie(x, labels, main = "City pie chart", col =
rainbow(length(x)))

# Save the file.


dev.off()
Pie chart example with colors
Pie chart with colors and labels
# Create data for the graph.
x <- c(21, 62, 10,53)
labels <- c("London","New
York","Singapore","Mumbai")

piepercent<- round(100*x/sum(x), 1)
png(file = "city_percentage_legends.png")

# Plot the chart.


pie(x, labels = piepercent, main = "City pie chart",col =
rainbow(length(x)))
legend("topright", c("Pune","Nashik","Aurangabad","Mumbai"),
cex = 0.8, fill = rainbow(length(x)))

# Save the file.


dev.off()
Pie chart with colors and labels
3D Pie Chart
# Get the library.
library(plotrix)

# Create data for the graph.


x <-c(21, 62, 10,53)
lbl <-
c("Nashik","Aurangabad","Nav
i Mumbai","Nagpur")

png(file = "3d_pie_chart.png")

# Plot the chart.


pie3D(x,labels = lbl,explode = 0.1, main = "Pie Chart of
Countries ")

dev.off()
3D Pie Chart
Bar charts
• A bar chart represents data in rectangular bars
with length of the bar proportional to the
value of the variable.
• R uses the function barplot() to create
bar charts.
• R can draw both vertical and horizontal bars
in the bar chart.
• In bar chart each of the bars can be
given different colors.
Bar charts – Syntax
• The basic syntax to create a bar-chart in R is −
barplot(H, xlab, ylab, main, names.arg, col)
• Following is the description of the parameters used −
– H is a vector or matrix containing numeric values used
in bar chart.
– xlab is the label for x axis.
– ylab is the label for y axis.
– main is the title of the bar chart.
– names.arg is a vector of names appearing under
each bar.
– col is used to give colors to the bars in the graph.
Bar charts – Example
# Create the data for the chart. H <-
c(7,12,28,3,41)

# Give the chart file a name. png(file =


"barchart.png")

# Plot the bar chart. barplot(H)

# Save the file. dev.off()


Bar charts – Example
Bar chart with attributes
# Create the data for the chart.
H <- c(7,12,28,3,41)
M <-
c("Mar","Apr","May","Jun","Jul")

# Give the chart file a name.


png(file =
"barchart_months_revenue.png")

# Plot the bar chart.


barplot(H,names.arg = M,xlab = "Month",ylab =
"Revenue",col = "blue", main = "Revenue chart",border
= "red")

dev.off()
Bar chart with attributes
Bar chart – Stacked
colors <- c("green","orange","brown")
months <- c("Mar","Apr","May","Jun","Jul")
regions <- c("East","West","North")

Values <-
matrix(c(2,9,3,11,9,4,8,7,3,12,5,2,8,10,11
),nrow =
3,ncol = 5,byrow = TRUE)

png(file = "barchart_stacked.png")

barplot(Values,main = "total revenue",names.arg = months,xlab =


"month",ylab = "revenue", col = colors)

legend("topleft", regions, cex = 1.3, fill = colors)

dev.off()
v <- c(2, 3, 8 , 7.5, 9,10,15)
barplot(v, col="red", horiz=TRUE)
group <- LETTERS[1:7]
barplot(v, col="red", names.arg = group)
Bar chart – Stacked
Boxplot
• Boxplots are a measure of how well distributed
is the data in a data set.
• It divides the data set into three quartiles. This
graph represents the minimum, maximum,
median, first quartile and third quartile in the
data set.
• It is also useful in comparing the distribution of
data across data sets by drawing boxplots for
each of them.
• Boxplots are created in R by using the
boxplot() function.
Boxplot – Syntax
• The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)
• Following is the description of the parameters used −
– x is a vector or a formula
– data is the data frame.
– notch is a logical value. Set as TRUE to draw a notch.
– varwidth is a logical value. Set as true to draw width of
the box proportionate to the sample size.
– names are the group labels which will be printed
under each boxplot.
– main is used to give a title to the graph.
Boxplot – Example
• We use the data set "mtcars" available in the
R environment to create a basic boxplot.
Let's look at the columns "mpg" and "cyl" in
mtcars.

input <- mtcars[,c('mpg','cyl')]


print(head(input))
Boxplot – Example
# Give the chart file a name.
png(file = "boxplot.png")

# Plot the chart.


boxplot(mpg ~ cyl, data = mtcars, xlab =
"Number of Cylinders", ylab = "Miles
Per Gallon", main = "Mileage Data")

# Save the file.


dev.off()
Boxplot – Example
Boxplot with notch
png(file = "boxplot_with_notch.png")

# Plot the chart.


boxplot(mpg ~ cyl, data = mtcars,
xlab = "Number of Cylinders",
ylab = "Miles Per Gallon",
main = "Mileage Data",
notch = TRUE,
varwidth = TRUE,
col =
c("green","yello
w","purple"),
names =
c("High","Medium
","Low")
)
Boxplot with notch
Line graph – Syntax
• The basic syntax to create a line chart in R is −
plot(v,type,col,xlab,ylab)
• Following is the description of the parameters used −
– v is a vector containing the numeric values.
– type takes the value "p" to draw only the points, "i"
to draw only the lines and "o" to draw both points
and lines.
– xlab is the label for x axis.
– ylab is the label for y axis.
– main is the Title of the chart.
– col is used to give colors to both the points and lines.
Line graph – Example
# Create the data for the chart.
v <- c(7,12,28,3,41)

# Give the chart file a name.


png(file = "line_chart.png")

# Plot the line graph.


plot(v,type = "o")

# Save the file.


dev.off()
Line graph – Example
Line graph – example.
# Create the data for the
chart. v <- c(7,12,28,3,41)

# Give the chart file a name.


png(file =
"line_chart_label_colored.png")

# Plot the bar chart.


plot(v,type = "o", col = "red", xlab =
"Month", ylab = "Rain fall", main = "Rain
fall chart")

# Save the file.


dev.off()
Line graph – example.
Multiple lines in chart
# Create the data for the chart.
v <- c(7,12,28,3,41)
t <- c(14,7,6,19,3)

# Give the chart file a name.


png(file = "line_chart_2_lines.png")

# Plot the bar chart.


plot(v,type = "o",col = "red", xlab = "Month", ylab = "Rain
fall", main = "Rain fall chart")

lines(t, type = "o", col = "blue")

dev.off()
Multiple lines in chart
ScatterPlot
• Scatterplots show many points plotted in
the Cartesian plane.
• Each point represents the values of
two variables.
• One variable is chosen in the horizontal axis
and another in the vertical axis.
• The simple scatterplot is created using
the plot() function.
ScatterPlot – Example
• We use the data set "mtcars" available in the
R environment to create a basic scatterplot.
• Let's use the columns "wt" and "mpg" in
mtcars.

input <- mtcars[,c('wt','mpg')]


print(head(input))
ScatterPlot – Example
# Get the input values.
input <- mtcars[,c('wt','mpg')]

png(file = "scatterplot.png")

# Plot the chart for cars with weight between 2.5 to 5 and
mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
dev.off()
ScatterPlot – Example
Scatter Plot matrices
• When we have more than two variables and we want
to find the correlation between one variable versus
the remaining ones we use scatterplot matrix.
• We use pairs() function to create matrices of
scatterplots.
– Syntax:
pairs(formula, data)
– Following is the description of the parameters used

formula represents the series of variables used in
pairs.
data represents the data set from which the variables
will be taken.
Scatter Plot matrices – Example
# Give the chart file a name.
png(file = "scatterplot_matrices.png")

# Plot the matrices between 4 variables giving 12


plots.

# One variable with 3 others and total 4


variables.

pairs(~wt+mpg+disp+cyl,data = mtcars,main =
"Scatterplot Matrix")

dev.off()
Scatter Plot matrices – Example
Useful resources
Thank you
This presentation is created using LibreOffice Impress 4.2.8.2, can be used freely as per GNU General Public License

Web Resources Blogs ht tp://digitallocha


http://mitu.co.in http: .blogspot.in
//tusharkute.com http://kyamputar.blogspot.in

tushar@tusharkute.com
Histogram
• A histogram represents the frequencies
of values of a variable bucketed into
ranges.
• Histogram is similar to bar chart but
the difference is it groups the values
into continuous ranges.
• Each bar in histogram represents the height
of the number of values present in that
range.
• R creates histogram using hist() function.
This function takes a vector as an input and
uses some more parameters to plot
Thank You
Any Queries

You might also like