You are on page 1of 49

CSE1006 – Foundations of Data Analytics

Module 2
Lecture#1: Introduction to R Programming

Dr.T.Sudhakar
Associate Professor
School of Computer Science & Engineering
VIT-AP University
Introduction to R Programming
• R is an open-source programming language that is
widely used as a statistical software and data analysis
tool.
• R generally comes with the Command-line interface.
• R is available across widely used platforms like
Windows, Linux, and macOS.
• Also, the R programming language is the latest
cutting-edge tool.
Introduction to R Programming
• R programming is used as a leading tool for machine
learning, statistics, and data analysis.
• Objects, functions, and packages can easily be
created by R.
• It’s a platform-independent language. This means it
can be applied to all operating system.
• It’s an open-source free language. That means
anyone can install it in any organization without
purchasing a license.
Introduction to R Programming
• R programming language is not only a statistic
package but also allows us to integrate with other
languages (C, C++).
• Thus, you can easily interact with many data sources
and statistical packages.
• The R programming language has a vast community of
users and it’s growing day by day.
• R is currently one of the most requested programming
languages in the Data Science job market that makes
it the hottest trend nowadays.
Installing R on Windows OS

• To install R on Windows OS:


• Go to the CRAN website.
• Click on "Download R for Windows".
• Click on "install R for the first time" link to
download the R executable (.exe) file.
• Run the R executable file to start installation, and
allow the app to make changes to your device.
• Select the installation language.
Additional R interfaces

• Other than the R GUI, the other ways to interface


with R include RStudio Integrated Development
Environment (RStudio IDE) and Jupyter Notebook.
• To run R on RStudio, you first need to install R on
your computer, while to run R on Jupyter Notebook,
you need to install an R kernel.
• RStudio and Jupyter Notebook provide an interactive
and friendly graphical interface to R that greatly
improves users’ experience.
Installing RStudio Desktop

• To install RStudio Desktop on your computer, do the


following:
• Go to the RStudio website.
• Click on "DOWNLOAD" in the top-right corner.
• Click on "DOWNLOAD" under the "RStudio Open
Source License".
• Download RStudio Desktop recommended for your
computer.
• Run the RStudio Executable file (.exe) for Windows OS
or the Apple Image Disk file (.dmg) for macOS X.
Basic operations in R using command line

R Comments (#)
• Comments are used to describe a piece of code
• They do not get interpreted or compiled and are
completely ignored during the execution of the
program.
• Example
# declare variable (comment line)
age = 24
R Variables and Constants
• In computer programming, a variable is a named memory
location where data is stored. For example, a=8.5
• Rules to Declare R Variables
• A variable name in R can be created using letters, digits, periods,
and underscores.
• You can start a variable name with a letter or a period, but not
with digits.
• If a variable name starts with a dot, you can't follow it with digits.
• R is case sensitive. This means that age and Age are treated as
different variables.
• Keywords or reserved words that cannot be used as variable
names.
R Data Types
• In R, there are 6 basic data types:
• logical
• numeric
• integer
• complex
• character
• Raw-A raw data type specifies values as raw
bytes
R Operators
The following unary and binary operators are defined. They are
listed in precedence groups, from highest to lowest.
Conditional and Looping
• Conditionals
– If
– If-else
• Looping
– While
– Repeat
– For
Output and Input functions
• Output:
Print() function
cat() function
• Input Function
realline()
scan()

Readlin()
name <- readline(prompt="Enter name: ")
Returns string. To convert to integer, we can use as.integer(name)
Scan()
This method reads data in the form of a vector or list. This method also uses to reads input
from a file also. 
x = scan()
scan() method is taking input continuously, to terminate the input process, need to
press Enter key 2 times on the console.
 
Scalars (Numbers) and Vectors
• R operates on named data structures.
• The simplest such structure is the numeric vector,
which is a single entity consisting of an ordered
collection of numbers.
• The function c() is used to create vectors.
• Example:
x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
• Assignment can also be made using the function
assign()
assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7))
Simple manipulations
• 1/x the reciprocals of the five values
• The further assignment
y <- c(x, 0, x)
would create a vector y with 11 entries
consisting of two copies of x with a zero in the
middle place.
Vector arithmetic
• Vectors can be used in arithmetic expressions,
in which case the operations are performed
element by element.
• x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
• x+2
• 12.4,7.6,5.1,8.4,23.7
• Vectors occurring in the same expression need not all be of the same
length.
• If they are not, the value of the expression is a vector with the same
length as the longest vector which occurs in the expression.
• Shorter vectors in the expression are recycled as often as need be
(perhaps fractionally) until they match the length of the longest vector.
• In particular a constant is simply repeated.
• So with the above assignments the command
• > v <- 2*x + y + 1
• generates a new vector v of length 11 constructed by adding together,
element by element, 2*x repeated 2.2 times, y repeated just once, and
1 repeated 11 times.
• The elementary arithmetic operators are the usual +, -, *, / and ^ for raising to
a power.
• In addition all of the common arithmetic functions are available. log, exp, sin,
cos, tan, sqrt, and so on, all have their usual meaning.
• max and min select the largest and smallest elements of a vector respectively.
• range is a function whose value is a vector of length two, namely c(min(x),
max(x)).
• length(x) is the number of elements in x, sum(x) gives the total of the elements
in x, and prod(x) their product.
• Two statistical functions are mean(x) which calculates the sample mean, which
is the same as sum(x)/length(x), and
• var(x) which gives sum((x-mean(x))^2)/(length(x)-1) or sample variance.
• If the argument to var() is an n-by-p matrix the value is a p-by-p sample
covariance matrix got by regarding the rows as independent p-variate sample
vectors.
• To work with complex numbers, supply an
explicit complex part.
• Thus sqrt(-17) will give NaN and a warning,
• but sqrt(-17+0i) will do the computations as
complex numbers.
Generating regular sequences
• R has a number of facilities for generating commonly
used sequences of numbers. For example 1:30 is the
vector c(1, 2, ..., 29, 30).
• The colon operator has high priority within an
expression, so, for example 2*1:15 is the vector c(2,
4, ..., 28, 30).
• Put n <- 10 and compare the sequences 1:n-1 and 1:
(n-1).
• The construction 30:1 may be used to generate a
sequence backwards.
seq() function
• The function seq() is a more general facility for generating
sequences. It has five arguments.
• seq(2,10) is the same vector as 2:10.
• The first two arguments may be named from=value and
to=value; thus seq(1,30), seq(from=1, to=30) and seq(to=30,
from=1) are all the same as 1:30.
• The next two arguments to seq() may be named by=value and
length=value, which specify a step size and a length for the
sequence respectively.
• If neither of these is given, the default by=1 is assumed.
• For example > seq(-5, 5, by=.2) -> s3
• c(-5.0, -4.8, -4.6, ..., 4.6, 4.8, 5.0)
• The fifth argument may be named along=vector, which is normally
used as the only argument to create the sequence 1, 2, ...,
length(vector), or the empty sequence if the vector is empty (as it
can be).
• A related function is rep() which can be used for replicating an
object in various complicated ways.
• The simplest form is
• > s5 <- rep(x, times=5) which will put five copies of x end-to-end in
s5.
• Another useful version is
• > s6 <- rep(x, each=5) which repeats each element of x five times
before moving on to the next.
Logical vectors
• Logical vectors are generated by conditions.
• For example
• > temp <- x > 13
• sets temp as a vector of the same length as x with
values FALSE corresponding to elements of x where
the condition is not met and TRUE where it is.
• Example if x is (5,20,10,15) then temp is
(FALSE,TRUE,FALSE,TRUE)
Missing values
• In some cases the components of a vector may not be completely
known.
• When an element or value is “not available” or a “missing value”
in the statistical sense, a place within a vector may be reserved
for it by assigning it the special value NA.
• In general any operation on an NA becomes an NA.
• The function is.na(x) gives a logical vector of the same size as x
with value TRUE if and only if the corresponding element in x is
NA.
• > z <- c(1:3,NA); ind <- is.na(z)
• Note that there is a second kind of “missing” values which are
produced by numerical computation, the so-called Not a Number,
NaN, values. Examples are > 0/0 or > Inf - Inf
Character vectors
• Character quantities and character vectors are used
frequently in R, for example as plot labels.
• e.g., "x-values", "New iteration results".
• Character vectors may be concatenated into a vector by the
c() function;
• The paste() function takes an arbitrary number of arguments
and concatenates them one by one into character strings.
• For example > labs <- paste(c("X","Y"), 1:10, sep="") makes
labs into the character vector
c("X1", "Y2", "X3", "Y4", "X5", "Y6", "X7", "Y8", "X9", "Y10")
Index vectors; selecting and modifying
subsets of a data set
• Index vectors can be any of four distinct types.
• 1. A logical vector. In this case the index vector is
recycled to the same length as the vector from which
elements are to be selected.
• Values corresponding to TRUE in the index vector are
selected and those corresponding to FALSE are
omitted.
• For example > y <- x[!is.na(x)] creates (or re-creates)
an object y which will contain the non-missing values
of x, in the same order.
Matrices
• Matrices are the R objects in which the elements are arranged in a two-
dimensional rectangular layout.
• A Matrix is created using the matrix() function.
• The basic syntax for creating a matrix in R is
matrix(data, nrow, ncol, byrow, dimnames)
• data is the input vector which becomes the data elements of the
matrix.
• nrow is the number of rows to be created.
• ncol is the number of columns to be created.
• byrow is a logical clue. If TRUE then the input vector elements are
arranged by row.
• dimname is the names assigned to the rows and columns.
Example

• M<-matrix(c(3:14), nrow=3,ncol=3)
Accessing Elements of a Matrix

• # Define the column and row names.


• rownames = c("row1", "row2", "row3", "row4")
• colnames = c("col1", "col2", "col3")
• # Create the matrix.
• P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames,
colnames))
• # Access the element at 3rd column and 1st row.
• print(P[1,3])
• # Access the element at 2nd column and 4th row. print(P[4,2])
• # Access only the 2nd row.
• print(P[2,])
• # Access only the 3rd column.
• print(P[,3])
Lists
• Lists are the R objects which contain elements of
different types like − numbers, strings, vectors and
another list inside it.
• A list can also contain a matrix or a function as its
elements.
• List is created using list() function.
• Creating a List
• list_data <- list("Red", "Green", c(21,32,11), TRUE,
51.23, 119.1) print(list_data)
Naming List Elements

• The list elements can be given names and they can


be accessed using these names.
• list_data <- list(c("Jan","Feb","Mar"),
matrix(c(3,9,5,1,-2,8), nrow = 2), list("green",12.3))
• names(list_data) <- c("1st Quarter", "A_Matrix", "A
Inner list")
• print(list_data)
Accessing List Elements

# Access the first element


print(list_data[1])
# Access the thrid element. As it is also a list, all its
elements will be printed.
print(list_data[3])
# Access the list element using the name of the element.

print(list_data$A_Matrix)
Manipulating List Elements

• We can add, delete and update list elements as shown below.


• We can add and delete elements only at the end of a list. But we
can update any element.
• # Add element at the end of the list.
• list_data[4] <- "New element" print(list_data[4])
• # Remove the last element.
• list_data[4] <- NULL
• # Print the 4th Element.
• print(list_data[4])
• # Update the 3rd Element.
• list_data[3] <- "updated element" print(list_data[3])
Merging Lists

• You can merge many lists into one list by placing all
the lists inside one list() function.
• # Create two lists.
• list1 <- list(1,2,3)
• list2 <- list("Sun","Mon","Tue")
• # Merge the two lists.
• merged.list <- c(list1,list2)
• # Print the merged list.
• print(merged.list)
Converting List to Vector

• we use the unlist() function. It takes the list as input and


produces a vector.
• # Create lists.
list1 <- list(1:5)
list2 <-list(10:14)
• # Convert the lists to vectors.
v1 <- unlist(list1)
v2 <- unlist(list2) # Now add the vectors
• result <- v1+v2
• print(result)
R - Data Frames
• A data frame is a table or a two-dimensional array-like
structure in which each column contains values of one
variable and each row contains one set of values from
each column.
• Following are the characteristics of a data frame.
– The column names should be non-empty.
– The row names should be unique.
– The data stored in a data frame can be of numeric, factor or
character type.
– Each column should contain same number of data items.
• Create Data Frame
Create Data Frame

• emp.data <- data.frame(emp_id = c (1:5),


emp_name=c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-
15", "2014-05-11", "2015-03-27")), stringsAsFactors = FALSE)
• # Print the data frame.
print(emp.data)
Get the Structure of the Data Frame

• # Get the structure of the data frame.


• str(emp.data)
Summary of Data in Data Frame
• The statistical summary and nature of the data
can be obtained by applying summary()
function.
• # Print the summary.
• print(summary(emp.data))
Extract Data from Data Frame
• Extract Data from Data Frame
• Extract specific column from a data frame
using column name.
• # Extract Specific columns.
• result <-
data.frame(emp.data$emp_name,emp.data$s
alary)
• print(result)
Extract the first two rows and then all
columns
• # Extract first two rows.
• result <- emp.data[1:2,]
• print(result)
Extract 3rd and 5th row with 2nd and 4th
column
• # Extract 3rd and 5th row with 2nd and 4th
column.
• result <- emp.data[c(3,5),c(2,4)]
• print(result)
Expand Data Frame
• A data frame can be expanded by adding columns and
rows.
• Add Column
• Just add the column vector using a new column name.
• # Add the "dept" coulmn.
• emp.data$dept <-
c("IT","Operations","IT","HR","Finance")
• v <- emp.data
• print(v)
Add Row
• # Create the second data frame
• emp.newdata <- data.frame(
• emp_id = c (6:8),
• emp_name = c("Rasmi","Pranab","Tusar"),
• salary = c(578.0,722.5,632.8),
• start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
• dept = c("IT","Operations","Fianance"),
• stringsAsFactors = FALSE)
• # Bind the two data frames.
• emp.finaldata <- rbind(emp.data,emp.newdata)
• print(emp.finaldata)
Thank you.

You might also like