You are on page 1of 64

Data Analysis

MAL 209
(Using R)

Lab Work Book


Dr. Anshu Malhotra

School of Engineering & Technology


Department of Applied Sciences

Name : Aman Gahlot

Roll No. 22BSM008

Branch : Applied Science

Batch : 2nd Sem

The NorthCap University


Gurugram, Haryana

Published by:
Data Analysis Laboratory Manual

SCHOOL OF ENGINEERING & TECHNOLOGY


Department of Applied Science

THE NORTHCAPUNIVERSITY

Gurugram

• Lab Manual is for internal circulation only.

© Copyright Reserved
No part of this manual may be reproduced, used, stored without prior
permission.

Second Edition

March, 2022

Printed by:
Data Analysis Laboratory Manual

PREFACE

It gives us great pleasure in presenting this manual to our esteemed readers.


This is a unique manual valuable for beginners. It is written with several goals
in mind. It is designed to provide all essential information you will need to
learn. It presents the basic features of R and knowledge of different
techniques used to for data analysis in the course Data Analysis.

The practical knowledge of any subject is thought to be an essential part of the


curriculum for an engineering student. During our teaching, we found that
students work under a very tight time schedule and they hardly find time to
search various books to understand the underlying principles and theory of
experiments. So, it was realized that there was a need for a comprehensive &
handy lab manual. Lab manual gives all required theory in detail related to all
experiments at one place. This encouraged the authors to come up with an
illustrative write-up of the lab manual. The manual is written keeping in mind
to make the subject easy to understand for the students.
Authors express deep gratitude to Members of the Governing Body-NCU for
encouragement and motivation. We are highly thankful to VC-NCU and Dr.
Hukum Singh, HOD-Applied Sciences for their valuable suggestions and
support.
In spite of all efforts, some errors might be there. We shall be grateful to the
readers if the same are brought to our notice. Suggestion & comments will
always be appreciated for further improvement of the manual.

Authors

THE NORTHCAP UNIVERSITY

Gurugram March, 2022


Data Analysis Laboratory Manual

CONTENTS

S. No. Details Page No.

Syllabus

1 Introduction

2 Lab Requirement

3 General Instructions

4 List of Experiments

5 Rubrics

6 Experiments
Data Analysis Laboratory Manual

COURSE TEMPLATE
Applied Sciences
1. Department:
2. Course Name: Data Analysis 3. Course Code 4. L-T- P 5. Credits
MAL-301/209 3-0-2 4

6. Type of Course
(Check one): Programme Core  Programme Elective Open Elective

7. Frequency of offering (check one): Odd Even  Either semester Every semester

Brief Syllabus: Data types and sources, visualizing and exploring data, organizational Interfaces,
precondition for data analysis, comparison among several samples, linear combinations and multiple
comparisons of means, linear regression: a model for the mean, multiple regression and inferential tools

8. Total lecture, Tutorial and Practical Hours for this course (Take 15 teaching weeks per semester)

Lectures: 45 hours Tutorials: 0 hours Practicals: 30 hours

9. Course Outcomes (COs)


Possible usefulness of this course after its completion i.e. how this course will be practically useful to him
it is completed

CO 1 Students should be able to understand introduction and basics of statistics, sources and types

CO 2 Students should be able to visualize data through charts and graphs and learn how to explore

CO 3 Students should understand to perform data preconditioning.

CO 4 Students should be able to perform linear combinations and multiple comparisons of means.

Students should be able to apply linear correlation, independent and dependent variables a
CO 5
types of correlation.

11. UNIT WISE DETAILS No. of Units: ___5______

Unit Number: 1 No. of Lectures: 10 Title: Introduction to Data Analysis and Statistics

Content Summary:

An overview of Statistics, Branches of Statistics, Sources of Data, Data Classification- Levels of Measure
Data Collection and Experimental Design- Design of a statistical study, Data collection, Experimental desig
Sampling techniques.

Unit Number: 2 No. of Lectures: 9 Title: Exploring and Visualizing data


Data Analysis Laboratory Manual
Content Summary:

Exploring data using Head, Tail and View, Types of Data Object, Plotting and Inspecting data, Summarizing
Visualizing data using charts and graphs-Pie charts, Bar charts, Boxplots, Histograms, Line graphs and S
plots

Unit Number: 3 No. of Lectures: 8 Title: Organizational Interfaces and Data Preconditioning

Content Summary:

Interfacing Organizational Data – Web data, Databases, XML files, Binary files, Excel files, CSV files,
Preconditioning for feature selection and regression in high dimensional problems.

Unit Number: 4 No. of Lectures: 8 Title: Linear Combinations and Multiple Comparisons

Content Summary:

Linear Combinations and Multiple Comparisons of Means and their case studies

Unit Number: 5 No. of Lectures: 10 Title: Correlation and Regression

Content Summary:

Correlation- Independent and Dependent, Types of Correlation, Correlation Coefficient, Using a table to test a
population, correlation coefficient, Hypothesis testing for a population correlation coefficient. Linear Regressio
Equation of a regression line, predicting y-values using regression equations. Multiple Regression-Finding a
multiple regression equation, predicting y-values

12. Brief Description of Self-learning component by students (through books/resource material etc.):

lms/ncuindia.edu; e-books
NPTEL (National Programme on Technology Enhanced Learning) provides E-learning through online We
Video courses various streams. Please check the home page http://nptel.ac.in/ and chose the course for w
you wish to have web materials / video lectures. These materials are prepared and delivered by the facult
members in IITs and IISc or great academicians serving elsewhere.

8. Books Recommended :

1. Elementary Statistics, Ron Larson and Betsy Farber, 5th Ed, Prentice Hall

2. An introduction to Statistical learning by James, Gareth Springer

3. Doing Data Science Straight Talk from the Frontline By Cathy O’Neil and Rachel Schutt
Data Analysis Laboratory Manual
Practical Content

Sr. Software/ Unit Estimated


No. Covered time
Title of the Experiment Instrument
based/
Component/
based

1 Installation and Introduction to R software R 1 90 min

2 R-Home, Overview, Environment Setup, Basic Syntax R 1 90 min

3 R-Data Types, Variables, R 2 90 min

R-Operators, Decision Making

R-Loops, Functions

4 R-Strings, Vectors, Lists, R 2 90 min

R-Matrices, Arrays, Factors

R-Data Frames, Package

5 R- Data Interfaces, CSV Files, Excel Files R 3 90 min

6 R Charts & Graphs R 3 90 min

R-Pie Charts, Bar Charts, Boxplots, Histograms

7 R-Line Graphs, Scatterplots R 4 90 min

8 R Statistics Examples R 4 90 min

R-Mean, Median & Mode

Linear Combinations and Multiple Comparisons of


Means

9 R-Linear Regression, Multiple Regression R 5 90 min

10 Case Study R 5 90 min


Data Analysis Laboratory Manual

L T P C Issue Date: July 2018

3 0 2 4 (MAL 301) Course: Data Analysis Revision Date:

Medium of Instruction: Code: MAL301/209

English Only EVALUATION SCHEME

Evaluation MAL 301


Scheme
 Major 70 Marks
 Minor 30 Marks
 Class Test /Assignment 20 Marks
 Online Test 10 Marks
 Practical 70 Marks
Total 200 Marks

Index of Minimum passing marks 40 % of total


success:
GAAP
Data Analysis Laboratory Manual

1. INTRODUCTION

How we analyse data has changed dramatically in recent years. With the
advent of personal computers and the internet, the sheer volume of data we
have available has grown enormously. Companies have terabytes of data on
the consumers they interact with, and governmental, academic, and private
research institutions have extensive archival and survey data on every
manner of research topic. Gleaning information (let alone wisdom) from
these massive stores of data has become an industry in itself. At the same
time, presenting the information in easily accessible and digestible ways has
become increasingly challenging.

The science of data analysis (statistics, psychometrics, econometrics,


machine learning) has kept pace with this explosion of data. Before personal
computers and the internet, new statistical methods were developed by
academic researchers who published their results as theoretical papers in
professional journals. It could take years for these methods to be adapted by
programmers and incorporated into the statistical packages widely available
to data analysts. Today, new methodologies appear daily. Statistical
researchers publish new and improved methods, along with the code to
produce them, on easily accessible websites.

The advent of personal computers had another effect on the way we analyse
Data Analysis Laboratory Manual
data. When data analysis was carried out on mainframe computers, computer
time was precious and difficult to come by. Analysts would carefully set up
a computer run with all the parameters and options thought to be needed.
When the procedure ran, the resulting output could be dozens or hundreds of
pages long. The analyst would sift through this output, extracting useful
material and discarding

the rest. Many popular statistical packages were originally developed during
this period and still follow this approach to some degree.

With the cheap and easy access afforded by personal computers, modern
data analysis has shifted to a different paradigm. Rather than setting up a
complete data analysis at once, the process has become highly interactive,
with the output from each stage serving as the input for the next stage. At
any point, the cycles may include transforming the data, imputing missing
values, adding or deleting variables, and looping back through the whole
process again. The process stops when the analyst believes he or she
understands the data intimately and has answered all the relevant questions
that can be answered.

The advent of personal computers (and especially the availability of high-


resolution monitors) has also had an impact on how results are understood
and presented. A picture really can be worth a thousand words, and human
beings are very adept at extracting useful information from visual
presentations. Modern data analysis increasingly relies on graphical
presentations to uncover meaning and convey results.

To summarize, in today’s data analysis need to be able to access data from a


wide range of sources (database management systems, text files, statistical
packages, and spreadsheets), merge the pieces of data together, clean and
annotate them, analyse them with the latest methods, present the findings in
meaningful and graphically appealing ways, and incorporate the results into
attractive reports that can be distributed to stakeholders and the public. As
you’ll see in the following pages, R is a comprehensive software package
that’s ideally suited to accomplish these goals.

The R environment

R is an integrated suite of software facilities for data manipulation,


calculation and graphical display. It includes

 an effective data handling and storage facility,


 a suite of operators for calculations on arrays, in particular matrices,
Data Analysis Laboratory Manual
a large, coherent, integrated collection of intermediate tools for data

analysis,
 graphical facilities for data analysis and display either on-screen or on
hardcopy, and
 a well-developed, simple and effective programming language which
includes
conditionals, loops, user-defined recursive functions and input and
output facilities.

2. LAB REQUIREMENTS

S. No. Requirements Details

1 Software Requirements R/MATLAB

2 Operating System Windows Operating System

3 Hardware Requirements Windows and Linux: Intel 64/32 or AMD Athlon


64/32, or AMD Opteron processor

2 GB RAM

80 GB hard disk space

4 Required Bandwidth NA

3. GENERAL INSTRUCTIONS

a. General discipline in the lab

● Students must turn up in time and contact concerned faculty.


● Students will not be allowed to enter late in the lab.
● Students will not leave the class until the period is over.
● Students should come prepared for their experiment by revising the
concepts taught in class.
Data Analysis Laboratory Manual
● Programs implemented with their output should be entered in the lab
report format and certified/signed by concerned faculty/ lab Instructor.
● Students should maintain silence while performing the experiments. If
any necessity arises for discussion amongst them, they should discuss
with a very low pitch without disturbing the adjacent groups.
● Violating the above code of conduct may attract disciplinary action.
● Damaging lab equipment or removing any component from the lab may
invite penalties and strict disciplinary action.

b. Attendance

● Attendance in the class is compulsory.


● Students should not attend a different lab group/section other than the
one assigned at the beginning of the session.
● On account of illness or some family problems, if a student misses
his/her lab classes, he/she may be assigned a different group to make
up the losses in consultation with the concerned faculty / lab instructor.
Or he/she may work in the lab during spare/extra hours to complete the
practicals. No attendance will be granted for such case.

c. Preparation and Performance

● Students should come to the lab thoroughly prepared on the practicals


they are assigned to perform on that day. Brief introduction to each
experiment with information about self study reference is provided on
LMS.
● Students must bring the lab report during each practical class with
written records of the last experiments performed complete in all
respect.
● Each student is required to write a complete report of the experiment
he has performed and bring to lab class for evaluation in the next
working lab. Sufficient space in work book is provided for
independent writing of theory, observation, calculation and
conclusion.
● Students should follow the Zero tolerance policy for copying /
plagiarism. Zero marks will be awarded if found copied. If caught
further, it will lead to disciplinary action.
Data Analysis Laboratory Manual

4. LIST OF EXPERIMENTS

Sr. No. CO Estimated


Covered time
Title of the Experiment

1 Installation and Introduction to R software CO1 90 min

2 R-Home, Overview, Environment Setup, Basic CO1 90 min


Syntax

3 R-Data Types, Variables, CO2 90 min

R-Operators, Decision Making

R-Loops, Functions

4 R-Strings, Vectors, Lists, CO2 90 min

R-Matrices, Arrays, Factors

R-Data Frames, Package

5 R- Data Interfaces, CSV Files, Excel Files CO3 90 min

6 R Charts & Graphs CO3 90 min

R-Pie Charts, Bar Charts, Boxplots, Histograms

7 R-Line Graphs, Scatterplots CO4 90 min

8 R Statistics Examples CO4 90 min


Data Analysis Laboratory Manual

R-Mean, Median & Mode

Linear Combinations and Multiple Comparisons of


Means

9 R-Linear Regression, Multiple Regression CO5 90 min

10 Case Study CO4 90 min

5. RUBRICS

Marks Distribution

S. TYPE OF ALLOTT PASS


PARTICULAR
N COURSE ED CRITERIA
o. End Term Project RANGE
25% Must

Theory+Practi Minor Test 15% Secure


1 cal Course (L- Major Test 30% 30%
T-P/L-T-0/L-0-
Class Test/ Assignment 15% Marks Out
P/L-0-0)
Online Test 5% of
Class Participation Evaluation Through 10%
Combined
Class
Data Analysis Laboratory Manual
Data Analysis Laboratory Manual
INDEX
S. No Experiment Page Date of Date of Marks CO Signature
No. Experiment Submission Covered

Installation and Introduction to R


1 software

R-Home, Overview, Environment


Setup, Basic Syntax
2

R-Data Types, Variables,

3 R-Operators, Decision Making

R-Loops, Functions

R-Strings, Vectors, Lists,

4 R-Matrices, Arrays, Factors

R-Data Frames, Package

R- Data Interfaces, CSV Files,


Excel Files
5

R Charts & Graphs

6 R-Pie Charts, Bar Charts, Boxplots,


Histograms

R-Line Graphs, Scatterplots


7

R Statistics Examples

8 R-Mean, Median & Mode

Linear Combinations and Multiple


Comparisons of Means

R-Linear Regression, Multiple


Regression
9

Case Study
10
Data Analysis Laboratory Manual

Practical No: 1

Date of Performance: Marks:


Faculty Signature: Remarks:

Aim: Introduction to R and installation of the software.


Introduction

R is not the only language that can be used for data analysis. Data analysis is inherently an interactive
process — what you see at one stage determines what you want to do next.  Interactivity is important.

R has a fantastic mechanism for creating data structures.  Obviously if you are doing data analysis, you
want to be able to put your data into a natural form.  You don’t have to warp your data into a particular
structure because that is all that is available.

 Graphics should be central to data analysis.  It is easy to produce graphs for exploring data.
The default graphs can be tweaked to get publication-quality graphs.
 Real data have missing values.  Missing values are an integral part of the R language. Many
functions have arguments that control how missing values are to be handled.
 Functions, like mean and median, are objects that you can use like data.  You can easily change
your analysis to use the median (or some strange estimate you make up on the spot) rather than
the mean.
 R has a package system that makes it extremely easy for people to add their own functionality
so it is indistinguishable from the central part of R.  And people have.  There are thousands of
packages that do all sorts of extraordinary things.

Obtaining and installing R

R is freely available from the Comprehensive R Archive Network (CRAN) at http:// cran.r-project.org.
Precompiled binaries are available for Linux, Mac OS X, and Windows. Follow the directions for
installing the base product on the platform of your choice.

Working with R

R is a case-sensitive, interpreted language. You can enter commands one at a time at the command
prompt (>) or run a set of commands from a source file. There are a wide variety of data types,
including vectors, matrices, data frames (similar to datasets), and lists (collections of objects).
Data Analysis Laboratory Manual

Most functionality is provided through built-in and user-created functions, and all data objects are kept
in memory during an interactive session. Basic functions are available by default. Other functions are
contained in packages that can be attached to a current session as needed.

Statements consist of functions and assignments. R uses the symbol <- for assignments, rather than the
typical = sign. For example, the statement

x<-rnorm(7)

creates a vector object named x containing seven random deviates from a standard normal
distribution. Comments are preceded by the # symbol . Any text appearing after the # is ignored by the
R interpreter.

Getting started

If you’re using Windows, launch R from the Start Menu. On a Mac, double-click the R
icon in the Applications folder. For Linux, type R at the command prompt of a terminal
window. Any of these will start the R interface.
Data Analysis Laboratory Manual

Important Questions

Q1. What are the steps to download the R software?


Step 1: Search R studio on google.
Step 2: Click on the first website cran.r-project.org
Step 3: Click on DOWNLOAD R 3.6.3 FOR WINDOWS link.
Step 4: Now an installation startup show on the screen. Select English
language and click on ok.
Step 5: Click on Next till the installation option appears on the screen.
Step 6: After all steps the r studio will installed in the device.

Q2. Why R is an important programming language?

R is actively used for statistical computing and design. It has brought about
revolutionary improvements in big data and data analytics. It is the most widely
used language in the world of data science! Some of the big shots in the industry
like Google, LinkedIn, and Facebook, rely on R for many of their operations

Installation code
Data Analysis Laboratory Manual

Practical No: 2

Date of Performance: Marks:


Faculty Signature: Remarks:

Aim: To understand the basic commands, environment and operations.


Data Analysis Laboratory Manual
R provides extensive help facilities, and learning to navigate them will help you significantly in your
programming efforts. The built-in help system provides details, references, and examples of any
function contained in a currently installed package. R has an inbuilt help facility to get more
information on any specific named function, for example solve, the command is

> help(solve)

An alternative is
> ?solve

For a feature specified by special characters, the argument must be enclosed in double or single
quotes, making it a “character string”: This is also necessary for a few words with syntactic
meaning including if, for and function.
> help("[[")
R Command Prompt
Once you have R environment setup, then it’s easy to start your R command prompt by just typing
the following command at your command prompt −
$R
This will launch R interpreter and you will get a prompt > where you can start typing your program
as follows −
> myString <- "Hello, World!"
> print ( myString)
[1] "Hello, World!"
Here first statement defines a string variable myString, where we assign a string "Hello, World!" and
then next statement print() is being used to print the value stored in variable myString.
R Script File
Usually, you will do your programming by writing your programs in script files and then you execute
those scripts at your command prompt with the help of R interpreter called Rscript. So let's start with
writing following code in a text file called test.R as under −
# My first program in R Programming
myString <- "Hello, World!"
print ( myString)
Save the above code in a file test.R and execute it at Linux command prompt as given below. Even if
you are using Windows or other system, syntax will remain same.
$ Rscript test.R
When we run the above program, it produces the following result.
[1] "Hello, World!"
Data Analysis Laboratory Manual

Comments
Comments are like helping text in your R program and they are ignored by the interpreter while
executing your actual program. Single comment is written using # in the beginning of the statement
as follows −
# My first program in R Programming
R does not support multi-line comments but you can perform a trick which is something as follows −
if(FALSE) {
"This is a demo for multi-line comments and it should be put inside either a
single OR double quote"
}

myString <- "Hello, World!"


print ( myString)
[1] "Hello, World!"
Though above comments will be executed by R interpreter, they will not interfere with your actual
program. You should put such comments inside, either single or double quote.

Important Questions

Q1. Explain basic commands for addition, subtraction of numbers.


It supports addition(add), subtraction(sub), multiplication(mul),
division(div) and calculation of the absolute value (abs). You will use
functions to support each type of calculation. You can assume that
the numbers are float data types.
Data Analysis Laboratory Manual
Q2. Explain basic commands for addition, subtraction of matrices.
They contain elements of the same atomic types. Though we can
create a matrix containing only characters or only logical values,
they are not of much use. We use matrices containing numeric
elements to be used in mathematical calculations

Q3. Write code to solve system of linear equations.

>A<-matrix(c(1,1,2,3,2,4,2,3,-6),nrow=3,byrow=TRUE)
>A
> b<-matrix(c(6,9,3))
>b
>solve(A,b)

Code
Data Analysis Laboratory Manual
Data Analysis Laboratory Manual

Practical No: 3

Date of Performance: Marks:


Faculty Signature: Remarks:

Aim: To understand the data types, variables, operators, decision making, loops
and functions.
The variables are not declared as some data type. The variables are assigned with R-Objects and the
data type of the R-object becomes the data type of the variable. There are many types of R-objects.
The frequently used ones are −

 Vectors
 Lists
 Matrices
 Arrays
 Factors
 Data Frames
The simplest of these objects is the vector object and there are six data types of these atomic vectors,
also termed as six classes of vectors. The other R-Objects are built upon the atomic vectors.

Data Type Example

Logical TRUE, FALSE

Numeric 12.3, 5, 999


Data Analysis Laboratory Manual
Integer 2L, 34L, 0L

Complex 3 + 2i

Character 'a', '"good", "TRUE", '23.4'

Raw "Hello" is stored as 48 65 6c 6c 6f

In R programming, the very basic data types are the R-objects called vectors which hold elements of
different classes as shown above. Please note in R the number of classes is not confined to only the
above six types. For example, we can use many atomic vectors and create an array whose class will
become array.

Variable Assignment

The variables can be assigned values using leftward, rightward and equal to operator. The values of
the variables can be printed using print() or cat() function. The cat() function combines multiple
items into a continuous print output.

# Assignment using equal operator.

var.1 = c(0,1,2,3)

# Assignment using leftward operator.

var.2 <- c("learn","R")

# Assignment using rightward operator.

c(TRUE,1) -> var.3

print(var.1)

cat ("var.1 is ", var.1 ,"\n")

cat ("var.2 is ", var.2 ,"\n")

cat ("var.3 is ", var.3 ,"\n")

Types of Operators

We have the following types of operators in R programming −


Data Analysis Laboratory Manual
 Arithmetic Operators

 Relational Operators

 Logical Operators

 Assignment Operators

 Miscellaneous Operators

R provides the following types of decision making statements. Click the following links to check their
detail.

Sr. No. Statement & Description

1 if statement

An if statement consists of a Boolean expression followed by one or more statements.

2 if...else statement

An if statement can be followed by an optional else statement, which executes when


the Boolean expression is false.

3 switch statement

A switch statement allows a variable to be tested for equality against a list of values.

R programming language provides the following kinds of loop to handle looping requirements. Click
the following links to check their detail.

Sr. No. Loop Type & Description

1 repeat loop

Executes a sequence of statements multiple times and abbreviates the code that
manages the loop variable.
Data Analysis Laboratory Manual

2 while loop

Repeats a statement or group of statements while a given condition is true. It tests the
condition before executing the loop body.

3 for loop

Like a while statement, except that it tests the condition at the end of the loop body.

An R function is created by using the keyword function. The basic syntax of an R function definition
is as follows −

function_name <- function(arg_1, arg_2, ...) {

Function body

Function Components

The different parts of a function are −

 Function Name − This is the actual name of the function. It is stored in R environment as an
object with this name.

 Arguments − An argument is a placeholder. When a function is invoked, you pass a value to


the argument. Arguments are optional; that is, a function may contain no arguments. Also
arguments can have default values.

 Function Body − The function body contains a collection of statements that defines what the
function does.

 Return Value − The return value of a function is the last expression in the function body to be
evaluated.

R has many in-built functions which can be directly called in the program without defining them first.
We can also create and use our own functions referred as user defined functions.

Important Questions
Q1. What is the difference between conditional and iterative loop?
Data Analysis Laboratory Manual
A condition is any variable or expression that returns a Boolean value ( TRUE or
FALSE ). The iteration structure executes a sequence of statements repeatedly as long
as a condition holds true.

Q2. Write a code to print sequence of numbers.

print(“Sequence of numbers from 10 to 20:”)


Print(seq(10,20))
Data Analysis Laboratory Manual
Q3. Write a code to display the grade of the student if marks of 5 subjects are
given.

Code:
Data Analysis Laboratory Manual

Practical No: 4

Date of Performance: Marks:


Faculty Signature: Remarks:

Aim: To understand the strings, vectors, lists, data frames and packages.
Any value written within a pair of single quote or double quotes in R is treated as a string. Internally R
stores every string within double quotes, even when you create them with single quote.

Rules Applied in String Construction

 The quotes at the beginning and end of a string should be both double quotes or both single
quotes. They cannot be mixed.

 Double quotes can be inserted into a string starting and ending with single quote.

 Single quote can be inserted into a string starting and ending with double quotes.

 Double quotes cannot be inserted into a string starting and ending with double quotes.
Data Analysis Laboratory Manual
 Single quote cannot be inserted into a string starting and ending with single quote.

 Vectors are the most basic R data objects and there are six types of atomic vectors. They are
logical, integer, double, complex, character and raw.
Vector Creation
Single Element Vector
Even when you write just one value in R, it becomes a vector of length 1 and belongs to one of
the above vector types.
# Atomic vector of type character.
print("abc");

# Atomic vector of type double.


print(12.5)

Multiple Elements Vector


Using colon operator with numeric data
# Creating a sequence from 5 to 13.
v <- 5:13
print(v)

A Matrix is created using the matrix() function.

Syntax

The basic syntax for creating a matrix in R is −

matrix(data, nrow, ncol, byrow, dimnames)

Following is the description of the parameters used −

 data is the input vector which becomes the data elements of the matrix.

 nrow is the number of rows to be created.

 ncol is the number of columns to be created.

 byrow is a logical clue. If TRUE then the input vector elements are arranged by row.

 dimname is the names assigned to the rows and columns.

A data frame is a table or a two-dimensional array-like structure in which each column contains
values of one variable and each row contains one set of values from each column.
Data Analysis Laboratory Manual
Following are the characteristics of a data frame.

 The column names should be non-empty.


 The row names should be unique.
 The data stored in a data frame can be of numeric, factor or character type.
 Each column should contain same number of data items.

Create Data Frame


# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the data frame.
print(emp.data)

R packages are a collection of R functions, complied code and sample data. They are stored under a
directory called "library" in the R environment. By default, R installs a set of packages during
installation. More packages are added later, when they are needed for some specific purpose. When we
start the R console, only the default packages are available by default. Other packages which are
already installed have to be loaded explicitly to be used by the R program that is going to use them.
All the packages available in R language are listed at R Packages.

Important Questions

Q1. What are the different packages available in R?

Tidyquant, dygraphs, leaflet, ggmap

Q2. Create a data frame of your class.


Q3. Write a code to display different types of strings.
Code
Data Analysis Laboratory Manual

Practical No: 5
Data Analysis Laboratory Manual
Date of Performance: Marks:
Faculty Signature: Remarks:

Aim: To understand the R data interfaces through CSV and Excel files.
In R, we can read data from files stored outside the R environment. We can also write data into files
which will be stored and accessed by the operating system. R can read and write into various file
formats like csv, excel, xml etc.
In this chapter we will learn to read data from a csv file and then write data into a csv file. The file
should be present in current working directory so that R can read it. Of course we can also set our
own directory and read files from there.

Getting and Setting the Working Directory

You can check which directory the R workspace is pointing to using the getwd() function. You can
also set a new working directory using setwd()function.
# Get and print current working directory.
print(getwd())

# Set current working directory.


setwd("/web/com")

# Get and print current working directory.


print(getwd())

Input as CSV File


The csv file is a text file in which the values in the columns are separated by a comma. Let's consider
the following data present in the file named input.csv.
You can create this file using windows notepad by copying and pasting this data. Save the file
as input.csv using the save As All files(*.*) option in notepad.
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
Data Analysis Laboratory Manual

Reading a CSV File


Following is a simple example of read.csv() function to read a CSV file available in your current
working directory −
data <- read.csv("input.csv")
print(data)

Analyzing the CSV File


By default the read.csv() function gives the output as a data frame. This can be easily checked as
follows. Also we can check the number of columns and rows.
data <- read.csv("input.csv")

print(is.data.frame(data))
print(ncol(data))
print(nrow(data))

Microsoft Excel is the most widely used spreadsheet program which stores data in the .xls or .xlsx
format. R can read directly from these files using some excel specific packages. Few such packages
are - XLConnect, xlsx, gdata etc. We will be using xlsx package. R can also write into excel file using
this package.
Install xlsx Package
You can use the following command in the R console to install the "xlsx" package. It may ask to
install some additional packages on which this package is dependent. Follow the same command with
required package name to install the additional packages.
install.packages("xlsx")
Verify and Load the "xlsx" Package
Use the following command to verify and load the "xlsx" package.
# Verify the package is installed.
any(grepl("xlsx",installed.packages()))

# Load the library into R workspace.


library("xlsx")
When the script is run we get the following output.
[1] TRUE
Loading required package: rJava
Loading required package: methods
Loading required package: xlsxjars
Data Analysis Laboratory Manual

Input as xlsx File


Open Microsoft excel. Copy and paste the following data in the work sheet named as sheet1.
id name salary start_date dept
1 Rick 623.3 1/1/2012 IT
2 Dan 515.2 9/23/2013 Operations
3 Michelle 611 11/15/2014 IT
4 Ryan 729 5/11/2014 HR
5 Gary 43.25 3/27/2015 Finance
6 Nina 578 5/21/2013 IT
7 Simon 632.8 7/30/2013 Operations
8 Guru 722.5 6/17/2014 Finance
Also copy and paste the following data to another worksheet and rename this worksheet to "city".
name city
Rick Seattle
Dan Tampa
Michelle Chicago
Ryan Seattle
Gary Houston
Nina Boston
Simon Mumbai
Guru Dallas
Save the Excel file as "input.xlsx". You should save it in the current working directory of the R
workspace.
Reading the Excel File
The input.xlsx is read by using the read.xlsx() function as shown below. The result is stored as a data
frame in the R environment.
# Read the first worksheet in the file input.xlsx.
data <- read.xlsx("input.xlsx", sheetIndex = 1)
print(data)

Important Questions

Q1. Create a data frame of a student in excel and import the data.
Data Analysis Laboratory Manual
Data Analysis Laboratory Manual

Q2. Add a column of grade.


Data Analysis Laboratory Manual

Q3. Compute the grade and update the excel.


Code:
Data Analysis Laboratory Manual
Data Analysis Laboratory Manual

Practical No: 6

Date of Performance: Marks:


Faculty Signature: Remarks:

Aim: To analyse the data using Graphs and Charts.


R Programming language has numerous libraries to create charts and graphs. A pie-chart is a
representation of values as slices of a circle with different colors. The slices are labeled and the
numbers corresponding to each slice is also represented in the chart.
In R the pie chart is created using the pie() function which takes positive numbers as a vector input.
The additional parameters are used to control labels, color, title etc.
Syntax
The basic syntax for creating a pie-chart using the R is −
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used −
 x is a vector containing the numeric values used in the pie chart.
 labels is used to give description to the slices.
 radius indicates the radius of the circle of the pie chart.(value between −1 and +1).
 main indicates the title of the chart.
 col indicates the color palette.
 clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.
Example
A very simple pie-chart is created using just the input vector and labels. The below script will create
and save the pie chart in the current R working directory.
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")
# Give the chart file a name.
png(file = "city.png")
# Plot the chart.
pie(x,labels)
# Save the file.
Data Analysis Laboratory Manual
dev.off()

When we execute the above code, it produces the following result −

A bar chart represents data in rectangular bars with length of the bar proportional to the value of the
variable. R uses the function barplot() to create bar charts. R can draw both vertical and Horizontal
bars in the bar chart. In bar chart each of the bars can be given different colors.
Syntax
The basic syntax to create a bar-chart in R is −
barplot(H,xlab,ylab,main, names.arg,col)
Following is the description of the parameters used −
 H is a vector or matrix containing numeric values used in bar chart.
 xlab is the label for x axis.
 ylab is the label for y axis.
 main is the title of the bar chart.
 names.arg is a vector of names appearing under each bar.
 col is used to give colors to the bars in the graph.

Boxplots are a measure of how well distributed is the data in a data set. It divides the data set into
three quartiles. This graph represents the minimum, maximum, median, first quartile and third
quartile in the data set. It is also useful in comparing the distribution of data across data sets by
drawing boxplots for each of them.
Boxplots are created in R by using the boxplot() function.
Data Analysis Laboratory Manual
Syntax
The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)

Following is the description of the parameters used −


 x is a vector or a formula.
 data is the data frame.
 notch is a logical value. Set as TRUE to draw a notch.
 varwidth is a logical value. Set as true to draw width of the box proportionate to the sample
size.
 names are the group labels which will be printed under each boxplot.
 main is used to give a title to the graph.

Important Questions

Q1. Create a bar graph of the data from the pie-chart.


Data Analysis Laboratory Manual
Q2. Create a data set and make bar graph and pie-chart for the same.
Data Analysis Laboratory Manual
Data Analysis Laboratory Manual

Q3. What is the difference between pie-chart and bar graph?

A pie chart can only be used if the sum of the individual parts add up to a
meaningful whole, and is built for visualizing how each part contributes to that whole.

Meanwhile, a bar chart can be used for a broader range of data types, not just for
breaking down a whole into components

Code:

Practical No: 7

Date of Performance: Marks:


Faculty Signature: Remarks:

Aim: To analyse the data using Line graphs, Scatter plots and Histogram.
A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is
similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in
histogram represents the height of the number of values present in that range.
R creates histogram using hist() function. This function takes a vector as an input and uses some
more parameters to plot histograms.
Syntax
The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used −
 v is a vector containing numeric values used in histogram.
 main indicates title of the chart.
 col is used to set color of the bars.
Data Analysis Laboratory Manual
 border is used to set border color of each bar.
 xlab is used to give description of x-axis.
 xlim is used to specify the range of values on the x-axis.
 ylim is used to specify the range of values on the y-axis.
 breaks is used to mention the width of each bar.
Example
A simple histogram is created using input vector, label, col and border parameters.
The script given below will create and save the histogram in the current R working directory.
# Create data for the graph.
v <- c(9,13,21,8,36,22,12,41,31,33,19)

# Give the chart file a name.


png(file = "histogram.png")

# Create the histogram.


hist(v,xlab = "Weight",col = "yellow",border = "blue")

# Save the file.


dev.off()
When we execute the above code, it produces the following result −

A line chart is a graph that connects a series of points by drawing line segments between them. These
points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are usually
used in identifying the trends in data.
The plot() function in R is used to create the line graph.
Data Analysis Laboratory Manual
Syntax
The basic syntax to create a line chart in R is −
plot(v,type,col,xlab,ylab)
Following is the description of the parameters used −
 v is a vector containing the numeric values.
 type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw
both points and lines.
 xlab is the label for x axis.
 ylab is the label for y axis.
 main is the Title of the chart.
 col is used to give colors to both the points and lines.

Example
A simple line chart is created using the input vector and the type parameter as "O". The below script
will create and save a line chart in the current R working directory.
# Create the data for the chart.
v <- c(7,12,28,3,41)
# Give the chart file a name.
png(file = "line_chart.jpg")
# Plot the bar chart.
plot(v,type = "o")

Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two
variables. One variable is chosen in the horizontal axis and another in the vertical axis.
The simple scatterplot is created using the plot() function.
Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)

Important Questions

Q1. Create a scatterplot of the data from the histogram.


Data Analysis Laboratory Manual

Q2. Create a data set and plot the data through scatterplot. What analysis you
obtained?
Data Analysis Laboratory Manual

Q3. Discuss any one real time application with analysis by plotting.

Practical No: 8

Date of Performance: Marks:


Faculty Signature: Remarks:

Aim: To analyse the data through Statistics by mean, median, mode, comparison
of means.
Mean
It is calculated by taking the sum of the values and dividing with the number of values in a data
series.
The function mean() is used to calculate this in R.
Data Analysis Laboratory Manual
Syntax
The basic syntax for calculating mean in R is −
mean(x, trim = 0, na.rm = FALSE, ...)
Following is the description of the parameters used −
 x is the input vector.
 trim is used to drop some observations from both end of the sorted vector.
 na.rm is used to remove the missing values from the input vector.
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.
result.mean <- mean(x)
print(result.mean)
When we execute the above code, it produces the following result −
[1] 8.22

Median
The middle most value in a data series is called the median. The median() function is used in R to
calculate this value.
Syntax
The basic syntax for calculating median in R is −
median(x, na.rm = FALSE)

Mode
The mode is the value that has highest number of occurrences in a set of data. Unike mean and
median, mode can have both numeric and character data.
R does not have a standard in-built function to calculate mode. So we have to create a user function to
calculate mode of a data set in R. 

Important Questions

Q1. Write command to calculate mean and median.

Mean - result.mean <- mean(x,na.rm = TRUE)


print(result.mean)
Data Analysis Laboratory Manual
Median - median.result <- median(x)
print(median.result)

Q2. Given the marks of 10 students of a class in economics and statistics. Find
mean of both the subjects and analyze the result.

Economics 25 28 35 32 31 36 29 38 34 32
Statistics 43 46 49 41 36 32 31 30 33 39

Code
Data Analysis Laboratory Manual

Practical No: 9

Date of Performance: Marks:


Faculty Signature: Remarks:

Aim: To obtain the relation between variables using linear and multiple
regression.
Regression analysis is a very widely used statistical tool to establish a relationship model between
two variables. One of these variable is called predictor variable whose value is gathered through
experiments. The other variable is called response variable whose value is derived from the predictor
variable.
In Linear Regression these two variables are related through an equation, where exponent (power) of
both these variables is 1. Mathematically a linear relationship represents a straight line when plotted
as a graph. A non-linear relationship where the exponent of any variable is not equal to 1 creates a
curve.
The general mathematical equation for a linear regression is −
y = ax + b
Following is the description of the parameters used −
 y is the response variable.
 x is the predictor variable.
 a and b are constants which are called the coefficients.
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Data Analysis Laboratory Manual
Following is the description of the parameters used −
 formula is a symbol presenting the relation between x and y.
 data is the vector on which the formula will be applied.
Multiple regression is an extension of linear regression into relationship between more than two
variables. In simple linear relation we have one predictor and one response variable, but in multiple
regression we have more than one predictor variable and one response variable.
The general mathematical equation for multiple regression is −
y = a + b1x1 + b2x2 +...bnxn

Following is the description of the parameters used −


 y is the response variable.
 a, b1, b2...bn are the coefficients.
 x1, x2, ...xn are the predictor variables.
We create the regression model using the lm() function in R. The model determines the value of the
coefficients using the input data. Next we can predict the value of the response variable for a given
set of predictor variables using these coefficients.
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in multiple regression is −
lm(y ~ x1+x2+x3...,data)
Following is the description of the parameters used −
 formula is a symbol presenting the relation between the response variable and predictor
variables.
 data is the vector on which the formula will be applied.

Important Questions

Q1. What is the difference between linear and multiple regression?

The main difference between linear regression and multiple regression is the number of
independent variables being considered. Linear regression involves only one
independent variable, while multiple regression involves two or more independent
variables.
Data Analysis Laboratory Manual

Q2. Differentiate between correlation and regression.

Correlation as the name says it determines Regression explains how an independent variable

the interconnection or a co-relationship is numerically associated with the dependent

between the variables. variable

Q3. Given the marks of 10 students of a class in economics and statistics. Find
correlation coefficient of both the subjects and analyze the result.

Economics 25 28 35 32 31 36 29 38 34 32
Statistics 43 46 49 41 36 32 31 30 33 39
Data Analysis Laboratory Manual

Q4. Solve the above question and verify the result.

Code:
Data Analysis Laboratory Manual
Data Analysis Laboratory Manual

Practical No: 10

Date of Performance: Marks:


Faculty Signature: Remarks:

Aim: To perform any case study and implement all functions of statistics, plotting
etc.

Code:
Data Analysis Laboratory Manual
Data Analysis Laboratory Manual
Data Analysis Laboratory Manual
Data Analysis Laboratory Manual

You might also like