You are on page 1of 30

DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

MASTER OF BUSINESS ADMINISTRATION


SEMESTER 3

DADS301
PROGRAMMING IN DATA SCIENCE

Unit 1: Introduction to R 1
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

Unit 1
Introduction to R
Table of Contents

SL Topic Fig No / SAQ / Page No


No Table / Activity
Graph
1 Introduction - -
3
1.1 Learning Objectives - -
2 History of R 1 - 4-5
3 Features of R - - 5-6
4 Organisations Using R 2 - 7
5 How Popular is R? - - 8
6 Tidyverse Package for Data Wrangling 3, 4 - 9-10
7 ggplot2 Package for High-Quality Graphics 5, 6 - 11-12
8 How to Download and Install R for Windows 7, 8 - 13-14
9 RStudio 9 - 14-15
10 How to Download and Install RStudio for 10, 11, 12, 13, -
Windows
14, 15, 16, 17, 15-22
18, 19, 20, 21
11 Introduction to Google Colab 22, 23, 24, 25, -
23-26
26, 27, 28
12 Setting up R on Google Colab 29 - 27
13 Installing R-Packages - - 28-29
14 Summary - - 29
15 Self Assessment Questions - - 30
16 Terminal Questions - - 30
17 SAQ Answers - - 30
18 Terminal Q Answers - - 30

Unit 1: Introduction to R 2
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

1. INTRODUCTION
R programming language is a very popular choice for practitioners of data science across the
world. The R programming language evolved from the S programming language which itself
was developed in 1976. The S programming language was meant for statistical computations
with the goal of giving users an interactive experience of doing statistical computations and
focus more on interaction with data.

The R programming language was developed in 1993 as an extended version of the S


programming language. R is an integrated set of software facilities – including the R
programming language – for data handling, analysis, and visualisation.

1.1 Learning Objectives


At the end of this topic, you will be able to:

❖ Explain the history, evolution, and current status of R


❖ Outline important features of R
❖ Identify the appropriate packages for graphics and data wrangling in R
❖ Outline the working of an R programming environment
❖ Explain the installation process and working of RStudio IDE for Windows
❖ Explain the working of Google Colab
❖ Run R codes using Google Colab

Unit 1: Introduction to R 3
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

2. HISTORY OF R
‘R’ is a powerful language used in Data Science. Given below is the history of R programming
language.
• R evolved from the S programming language which was developed in 1976 at Bell
Laboratories—formerly AT&T and now called Lucent Technologies. The S
programming language, as the letter S indicates, was intended for statistical
computations. The goal of the S programming language was to give the user an
interactive experience of doing statistical computations and focus more on the
interaction with data and relatively less on the programming aspects.
• R programming language was developed in 1993 as an advanced or extended version
of the S programming language. It can be thought of as a modern implementation of the
S programming language. It offers or supports the same statistical computations that
the language ‘S’ supported. R was created by statisticians Ross Ihaka and Robert
Gentleman at the University of Auckland, New Zealand.
• R also supports advanced graphics capabilities for producing high-quality graphical
output and has packages that are particularly designed for data science applications. R
itself is an open-source software designed and controlled by a core group called the R
Core Group and people from the R Foundation.
• The source code for R was originally written in programming languages such as C and
Fortran which allows R to be run on multiple hardware platforms ensuring superior
performance. Another interesting aspect is that R also allows users to develop new
packages by coding or programming directly in R itself. One of the popular graphics
packages in R called ggplot2 was in fact developed using R itself. We can say that R is
an integrated suite of software facilities for data handling, analysis, and visualisation.

Unit 1: Introduction to R 4
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

Fig 1: Applications of R

3. FEATURES OF R
There are several features of R that make it a preferred choice among developers.
Some of the key features of R are:

1. Seamless data handling and storage: One of the strongest features of R is that it
enables users to handle data from various sizes and sources. R lets users handle small
data which refers to data of size typically around one gigabyte. It enables user to read,
store, and transform such data using readily available built-in features. Such features
are available in the base R package but there are more advanced packages, such as the
data.table package, which let users handle data of hundreds of gigabytes of size
efficiently with minimal syntax. This is what makes R a very popular programming
language as far as data handling is concerned.
2. Fast operators for arrays: Another important feature of R is the availability of a fast
operator on objects known as arrays. Arrays are parts of data that a user will extract
for further analysis and visualisation. R has efficient in-built libraries that let a user
perform these operations in a highly computation-efficient manner. One of the features
built on top of the fast operations on array-like objects is called ‘vectorised
computation’. Vectorised computation is essentially a style of programming that
operates on several small pieces of data independently for the same purpose. This
allows users to write a program or a code in R in clear, concise, and efficient manner.
An R program written by a user can be seen as a clear exposition of their thought
process.

Unit 1: Introduction to R 5
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

3. Efficient tools for data wrangling and analysis: Another feature that makes R very
popular is the availability of tools for data wrangling and analysis. Data wrangling is the
process of transforming raw data into a usable format. This is especially useful when
dealing with unstructured data. There are libraries or packages that are built
specifically for this purpose in R, and these libraries let users deal with data of all types
of sizes, shapes, and complexity.
4. High-quality graphics for display or print: One of the standout features of R, in
addition to all the features mentioned above which are also available in other high-level
programming languages such as Python, is its ability to produce high-quality graphics
output. For any work that requires the communication of information via high-quality
graphical output, R is the first choice.
5. Traditional programming concepts - Conditionals, loops, user-defined functions, and
input-output facilities: Besides all these high-level features, R also has traditional
programming features such as conditionals, for-loops, if-statements, user-defined
functions, etc. These low-level features let users write simple codes for debugging or
for specific purposes.
6. Others: R is Platform Independent. It provides support for different tools. It also
supports seamless integration with various third party tools.

Unit 1: Introduction to R 6
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

4. ORGANISATIONS USING R
R is one of those rare programming languages embraced by a multitude of organisations
across the domains of technology, pharmaceuticals, media, and academics.

Fig 2: Organisations Use R

All these organisations use R to graphically communicate the information hidden within the
data. It offers powerful yet concise features to deal with small data, big data as well as what
is known as streaming data. Due to the advantages discussed above, R is the preferred
programming language or the first choice for applications in fields such as geospatial
analysis, bioinformatics, and genetics. R also helps users create interactive Web apps using
R Shiny package. These interactive Web apps let end-users interact with the Web application
for sophisticated display of graphics while all the complicated calculations run behind the
screen on an R engine.

Unit 1: Introduction to R 7
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

5. HOW POPULAR IS R?
Table 1: TIOBE Programming Community Index

Source: https://www.tiobe.com/tiobe-index/

As shown in TIOBE Programming Community Index (Table 1), R is the number-two choice
for high-level programming languages for data science applications.

Clearly, the number-one choice is Python, which remains the most popular programming
language for data science applications. However, it is not uncommon to see data science
practitioners using both Python and R, depending on the purpose at hand. So, it is a good
idea to be skilful in both these programming languages; by learning one at a time and by
diving into the core topics of each programming language.

Unit 1: Introduction to R 8
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

6. TIDYVERSE PACKAGE FOR DATA WRANGLING


What is data wrangling?
Data wrangling is the process of transforming raw data or unstructured data into a more
usable form. In other words, it is the art of turning data into a useful form for visualisation
and analysis.

There are three important steps in data wrangling:


1. The first step is to import data from wherever it is available.
2. The second step is called the tidy step; the name of the package—tidyverse-- originates
from this term. The tidy step corresponds to saving the data in a useful format for
further transformation, visualisation, and analysis.
3. The third step is to transform, visualise and analyse the data. And that is the
interpretation part of the data.

Fig 3: Important Steps in Data Wrangling


Source: https://ggplot2-book.org/index.html

Transformation refers to the reordering of the rows of data, adding new names to variables,
or simply renaming the variables; data resulting from such steps is said to be tidied up. The
tidyverse package is a highly efficient and interactive package for performing all these three
steps by using a very simple syntax which enables the user to communicate data effectively.
It is notable that tidyverse package is probably one of the most popular packages that allows
users to organise data in a consistent way.

Unit 1: Introduction to R 9
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

Fig 4: Tidy form of Data


Source: https://ggplot2-book.org/index.html

The goal of the tidyverse package is to put data in what is known as a tidy form. Each column
in the data refers to a variable or a feature. For example, the first column could be age, the
second column could be gender, etc.

Each row in the data refers to a specific observation or sample. The first row could
correspond to the first customer or to the first patient, etc. Each value should correspond to
a particular cell. So, for example, if the first row of the first column is referred to, then one is
referring to the age of the first student or the first patient. This is known as organising data
in a tidy format. It may seem simple, but data is not always available in a tidy format. The
tidyverse package helps users organise data in a format which is easy to analyse and visualise
later.

Unit 1: Introduction to R 10
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

7. GGPLOT2 PACKAGE FOR HIGH-QUALITY GRAPHICS


ggplot was developed by directly coding using the R programming language. The ggplot2
package lets users create a high-level description of graphical plots using what is known as
R constructs. The user only has to specify a high-level description as to how the graphics
have to appear, and there is a rendering engine that takes a holistic view of all these plot
objects and renders the final graphics on a monitor or some other medium.

Fig 5: Data Visualisation Using ggplot2


Source: https://web.stanford.edu/class/bios221/book/Chap-Graphics.html

The ggplot2 package makes R stand out among the other popular high level programming
languages such as Python. The ggplot2 package enables users to create high-quality
publication level graphics such as the one shown in Figure 5; this graphic shows the
expression level of genes (rows) across several samples (columns). This example is a clear
exposition of the capabilities of the ggplot2 package for creating high quality graphical
outputs.

Unit 1: Introduction to R 11
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

The ggplot2 package is also used extensively in the media industry. Popular publication
houses such as the New York Times use ggplot2 to generate graphical output such as the one
shown in Figure 6 to communicate data that is of social and economic relevance.

Fig 6: Example of ggplot2 Visualisation in Media


Source: https://www.nytimes.com/

Unit 1: Introduction to R 12
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

8. HOW TO DOWNLOAD AND INSTALL R FOR WINDOWS


Step 1: To install R on Windows, go to the Comprehensive R Archive Network (CRAN)
website https://cran.r-project.org/ and click the “Download R for Windows” link.

Fig 7: Web Page of the Comprehensive R Archive Network

Step 2: Click the “base” link which should lead to a download link “Download R-4.2.1 for
Windows,” in which 4.2.1 will be replaced by the latest version of R.

Fig 8: Finding the Link to Download the Latest Version of R

Unit 1: Introduction to R 13
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

Step 3: Download the installer program, execute it, and follow the default prompts to install
the latest version of R for Windows. Make sure the administrator privileges for installing
new software are enabled.

9. RSTUDIO
So, how to use R programming language or how do we program using the R programming
language?

The most preferred choice for that is to use a software called RStudio.

Fig 9: RStudio
Source: https://www.rstudio.com/blog/announcing-rstudio-1-4/

RStudio is an open-source interactive development environment (IDE) specifically designed


for the R programming language. It offers a visual editor as shown in Figure 7; on the left-
hand side, you can see a piece of R code and on the right-hand side there is a graphical output
from that code. We can also run commands using the console. RStudio can be easily
integrated with Python. RStudio is an open-source software that can be installed as a desktop
version or can also be used on the Web using a cloud interface. There is also a paid version
with additional features for the web-based interface of RStudio.

Unit 1: Introduction to R 14
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

The other popular IDEs are :


• Visual Studio Code
• Atom
• Sublime Text
• Rider

10. HOW TO DOWNLOAD AND INSTALL RSTUDIO FOR WINDOWS


RStudio is an IDE for managing packages and writing codes in R.

Step 1: To download RStudio, go to the link https://www.rstudio.com/products/rstudio/


and select "RStudio Desktop" version followed by selecting "Open Source Edition" and click
“Download RStudio Desktop”.

Fig 10: Downloading RStudio Desktop Version

Step 2: In the downloads window, click “RStudio Desktop Free download version”.

Unit 1: Introduction to R 15
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

Fig 11: Downloading RStudio for Windows

Step 3: Click “Download RStudio for Windows” for the installer.

Fig 12: Downloading RStudio Desktop Version

Step 4: Execute the installer with administrator privileges and follow the prompts to install
RStudio.

Step 5: Once installed, run RStudio.

Unit 1: Introduction to R 16
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

Fig 13: Welcome Page of RStudio Desktop Version

Step 6: As RStudio starts, it shows up the major components in its interface that include:
• Console: Place to insert commands and check interactively if the right set of packages
is installed
• Terminal: A windows terminal used for executing windows commands
• Job: Displays the on-going processes
• Environment: Deals with the display of the current R environment and variables used
• History: Keeps record of all commands entered onto the IDE console
• Connections: Helps to work seamlessly with the drivers
• Tutorial: Powered by “learnr” package which gets connected to the RStudio Education
• Files: Displays the files accessed on the system
• Plots: Presents the output
• Packages: Includes the list of packages that are available in RStudio
• Help: Provides details about the various R resources and RStudio community support
• Viewer: Allows users to view the local web content

Step 7: Check the working on the console with a few commands such as “R.version” to know
the current version of R environment.

Unit 1: Introduction to R 17
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

Fig 14: Command for R.version

Step 8: To clear the console environment, press Ctrl + L.

Step 9: Before starting to work, it is important to set the working directory.Click on ‘Session’-
>’Set Working Directory’->Choose Directory and set the working directory.

Fig 15: Setting up of Working Directory

Unit 1: Introduction to R 18
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

Fig 16: Console View of setwd() Command

Step 10: Once set, we can verify the working directory with the command “getwd()”.

Fig 17: Console View of getwd() Command

Step 11: One can set the default working directory by going to Tools-> Global Settings. After
the respective location is selected, click apply and then click ok.

Unit 1: Introduction to R 19
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

Fig 18: Setting up of Default Working Directory (Step I)

Fig 19: Setting up of Default Working Directory (Step II)

Unit 1: Introduction to R 20
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

Fig 20: Setting up of Default Working Directory (Step III)

Step 12: As R is power-packed with a lot of packages, before starting to work one typically
needs to check whether a specific package is installed. This could be done with the help of
the “require()” command in the console. If the package is not pre-installed, it must be
installed using the command “install.packages()”. This could be done easily through the
console. For example, install.packages("calendR").

The packages can also be installed by seelcting “Tools”->”Install Packages”. It will open a
prompt as shown below. This can be used to download and install the required packages.

Unit 1: Introduction to R 21
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

Fig 21: Package Installation

Unit 1: Introduction to R 22
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

11. INTRODUCTION TO GOOGLE COLAB


Google Colaboratory (Colab in short) is an interactive programming environment offered by
Google. It is a cloud-based virtual machine which can run Python codes using an interface
like Jupyter.

To access Google Colab, click this link: https://colab.research.google.com/ . Please note a


gmail id is required to access the google colab environment.

Fig 22: Accessing Codes from Google Drive

This process is meant to access Google Colab for the first time. Once the first notebook is
created, you can access your codes from Google Drive.

Unit 1: Introduction to R 23
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

Fig 23: Creating a New Notebook

To create the first notebook, use the “New notebook” button present on the screen as
highlighted in Figure 2. In the drive, you can also see other files which are compatible with
Google Colab or the files you have created.

Once you click the new notebook, you will be redirected to the page as in Figure 24:

Fig 24

Unit 1: Introduction to R 24
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

This is called an Interactive Notebook. Here you can write the code, write descriptive
comments, and much more. The advantage of using Google Colab is that the code uses
Google’s hardware to run the code and not that of your physical computers’.

Once you have created the notebook, you can go to your Google Drive and find a folder named
“Colab Notebooks” as shown in Figure 4. This folder will have all the notebooks that you
create later.

Fig 25

Open the folder and create a new folder named “RCourse” in which you will store all the
codes that you will be writing in this course.

Fig 26

Unit 1: Introduction to R 25
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

Step 1: Right click to get the menu as shown in Figure 26.

Fig 27

Step 2: Write the folder name (RCourse) and click ‘Create’ as shown in Figure 27.

Let us explore the different options in the interactive notebook.

Fig 28

The boxes numbered 1 and 2 in Figure 28 highlight the words code and text. To create a cell
to execute code, click on the “code” button and to create a text cell, click on the “text” button.
We will be using the code as well as text cells to document the code that we write. The
documents stored or saved via Google Colab are in the so called “.ipynb” format which refers
to an Interactive Python Notebook.

Unit 1: Introduction to R 26
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

12. SETTING UP R ON GOOGLE COLAB


Google Colab can run Python codes directly. However, it cannot be used for running R codes
as such. To do so, use the empty notebook titled “EmptyRNotebook.ipynb” which is provided
in the LMS. This empty notebook has specific instructions coded into it that enables the
Google Colab virtual machine to use the R interpreter instead of the default Python
interpreter in order to run R codes. Each time you want to create and run an R code, make a
copy of this notebook, rename it appropriately, and open that notebook using Google Colab
for further coding.

To open a file from Google Drive on Google Colab, right click on the file and hover on “open
with”, and then choose Google Colaboratory (Figure 29).

Fig 29

After opening the file in Google Colab, you can start writing and executing R codes.

Unit 1: Introduction to R 27
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

13. INSTALLING R-PACKAGES


To work with R codes, we require few packages that will help us execute the codes better. To
install a package in R, the syntax is:

install.packages(“package names”)

While working with Google Colab, you can work with packages that come pre-installed in
their virtual machines. In order to work with a package that is not pre-installed, for example.,
the ggplot2 package, you will have to install it by typing the following code in a code cell.

install.packages(‘ggplot2’)

To execute the code cell, click the “play” button which you can see on the left-hand side of
the cell.

Once you execute the code, it will give the following output:

The path in the output is located inside the virtual machine where the package gets installed.

To use the installed package, use the “library()” function, for which the syntax is as follows:

library(package name)

library(‘ggplot2’)

Executing the above code will load the ggplot2 package which allows us to use the in-built
functions in that package for plotting and visualisation.

Unlike in RStudio, the packages installed in Google Colab might be lost once the tab is closed
as the virtual machine is shut down. This means that when you open the file later, you will

Unit 1: Introduction to R 28
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

have to execute the “install.packages” code cell again to re-install a specific package in the
virtual machine. This is a major disadvantage of using Google Colab when compared to
RStudio.

14. SUMMARY
Here is a quick recap of what we have learnt so far:
• The R programming language evolved as a powerful, interactive, and intuitive tool for
analysing and visualising data.
• The “tidyverse” library in R has a powerful set of tools for data wrangling.
• The “ggplot2” library, which is a part of the “tidyverse” library, is a collection of state-
of-the-art visualisation tools.
• RStudio is a powerful IDE for running R codes locally or on the cloud.

Unit 1: Introduction to R 29
DADS301: Programming in Data Science Manipal University Jaipur (MUJ)

15. SELF ASSESSMENT QUESTIONS


1. The language R evolved from which programming language.
2. Name a popular IDE that is used for development in R language.
3. Name the library in R that is used for Data Wrangling .
4. Name the library in R that is used in Visualisation of Data.
5. R code can be directly executed on Google Colab Environment- True/False ?

16. TERMINAL QUESTIONS


1. What is Data wrangling ? Name the package used for Data wrangling in R and describe
some of its features.
2. Describe how R is used in visualisation of data.
3. Name the major components of R Studio.

17. SAQ ANSWERS


1. S programming language
2. RStudio
3. Tidyverse
4. Ggplot
5. False

18. TERMINAL QUESTIONS


1. Refer Section 5
2. Refer Section 6.
3. Refer Step 6 of Section 8.

Unit 1: Introduction to R 30

You might also like