Professional Documents
Culture Documents
October 18-19,2023
1
2
Date Session Time Topic Instructor
Dr. Prasanna
09:00 - 10:30 R - Introduction Samuel /
Ms. Divya
I 10:30 - 11:00 Refreshment
Interactive Practical
11:00 - 12:30
Session
18-Oct-2023 12:30 - 13:30 Lunch
3
Contents
Chapter 1 R software 7
1. Introduction to R software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1. Evolution of R programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2. The R environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3. Why R programming? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4. Download and Install R software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2. RStudio software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1. What is RStudio? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2. Some of the features in RStudio software . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3. Download and Install RStudio software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3. Basic R commands, data types, operators and data structures . . . . . . . . . . . . . . . . . . . 10
3.1. Learning R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2. R commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3. Types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4. Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5. Types of data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.1. Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.2. Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.3. Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.4. Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.5. Data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.6. Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4. Import a CSV file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5. Display data and data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6. Important Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4
3.1. Rename the variables in the data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2. Adding variable labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3. Recoding variables in the data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4. Value labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4. Basic column operations with dplyr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5. Basic row operations with dplyr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6. Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.1. Find and remove the duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2. Find and replace the missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.3. Find and replace the outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7. The join() functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8. Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.1. Creating summary statistics by using dplyr . . . . . . . . . . . . . . . . . . . . . . . . . 37
9. Import a MS excel file (optional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
10. Reading data from different sources (optional) . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5
Chapter 4 Statistical reports generation 51
1. Summary statistics for publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2. The gtsummary package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.1. Install gtsummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2. Summary table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3. Statistical tests and outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1. Comparing two groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2. Comparing more than 2 groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3. Association between categorical groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4. Summary statistics with statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4. Merge two or more gtsummary objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5. Formatting the gtsummary table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6. Export gtsummary table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Challenges: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Day - 1 (Session - 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Day - 1 (Session - 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Day - 2 (Session - 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Day - 2 (Session - 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6
Chapter 1 R software
1. Introduction to R software
• R is a programming language and computing environment used for reporting, graphic representation,
and statistical analysis. Ross Ihaka and Robert Gentleman developed R at the University of
Auckland in New Zealand. R can be considered as a different implementation of S programming
language which was developed in 1976 at Bell laboratories. Later in 2004 S, became S-PLUS which
includes GUI (Graphical User Interface) features, though the fundamentals has not changed over time.
• R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-
series analysis, classification, clustering, etc.,) and graphical techniques, and is highly extensible. The
S language is often the vehicle of choice for research in statistical methodology, and R provides an
Open Source route to participation in that activity.
One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including
mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor
design choices in graphics, but the user retains full control.
R is available as Free (freedom to use, adapt, redistribute & improve) Software under the terms
of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs
on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and
MacOS.
• R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the
University of Auckland, New Zealand. R made its first appearance in 1993.
• R can be regarded as an implementation of the S language which was developed at Bell Laboratories
by Rick Becker, John Chambers and Allan Wilks, and also forms the basis of the S-Plus systems.
• In June 1995, statistician Martin Mächler convinced Ihaka and Gentleman to make R free and open-
source under the GNU (General Public License). Mailing lists for the R project began on 1 April 1997
preceding the release of version 0.50. R officially became a GNU project on 5 December 1997 when
version 0.60 released. The first official 1.0 version was released on 29 February 2000.
• A large group of individuals has contributed to R by sending code and bug reports. Since mid-1997
there has been a core group (the “R Core Team”) who can modify the R source code archive. The R
Development Core Team is currently responsible for maintaining this programming language.
• Comprehensive R Archive Network (CRAN) is a network of ftp and web servers around the world that
store identical, up-to-date, versions of code and documentation for R. We can download and install
packages and software from the CRAN platform.
• The CRAN R software has 19925 packages with the top-rated tools for medical research as of September
2023.
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It
includes,
7
• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data analysis,
• graphical facilities for data analysis and display either on-screen or on hardcopy, and
• a well-developed, simple and effective programming language which includes conditionals, loops, user-
defined recursive functions and input and output facilities.
The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an
incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis
software.
• In contrast to other statistical software, R is exceptionally open and free software, making it simple
to collaborate with other people, simple to read syntax and readily achieve reproducibility. We can
integrate R programming with other tools like Python, MySQL, REDCap, Hadoop, and etc.,
• R excels in data visualization with packages like ggplot2, which allows to create highly customizable
and publication-quality graphics.
• R is specifically designed for statistical analysis. It provides a wide range of statistical tests, models,
and methods for data exploration, hypothesis testing, and modeling. Its rich statistical functions and
libraries make it a go-to tool for statisticians.
For download the R software https://CRAN.R-project.org, to download the most recent version of R.
Depending on your operating system, download the proper version of R software. In this website, Choose
Download R for (Windows / macOS / Linux) —> Sub directories (base / install the R for first time).
The software’s interface looks like this when it is opened after installation (see Figure 1).
What is Console?
In programming, a console is a text-based interface that allows you to interact with the computer’s operating
system
2. RStudio software
Even though R programming is widely used for statistical programming, it can be challenging to manage
different panels, scripts, variables, and objects while using the R software panel. We then go to RStudio,
which is the R programming interface.
RStudio is an integrated development environment (IDE) specifically designed for working with the R pro-
gramming language. We can use R without RStudio, but R studio provides various features that are quite
helpful when writing R syntax.
8
Figure 1: R software
• RStudio provides a user-friendly and intuitive interface for working with R. It includes a script editor,
a console, a workspace browser, and integrated plotting capabilities, making it easier to write, run,
and manage your R code.
• The code editor in RStudio offers syntax highlighting, code autocompletion, and error checking, which
helps you write code more efficiently and identify errors quickly. This can improve your productivity
and reduce coding mistakes.
• RStudio provides tools for managing your R workspace, including viewing loaded datasets, objects,
and functions. This helps you keep track of your data and variables during your analysis.
• RStudio supports Markdown and R Markdown, which are useful for creating dynamic and reproducible
reports and documents. You can combine code, text, and graphics in a single document, making it
easy to communicate your analysis results.
• Studio has built-in project management features, allowing you to organize your work into separate
projects. Projects help keep your files, scripts, data, and settings organized, making it easier to
collaborate and maintain reproducibility.
RStudio has now changed its name to POSIT. Hadley Alexander Wickham is the chief scientist at POSIT.
He developed many packages that make R easier to use.
For download RStudio software https://posit.co/download/rstudio-desktop/. Depending on your op-
erating system, download the proper version of RStudio.
Note: RStudio will function if R is supported. Therefore, ensure that R and RStudio are installed separately.
9
Figure 2: RStudio software panel
3.1. Learning R
• Learing R can be difficult and frustrating; and it is normal. The main aim is to learn how to use R –
basic syntax, and not to memorize functions.
3.2. R commands
Note: Select the codes, then click the Run button in the toolbar’s upper right corner to execute the R
commands. If not, choose the code and press Ctrl + Enter to run the command.
# R as calculator
2 + 2 # Addition
> [1] 4
10
10 - 2 # Substraction
> [1] 8
4 * 5 # Multiplication
> [1] 20
10 / 2 # Division
> [1] 5
10 ˆ 2 # Powers
3+5*2
> [1] 13
(3+5)*2
> [1] 16
x = 5
> [1] 5
# R- way of assignment
x <- 5
> [1] 5
print(x)
> [1] 5
Notice that the assignment operator (‘<-’), which consists of the two characters ‘<’ (“less than”) and ‘-’
(“minus”) occurring strictly side-by-side and it ‘points’ to the object receiving the value of the expression.
In most contexts the ‘=’ operator can be used as an alternative.
11
3.3. Types of data
3.4. Operators
Operators are used to perform operations on variables and values. R divides the operators as Arithmetic
operators, Comparison operators and Logical operators.
Arithmetic operators
• The elementary arithmetic operators are the usual +, -, *, / and ^ for raising to a power.
Operator Description
== Equal to a==b
!= Not equal to a!=b
< Greater than a<b
<= Greater than equal to a<=b
> Less than a>b
>= Less than or equal to a>=b
& Element-wise AND logical operator. Condition A AND B
both are TRUE.
| Element-wise OR logical operator. Condition A OR B TRUE.
:: Operator which helps to access a specific function from a
specific package.
• Vectors
• Arrays
• Matrices
• Lists
• Dataframes
• Factors
3.5.1. Vectors
#Integer vector
age <- c(35, 31, 29, 33, 28)
age
12
> [1] 35 31 29 33 28
#Numeric vector
weight <- c(71.7, 77.2, 53.5, 62.1, 29.2)
weight
#Character vector
name <- c("A", "B", "C", "D", "E")
gender <- c("Male","Female","Male","Male","Female")
Note: c() usually stands for combine. This function is used to get the output by giving parameters inside the
function.
Vector Operations
• Vectors can be used in arithmetic expressions, in which case the operations are performed element by
element. These operations are performed element-wise and hence the length of both the vectors should
be the same.
# Vector operations
age + 3
> [1] 38 34 32 36 31
> Warning in age + c(5, 9, 11, 7): longer object length is not a multiple of
> shorter object length
> [1] 40 40 40 40 33
• Vectors in R are 1 based indexed, unlike the normal C, python, etc., format where indexing starts from
0.
Example:
13
#Extracting elements in a vector
#We can extract a element using [], following the R command
bmi[2:3] # from 2 to 3
#Logical vector
# Vector which is contains only TRUE or FALSE or NA.
bmi_cat <- bmi<30
bmi_cat
bmi[bmi_cat]
• In addition all of the common arithmetic functions are available. log, exp, sin, cos, tan, sqrt, and so
on, all have their usual meaning.
• max and min select the largest and smallest elements of a vector respectively. range is a function whose
value is a vector of length two, namely c(min(x), max(x)).
• length(x) is the number of elements in x, sum(x) gives the total of the elements in x, and prod(x) their
product.
> [1] 28
> [1] 35
14
length(age) #Number of elements in age
> [1] 5
> [1] 31
3.5.2. Matrices
• Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular layout.
They contain elements of the same data types. We use matrices containing numeric elements to be
used in mathematical calculations.
15
> Outcome == Yes Outcome == NO
> Treatment 10 40
> Control 7 43
## [1] 7
crosstab[2,]
#The whole row can be accessed if you specify a comma after the number in the bracket
crosstab[,2]
## Treatment Control
## 40 43
#The whole column can be accessed if we specify a comma before the number in the bracket
Matrix operations
rowSums() for row sum and colSums() for column sum.
Calculation of Relative Risk for the cross-tab
## Treatment Control
## 50 50
## [1] 10
16
n1 <- rtotal[1] # Extracting no in the treatment group
n1
## Treatment
## 50
## Treatment
## 0.2
## [1] 7
## Control
## 50
## Control
## 0.14
## Treatment
## 1.428571
3.5.3. Arrays
• Arrays are the R data objects which can store data in more than two dimensions. For example, If we
create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3
columns. Arrays can store only data type.
An array is created using the array() function. It takes vectors as input and uses the values in the dim
parameter to create an array.
17
Example:
> , , 1
>
> [,1] [,2]
> [1,] 11 22
> [2,] 13 42
>
> , , 2
>
> [,1] [,2]
> [1,] 12 7
> [2,] 9 19
strattab
> , , confounder = M
>
> outcome
> group Y N
> I 11 22
> C 13 42
>
> , , confounder = F
>
> outcome
> group Y N
> I 12 7
> C 9 19
#For example,
strattab[,,2] # the second matrix
> outcome
18
> group Y N
> I 12 7
> C 9 19
> Y N
> 12 7
> [1] 12
3.5.4. Lists
There is no particular need for the components to be of the same mode or type, and, for example, a list
could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function,
and so on. Here, is a simple example of how to make a list:
Example:
> [[1]]
> [1] 11 22 13 42
>
> [[2]]
> [,1] [,2]
> [1,] 11 22
> [2,] 13 42
>
> [[3]]
> , , 1
>
> [,1] [,2]
> [1,] 11 13
> [2,] 22 42
>
> , , 2
>
> [,1] [,2]
> [1,] 12 9
> [2,] 7 19
19
lst <- list(vector1,matrix1,array, coln, rown)
lst
> [[1]]
> [1] 11 22 13 42
>
> [[2]]
> [,1] [,2]
> [1,] 11 22
> [2,] 13 42
>
> [[3]]
> , , 1
>
> [,1] [,2]
> [1,] 11 13
> [2,] 22 42
>
> , , 2
>
> [,1] [,2]
> [1,] 12 9
> [2,] 7 19
>
>
> [[4]]
> [1] "Outcome == Yes" "Outcome == NO"
>
> [[5]]
> [1] "Treatment" "Control"
#For example,
lst[[1]] #Extract first vector of the list
## [1] 11 22 13 42
lst[[1]][2] #This is for extract 2nd element in the first vector of a list.
## [1] 22
• Data frames in R language are generic data objects of R that are used to store tabular data. Data
frames can also be interpreted as matrices where each column of a matrix can be of different data
types.
20
Example:
#For example,
data[,2] #Extract second entire column
## [1] 35 31 29 33 28
## [1] 35
3.5.6. Factors
• Factors are the data objects which are used to categorize the data and store it as levels. They can
store both strings and integers. They are useful in data analysis for statistical modeling. Factors are
created using the factor() function by taking a vector as input.
Example:
21
#Change the order of levels
factor(gender, levels = c("Male","Female"))
Note: The levels are automatically treated as alphabetically by R if a variable is declared to be a factor. We
must manually enter levels in order to order the factors as we see fit.
Thus far, we have created different R objects: vectors, lists, arrays, and dataframes. Now, we will look at
how we can import an external data set saved in the comma-separated file (csv). CSV formats are plain-text
files easy, quick and efficient that takes up less space than other formats
R command for upload a CSV file, read.csv(File path).
Example: read.csv(“S:/Users/R-Workshop/Example_data.csv”)
Note: The forward slash (/) should be used in the file path. backslash (\) is the default character for file
paths. Because of this, we must manually notate the file path. After we write the file name .csv is mandatory
to write.
> [1] 90 6
22
> 4 Illiterate Urban settings
> 5 Secondary education Rural settings
> 6 Basic education Urban settings
23
6. Important Points
• To get more information on any specific named function, for example mean, the command is
help(mean) alternative is ?mean
• We can set the working directory using setwd() function.
• For checking the current working directory use getwd()
24
Chapter 2 Data management
1. Packages in R
• Packages in R programming language are a set of R functions, compiled code, and sample data. These
are stored under a directory called “library” within the R environment.
• By default, R installs a group of packages during installation (base R). Once we start the R console,
only the default packages are available by default.
• Other packages that are already installed need to be loaded explicitly to be utilized by the R program
that’s getting to use them.
For installing R Package from CRAN we need the name of the package and use the following command:
install.packages("package name")
Installing Package from CRAN is the most common and easiest way as we just have to use only one command.
In order to install more than a package at a time, we just have to write them as a character vector in the
first argument of the install.packages() function:
Example:
install.packages(c(“dplyr”, “openxlsx”))
To check what packages are installed on your computer, type this command: installed.packages()
To update all the packages, type this command: update.packages()
To update a specific package, type this command: install.packages("PACKAGE NAME")
To remove a specific package, type this command: remove.package("PACKAGE NAME")
Installing Packages Using RStudio user interface In R Studio go to Tools –> Install Package, and
there we will get a pop-up window to type the package what we want to install.
Once you install an R package, you can immediately start using its features. If you only need to occasionally
use specific functions or data from the package, you can access them using a certain notation. Installing a
package is a one-time task, but you have to reload it every time you begin a new session.
library("PACKAGE NAME")
Example: library(dplyr)
Suppose a serological survey was conducted as part of a formal study to investigate dengue IgG, NS1
marker, and hemoglobin levels across different demographic characteristics. We will be conducting several
data cleaning procedures, visualizations, and preparing tables with statistical tests using this dataset. In the
session, we will be experimenting with the following variables.
25
• Participant.ID
• Sex at birth
• Age (in years)
• Occupation
• Education level
• Locality
• Hemoglobin
• Dengue IgG
• Dengue IgG results
• NS1 bio-marker
#Example
#Demographic data
dem_data <- read.csv("S:/CMC Dept. of Biostatistics/R Workshop/Example_data_1.csv")
#Lab data
lab_data <- read.csv("S:/CMC Dept. of Biostatistics/R Workshop/Example_data_2.csv")
2. dplyr Package
• dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the
most common data manipulation challenges.
• It is a set of tools for a common set of problems:
%>% is a special operator in R found in the magrittr and dplyr packages. %>% lets we pass objects to
functions elegantly. This pipe operator helps us to make our code more readable.
The R-studio keyboard shortcut for pipe operator, ctrl - shift - m
Note: When we start using more than two functions for the data cleaning and processing, the role of the
pipe operator will become more important.
26
3. Change the variable and values’ names and labels
The variable names are often quite lengthy and challenging to use for subsequent commands. Therefore, we
can rename these variables for easier and more convenient use in further analyses.
The syntax for renaming variables is, rename(NEW VARIABLE NAME = OLD VARIABLE
NAME)
#Example
dem_data1 <- dem_data %>% #Store the modified data in another data frame
rename(age=Age..in.years.,
gender=Sex.at.birth,
ocp=Occupation,
edu=Education.Level,
loc=Locality)
• Variable labels provide a detailed and comprehensive description of each variable, making it easier to
understand their meanings and purposes in the dataset.
• Using descriptive variable labels enhances the clarity and interpretability of the data, facilitating more
effective analysis and communication of results.
• With this description it is easier to remember what those variable names refer to.
The R labelled package can be used to label variables. Load the package after installation.
R syntax is set_variable_labels(data, list(labels))
#Example
dem_data1 <- dem_data1 %>% set_variable_labels(Participant.ID="Unique ID",
gender="Gender",
age ="Age (in years)",
ocp="Occupational status",
edu="Educational status",
loc="Locality")
dem_data1 %>% var_label()
27
> $Participant.ID
> [1] "Unique ID"
>
> $gender
> [1] "Gender"
>
> $age
> [1] "Age (in years)"
>
> $ocp
> [1] "Occupational status"
>
> $edu
> [1] "Educational status"
>
> $loc
> [1] "Locality"
Recoding refers to the process of transforming or reassigning values of variables within a dataset. Recoding
can involve converting numerical values into categories, aggregating data into different groups, or changing
the scale of measurement.
#Data recode
#Re-coding an existing variable with the help of the recode() function.
#Re-code syntax is, recode(VAR_NAME, Value="New assign")
28
> 2 Rural settings 2
> 3 Rural settings 3
> 4 Urban settings 3
> 5 Rural settings 3
> 6 Urban settings 3
• Value labels provide descriptive meanings or interpretations for the numeric or coded values within a
variable. By assigning value labels such as “Extremely poor” to 1 and “Excellent” to 7, it eliminates
the need to remember the numeric representations of these categories.
• This practice enhances the readability and interpretability of the dataset, making it easier for re-
searchers and analysts to understand the meaning behind the coded values without having to reference
a separate codebook or manual.
set_value_labels() could also be used to modify all the value labels attached to a vector.
#Example
dem_data3 <- dem_data2 %>%
set_value_labels(gender=c(Male=1,Female=2),
age_group=c("00-10"=0,
"11-20"=1,
"21-30"=2,
"30+"=3))
> $Participant.ID
> NULL
>
> $gender
> Male Female
> 1 2
>
> $age
> NULL
>
> $ocp
> NULL
>
> $edu
> NULL
>
> $loc
> NULL
>
> $age_group
> 00-10 11-20 21-30 30+
> 0 1 2 3
The above described functions are essential aspects in providing context and understanding of the variables
in the dataset, forms a part of dataset documentation and collectively enrich metadata documentation.
29
4. Basic column operations with dplyr
The select() operation allows you to choose and extract columns (or variables) of interest from the dataset.
The filter() operation allows you to choose and extract rows of interest from the dataset
30
> 2 2
> 3 3
> 4 3
> 5 1
> 6 2
31
> 3 N000168053 1 51 Shop keeper Basic education Urban settings
> 4 N000168054 1 11 Shop keeper Basic education Urban settings
> age_group
> 1 3
> 2 3
> 3 3
> 4 1
In this case, we performed certain row and column actions. In order to do computations on
already-existing data, data cleansing is necessary.
6. Data Cleaning
In RStudio, you can use the View() function to display the contents of a data frame in a new panel. Assuming
you have imported your dengue IgG data from a CSV file into a data frame
32
There are two separate CSV files. Demographic data is on the left, and lab IgG results are on the right. A
few of the duplicates, outliers, and missing data points are shown in the figure above. We must therefore
clean this dataset before we start our analysis.
• Identifying and removing duplicate rows from a dataset is crucial to maintain data accuracy and prevent
redundancy. Duplicate values can distort statistical analyses and lead to incorrect conclusions.
• In R, you can use the dplyr package to easily identify and remove duplicate rows from a data frame.
• We can use duplicated() function to find out how many duplicates value are present in a vector.
#Run the R command below if you also want to print the entire sets of duplicates.
dem_data3 %>%
filter(duplicated(Participant.ID) |
duplicated(Participant.ID, fromLast = TRUE))
33
> 3 N000168020 1 23 Government employee Illiterate Urban settings
> 4 N000168020 1 23 Government employee Illiterate Urban settings
> 5 N000168020 1 23 Government employee Illiterate Urban settings
> 6 N000168039 2 46 Daily wages Basic education Urban settings
> 7 N000168039 2 46 Daily wages Basic education Urban settings
> age_group
> 1 3
> 2 3
> 3 2
> 4 2
> 5 2
> 6 3
> 7 3
#Remove duplicates
#To remove duplicates we can use distinct() function.
> [1] 86 7
• A common task in data analysis is dealing with missing values. In R, missing values are often repre-
sented by NA
• is.na() will work on vectors, lists, matrices, and data frames.
The function is.na() generates a variable’s logical vector (TRUE or FALSE). Combine with colSums()
function will provide us the total number of missing values for each variable.
lab_data %>%
is.na() %>% #Logical check for missing or not
colSums() #Each column total (i.e., each variable count)
34
#Replace missing value with mean/median
#Use replace() function to replace the missing values.
#Syntax is replace(data, list, replace with)
lab_data1 <- lab_data %>%
mutate(hem=replace(hem,
is.na(hem),
median(hem, na.rm = T)))
#Verification
lab_data1 %>%
is.na() %>%
colSums()
• During the process of data analysis one of the most crucial steps is to identify and account for outliers,
observations that have essentially different nature than most other observations. Their presence can
lead to untrustworthy conclusions.
• To detect and remove outliers from a data frame, we use the reference range. Lets consider we have
the outliers in Dengue IgG values. Dengue IgG have some range like the below table,
• If we have the source document for the values, we may cross-check the errors and fix them; otherwise,
we can use the replace() function to change those values.
lab_data1 %>%
filter(d_igg<0 | d_igg>5)
35
> Participant.ID hem d_igg igg_cat ns1
> 1 N000168003 7.2 44.09 Positive Absent
> 2 N000168018 5.9 200.82 Equivocal Present
> 3 N000168037 4.8 333.25 Positive Present
Cleaning has been completed. We now begin working with charts. But first, we must maintain a single data
frame with all the variables in it (Demographic and Lab data). So, lets use join() function to merge two
tables.
• Join functions add the columns from second datasheet to first datasheet, matching the observations
based on the keys.
• There are various join() functions available. The functions inner_join(), left_join(),
right_join() and full_join(). But, most often we use inner_join() only.
36
> Participant.ID gender age ocp edu
> 1 N000168001 1 36 Daily wages Higher or University
> 2 N000168002 2 27 Daily wages Basic education
> 3 N000168003 2 34 Daily wages Illiterate
> 4 N000168004 1 40 Government employee Illiterate
> 5 N000168005 2 35 Daily wages Secondary education
> 6 N000168006 1 50 Shop keeper Basic education
> loc age_group hem d_igg igg_cat ns1
> 1 Rural settings 3 8.5 2.51 Equivocal Absent
> 2 Rural settings 2 12.1 1.57 Negative Present
> 3 Rural settings 3 7.2 2.41 Positive Absent
> 4 Urban settings 3 6.0 2.92 Positive Present
> 5 Rural settings 3 5.3 2.42 Equivocal Present
> 6 Urban settings 3 5.4 3.51 Positive Present
8. Summary statistics
We can use summary() function to see the summary statistics of all variables. Only the numeric vectors
mean, median, and 25% and 75% quartiles are displayed in this summary of statistics. Along with the
number of values that were missing.
37
data_dengu_igg %>%
group_by(loc) %>% #Group the locality category
summarise(N=n()) #Count based on locality category
## # A tibble: 2 x 2
## loc N
## <chr> <int>
## 1 Rural settings 35
## 2 Urban settings 40
This summarise function will add a new column variable N and output a count for each location.
Similarly we can include, mean, median and 25% and 75% percentiles of the data.
data_dengu_igg %>%
group_by(loc) %>%
summarise(N=n(),
Mean=mean(d_igg, na.rm=TRUE),
Median=median(d_igg, na.rm=TRUE),
Q1=quantile(d_igg, probs=0.25),
Q3=quantile(d_igg, probs=0.75))
## # A tibble: 2 x 6
## loc N Mean Median Q1 Q3
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Rural settings 35 2.38 2.41 1.64 3.31
## 2 Urban settings 40 2.57 2.41 1.97 3.28
There is no default functions for import an excel file. In order to import an excel file, we must install the
openxlsx package. R command for import a .xlsx file is,
data <- read.xlsx("PATH/FILE NAME.xlsx", sheet = "SHEET NAME")
• In R software, we can import data from different sources. SPSS/SAS/STATA files we can import by
using {haven} package. Install and load the package.
• Import SPSS data using read.spss(). Use as_factor() to load the data value labels, and the data
will reflect the label values.
• Import SAS data by use read.sas() and use read.dta() to read STATA data.
• One of the helpful packages to extract tables from HTML is {rvest} package.
• To extract tables from a PDF file, use the {tabulizer} package.
38
Chapter 3 Data Visualization
Data visualization is the graphical representation of information and data in a pictorial or graphical for-
mat(Example: charts, graphs, and maps). Data visualization tools provide an accessible way to see and
understand trends, patterns in data, and outliers.
• Data can be represented graphically in a variety of ways, including histograms, line charts, pie charts,
scatter plots, and more.
• Typically, a distributions can be described as having a normal distribution, a right skew, a left skew,
or being uniformly distributed.
1. The ggplot2
Data visualization is implemented in R using ggplot2 package written by Hadley Wickham. We will use this
package to create various plots, understand the ggplot2 mechanics, and develop quality data visualizations.
Installing the ggplot2 package using the function install.packages() and loading the package using
library().
39
1.2. Work with ggplot2
• With ggplot2, the function ggplot() defines a plot object to which layers can be added.
• The first argument of ggplot() is the dataset to use in the graph and so ggplot(data = dataset) creates
an empty graph that is primed to display the given data, but since we haven’t told it how to visualize
it yet, for now it’s empty.
• Essentially, we develop plots by layering elements on top each other beginning from data - aesthetics
- geoms - facets- themes.
10.0
7.5
y
5.0
2.5
• From the above graph – it’s clear where x will be displayed (on the x-axis)and y will be displayed (on
the y-axis). This is because we have not yet articulated, in our code, how to represent the observations
from our data frame on our plot.
• To do so, we need to define a geom: the geometrical object that a plot uses to represent data. These
geometric objects are made available in ggplot2 with functions that start with geom_. For example,
bar charts use bar geoms (geom_bar()), line charts use line geoms (geom_line()), boxplots use boxplot
geoms (geom_boxplot()), scatterplots use point geoms (geom_point()), and so on.
• A few questions of interest and how we can use visualizations to explore
40
What is the frequency of IgG positivity?
What is the distribution of IgG values?
Does the frequency of IgG postivity differ by occupation?
Does the distribution of IgG values differ by location?
What is the relationship between age and IgG values? And does it vary by occupation?
A variable is categorical if it can only take one of a small set of distinct values. To examine the distribution
of a categorical variable, we can use a bar chart. The height of the bars displays how many observations
occurred with each x value.
30
20
count
10
0
Equivocal Negative Positive
igg_cat
• Based on the above graph, negative is less than both positive and equivocal.
Positive
igg_cat
Negative
Equivocal
0 10 20 30
count
41
2.2. A numerical variable
A variable is continuous if it can any value within its range. One commonly used visualization for distributions
of continuous variables is a boxplot. We will plot the dengue IgG values to boxplot.
The box plot is one of the easily understandable plots. The 25th percentile, median, and 75th percentile
may all be seen in the box plot. This graph typically allows us to identify the outlier.
#Box plot
ggplot(data_dengu_igg, aes(y=d_igg))+ # For vertical view
geom_boxplot()
4
d_igg
#Density plot
ggplot(data_dengu_igg, aes(x = d_igg)) +
geom_density()
0.4
0.3
density
0.2
0.1
0.0
1 2 3 4 5
d_igg
42
3.1. Bar charts (Two categorical variable)
If we have both categorical variables, we can develop various bar plots. Let’s generate a Stacked bar chart,
a Group bar chart, and a Segmented bar chart for the clean_data dataset’s dengue IgG category and
occupational status.
30 ocp
Daily wages
20
count
Farmer
Government employee
10
None employed
Shop keeper
0
Equivocal Negative Positive
igg_cat
12.5
ocp
10.0
Daily wages
7.5
count
Farmer
5.0 Government employee
43
1.00 ocp
0.75 Daily wages
Proportion
Farmer
0.50
Government employee
0.25 None employed
Shop keeper
0.00
Equivocal Negative Positive
igg_cat
• A boxplot displays the 25th percentile, median, and 75th percentile of a distribution. The whiskers
(vertical lines) capture roughly 99% of a normal distribution, and observations outside this range are
plotted as points representing outliers.
• One of the advantages of boxplots is that their widths are not usually meaningful. This allows us to
compare the distribution of many groups in a single graph.
ggplot(data_dengu_igg,
aes(x = ocp,
y = d_igg)) +
geom_boxplot()
4
d_igg
The box plot shows that the median is close to the 25th percentile. This means that 50% of the data was
visible at that point.
For two-dimensional numerical data, the method geom_point() adds a layer of points to plot and creates a
scatterplot. Let’s examine the correlation between dengue IgG and age using the provided data.
ggplot(data = data_dengu_igg,
mapping = aes(x=age,
y=d_igg))+
geom_point()
44
5
d_igg
3
10 20 30 40 50 60
age
According to this graph, there is a positive association between dengue IgG and age.
Including line in scatter plot
ggplot(data_dengu_igg,
aes(x=age,
y=d_igg)) +
geom_point()+
geom_smooth(method = "lm")
4
d_igg
10 20 30 40 50 60
age
Similar to a scatter plot shown by graph. It is obvious that the order is increasing, and geom_smooth() uses
this information to build the line of best fit based on a linear model with method = "lm".
4. Multivariate graphs
We can include a second categorical variable as a group variable in bivariate graphs. So, lets start with
already plotted scatter plot age and dengue IgG values which is included with location.
45
#Scatter plot with group
ggplot(data = data_dengu_igg,
mapping = aes(x=age,
y=d_igg,
color=loc))+
geom_point()
4
loc
d_igg
3
Rural settings
2 Urban settings
10 20 30 40 50 60
age
When a categorical variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of
the aesthetic to each unique level of the variable (each of the location group), a process known as scaling.
ggplot2 will also add a legend that explains which values correspond to which levels.
4.2. Faceting
Another multivariate graph technique is faceting. In faceting, a graph consists of several separate plots or
small multiples, one for each level of a third variable, or combination of variables.
For example, let’s take a bar graph of dengue IgG results category and occupation from our dataset and
add locality to that group as well. facet() can be used to divide the graph into smaller groups in order to
achieve this.
46
Rural settings Urban settings
ocp
Daily wages
4
count
Farmer
Government employee
None employed
2 Shop keeper
47
Rural settings
5
1
d_igg
Urban settings
5
Clear labelling and annotations: Instead of using variable names for the x and y axes when plotting the
graph, we must give them meaningful names. We must also give the graph titles and alter the axis’ colour,
shape, and range.
To enhance colour and labelling, utilise extra choices. In the chart below,
• We can add the graph’s value labels by using the geom_text function
• xlim and ylim which aid in trimming the axis’s boundaries
• scale_y_continuous modifies the y-axis tick mark labels
• labs provides a title and changed the labels for the x and y axes and the legend
• scale_fill_brewer changes the fill color scheme
• theme_minimal removes the grey background and changed the grid color
• theme, adjust the position of the legend, modify font style of the graph
48
ggplot(data_dengu_igg,
aes(x = loc,
fill = igg_cat)) +
geom_bar(position = "fill") + #Stacked bar chart
labs(y = "Proportion",
title = "Dengue IgG results category by locality wise") +
#Label for y-axis; and title for the graph
theme_minimal() #Background theme without gray color.
0.75
Equivocal
0.50
Negative
0.25
Positive
0.00
Rural settings Urban settings
loc
We can distinguish the points by their forms rather than their colour. We will obtain a beautiful represen-
tation of the data using a shape aesthetic for the given location grouping.
ggplot(data = data_dengu_igg,
mapping = aes(x=age,
y=d_igg,
color=loc, #Color of the group
shape=loc))+ #Provide the shape of the group
geom_point()
4
loc
d_igg
3
Rural settings
2 Urban settings
10 20 30 40 50 60
age
We can control the limits of x-axis and y-axis by using xlim and ylim.
49
color=loc,
shape=loc))+
geom_point()+
xlim(c(20,45))+ #X-axis (Age) value limits between 20 to 45
ylim(c(0,5)) #Y-axis (Dengue IgG) value limits between 0 to 5
4
loc
d_igg
3
Rural settings
2
Urban settings
1
0
20 25 30 35 40 45
age
4.0
3.5 loc
d_igg
3.0
Rural settings
2.5
Urban settings
10 20 30 40 50 60
age
6. Saving graphs
The graphs in RStudio are fairly simple to export. The Export option is there above the plot if we view it
in the lower right corner (Plots panel —> Export —> Save as Image or Save as PDF). The choice
allows us to export in the format that we need; the recommended format for exporting is .jpeg.
50
Chapter 4 Statistical reports generation
R offers several packages that make it efficient to generate publication-ready tables with demographic infor-
mation and statistical test results. These packages allow you to create visually appealing and customizable
tables directly from your R code
The {gtsummary} package provides an elegant and flexible way to create publication-ready analytical and
summary tables. The {gtsummary} package summarizes data sets, regression models, and more, using
sensible defaults with highly customizable capabilities.
data_dengu_igg %>%
tbl_summary(include = c(age_group, gender, loc, ocp, igg_cat))
Characteristic N = 75
Age group
00-10 0 (0%)
11-20 4 (5.3%)
21-30 25 (33%)
30+ 46 (61%)
Sex at birth
Male 31 (41%)
Female 44 (59%)
Locality
Rural settings 35 (47%)
Urban settings 40 (53%)
Occupational status
Daily wages 23 (31%)
Farmer 16 (21%)
Government employee 20 (27%)
None employed 6 (8.0%)
Shop keeper 10 (13%)
51
Characteristic N = 75
Dengue IgG results
Equivocal 31 (41%)
Negative 15 (20%)
Positive 29 (39%)
Let us obtain a summary table for age, gender, location and occupation by dengue IgG levels. We can then
add many customization options like number of missing observations in the table. We can then add overall
count in grouping table.
data_dengu_igg %>%
tbl_summary(by = igg_cat,
include = c(age_group, gender, loc, ocp),
missing = "no") %>%
add_n() #Adding over all count in the table
Note: Gtsummary by default considers Yes/No, True/False variables as dichotomous and therefore presents
only one row (Yes or No) in the output table. To have both levels show in the output, one needs to modify
the type argument so that the variables are considered as categorical and not dichotomous.
gtsummary, also allows performing a few commonly used statistical tests and enables exporting results as
tables.
52
3.1. Comparing two groups
For example, let’s do a t-test to see if the average levels of hemoglobin differ between NS1 status
data_dengu_igg %>%
tbl_summary(by=ns1, include = hem) %>%
add_p(~ "t.test") #t-test for comparing the age between two biomarkers
Let us add other variables such as age, and IgG levels to see if their average levels differ by NS1 status
data_dengu_igg %>%
tbl_summary(by=ns1, include = c(age, hem, d_igg)) %>%
add_p(~ "t.test")
If there are more than two groups, then we have to use ANOVA test for normally distributed data and
Kruskal-wallis for non-normal data.
The syntax is similar to the above, but name of the test name must be changed within the add_p() function.
For example, to compare hemoglobin values across three categories of IgG status
data_dengu_igg %>%
tbl_summary(by=igg_cat, include = hem) %>%
add_p(~ "aov") # Comparing the hemoglobin for various IgG categories
For testing significant differences or associations between categorical groups, Chi-square test are used.
For example, we may want to apply chi-square tests to assess if occupation or location is associated with
IgG positivity
53
data_dengu_igg %>%
tbl_summary(by=igg_cat, include = c(ocp, loc)) %>%
add_p(~"chisq.test") #Comparing association between two categorical variables
In some tables, one may want to add results from statistical tests next to summary or descriptive statistics.
This can be done as follows:
data_dengu_igg %>%
tbl_summary(by = igg_cat, include = c(age_group,gender, loc, ocp)) %>%
add_n() %>%
add_p() #Here, we do not mention the test
If we omit the test name, the gtsummary table will automatically assume the following: Wilcoxon rank
sum test for two group comparison, Kruskal-Wallis test for comparisons involving more than two groups,
Pearson’s Chi-squared test or Fisher’s exact test for association tests.
54
4. Merge two or more gtsummary objects
It is also possible to merge two different tables into a single table. For example, we may want to merge
summary statistics for two different outcomes (say, IgG and NS1) in a single table.
gtsummary provides a high level of customization, allowing you to create publication-ready tables for research
papers. Here’s how you can add titles, bold headers and footers, and change the theme of the table:
data_dengu_igg %>%
tbl_summary(by = igg_cat,
include = c(ocp, loc)) %>%
add_n() %>%
modify_header(label = "**Variable**") %>%
55
modify_footnote(all_stat_cols() ~ "median (IQR) for Age; n (%) for Grade") %>%
#To change the footnote of the table
modify_caption("**Patient Characteristics** (N = {N})")
The ggsummary package does not return a standard data frame, and exporting the summary table to a Word
document requires additional steps. The gt package, which is specifically designed for creating beautiful and
highly customizable tables, can be used to export the ggsummary summary table to a Word document.
Syntax for export the table to word document, table %>% as_gt() %>% gtsave(filename="__________.docx")
For example, to export the above table into word document.
data_dengu_igg %>%
tbl_summary(by = igg_cat,
include = c(gender,age_group, loc, ocp)) %>%
add_n() %>%
as_gt() %>% #Convert the table to gt format
gtsave(filename = "Table 1.docx")
#Save the file as Table 1 word document
56
Challenges:
Day - 1 (Session - 1)
Challenge - I
Create vectors named height and weight using the following data:
height : 160.3, 134.2, 159, 149, 145, and 147.1
weight : 83.8, 37.2, 71.7, 72.8, 50.5, and 42.9.
57
Challenge - II
Create a matrix using the following table, and answer the following questions using matrix operations
(Hint rowSums(___))
(Hint rowSums(___))
A/(A+B)
e) Risk ratio of CHD ( C/(C+D) ) __________.
58
Challenge - III
Represent the following table using array, and answer the following using array operations
59
Challenge - IV
4) Create a list that contains results of overall risk ratio (Challenge II), rural risk ratio (Challenge IIIc)
and urban risk ratio (challenge IIIf)
Challenge - V
Challenge - VI
b) Consider mat is a 2X2 matrix. Now, to extract 2nd row 1st column, will this command mat(2;1) works?
60
Day - 1 (Session - 2)
In this hypothetical study, data from 25 individuals have been collected to explore the relationship between
demographic factors, systolic blood pressure, hypertension, and the effectiveness of two types of drugs, A
and B.
Lets work through these questions to undergo the data cleaning process.
1) Import the exercise data from the directory (File name is Exercise_data-Day1.csv)
i) How many variables are there in the datasheet? __________ (Hint ____ %>% dim())
2) Give the variables new names as the following (Hint ___ %>% rename())
i) “Height.in.cms” as height
3) Give the variables labels as the following (Hint ___ %>% set_variable_labels())
61
4) Recode the values of the following variables (Hint ___ %>% recode())
6) How many people participated in the study from urban? __________ (Hint ___ %>% filter(
))
7) How many individuals took drug A? __________ (Hint ___ %>% filter( ))
8) How many individuals took drug B? __________ (Hint ___ %>% filter( )
62
9) Find the duplicates. How many pairs that are the same did you find?__________
10) Find the missing data for the variable Systolic Blood pressure (mmHg). (Hint filter(is.na(-----)))
11) Identify the outliers in Systolic Blood Pressure (mmHg). (Hint use the range 80-160)
12) Prepare summary table by drug type for diastolic blood pressure with count, mean and median, and
SD (Hint ___ %>% group_by(___) %>% summarise(___))
63
Day - 2 (Session - 1)
Let us create some data visualizations to understand how drug is effective in treatment of blood pressure,
and see if there are any baseline differences, and differences in outcomes - hypertension, systolic and diastolic
BP.
1. Use the ggplot2 package to plot the bar graph for hypertension response (Univariate bar graph). Which
response has the most frequency? __________
2. Could you add drug type in the bar chart for hypertension? (Bivariate grouped bar chart). How many
people who indicated they had hypertension also took drug A? __________
3. Could you now add the dwelling type to the previous bar graph. In bar graph, to include the loca-
tion use facet_wrap() function. What type of distribution does the graph looks like in large city?
__________
(Hint facet_wrap(~____))
4. Draw a density chart for systolic blood pressure (Univariate chart). What type of distribution does
the graph looks like? __________
a) Right skewed-distribution
b) Left skewed-distribution
c) Normal distribution
d) Uniform distribution
5. Create a box plot to represent systolic blood pressure by drug type (Bivariate box plot). What is the
median blood pressure for both drug type? __________
64
6. Using facet_wrap(), add the type of dwelling to the previous graph. Which sort of dwelling has the
highest blood pressure when using drug B? __________
7. Use a scatter chart to plot the graph for systolic and diastolic pressure (Bivariate graph).
What is the relationship between systolic and diastolic blood pressure? __________
a) No association
b) Positive association
c) Negative association
65
Day - 2 (Session - 2)
Create summary tables for the following conditions. Then, fill in the blanks.
Variable n(%)
Gender
- Male __________
- Female __________
Location
- Town __________
Drug type
- Type A __________
- Type B __________
66
2) Prepare summary statistics for the following variables by type of drug, sex, dwelling, hyper. Include
statistical tests.
Location __________
Hypertension __________
- No __________ __________
3) Prepare the summary statistics for the numerical vectors systolic and diastolic blood pressure by drug
type. Include statistical tests.
67