You are on page 1of 67

Statistical Computing using R-Software

Department of Biostatistics, CMC Vellore

October 18-19,2023

1
2
Date Session Time Topic Instructor

Dr. Prasanna
09:00 - 10:30 R - Introduction Samuel /
Ms. Divya
I 10:30 - 11:00 Refreshment
Interactive Practical
11:00 - 12:30
Session
18-Oct-2023 12:30 - 13:30 Lunch

Data management using Ms. Nagayazhini


13:30 - 15:00
dplyr package / Mr. Saravanaraj
II 15:00 - 15:30 Refreshment
Interactive Practical
15:30 - 17:00
Session

Data visualization using


09:00 - 10:30 Ms. Maya P G
ggplot package
I 10:30 - 11:00 Refreshment
Interactive Practical
11:00 - 12:30
Session
12:30 - 13:30 Lunch
19-Oct-2023
Statistical reports
13:30 - 15:00 generation using Mr. Saravanaraj
gtsummary package
II
15:00 - 15:30 Refreshment
Interactive Practical
15:30 - 17:00
Session

3
Contents
Chapter 1 R software 7
1. Introduction to R software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1. Evolution of R programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2. The R environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3. Why R programming? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4. Download and Install R software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2. RStudio software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1. What is RStudio? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2. Some of the features in RStudio software . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3. Download and Install RStudio software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3. Basic R commands, data types, operators and data structures . . . . . . . . . . . . . . . . . . . 10
3.1. Learning R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2. R commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3. Types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4. Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5. Types of data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.1. Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.2. Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.3. Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.4. Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.5. Data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.6. Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4. Import a CSV file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5. Display data and data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6. Important Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Chapter 2 Data management 25


1. Packages in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.1. R package installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2. Update, Remove and Check Installed Packages in R . . . . . . . . . . . . . . . . . . . . . 25
1.3. Load packages in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4. Data for demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2. dplyr Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1. The %>% operator (Pipe operator) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3. Change the variable and values’ names and labels . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4
3.1. Rename the variables in the data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2. Adding variable labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3. Recoding variables in the data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4. Value labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4. Basic column operations with dplyr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5. Basic row operations with dplyr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6. Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.1. Find and remove the duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2. Find and replace the missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.3. Find and replace the outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7. The join() functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8. Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.1. Creating summary statistics by using dplyr . . . . . . . . . . . . . . . . . . . . . . . . . 37
9. Import a MS excel file (optional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
10. Reading data from different sources (optional) . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Chapter 3 Data Visualization 39


1. The ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.1. Install ggplot2 package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.2. Work with ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.3. Graph panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2. Univariate graphical representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.1. A categorical variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2. A numerical variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3. Bivariate graphical representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1. Bar charts (Two categorical variable) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2. Box plots (One categorical one numerical variable) . . . . . . . . . . . . . . . . . . . . . 44
3.3. Scatter plot (Two numerical variable) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4. Multivariate graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1. Scatter plot with categorical group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2. Faceting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5. Preparing graphs for publication with ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6. Saving graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5
Chapter 4 Statistical reports generation 51
1. Summary statistics for publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2. The gtsummary package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.1. Install gtsummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2. Summary table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3. Statistical tests and outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1. Comparing two groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2. Comparing more than 2 groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3. Association between categorical groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4. Summary statistics with statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4. Merge two or more gtsummary objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5. Formatting the gtsummary table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6. Export gtsummary table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Challenges: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Day - 1 (Session - 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Day - 1 (Session - 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Day - 2 (Session - 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Day - 2 (Session - 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6
Chapter 1 R software

1. Introduction to R software

• R is a programming language and computing environment used for reporting, graphic representation,
and statistical analysis. Ross Ihaka and Robert Gentleman developed R at the University of
Auckland in New Zealand. R can be considered as a different implementation of S programming
language which was developed in 1976 at Bell laboratories. Later in 2004 S, became S-PLUS which
includes GUI (Graphical User Interface) features, though the fundamentals has not changed over time.
• R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-
series analysis, classification, clustering, etc.,) and graphical techniques, and is highly extensible. The
S language is often the vehicle of choice for research in statistical methodology, and R provides an
Open Source route to participation in that activity.

One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including
mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor
design choices in graphics, but the user retains full control.
R is available as Free (freedom to use, adapt, redistribute & improve) Software under the terms
of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs
on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and
MacOS.

1.1. Evolution of R programming

• R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the
University of Auckland, New Zealand. R made its first appearance in 1993.

• R can be regarded as an implementation of the S language which was developed at Bell Laboratories
by Rick Becker, John Chambers and Allan Wilks, and also forms the basis of the S-Plus systems.
• In June 1995, statistician Martin Mächler convinced Ihaka and Gentleman to make R free and open-
source under the GNU (General Public License). Mailing lists for the R project began on 1 April 1997
preceding the release of version 0.50. R officially became a GNU project on 5 December 1997 when
version 0.60 released. The first official 1.0 version was released on 29 February 2000.
• A large group of individuals has contributed to R by sending code and bug reports. Since mid-1997
there has been a core group (the “R Core Team”) who can modify the R source code archive. The R
Development Core Team is currently responsible for maintaining this programming language.
• Comprehensive R Archive Network (CRAN) is a network of ftp and web servers around the world that
store identical, up-to-date, versions of code and documentation for R. We can download and install
packages and software from the CRAN platform.
• The CRAN R software has 19925 packages with the top-rated tools for medical research as of September
2023.

1.2. The R environment

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It
includes,

• an effective data handling and storage facility,

7
• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data analysis,
• graphical facilities for data analysis and display either on-screen or on hardcopy, and
• a well-developed, simple and effective programming language which includes conditionals, loops, user-
defined recursive functions and input and output facilities.

The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an
incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis
software.

1.3. Why R programming?

• In contrast to other statistical software, R is exceptionally open and free software, making it simple
to collaborate with other people, simple to read syntax and readily achieve reproducibility. We can
integrate R programming with other tools like Python, MySQL, REDCap, Hadoop, and etc.,
• R excels in data visualization with packages like ggplot2, which allows to create highly customizable
and publication-quality graphics.

• R is specifically designed for statistical analysis. It provides a wide range of statistical tests, models,
and methods for data exploration, hypothesis testing, and modeling. Its rich statistical functions and
libraries make it a go-to tool for statisticians.

1.4. Download and Install R software

For download the R software https://CRAN.R-project.org, to download the most recent version of R.
Depending on your operating system, download the proper version of R software. In this website, Choose
Download R for (Windows / macOS / Linux) —> Sub directories (base / install the R for first time).
The software’s interface looks like this when it is opened after installation (see Figure 1).
What is Console?
In programming, a console is a text-based interface that allows you to interact with the computer’s operating
system

2. RStudio software

Even though R programming is widely used for statistical programming, it can be challenging to manage
different panels, scripts, variables, and objects while using the R software panel. We then go to RStudio,
which is the R programming interface.

2.1. What is RStudio?

RStudio is an integrated development environment (IDE) specifically designed for working with the R pro-
gramming language. We can use R without RStudio, but R studio provides various features that are quite
helpful when writing R syntax.

8
Figure 1: R software

2.2. Some of the features in RStudio software

• RStudio provides a user-friendly and intuitive interface for working with R. It includes a script editor,
a console, a workspace browser, and integrated plotting capabilities, making it easier to write, run,
and manage your R code.
• The code editor in RStudio offers syntax highlighting, code autocompletion, and error checking, which
helps you write code more efficiently and identify errors quickly. This can improve your productivity
and reduce coding mistakes.
• RStudio provides tools for managing your R workspace, including viewing loaded datasets, objects,
and functions. This helps you keep track of your data and variables during your analysis.
• RStudio supports Markdown and R Markdown, which are useful for creating dynamic and reproducible
reports and documents. You can combine code, text, and graphics in a single document, making it
easy to communicate your analysis results.
• Studio has built-in project management features, allowing you to organize your work into separate
projects. Projects help keep your files, scripts, data, and settings organized, making it easier to
collaborate and maintain reproducibility.

2.3. Download and Install RStudio software

RStudio has now changed its name to POSIT. Hadley Alexander Wickham is the chief scientist at POSIT.
He developed many packages that make R easier to use.
For download RStudio software https://posit.co/download/rstudio-desktop/. Depending on your op-
erating system, download the proper version of RStudio.
Note: RStudio will function if R is supported. Therefore, ensure that R and RStudio are installed separately.

9
Figure 2: RStudio software panel

3. Basic R commands, data types, operators and data structures

3.1. Learning R

• Learing R can be difficult and frustrating; and it is normal. The main aim is to learn how to use R –
basic syntax, and not to memorize functions.

• Learn R via structured projects


• Try all the exercises by yourself and work through the error messages. If you are stuck, then ask for
help!
• Build projects on your own

3.2. R commands

Note: Select the codes, then click the Run button in the toolbar’s upper right corner to execute the R
commands. If not, choose the code and press Ctrl + Enter to run the command.

# R as calculator

2 + 2 # Addition

> [1] 4

10
10 - 2 # Substraction

> [1] 8

4 * 5 # Multiplication

> [1] 20

10 / 2 # Division

> [1] 5

10 ˆ 2 # Powers

> [1] 100

3+5*2

> [1] 13

(3+5)*2

> [1] 16

# Variable assignment; assign value 5 to variable x

x = 5

> [1] 5

# R- way of assignment

x <- 5

> [1] 5

print(x)

> [1] 5

Notice that the assignment operator (‘<-’), which consists of the two characters ‘<’ (“less than”) and ‘-’
(“minus”) occurring strictly side-by-side and it ‘points’ to the object receiving the value of the expression.
In most contexts the ‘=’ operator can be used as an alternative.

11
3.3. Types of data

R’s basic data types are,

• Numeric ( 1.2, 5, 7, 3.14159 )


• Integer ( 1, 2, 3, 4, 5 )
• Complex ( i + 4 )
• Logical ( TRUE / FALSE )
• Character ( “a”, “apple” )

3.4. Operators

Operators are used to perform operations on variables and values. R divides the operators as Arithmetic
operators, Comparison operators and Logical operators.
Arithmetic operators

• The elementary arithmetic operators are the usual +, -, *, / and ^ for raising to a power.

Comparison and Logical operators

Operator Description
== Equal to a==b
!= Not equal to a!=b
< Greater than a<b
<= Greater than equal to a<=b
> Less than a>b
>= Less than or equal to a>=b
& Element-wise AND logical operator. Condition A AND B
both are TRUE.
| Element-wise OR logical operator. Condition A OR B TRUE.
:: Operator which helps to access a specific function from a
specific package.

3.5. Types of data structures

• Vectors
• Arrays
• Matrices
• Lists
• Dataframes
• Factors

3.5.1. Vectors

• Vector is a single entity consisting of an ordered collection of numbers (or) characters.

#Integer vector
age <- c(35, 31, 29, 33, 28)
age

12
> [1] 35 31 29 33 28

#Numeric vector
weight <- c(71.7, 77.2, 53.5, 62.1, 29.2)
weight

> [1] 71.7 77.2 53.5 62.1 29.2

#Character vector
name <- c("A", "B", "C", "D", "E")
gender <- c("Male","Female","Male","Male","Female")

Note: c() usually stands for combine. This function is used to get the output by giving parameters inside the
function.
Vector Operations

• Vectors can be used in arithmetic expressions, in which case the operations are performed element by
element. These operations are performed element-wise and hence the length of both the vectors should
be the same.

# Vector operations
age + 3

> [1] 38 34 32 36 31

# Vector operations - recycling

age + c(5, 9,11, 7)

> Warning in age + c(5, 9, 11, 7): longer object length is not a multiple of
> shorter object length

> [1] 40 40 40 40 33

• Vectors in R are 1 based indexed, unlike the normal C, python, etc., format where indexing starts from
0.

Example:

height <- c(150.9, 145.1, 138.5, 133.0, 123.7)


weight <- c(71.7, 77.2, 53.5, 62.1, 29.2)

bmi <- weight/(height/100)ˆ2 #Weight/Height(m)ˆ2


bmi

> [1] 31.48768 36.66760 27.89037 35.10656 19.08286

13
#Extracting elements in a vector
#We can extract a element using [], following the R command

bmi[4] #Fourth element from bmi numeric vector.

> [1] 35.10656

bmi[2:3] # from 2 to 3

> [1] 36.66760 27.89037

bmi[-3] # except 3rd element

> [1] 31.48768 36.66760 35.10656 19.08286

#Logical vector
# Vector which is contains only TRUE or FALSE or NA.
bmi_cat <- bmi<30
bmi_cat

> [1] FALSE FALSE TRUE FALSE TRUE

bmi[bmi_cat]

> [1] 27.89037 19.08286

Common arithmetic functions

• In addition all of the common arithmetic functions are available. log, exp, sin, cos, tan, sqrt, and so
on, all have their usual meaning.
• max and min select the largest and smallest elements of a vector respectively. range is a function whose
value is a vector of length two, namely c(min(x), max(x)).
• length(x) is the number of elements in x, sum(x) gives the total of the elements in x, and prod(x) their
product.

age <- c(35, 31, 29, 33, 28)

min(age) #Minimum from the age vector

> [1] 28

max(age) #Maximum from the age vector

> [1] 35

14
length(age) #Number of elements in age

> [1] 5

mean(age, na.rm=TRUE) #Mean of age

> [1] 31.2

median(age, na.rm=TRUE) #Median of age

> [1] 31

sd(age, na.rm=TRUE) #Standard deviation of age

> [1] 2.863564

var(age, na.rm=TRUE) #Variance of age

> [1] 8.2

3.5.2. Matrices

• Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular layout.
They contain elements of the same data types. We use matrices containing numeric elements to be
used in mathematical calculations.

A Matrix is created using the matrix() function.


The basic syntax for creating a matrix in R is - matrix(data, nrow, ncol, byrow, dimnames)
Example:

crosstab <- matrix(nrow = 2, ncol= 2)


crosstab <- matrix(c(10, 40, 7, 43), nrow=2, ncol=2, byrow=TRUE)

#By setting byrow=TRUE, the vector will be assigned row-wise.


crosstab #2X2 contingency table.

> [,1] [,2]


> [1,] 10 40
> [2,] 7 43

coln <- c("Outcome == Yes", "Outcome == NO")


rown <- c("Treatment","Control")

crosstab <- matrix(c(10,40, 7, 43),nrow = 2, byrow = TRUE,


dimnames = list(rown,coln))
crosstab

15
> Outcome == Yes Outcome == NO
> Treatment 10 40
> Control 7 43

Extracting elements from the matrix


We can access the items by using [Row,Column] brackets. The first number in the bracket specifies the
row-position, while the second number specifies the column-position.

crosstab[2,1] #Second row and first column

## [1] 7

crosstab[2,]

## Outcome == Yes Outcome == NO


## 7 43

#The whole row can be accessed if you specify a comma after the number in the bracket
crosstab[,2]

## Treatment Control
## 40 43

#The whole column can be accessed if we specify a comma before the number in the bracket

Matrix operations
rowSums() for row sum and colSums() for column sum.
Calculation of Relative Risk for the cross-tab

rtotal <- rowSums(crosstab) # no in the treatment & control group


rtotal

## Treatment Control
## 50 50

ctotal <- colSums(crosstab) # no with and without outcome


ctotal

## Outcome == Yes Outcome == NO


## 17 83

# Calculation of risk in the intervention group (a / a + b)


a <- crosstab[1,1] #Extracting those with outcome from the intervention
a

## [1] 10

16
n1 <- rtotal[1] # Extracting no in the treatment group
n1

## Treatment
## 50

#Risk in exposed group for disease develops (a / a + b)


r1 <- a/n1
r1

## Treatment
## 0.2

# Calculation of risk in the control group (c / c + d)


c <- crosstab[2,1] #Extracting those with outcome from the control group
c

## [1] 7

n2 <- rtotal[2] # Extraction no in the control group


n2

## Control
## 50

# Calculation of risk in the control group (c / c + d)


r2 <- c/n2
r2

## Control
## 0.14

#Risk ratio (Risk in exposed/Risk in not exposed)


rr <- r1/r2
rr

## Treatment
## 1.428571

3.5.3. Arrays

• Arrays are the R data objects which can store data in more than two dimensions. For example, If we
create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3
columns. Arrays can store only data type.

An array is created using the array() function. It takes vectors as input and uses the values in the dim
parameter to create an array.

17
Example:

tab1 <- c(11,22,13,42)


crosstab1 <- matrix(tab1, nrow = 2, byrow = TRUE)

tab2 <- c(12,7,9,19)


crosstab2 <- matrix(tab2, nrow=2,byrow=TRUE)

strattab <- array(c(crosstab1,crosstab2), dim = c(2,2,2))


strattab

> , , 1
>
> [,1] [,2]
> [1,] 11 22
> [2,] 13 42
>
> , , 2
>
> [,1] [,2]
> [1,] 12 7
> [2,] 9 19

dimnames(strattab)<-list(group=c("I", "C"), outcome=c("Y", "N"),


confounder=c("M", "F"))

strattab

> , , confounder = M
>
> outcome
> group Y N
> I 11 22
> C 13 42
>
> , , confounder = F
>
> outcome
> group Y N
> I 12 7
> C 9 19

### Access the array elements


#We can access the array elements same as matrix format with a additional condition.
#R command for accessing array element is

#variable[ROW, COLUMN, MATRIX]

#For example,
strattab[,,2] # the second matrix

> outcome

18
> group Y N
> I 12 7
> C 9 19

strattab[1,,2] # first row of the second matrix

> Y N
> 12 7

strattab[1,1,2] # frist row, first column of the second matrix

> [1] 12

3.5.4. Lists

• An R list is an object consisting of an ordered collection of objects known as its components.

There is no particular need for the components to be of the same mode or type, and, for example, a list
could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function,
and so on. Here, is a simple example of how to make a list:
Example:

vector1 <- c(11,22,13,42)


matrix1 <- matrix(vector1, nrow = 2, byrow = TRUE)

array <- array(c(tab1,tab2), dim = c(2,2,2))

lst <- list(vector1,matrix1,array)


lst

> [[1]]
> [1] 11 22 13 42
>
> [[2]]
> [,1] [,2]
> [1,] 11 22
> [2,] 13 42
>
> [[3]]
> , , 1
>
> [,1] [,2]
> [1,] 11 13
> [2,] 22 42
>
> , , 2
>
> [,1] [,2]
> [1,] 12 9
> [2,] 7 19

19
lst <- list(vector1,matrix1,array, coln, rown)
lst

> [[1]]
> [1] 11 22 13 42
>
> [[2]]
> [,1] [,2]
> [1,] 11 22
> [2,] 13 42
>
> [[3]]
> , , 1
>
> [,1] [,2]
> [1,] 11 13
> [2,] 22 42
>
> , , 2
>
> [,1] [,2]
> [1,] 12 9
> [2,] 7 19
>
>
> [[4]]
> [1] "Outcome == Yes" "Outcome == NO"
>
> [[5]]
> [1] "Treatment" "Control"

Extract elements from the list


To extract the elements, we can utilise the “[[ ]]” symbol indexing method.
Syntax for the extracting, Listname[[INDEX NUMBER]]

#For example,
lst[[1]] #Extract first vector of the list

## [1] 11 22 13 42

#From the vector mentioned above, to extract a single element.

lst[[1]][2] #This is for extract 2nd element in the first vector of a list.

## [1] 22

3.5.5. Data frames

• Data frames in R language are generic data objects of R that are used to store tabular data. Data
frames can also be interpreted as matrices where each column of a matrix can be of different data
types.

20
Example:

name <- c("A", "B", "C", "D", "E")


age <- c(35, 31, 29, 33, 28)
gender <- c("Male","Female","Male","Male","Female")
height <- c(150.9, 145.1, 138.5, 133.0, 123.7)
weight <- c(71.7, 77.2, 53.5, 62.1, 29.2)

data <- data.frame(name, age, gender, height, weight)


data

> name age gender height weight


> 1 A 35 Male 150.9 71.7
> 2 B 31 Female 145.1 77.2
> 3 C 29 Male 138.5 53.5
> 4 D 33 Male 133.0 62.1
> 5 E 28 Female 123.7 29.2

Extract elements from a dataframe


The nˆth row or column of the data frame is extracted using the same procedure as when extracting elements
from matrices.

#For example,
data[,2] #Extract second entire column

## [1] 35 31 29 33 28

data[4,] #Extract fourth entire row

## name age gender height weight


## 4 D 33 Male 133 62.1

data[1,2] #Extracting first row, second column element.

## [1] 35

3.5.6. Factors

• Factors are the data objects which are used to categorize the data and store it as levels. They can
store both strings and integers. They are useful in data analysis for statistical modeling. Factors are
created using the factor() function by taking a vector as input.

Example:

gender <- c("Male","Female","Male","Male","Female")


gender <- factor(gender)
gender

> [1] Male Female Male Male Female


> Levels: Female Male

21
#Change the order of levels
factor(gender, levels = c("Male","Female"))

> [1] Male Female Male Male Female


> Levels: Male Female

#We can specify the reference value by using relevel().


relevel(gender, ref = "Male")

> [1] Male Female Male Male Female


> Levels: Male Female

Note: The levels are automatically treated as alphabetically by R if a variable is declared to be a factor. We
must manually enter levels in order to order the factors as we see fit.

4. Import a CSV file

Thus far, we have created different R objects: vectors, lists, arrays, and dataframes. Now, we will look at
how we can import an external data set saved in the comma-separated file (csv). CSV formats are plain-text
files easy, quick and efficient that takes up less space than other formats
R command for upload a CSV file, read.csv(File path).
Example: read.csv(“S:/Users/R-Workshop/Example_data.csv”)
Note: The forward slash (/) should be used in the file path. backslash (\) is the default character for file
paths. Because of this, we must manually notate the file path. After we write the file name .csv is mandatory
to write.

5. Display data and data structure

df <- read.csv("S:/CMC Dept. of Biostatistics/R Workshop/Example_data_1.csv")

dim(df) #Dimension of the data (row and column)

> [1] 90 6

head(df) #First 6 observations of the data

> Participant.ID Sex.at.birth Age..in.years. Occupation


> 1 N000168001 Male 36 Daily wages
> 2 N000168002 Female 27 Daily wages
> 3 N000168003 Female 34 Daily wages
> 4 N000168004 Male 40 Government employee
> 5 N000168005 Female 35 Daily wages
> 6 N000168006 Male 50 Shop keeper
> Education.Level Locality
> 1 Higher or University Rural settings
> 2 Basic education Rural settings
> 3 Illiterate Rural settings

22
> 4 Illiterate Urban settings
> 5 Secondary education Rural settings
> 6 Basic education Urban settings

tail(df) #Last 6 observations of the data

> Participant.ID Sex.at.birth Age..in.years. Occupation


> 85 N000168081 Female 44 Farmer
> 86 N000168082 Male 48 Government employee
> 87 N000168083 Female 32 Shop keeper
> 88 N000168084 Female 53 Daily wages
> 89 N000168085 Male 22 Government employee
> 90 N000168086 Female 56 Shop keeper
> Education.Level Locality
> 85 Illiterate Rural settings
> 86 Secondary education Urban settings
> 87 Secondary education Rural settings
> 88 Illiterate Rural settings
> 89 Secondary education Urban settings
> 90 Secondary education Rural settings

str(df) #Structure of the data frame

> ’data.frame’: 90 obs. of 6 variables:


> $ Participant.ID : chr "N000168001" "N000168002" "N000168003" "N000168004" ...
> $ Sex.at.birth : chr "Male" "Female" "Female" "Male" ...
> $ Age..in.years. : int 36 27 34 40 35 50 50 19 22 20 ...
> $ Occupation : chr "Daily wages" "Daily wages" "Daily wages" "Government employee" ...
> $ Education.Level: chr "Higher or University" "Basic education" "Illiterate" "Illiterate" ...
> $ Locality : chr "Rural settings" "Rural settings" "Rural settings" "Urban settings" ...

class(df) #Class of the data frame

> [1] "data.frame"

23
6. Important Points

• To get more information on any specific named function, for example mean, the command is
help(mean) alternative is ?mean
• We can set the working directory using setwd() function.
• For checking the current working directory use getwd()

• Use View() in RStudio to view the data table as a separate panel.


• The $ operator is used to extract or subset a specific part of a data object in R. Example: df$age
#Extracting age vector from df dataframe.
• Technically R is an expression language with a very simple syntax. It is case sensitive as are most
UNIX based packages, so A and a are different symbols and would refer to different variables.

• Valid variable names are given below,

Variable Name Validity


var_name2. Valid
var_name% Invalid
2var_name Invalid
.var_name;var.name Valid
.2var_name Invalid
_var_name Invalid

24
Chapter 2 Data management

1. Packages in R

• Packages in R programming language are a set of R functions, compiled code, and sample data. These
are stored under a directory called “library” within the R environment.
• By default, R installs a group of packages during installation (base R). Once we start the R console,
only the default packages are available by default.
• Other packages that are already installed need to be loaded explicitly to be utilized by the R program
that’s getting to use them.

1.1. R package installation

For installing R Package from CRAN we need the name of the package and use the following command:
install.packages("package name")
Installing Package from CRAN is the most common and easiest way as we just have to use only one command.
In order to install more than a package at a time, we just have to write them as a character vector in the
first argument of the install.packages() function:
Example:
install.packages(c(“dplyr”, “openxlsx”))

1.2. Update, Remove and Check Installed Packages in R

To check what packages are installed on your computer, type this command: installed.packages()
To update all the packages, type this command: update.packages()
To update a specific package, type this command: install.packages("PACKAGE NAME")
To remove a specific package, type this command: remove.package("PACKAGE NAME")
Installing Packages Using RStudio user interface In R Studio go to Tools –> Install Package, and
there we will get a pop-up window to type the package what we want to install.

1.3. Load packages in R

Once you install an R package, you can immediately start using its features. If you only need to occasionally
use specific functions or data from the package, you can access them using a certain notation. Installing a
package is a one-time task, but you have to reload it every time you begin a new session.
library("PACKAGE NAME")
Example: library(dplyr)

1.4. Data for demonstration

Suppose a serological survey was conducted as part of a formal study to investigate dengue IgG, NS1
marker, and hemoglobin levels across different demographic characteristics. We will be conducting several
data cleaning procedures, visualizations, and preparing tables with statistical tests using this dataset. In the
session, we will be experimenting with the following variables.

25
• Participant.ID
• Sex at birth
• Age (in years)
• Occupation
• Education level
• Locality
• Hemoglobin
• Dengue IgG
• Dengue IgG results
• NS1 bio-marker

#Example
#Demographic data
dem_data <- read.csv("S:/CMC Dept. of Biostatistics/R Workshop/Example_data_1.csv")

#Lab data
lab_data <- read.csv("S:/CMC Dept. of Biostatistics/R Workshop/Example_data_2.csv")

2. dplyr Package

• dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the
most common data manipulation challenges.
• It is a set of tools for a common set of problems:

(1) split up a data frame,


(2) apply a function, and
(3) combine all the results back together

2.1. The %>% operator (Pipe operator)

%>% is a special operator in R found in the magrittr and dplyr packages. %>% lets we pass objects to
functions elegantly. This pipe operator helps us to make our code more readable.
The R-studio keyboard shortcut for pipe operator, ctrl - shift - m

# Without the %>% operator


colnames(dem_data)

> [1] "Participant.ID" "Sex.at.birth" "Age..in.years." "Occupation"


> [5] "Education.Level" "Locality"

# By using the %>% operator


dem_data %>% colnames()

> [1] "Participant.ID" "Sex.at.birth" "Age..in.years." "Occupation"


> [5] "Education.Level" "Locality"

Note: When we start using more than two functions for the data cleaning and processing, the role of the
pipe operator will become more important.

26
3. Change the variable and values’ names and labels

3.1. Rename the variables in the data set

The variable names are often quite lengthy and challenging to use for subsequent commands. Therefore, we
can rename these variables for easier and more convenient use in further analyses.
The syntax for renaming variables is, rename(NEW VARIABLE NAME = OLD VARIABLE
NAME)

#Example
dem_data1 <- dem_data %>% #Store the modified data in another data frame
rename(age=Age..in.years.,
gender=Sex.at.birth,
ocp=Occupation,
edu=Education.Level,
loc=Locality)

dem_data1 %>% head(n=5)

## Participant.ID gender age ocp edu


## 1 N000168001 Male 36 Daily wages Higher or University
## 2 N000168002 Female 27 Daily wages Basic education
## 3 N000168003 Female 34 Daily wages Illiterate
## 4 N000168004 Male 40 Government employee Illiterate
## 5 N000168005 Female 35 Daily wages Secondary education
## loc
## 1 Rural settings
## 2 Rural settings
## 3 Rural settings
## 4 Urban settings
## 5 Rural settings

3.2. Adding variable labels

• Variable labels provide a detailed and comprehensive description of each variable, making it easier to
understand their meanings and purposes in the dataset.
• Using descriptive variable labels enhances the clarity and interpretability of the data, facilitating more
effective analysis and communication of results.
• With this description it is easier to remember what those variable names refer to.

The R labelled package can be used to label variables. Load the package after installation.
R syntax is set_variable_labels(data, list(labels))

#Example
dem_data1 <- dem_data1 %>% set_variable_labels(Participant.ID="Unique ID",
gender="Gender",
age ="Age (in years)",
ocp="Occupational status",
edu="Educational status",
loc="Locality")
dem_data1 %>% var_label()

27
> $Participant.ID
> [1] "Unique ID"
>
> $gender
> [1] "Gender"
>
> $age
> [1] "Age (in years)"
>
> $ocp
> [1] "Occupational status"
>
> $edu
> [1] "Educational status"
>
> $loc
> [1] "Locality"

3.3. Recoding variables in the data set

Recoding refers to the process of transforming or reassigning values of variables within a dataset. Recoding
can involve converting numerical values into categories, aggregating data into different groups, or changing
the scale of measurement.

#Data recode
#Re-coding an existing variable with the help of the recode() function.
#Re-code syntax is, recode(VAR_NAME, Value="New assign")

dem_data2 <- dem_data1 %>%


mutate(gender=recode(gender, "Male"=1,"Female"=2))

#mutate() to create a new variable or modify the existing variable

#Re-coding for the continuous variable, we can use case_when()


#case_when syntax is, case_when(condition)

dem_data2 <- dem_data2 %>%


mutate(age_group=case_when(age<=10~0,
age>10 & age<=20~1,
age>20 & age<=30~2,
age>30~3))

dem_data2 %>% head()

> Participant.ID gender age ocp edu


> 1 N000168001 1 36 Daily wages Higher or University
> 2 N000168002 2 27 Daily wages Basic education
> 3 N000168003 2 34 Daily wages Illiterate
> 4 N000168004 1 40 Government employee Illiterate
> 5 N000168005 2 35 Daily wages Secondary education
> 6 N000168006 1 50 Shop keeper Basic education
> loc age_group
> 1 Rural settings 3

28
> 2 Rural settings 2
> 3 Rural settings 3
> 4 Urban settings 3
> 5 Rural settings 3
> 6 Urban settings 3

3.4. Value labels

• Value labels provide descriptive meanings or interpretations for the numeric or coded values within a
variable. By assigning value labels such as “Extremely poor” to 1 and “Excellent” to 7, it eliminates
the need to remember the numeric representations of these categories.
• This practice enhances the readability and interpretability of the dataset, making it easier for re-
searchers and analysts to understand the meaning behind the coded values without having to reference
a separate codebook or manual.

set_value_labels() could also be used to modify all the value labels attached to a vector.

#Example
dem_data3 <- dem_data2 %>%
set_value_labels(gender=c(Male=1,Female=2),
age_group=c("00-10"=0,
"11-20"=1,
"21-30"=2,
"30+"=3))

dem_data3 %>% val_labels()

> $Participant.ID
> NULL
>
> $gender
> Male Female
> 1 2
>
> $age
> NULL
>
> $ocp
> NULL
>
> $edu
> NULL
>
> $loc
> NULL
>
> $age_group
> 00-10 11-20 21-30 30+
> 0 1 2 3

The above described functions are essential aspects in providing context and understanding of the variables
in the dataset, forms a part of dataset documentation and collectively enrich metadata documentation.

29
4. Basic column operations with dplyr

The select() operation allows you to choose and extract columns (or variables) of interest from the dataset.

# Select one or more columns with select()


dem_data3 %>%
select(Participant.ID, age,loc) %>%
head(n=5) #Restricted the entry because of more number of data.

> Participant.ID age loc


> 1 N000168001 36 Rural settings
> 2 N000168002 27 Rural settings
> 3 N000168003 34 Rural settings
> 4 N000168004 40 Urban settings
> 5 N000168005 35 Rural settings

# Select columns based on start characters


dem_data3 %>%
select(starts_with("age")) %>%
head(n=5)

> age age_group


> 1 36 3
> 2 27 2
> 3 34 3
> 4 40 3
> 5 35 3

Other select() options


If we want to extract same end characters, we can use within select function %>% select(ends_with("____")).
We can select the column that contains a particular character then, %>% select(contains("a"))
We can exclude a particular variable by using %>% select(-c("VAR"))

5. Basic row operations with dplyr

The filter() operation allows you to choose and extract rows of interest from the dataset

# Filter rows on one condition


dem_data3 %>%
filter(loc=="Rural settings") %>% head() #For a character vector

> Participant.ID gender age ocp edu loc


> 1 N000168001 1 36 Daily wages Higher or University Rural settings
> 2 N000168002 2 27 Daily wages Basic education Rural settings
> 3 N000168003 2 34 Daily wages Illiterate Rural settings
> 4 N000168005 2 35 Daily wages Secondary education Rural settings
> 5 N000168007 2 19 Shop keeper Basic education Rural settings
> 6 N000168008 2 22 Daily wages Basic education Rural settings
> age_group
> 1 3

30
> 2 2
> 3 3
> 4 3
> 5 1
> 6 2

#Filter rows on one condition with logical operators


dem_data3 %>%
filter(age<30) %>% #For a numeric vector
head()

> Participant.ID gender age ocp edu loc


> 1 N000168002 2 27 Daily wages Basic education Rural settings
> 2 N000168007 2 19 Shop keeper Basic education Rural settings
> 3 N000168008 2 22 Daily wages Basic education Rural settings
> 4 N000168009 1 20 Farmer Basic education Rural settings
> 5 N000168011 2 26 Farmer Secondary education Rural settings
> 6 N000168013 1 12 None employed Illiterate Urban settings
> age_group
> 1 2
> 2 1
> 3 2
> 4 1
> 5 2
> 6 1

# Filter on two OR more conditions


dem_data3 %>%
filter(ocp=="Shop keeper" | age<40) %>% head()

> Participant.ID gender age ocp edu loc


> 1 N000168001 1 36 Daily wages Higher or University Rural settings
> 2 N000168002 2 27 Daily wages Basic education Rural settings
> 3 N000168003 2 34 Daily wages Illiterate Rural settings
> 4 N000168005 2 35 Daily wages Secondary education Rural settings
> 5 N000168006 1 50 Shop keeper Basic education Urban settings
> 6 N000168006 1 50 Shop keeper Basic education Urban settings
> age_group
> 1 3
> 2 2
> 3 3
> 4 3
> 5 3
> 6 3

# Filter on two AND more conditions


dem_data3 %>%
filter(ocp=="Shop keeper" & gender==1) %>% head()

> Participant.ID gender age ocp edu loc


> 1 N000168006 1 50 Shop keeper Basic education Urban settings
> 2 N000168006 1 50 Shop keeper Basic education Urban settings

31
> 3 N000168053 1 51 Shop keeper Basic education Urban settings
> 4 N000168054 1 11 Shop keeper Basic education Urban settings
> age_group
> 1 3
> 2 3
> 3 3
> 4 1

# Sort rows by values in a column in ascending order


dem_data3 %>%
arrange(age) %>% head()

> Participant.ID gender age ocp edu loc


> 1 N000168054 1 11 Shop keeper Basic education Urban settings
> 2 N000168013 1 12 None employed Illiterate Urban settings
> 3 N000168040 2 17 Farmer Illiterate Rural settings
> 4 N000168007 2 19 Shop keeper Basic education Rural settings
> 5 N000168009 1 20 Farmer Basic education Rural settings
> 6 N000168055 2 21 Shop keeper Basic education Rural settings
> age_group
> 1 1
> 2 1
> 3 1
> 4 1
> 5 1
> 6 2

# Sort rows by values in a column in descending order


dem_data3 %>%
arrange(desc(age)) %>% head()

> Participant.ID gender age ocp edu


> 1 N000168023 2 64 Government employee Secondary education
> 2 N000168050 2 61 Daily wages Illiterate
> 3 N000168024 2 60 Shop keeper Higher or University
> 4 N000168070 1 60 Government employee Illiterate
> 5 N000168076 2 58 Government employee Higher or University
> 6 N000168033 2 56 Shop keeper Secondary education
> loc age_group
> 1 Urban settings 3
> 2 Rural settings 3
> 3 Rural settings 3
> 4 Urban settings 3
> 5 Urban settings 3
> 6 Rural settings 3

In this case, we performed certain row and column actions. In order to do computations on
already-existing data, data cleansing is necessary.

6. Data Cleaning
In RStudio, you can use the View() function to display the contents of a data frame in a new panel. Assuming
you have imported your dengue IgG data from a CSV file into a data frame

32
There are two separate CSV files. Demographic data is on the left, and lab IgG results are on the right. A
few of the duplicates, outliers, and missing data points are shown in the figure above. We must therefore
clean this dataset before we start our analysis.

6.1. Find and remove the duplicates

• Identifying and removing duplicate rows from a dataset is crucial to maintain data accuracy and prevent
redundancy. Duplicate values can distort statistical analyses and lead to incorrect conclusions.
• In R, you can use the dplyr package to easily identify and remove duplicate rows from a data frame.
• We can use duplicated() function to find out how many duplicates value are present in a vector.

#Selected unique ids only for duplicate checking


dem_data3 %>%
filter(duplicated(Participant.ID)) #Only print the unique id which is have duplicate ids.

> Participant.ID gender age ocp edu loc


> 1 N000168006 1 50 Shop keeper Basic education Urban settings
> 2 N000168020 1 23 Government employee Illiterate Urban settings
> 3 N000168020 1 23 Government employee Illiterate Urban settings
> 4 N000168039 2 46 Daily wages Basic education Urban settings
> age_group
> 1 3
> 2 2
> 3 2
> 4 3

#Run the R command below if you also want to print the entire sets of duplicates.
dem_data3 %>%
filter(duplicated(Participant.ID) |
duplicated(Participant.ID, fromLast = TRUE))

> Participant.ID gender age ocp edu loc


> 1 N000168006 1 50 Shop keeper Basic education Urban settings
> 2 N000168006 1 50 Shop keeper Basic education Urban settings

33
> 3 N000168020 1 23 Government employee Illiterate Urban settings
> 4 N000168020 1 23 Government employee Illiterate Urban settings
> 5 N000168020 1 23 Government employee Illiterate Urban settings
> 6 N000168039 2 46 Daily wages Basic education Urban settings
> 7 N000168039 2 46 Daily wages Basic education Urban settings
> age_group
> 1 3
> 2 3
> 3 2
> 4 2
> 5 2
> 6 3
> 7 3

#Remove duplicates
#To remove duplicates we can use distinct() function.

dem_data3 <- dem_data3 %>% distinct()

dem_data3 %>% dim()

> [1] 86 7

There are 86 observations remaining after the duplicates are deleted.

6.2. Find and replace the missing values

• A common task in data analysis is dealing with missing values. In R, missing values are often repre-
sented by NA
• is.na() will work on vectors, lists, matrices, and data frames.

The function is.na() generates a variable’s logical vector (TRUE or FALSE). Combine with colSums()
function will provide us the total number of missing values for each variable.

#R command for counting missing values for each variable

lab_data %>%
is.na() %>% #Logical check for missing or not
colSums() #Each column total (i.e., each variable count)

> Participant.ID hem d_igg igg_cat ns1


> 0 3 0 0 0

#Filter the missing value in a particular vector


lab_data %>%
filter(is.na(hem)) #Haemoglobin

> Participant.ID hem d_igg igg_cat ns1


> 1 N000168010 NA 3.31 Positive Absent
> 2 N000168028 NA 1.54 Negative Present
> 3 N000168036 NA 3.98 Positive Present

34
#Replace missing value with mean/median
#Use replace() function to replace the missing values.
#Syntax is replace(data, list, replace with)
lab_data1 <- lab_data %>%
mutate(hem=replace(hem,
is.na(hem),
median(hem, na.rm = T)))

lab_data1 %>% head()

> Participant.ID hem d_igg igg_cat ns1


> 1 N000168001 8.5 2.51 Equivocal Absent
> 2 N000168002 12.1 1.57 Negative Present
> 3 N000168003 7.2 44.09 Positive Absent
> 4 N000168004 6.0 2.92 Positive Present
> 5 N000168005 5.3 2.42 Equivocal Present
> 6 N000168006 5.4 3.51 Positive Present

#Verification
lab_data1 %>%
is.na() %>%
colSums()

> Participant.ID hem d_igg igg_cat ns1


> 0 0 0 0 0

6.3. Find and replace the outliers

• During the process of data analysis one of the most crucial steps is to identify and account for outliers,
observations that have essentially different nature than most other observations. Their presence can
lead to untrustworthy conclusions.
• To detect and remove outliers from a data frame, we use the reference range. Lets consider we have
the outliers in Dengue IgG values. Dengue IgG have some range like the below table,

• If we have the source document for the values, we may cross-check the errors and fix them; otherwise,
we can use the replace() function to change those values.

#Identifying the outliers

lab_data1 %>%
filter(d_igg<0 | d_igg>5)

35
> Participant.ID hem d_igg igg_cat ns1
> 1 N000168003 7.2 44.09 Positive Absent
> 2 N000168018 5.9 200.82 Equivocal Present
> 3 N000168037 4.8 333.25 Positive Present

data_clean <- lab_data1 %>%


mutate(d_igg=replace(d_igg,
d_igg<0 | d_igg>5,
median(d_igg, na.rm=T)))

data_clean %>% head()

> Participant.ID hem d_igg igg_cat ns1


> 1 N000168001 8.5 2.51 Equivocal Absent
> 2 N000168002 12.1 1.57 Negative Present
> 3 N000168003 7.2 2.41 Positive Absent
> 4 N000168004 6.0 2.92 Positive Present
> 5 N000168005 5.3 2.42 Equivocal Present
> 6 N000168006 5.4 3.51 Positive Present

Cleaning has been completed. We now begin working with charts. But first, we must maintain a single data
frame with all the variables in it (Demographic and Lab data). So, lets use join() function to merge two
tables.

7. The join() functions

• Join functions add the columns from second datasheet to first datasheet, matching the observations
based on the keys.

• There are various join() functions available. The functions inner_join(), left_join(),
right_join() and full_join(). But, most often we use inner_join() only.

#Syntax inner_join(x=first data, y=second data, by= "Unique ID")

data_dengu_igg <- inner_join(x=dem_data3, y=data_clean, by="Participant.ID")

data_dengu_igg %>% head()

36
> Participant.ID gender age ocp edu
> 1 N000168001 1 36 Daily wages Higher or University
> 2 N000168002 2 27 Daily wages Basic education
> 3 N000168003 2 34 Daily wages Illiterate
> 4 N000168004 1 40 Government employee Illiterate
> 5 N000168005 2 35 Daily wages Secondary education
> 6 N000168006 1 50 Shop keeper Basic education
> loc age_group hem d_igg igg_cat ns1
> 1 Rural settings 3 8.5 2.51 Equivocal Absent
> 2 Rural settings 2 12.1 1.57 Negative Present
> 3 Rural settings 3 7.2 2.41 Positive Absent
> 4 Urban settings 3 6.0 2.92 Positive Present
> 5 Rural settings 3 5.3 2.42 Equivocal Present
> 6 Urban settings 3 5.4 3.51 Positive Present

8. Summary statistics

We can use summary() function to see the summary statistics of all variables. Only the numeric vectors
mean, median, and 25% and 75% quartiles are displayed in this summary of statistics. Along with the
number of values that were missing.

data_dengu_igg %>% summary()

> Participant.ID gender age ocp


> Length:75 Min. :1.000 Min. :11.0 Length:75
> Class :character 1st Qu.:1.000 1st Qu.:28.0 Class :character
> Mode :character Median :2.000 Median :33.0 Mode :character
> Mean :1.587 Mean :35.6
> 3rd Qu.:2.000 3rd Qu.:43.5
> Max. :2.000 Max. :61.0
> edu loc age_group hem
> Length:75 Length:75 Min. :1.00 Min. : 4.800
> Class :character Class :character 1st Qu.:2.00 1st Qu.: 5.300
> Mode :character Mode :character Median :3.00 Median : 5.685
> Mean :2.56 Mean : 6.306
> 3rd Qu.:3.00 3rd Qu.: 6.700
> Max. :3.00 Max. :12.600
> d_igg igg_cat ns1
> Min. :0.470 Length:75 Length:75
> 1st Qu.:1.770 Class :character Class :character
> Median :2.410 Mode :character Mode :character
> Mean :2.481
> 3rd Qu.:3.310
> Max. :4.900

8.1. Creating summary statistics by using dplyr

group_by() and summarise()


Group_by() function groups the data frames. Group_by() function alone will not give any output. It should
be followed by summarise() function with an appropriate action to perform.
For example, if we want to prepare summary for dengue IgG values by location then,

37
data_dengu_igg %>%
group_by(loc) %>% #Group the locality category
summarise(N=n()) #Count based on locality category

## # A tibble: 2 x 2
## loc N
## <chr> <int>
## 1 Rural settings 35
## 2 Urban settings 40

This summarise function will add a new column variable N and output a count for each location.
Similarly we can include, mean, median and 25% and 75% percentiles of the data.

data_dengu_igg %>%
group_by(loc) %>%
summarise(N=n(),
Mean=mean(d_igg, na.rm=TRUE),
Median=median(d_igg, na.rm=TRUE),
Q1=quantile(d_igg, probs=0.25),
Q3=quantile(d_igg, probs=0.75))

## # A tibble: 2 x 6
## loc N Mean Median Q1 Q3
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Rural settings 35 2.38 2.41 1.64 3.31
## 2 Urban settings 40 2.57 2.41 1.97 3.28

9. Import a MS excel file (optional)

There is no default functions for import an excel file. In order to import an excel file, we must install the
openxlsx package. R command for import a .xlsx file is,
data <- read.xlsx("PATH/FILE NAME.xlsx", sheet = "SHEET NAME")

10. Reading data from different sources (optional)

• In R software, we can import data from different sources. SPSS/SAS/STATA files we can import by
using {haven} package. Install and load the package.
• Import SPSS data using read.spss(). Use as_factor() to load the data value labels, and the data
will reflect the label values.
• Import SAS data by use read.sas() and use read.dta() to read STATA data.
• One of the helpful packages to extract tables from HTML is {rvest} package.
• To extract tables from a PDF file, use the {tabulizer} package.

38
Chapter 3 Data Visualization
Data visualization is the graphical representation of information and data in a pictorial or graphical for-
mat(Example: charts, graphs, and maps). Data visualization tools provide an accessible way to see and
understand trends, patterns in data, and outliers.

• Data can be represented graphically in a variety of ways, including histograms, line charts, pie charts,
scatter plots, and more.
• Typically, a distributions can be described as having a normal distribution, a right skew, a left skew,
or being uniformly distributed.

• Positive, negative, or no associations are possible for scatter plots.

1. The ggplot2

Data visualization is implemented in R using ggplot2 package written by Hadley Wickham. We will use this
package to create various plots, understand the ggplot2 mechanics, and develop quality data visualizations.

1.1. Install ggplot2 package

Installing the ggplot2 package using the function install.packages() and loading the package using
library().

39
1.2. Work with ggplot2

• With ggplot2, the function ggplot() defines a plot object to which layers can be added.
• The first argument of ggplot() is the dataset to use in the graph and so ggplot(data = dataset) creates
an empty graph that is primed to display the given data, but since we haven’t told it how to visualize
it yet, for now it’s empty.
• Essentially, we develop plots by layering elements on top each other beginning from data - aesthetics
- geoms - facets- themes.

1.3. Graph panel

R syntax is ggplot(data, mapping=aes(x,y,...)) %>% geom_()


For example,

ggplot(data=graph_data, #Print the empty panel


mapping = aes(x,y)) #Panel with x and y axis.

10.0

7.5
y

5.0

2.5

2.5 5.0 7.5 10.0


x

• From the above graph – it’s clear where x will be displayed (on the x-axis)and y will be displayed (on
the y-axis). This is because we have not yet articulated, in our code, how to represent the observations
from our data frame on our plot.

• To do so, we need to define a geom: the geometrical object that a plot uses to represent data. These
geometric objects are made available in ggplot2 with functions that start with geom_. For example,
bar charts use bar geoms (geom_bar()), line charts use line geoms (geom_line()), boxplots use boxplot
geoms (geom_boxplot()), scatterplots use point geoms (geom_point()), and so on.
• A few questions of interest and how we can use visualizations to explore

40
What is the frequency of IgG positivity?
What is the distribution of IgG values?
Does the frequency of IgG postivity differ by occupation?
Does the distribution of IgG values differ by location?
What is the relationship between age and IgG values? And does it vary by occupation?

2. Univariate graphical representation

2.1. A categorical variable

A variable is categorical if it can only take one of a small set of distinct values. To examine the distribution
of a categorical variable, we can use a bar chart. The height of the bars displays how many observations
occurred with each x value.

ggplot(data_dengu_igg, aes(x = igg_cat)) +


geom_bar()

30

20
count

10

0
Equivocal Negative Positive
igg_cat

• Based on the above graph, negative is less than both positive and equivocal.

#Transform the graph from x-y axis to y-x axis


ggplot(data_dengu_igg, aes(x = igg_cat)) +
geom_bar()+
coord_flip() #Transform the graph axis

Positive
igg_cat

Negative

Equivocal

0 10 20 30
count

41
2.2. A numerical variable

A variable is continuous if it can any value within its range. One commonly used visualization for distributions
of continuous variables is a boxplot. We will plot the dengue IgG values to boxplot.
The box plot is one of the easily understandable plots. The 25th percentile, median, and 75th percentile
may all be seen in the box plot. This graph typically allows us to identify the outlier.

#Box plot
ggplot(data_dengu_igg, aes(y=d_igg))+ # For vertical view
geom_boxplot()

4
d_igg

−0.4 −0.2 0.0 0.2 0.4

An alternative visualization for distributions of numerical variables is a density plot.


A density plot is a smoothed-out version of a histogram and a practical alternative, particularly for continuous
data that comes from an underlying smooth distribution.
Lets try for the same data.

#Density plot
ggplot(data_dengu_igg, aes(x = d_igg)) +
geom_density()

0.4

0.3
density

0.2

0.1

0.0
1 2 3 4 5
d_igg

The density plot shows, dengue IgG values normally distributed.

3. Bivariate graphical representation


Bivariate graphs display the relationship between two variables. The type of graph will depend on the
measurement level of the variables (categorical or quantitative).

42
3.1. Bar charts (Two categorical variable)

If we have both categorical variables, we can develop various bar plots. Let’s generate a Stacked bar chart,
a Group bar chart, and a Segmented bar chart for the clean_data dataset’s dengue IgG category and
occupational status.

# Stacked bar chart


ggplot(data_dengu_igg,
aes(x = igg_cat,
fill = ocp)) +
geom_bar(position = "stack") #Position for the stacked bar chart.

30 ocp
Daily wages
20
count

Farmer
Government employee
10
None employed
Shop keeper
0
Equivocal Negative Positive
igg_cat

# Grouped bar chart


ggplot(data_dengu_igg,
aes(x = igg_cat,
fill = ocp)) +
geom_bar(position = "dodge") #Position for the grouped bar chart.

12.5
ocp
10.0
Daily wages
7.5
count

Farmer
5.0 Government employee

2.5 None employed


Shop keeper
0.0
Equivocal Negative Positive
igg_cat

# Segmented bar chart


ggplot(data_dengu_igg,
aes(x = igg_cat,
fill = ocp)) +
geom_bar(position = "fill") + #Position for the segmented bar chart.
labs(y = "Proportion")

43
1.00 ocp
0.75 Daily wages

Proportion
Farmer
0.50
Government employee
0.25 None employed
Shop keeper
0.00
Equivocal Negative Positive
igg_cat

3.2. Box plots (One categorical one numerical variable)

• A boxplot displays the 25th percentile, median, and 75th percentile of a distribution. The whiskers
(vertical lines) capture roughly 99% of a normal distribution, and observations outside this range are
plotted as points representing outliers.
• One of the advantages of boxplots is that their widths are not usually meaningful. This allows us to
compare the distribution of many groups in a single graph.

ggplot(data_dengu_igg,
aes(x = ocp,
y = d_igg)) +
geom_boxplot()

4
d_igg

Daily wages Farmer Government employee None employed Shop keeper


ocp

The box plot shows that the median is close to the 25th percentile. This means that 50% of the data was
visible at that point.

3.3. Scatter plot (Two numerical variable)

For two-dimensional numerical data, the method geom_point() adds a layer of points to plot and creates a
scatterplot. Let’s examine the correlation between dengue IgG and age using the provided data.

ggplot(data = data_dengu_igg,
mapping = aes(x=age,
y=d_igg))+
geom_point()

44
5

d_igg
3

10 20 30 40 50 60
age

According to this graph, there is a positive association between dengue IgG and age.
Including line in scatter plot

ggplot(data_dengu_igg,
aes(x=age,
y=d_igg)) +
geom_point()+
geom_smooth(method = "lm")

> ‘geom_smooth()‘ using formula = ’y ~ x’

4
d_igg

10 20 30 40 50 60
age

Similar to a scatter plot shown by graph. It is obvious that the order is increasing, and geom_smooth() uses
this information to build the line of best fit based on a linear model with method = "lm".

4. Multivariate graphs

We can include a second categorical variable as a group variable in bivariate graphs. So, lets start with
already plotted scatter plot age and dengue IgG values which is included with location.

4.1. Scatter plot with categorical group

45
#Scatter plot with group
ggplot(data = data_dengu_igg,
mapping = aes(x=age,
y=d_igg,
color=loc))+
geom_point()

4
loc
d_igg

3
Rural settings
2 Urban settings

10 20 30 40 50 60
age

When a categorical variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of
the aesthetic to each unique level of the variable (each of the location group), a process known as scaling.
ggplot2 will also add a legend that explains which values correspond to which levels.

4.2. Faceting

Another multivariate graph technique is faceting. In faceting, a graph consists of several separate plots or
small multiples, one for each level of a third variable, or combination of variables.
For example, let’s take a bar graph of dengue IgG results category and occupation from our dataset and
add locality to that group as well. facet() can be used to divide the graph into smaller groups in order to
achieve this.

#Facet into a single row


ggplot(data_dengu_igg,
mapping = aes(x=igg_cat,
fill=ocp)) +
geom_bar(position = "dodge") +
facet_wrap(~loc, nrow = 1) #Divide the graph into small groups

46
Rural settings Urban settings

ocp
Daily wages
4
count

Farmer
Government employee
None employed

2 Shop keeper

Equivocal Negative Positive Equivocal Negative Positive


igg_cat

#Facet into a single column


ggplot(data_dengu_igg,
aes(x = ocp,
y = d_igg)) +
geom_boxplot() +
facet_wrap(~loc, ncol = 1) #Divide the graph into small groups

47
Rural settings
5

1
d_igg

Urban settings
5

Daily wages Farmer Government employee None employed Shop keeper


ocp

5. Preparing graphs for publication with ggplot2

Clear labelling and annotations: Instead of using variable names for the x and y axes when plotting the
graph, we must give them meaningful names. We must also give the graph titles and alter the axis’ colour,
shape, and range.
To enhance colour and labelling, utilise extra choices. In the chart below,

• We can add the graph’s value labels by using the geom_text function
• xlim and ylim which aid in trimming the axis’s boundaries
• scale_y_continuous modifies the y-axis tick mark labels
• labs provides a title and changed the labels for the x and y axes and the legend
• scale_fill_brewer changes the fill color scheme
• theme_minimal removes the grey background and changed the grid color
• theme, adjust the position of the legend, modify font style of the graph

Let’s try for the previous graph with some inputs.

48
ggplot(data_dengu_igg,
aes(x = loc,
fill = igg_cat)) +
geom_bar(position = "fill") + #Stacked bar chart
labs(y = "Proportion",
title = "Dengue IgG results category by locality wise") +
#Label for y-axis; and title for the graph
theme_minimal() #Background theme without gray color.

Dengue IgG results category by locality wise


1.00
igg_cat
Proportion

0.75
Equivocal
0.50
Negative
0.25
Positive
0.00
Rural settings Urban settings
loc

We can distinguish the points by their forms rather than their colour. We will obtain a beautiful represen-
tation of the data using a shape aesthetic for the given location grouping.

ggplot(data = data_dengu_igg,
mapping = aes(x=age,
y=d_igg,
color=loc, #Color of the group
shape=loc))+ #Provide the shape of the group
geom_point()

4
loc
d_igg

3
Rural settings
2 Urban settings

10 20 30 40 50 60
age

We can control the limits of x-axis and y-axis by using xlim and ylim.

#Example for xlim and ylim


ggplot(data = data_dengu_igg,
mapping = aes(x=age,
y=d_igg,

49
color=loc,
shape=loc))+
geom_point()+
xlim(c(20,45))+ #X-axis (Age) value limits between 20 to 45
ylim(c(0,5)) #Y-axis (Dengue IgG) value limits between 0 to 5

> Warning: Removed 18 rows containing missing values (‘geom_point()‘).

4
loc
d_igg

3
Rural settings
2
Urban settings
1

0
20 25 30 35 40 45
age

#Example for scale_y_continuous


ggplot(data = data_dengu_igg,
mapping = aes(x=age,
y=d_igg,
color=loc,
shape=loc))+
geom_point()+
scale_y_continuous(breaks = c(2.5,3,3.5,4)) #Choose where the axis ticks appear

4.0
3.5 loc
d_igg

3.0
Rural settings
2.5
Urban settings

10 20 30 40 50 60
age

6. Saving graphs

The graphs in RStudio are fairly simple to export. The Export option is there above the plot if we view it
in the lower right corner (Plots panel —> Export —> Save as Image or Save as PDF). The choice
allows us to export in the format that we need; the recommended format for exporting is .jpeg.

50
Chapter 4 Statistical reports generation

1. Summary statistics for publication

R offers several packages that make it efficient to generate publication-ready tables with demographic infor-
mation and statistical test results. These packages allow you to create visually appealing and customizable
tables directly from your R code

2. The gtsummary package

The {gtsummary} package provides an elegant and flexible way to create publication-ready analytical and
summary tables. The {gtsummary} package summarizes data sets, regression models, and more, using
sensible defaults with highly customizable capabilities.

2.1. Install gtsummary

R command for install gtsummary package, install.packages("gtsummary")

2.2. Summary table

Use tbl_summary() to summarize a data frame.


Let’s begin with the data that we used for graph. Let’s suppose the main focus here is to compare the
dengue IgG results to several baseline characteristics.
Start with basic information table.

data_dengu_igg <- data_dengu_igg %>% to_factor()


#Preparing summary table for selected variables

data_dengu_igg %>%
tbl_summary(include = c(age_group, gender, loc, ocp, igg_cat))

Characteristic N = 75
Age group
00-10 0 (0%)
11-20 4 (5.3%)
21-30 25 (33%)
30+ 46 (61%)
Sex at birth
Male 31 (41%)
Female 44 (59%)
Locality
Rural settings 35 (47%)
Urban settings 40 (53%)
Occupational status
Daily wages 23 (31%)
Farmer 16 (21%)
Government employee 20 (27%)
None employed 6 (8.0%)
Shop keeper 10 (13%)

51
Characteristic N = 75
Dengue IgG results
Equivocal 31 (41%)
Negative 15 (20%)
Positive 29 (39%)

#To create summary table

Let us obtain a summary table for age, gender, location and occupation by dengue IgG levels. We can then
add many customization options like number of missing observations in the table. We can then add overall
count in grouping table.

data_dengu_igg %>%
tbl_summary(by = igg_cat,
include = c(age_group, gender, loc, ocp),
missing = "no") %>%
add_n() #Adding over all count in the table

Characteristic N Equivocal, N = 31 Negative, N = 15 Positive, N = 29


Age group 75
00-10 0 (0%) 0 (0%) 0 (0%)
11-20 0 (0%) 4 (27%) 0 (0%)
21-30 14 (45%) 11 (73%) 0 (0%)
30+ 17 (55%) 0 (0%) 29 (100%)
Sex at birth 75
Male 10 (32%) 6 (40%) 15 (52%)
Female 21 (68%) 9 (60%) 14 (48%)
Locality 75
Rural settings 13 (42%) 10 (67%) 12 (41%)
Urban settings 18 (58%) 5 (33%) 17 (59%)
Occupational status 75
Daily wages 12 (39%) 5 (33%) 6 (21%)
Farmer 6 (19%) 4 (27%) 6 (21%)
Government 9 (29%) 2 (13%) 9 (31%)
employee
None employed 1 (3.2%) 1 (6.7%) 4 (14%)
Shop keeper 3 (9.7%) 3 (20%) 4 (14%)

Note: Gtsummary by default considers Yes/No, True/False variables as dichotomous and therefore presents
only one row (Yes or No) in the output table. To have both levels show in the output, one needs to modify
the type argument so that the variables are considered as categorical and not dichotomous.

3. Statistical tests and outputs

gtsummary, also allows performing a few commonly used statistical tests and enables exporting results as
tables.

52
3.1. Comparing two groups

For example, let’s do a t-test to see if the average levels of hemoglobin differ between NS1 status

data_dengu_igg %>%
tbl_summary(by=ns1, include = hem) %>%
add_p(~ "t.test") #t-test for comparing the age between two biomarkers

Characteristic Absent, N = 33 Present, N = 42 p-value


Hemoglobin 6.00 (5.40, 6.80) 5.50 (5.23, 5.98) 0.057

#Use the "wilcox.test" instead of "t.test", if the observations are non-normal

Let us add other variables such as age, and IgG levels to see if their average levels differ by NS1 status

data_dengu_igg %>%
tbl_summary(by=ns1, include = c(age, hem, d_igg)) %>%
add_p(~ "t.test")

Characteristic Absent, N = 33 Present, N = 42 p-value


Age (in years) 34 (30, 43) 32 (27, 44) 0.3
Hemoglobin 6.00 (5.40, 6.80) 5.50 (5.23, 5.98) 0.057
Dengue IgG 2.51 (2.30, 3.31) 2.35 (1.65, 3.20) 0.3

3.2. Comparing more than 2 groups

If there are more than two groups, then we have to use ANOVA test for normally distributed data and
Kruskal-wallis for non-normal data.
The syntax is similar to the above, but name of the test name must be changed within the add_p() function.
For example, to compare hemoglobin values across three categories of IgG status

data_dengu_igg %>%
tbl_summary(by=igg_cat, include = hem) %>%
add_p(~ "aov") # Comparing the hemoglobin for various IgG categories

Characteristic Equivocal, N = 31 Negative, N = 15 Positive, N = 29 p-value


Hemoglobin 5.50 (5.30, 6.65) 5.90 (5.25, 7.30) 5.69 (5.40, 6.80) 0.3

#Use "kruskal.test" instead of "aov", if the observations are non-normal.

3.3. Association between categorical groups

For testing significant differences or associations between categorical groups, Chi-square test are used.
For example, we may want to apply chi-square tests to assess if occupation or location is associated with
IgG positivity

53
data_dengu_igg %>%
tbl_summary(by=igg_cat, include = c(ocp, loc)) %>%
add_p(~"chisq.test") #Comparing association between two categorical variables

Characteristic Equivocal, N = 31 Negative, N = 15 Positive, N = 29 p-value


Occupational 0.6
status
Daily wages 12 (39%) 5 (33%) 6 (21%)
Farmer 6 (19%) 4 (27%) 6 (21%)
Government 9 (29%) 2 (13%) 9 (31%)
employee
None employed 1 (3.2%) 1 (6.7%) 4 (14%)
Shop keeper 3 (9.7%) 3 (20%) 4 (14%)
Locality 0.2
Rural settings 13 (42%) 10 (67%) 12 (41%)
Urban settings 18 (58%) 5 (33%) 17 (59%)

3.4. Summary statistics with statistical tests

In some tables, one may want to add results from statistical tests next to summary or descriptive statistics.
This can be done as follows:

data_dengu_igg %>%
tbl_summary(by = igg_cat, include = c(age_group,gender, loc, ocp)) %>%
add_n() %>%
add_p() #Here, we do not mention the test

Characteristic N Equivocal, N = 31 Negative, N = 15 Positive, N = 29 p-value


Age group 75 <0.001
00-10 0 (0%) 0 (0%) 0 (0%)
11-20 0 (0%) 4 (27%) 0 (0%)
21-30 14 (45%) 11 (73%) 0 (0%)
30+ 17 (55%) 0 (0%) 29 (100%)
Sex at birth 75 0.3
Male 10 (32%) 6 (40%) 15 (52%)
Female 21 (68%) 9 (60%) 14 (48%)
Locality 75 0.2
Rural settings 13 (42%) 10 (67%) 12 (41%)
Urban settings 18 (58%) 5 (33%) 17 (59%)
Occupational 75 0.6
status
Daily wages 12 (39%) 5 (33%) 6 (21%)
Farmer 6 (19%) 4 (27%) 6 (21%)
Government 9 (29%) 2 (13%) 9 (31%)
employee
None employed 1 (3.2%) 1 (6.7%) 4 (14%)
Shop keeper 3 (9.7%) 3 (20%) 4 (14%)

If we omit the test name, the gtsummary table will automatically assume the following: Wilcoxon rank
sum test for two group comparison, Kruskal-Wallis test for comparisons involving more than two groups,
Pearson’s Chi-squared test or Fisher’s exact test for association tests.

54
4. Merge two or more gtsummary objects

It is also possible to merge two different tables into a single table. For example, we may want to merge
summary statistics for two different outcomes (say, IgG and NS1) in a single table.

table_1 <- data_dengu_igg %>%


tbl_summary(include = c(ocp, edu),
by = igg_cat,
missing = "no") %>%
add_n()

table_2 <- data_dengu_igg %>%


tbl_summary(include = c(ocp, edu),
by = ns1,
missing = "no") %>%
add_n()

tbl_merge(tbls=list(table_1, table_2), #Merge two tables as list format


tab_spanner = c("**Dengue IgG results**", "**NS1 bio markers**"))

Equivocal, Negative, N Positive, N Absent, N Present, N


Characteristic N N = 31 = 15 = 29 N = 33 = 42
Occupational 75 75
status
Daily wages 12 (39%) 5 (33%) 6 (21%) 12 (36%) 11 (26%)
Farmer 6 (19%) 4 (27%) 6 (21%) 8 (24%) 8 (19%)
Government 9 (29%) 2 (13%) 9 (31%) 6 (18%) 14 (33%)
employee
None 1 (3.2%) 1 (6.7%) 4 (14%) 2 (6.1%) 4 (9.5%)
employed
Shop keeper 3 (9.7%) 3 (20%) 4 (14%) 5 (15%) 5 (12%)
Educational 75 75
status
Basic 8 (26%) 9 (60%) 9 (31%) 12 (36%) 14 (33%)
education
Higher or 7 (23%) 1 (6.7%) 5 (17%) 7 (21%) 6 (14%)
University
Illiterate 5 (16%) 3 (20%) 11 (38%) 6 (18%) 13 (31%)
Secondary 11 (35%) 2 (13%) 4 (14%) 8 (24%) 9 (21%)
education

5. Formatting the gtsummary table

gtsummary provides a high level of customization, allowing you to create publication-ready tables for research
papers. Here’s how you can add titles, bold headers and footers, and change the theme of the table:

data_dengu_igg %>%
tbl_summary(by = igg_cat,
include = c(ocp, loc)) %>%
add_n() %>%
modify_header(label = "**Variable**") %>%

55
modify_footnote(all_stat_cols() ~ "median (IQR) for Age; n (%) for Grade") %>%
#To change the footnote of the table
modify_caption("**Patient Characteristics** (N = {N})")

Table 11: Patient Characteristics (N = 75)

Variable N Equivocal, N = 31 Negative, N = 15 Positive, N = 29


Occupational status 75
Daily wages 12 (39%) 5 (33%) 6 (21%)
Farmer 6 (19%) 4 (27%) 6 (21%)
Government 9 (29%) 2 (13%) 9 (31%)
employee
None employed 1 (3.2%) 1 (6.7%) 4 (14%)
Shop keeper 3 (9.7%) 3 (20%) 4 (14%)
Locality 75
Rural settings 13 (42%) 10 (67%) 12 (41%)
Urban settings 18 (58%) 5 (33%) 17 (59%)

#To change the header table.

6. Export gtsummary table

The ggsummary package does not return a standard data frame, and exporting the summary table to a Word
document requires additional steps. The gt package, which is specifically designed for creating beautiful and
highly customizable tables, can be used to export the ggsummary summary table to a Word document.
Syntax for export the table to word document, table %>% as_gt() %>% gtsave(filename="__________.docx")
For example, to export the above table into word document.

data_dengu_igg %>%
tbl_summary(by = igg_cat,
include = c(gender,age_group, loc, ocp)) %>%
add_n() %>%
as_gt() %>% #Convert the table to gt format
gtsave(filename = "Table 1.docx")
#Save the file as Table 1 word document

56
Challenges:

Day - 1 (Session - 1)

Challenge - I

Create vectors named height and weight using the following data:
height : 160.3, 134.2, 159, 149, 145, and 147.1
weight : 83.8, 37.2, 71.7, 72.8, 50.5, and 42.9.

i) Based on the above vectors, answer the following questions:

a) The average height is __________


b) The variance of height is __________

c) The SD of height is __________

d) The average weight is __________


e) The variance of weight is __________

f) The SD of weight is __________

ii) Extract the 4th element in weight and height vector

a) 4th element in weight __________


b) 4th element in height __________

iii) Based on the above vector, calculate BMI

a) Calculate BMI __________


b) Extract the 4th element in BMI vector __________

57
Challenge - II

Create a matrix using the following table, and answer the following questions using matrix operations

a) The total number of smokers __________.

(Hint rowSums(___))

b) The total number of non smokers __________.

(Hint rowSums(___))

c) Incidence of CHD among smokers ( A+B


A
)__________.

d) Incidence of CHD among non smokers ( C+D


C
)__________.

A/(A+B)
e) Risk ratio of CHD ( C/(C+D) ) __________.

58
Challenge - III

Represent the following table using array, and answer the following using array operations

a) In rural, incidence of CHD among smokers __________.

b) In rural, incidence of CHD among non smokers __________.

c) In rural, what is the risk ratio of CHD __________.

d) In urban, incidence of CHD among smokers __________.

e) In urban, incidence of CHD among non smokers __________.

f) In urban, risk ratio of CHD __________.

59
Challenge - IV

4) Create a list that contains results of overall risk ratio (Challenge II), rural risk ratio (Challenge IIIc)
and urban risk ratio (challenge IIIf)

Challenge - V

5) Write a R program to create a data frame for the following data:

unique_id’s vector C10001, C10002, and C10003;


treatment vector A, B and C;
age vector 29,30 and 28.

Then extract 3rd entire row.

Challenge - VI

Find the error in the following R codes

a) temp <- c(99.4, 102.3; 100.3)

b) Consider mat is a 2X2 matrix. Now, to extract 2nd row 1st column, will this command mat(2;1) works?

c) hba1c% <- c(16.4, 11.0, 10.3, 12.4)

d) vector <- c(13, 7A, 11, 30)

e) R command to view the last 6 rows of dataframe df is str(df).

60
Day - 1 (Session - 2)

In this hypothetical study, data from 25 individuals have been collected to explore the relationship between
demographic factors, systolic blood pressure, hypertension, and the effectiveness of two types of drugs, A
and B.

Lets work through these questions to undergo the data cleaning process.

1) Import the exercise data from the directory (File name is Exercise_data-Day1.csv)

i) How many variables are there in the datasheet? __________ (Hint ____ %>% dim())

ii) The datasheet has how many observations? __________

2) Give the variables new names as the following (Hint ___ %>% rename())

i) “Height.in.cms” as height

ii) “Weight.in.kgs” as weight

iii) “Type.of.drug.given” as drugType

3) Give the variables labels as the following (Hint ___ %>% set_variable_labels())

i) height as Height (in Cms)

ii) weight as Weight (in Kgs)

iii) drugType as Type of Drug

61
4) Recode the values of the following variables (Hint ___ %>% recode())

i) Hypertension, Yes=1 and No=0

ii) Gender, Male=1 and Female=2

5) Assign value labels to the following (Hint ___ %>% set_value_labels())

i) In Hypertension 0 as “NO” and 1 as “YES”

ii) In Gender 1 as “Male” and 2 as “Female”

6) How many people participated in the study from urban? __________ (Hint ___ %>% filter(
))

7) How many individuals took drug A? __________ (Hint ___ %>% filter( ))

8) How many individuals took drug B? __________ (Hint ___ %>% filter( )

62
9) Find the duplicates. How many pairs that are the same did you find?__________

(Hint ___ %>% filter(duplicated(-----))

10) Find the missing data for the variable Systolic Blood pressure (mmHg). (Hint filter(is.na(-----)))

How many missing values were discovered?__________

11) Identify the outliers in Systolic Blood Pressure (mmHg). (Hint use the range 80-160)

How many outliers were found? __________

12) Prepare summary table by drug type for diastolic blood pressure with count, mean and median, and
SD (Hint ___ %>% group_by(___) %>% summarise(___))

i) What is the mean diastolic blood pressure for A __________

ii) What is the median diastolic blood pressure for B __________

63
Day - 2 (Session - 1)

Let us create some data visualizations to understand how drug is effective in treatment of blood pressure,
and see if there are any baseline differences, and differences in outcomes - hypertension, systolic and diastolic
BP.

Import exercise data (Exercise_data-Day2.csv) from the directory.

1. Use the ggplot2 package to plot the bar graph for hypertension response (Univariate bar graph). Which
response has the most frequency? __________

2. Could you add drug type in the bar chart for hypertension? (Bivariate grouped bar chart). How many
people who indicated they had hypertension also took drug A? __________

(Hint ggplot(aes(x=____), fill=____))

3. Could you now add the dwelling type to the previous bar graph. In bar graph, to include the loca-
tion use facet_wrap() function. What type of distribution does the graph looks like in large city?
__________

(Hint facet_wrap(~____))

4. Draw a density chart for systolic blood pressure (Univariate chart). What type of distribution does
the graph looks like? __________

a) Right skewed-distribution

b) Left skewed-distribution
c) Normal distribution
d) Uniform distribution

5. Create a box plot to represent systolic blood pressure by drug type (Bivariate box plot). What is the
median blood pressure for both drug type? __________

64
6. Using facet_wrap(), add the type of dwelling to the previous graph. Which sort of dwelling has the
highest blood pressure when using drug B? __________

7. Use a scatter chart to plot the graph for systolic and diastolic pressure (Bivariate graph).

What is the relationship between systolic and diastolic blood pressure? __________

a) No association
b) Positive association

c) Negative association

65
Day - 2 (Session - 2)

Create summary tables for the following conditions. Then, fill in the blanks.

Import exercise data (Exercise_data-Day2.csv) from the directory.

1) Prepare summary statistics for the variables, sex, dwelling, drugType.

Variable n(%)
Gender

- Male __________

- Female __________

Location

- Small city __________

- Large city __________

- Town __________

Drug type

- Type A __________

- Type B __________

66
2) Prepare summary statistics for the following variables by type of drug, sex, dwelling, hyper. Include
statistical tests.

Variable Type A Type B p-value


Gender __________

- Male __________ __________

- Female __________ __________

Location __________

- Small city __________ __________

- Large city __________ __________

- Town __________ __________

Hypertension __________

- Yes __________ __________

- No __________ __________

3) Prepare the summary statistics for the numerical vectors systolic and diastolic blood pressure by drug
type. Include statistical tests.

Variable Type A Type B p-value

Systolic BP __________ __________ __________

Diastolic BP __________ __________ __________

67

You might also like