0% found this document useful (0 votes)
27 views33 pages

1.importing Data From External Files

The document provides a comprehensive guide on importing and exporting data in R, covering methods for text files, databases, and handling missing data. It details functions for reading and writing data, including read.table, write.table, and database connection packages like RODBC and DBI. Additionally, it discusses data cleaning techniques, such as identifying and removing missing values, and the importance of data transformation.

Uploaded by

neithhel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views33 pages

1.importing Data From External Files

The document provides a comprehensive guide on importing and exporting data in R, covering methods for text files, databases, and handling missing data. It details functions for reading and writing data, including read.table, write.table, and database connection packages like RODBC and DBI. Additionally, it discusses data cleaning techniques, such as identifying and removing missing values, and the importance of data transformation.

Uploaded by

neithhel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE 2

Importing data from Text files and other software, Exporting data, importing data from
databases- Database Connection packages, Missing Data - NA, NULL. Combining data
sets, Transformations, Binning Data, Subsets, summarizing functions. Data Cleaning,
Finding and removing Duplicates, Sorting.

[Link] data from external files


R can import data from text files, other statistics software, and even spreadsheets. don’t even
need a local copy of the file: we can specify a file at a URL, and R will fetch the file for you
over the Internet.

[Link] Files
Most text files containing data are formatted similarly: each line of a text file represents an
observation (or record).
Each line contains a set of different variables associated with that observation. Sometimes,
different variables are separated by a special character called the delimiter.
Other times, variables are differentiated by their location on each line.

1.1.1 Delimited files


R includes a family of functions for importing delimited text files into R,
based on the [Link] function:
[Link](file, header, sep = , quote = , dec = , [Link], [Link], [Link] = , [Link] ,
colClasses , nrows =, skip = , [Link] = , fill = , [Link] = , [Link] = ,
[Link] = , allowEscapes = , flush = , stringsAsFactors = , encoding = )

• The [Link] function reads a text file into R and returns a [Link] object.
• Each row in the input file is interpreted as an observation.
• Each column in the input file represents a variable.
• The [Link] function expects each field to be separated by a delimiter.
For example, suppose that you had a file called [Link] that contained the
following text (and only this text):
[Link],[Link],team,position,salary COLOUMN NAMES
"Manning","Peyton","Colts","QB",18700000
"Brady","Tom","Patriots","QB",14626720
"Pepper","Julius","Panthers","DE",14137500
"Palmer","Carson","Bengals","QB",13980000
"Manning","Eli","Giants","QB",12916666

data is encoded as follows


• The first row contains the column names.
• Each text field is encapsulated in quotes.
• Each field is separated by commas.
To load this file into R, you would specify that the first row contained column names
(header=TRUE), that the delimiter was a comma (sep=","), and that quotes were used
to encapsulate text (quote="\""). Here is an R statement that loads in this file:

1
> [Link] <- [Link]("[Link]", header=TRUE, sep=",", quote="\"")

The most important options are sep and header.


In most cases, you will find that you can use [Link] for comma-separated files or [Link]
for tab-delimited files without specifying any other options

2
1.1.2. Fixed-width files
To read a fixed-width format text file into a data frame, you can use the
[Link] function:
[Link](file, widths, header = , sep = , skip = , [Link], [Link], n = , buffersize = , ...)
Here is a description of the arguments to [Link]

Note that [Link] can also take many arguments used by [Link], including [Link],
[Link], colClasses, and [Link].

1.1.3. Other functions to parse data


• We load text files into R with the [Link] function.
• Sometimes, we might be provided with a file that cannot be read correctly with
this function.
• For example, observations in the file might span multiple lines.
• To read data into R one line at a time, use the function readLines:
readLines(con = stdin(), n = -1L, ok = TRUE, warn = TRUE, encoding = "unknown")

3
1.1.4. scan
Another useful function for reading more complex file formats is scan:
scan(file = "", what = double(0), nmax = -1, n = -1, sep = "", , skip = 0, nlines = 0,
[Link] = "NA", [Link] = TRUE, allowEscapes = FALSE, encoding = "unknown")

4
Example: If the observations in the file span multiple lines, it implies that the structure of the
CSV file might be a bit unusual. Typically, CSV files have one record per line, but in cases
where an observation spans multiple lines, the file structure might be less straightforward.
Name, Salary
John, 50000
"Alice,
Cooper",60000
Bob, 55000
"Emma,
Watson", 62000
Michael, 58000

In this case, "Alice Cooper" and "Emma Watson" have values that span multiple lines.
[Link] <- [Link]("[Link]",header=TRUE, sep=",", quote="\"")

OUTPUT

Name Salary
1 John 50000
2 Alice,\n 60000
3 Bob 55000
4 Emma,\n 62000
5 Michael 58000

1.2 Other Software

2. Exporting Data
• R can also export R data objects (usually data frames and matrices) as text files.

5
• To export data to a text file, use the [Link] function:

[Link](x, file = "", append = FALSE, quote = TRUE, sep = " ", eol = "\n", na =
"NA", dec = ".", [Link] = TRUE, [Link] = TRUE))

There are wrapper functions for [Link] that call [Link] with different defaults.
These are useful if you want to create a file of comma-separated values,

for example, to import into Microsoft Excel: [Link](...) write.csv2(...) Here is a


description of the arguments to [Link].

6
3. Importing Data From Databases

• Importing data from databases into R involves several steps that allow you to
connect to a database, retrieve data, and bring it into R for analysis.
• This process is essential for data analysis workflows, as databases often store large
volumes of structured data.
• Here’s a comprehensive guide to understanding and performing data importation
from databases in R.

3.1. Export Then Import

• One of the best approaches for working with data from a database is to export the
data to a text file and then import the text file into R.
• When dealing with very large data sets it is found that you can import data into R
at a much faster rate from text files than you can from database connections.
• extract a large amount of data once and then analyze the data,
• using R to produce regular reports or to repeat an analysis many times, then it might
be better to import data into R directly through a database connection

3.2. Database Connection Packages


In order to connect directly to a database from R, you will need to install some optional
packages.
The packages you need depend on the database(s) to which you want to connect and the
connection method you want to use.
There are two sets of database interfaces available in R:
3.2.1. RODBC.
The RODBC package allows R to fetch data from ODBC (Open DataBase Connectivity)
connections.
ODBC provides a standard interface for different programs to connect to databases.
3.2.2. DBI.
DBI package allows R to connect to databases using native database drivers or JDBC drivers.
This package provides a common database abstraction for R software.
You must install additional packages to use the native drivers for each database.

7
3.3. Importing Data from Databases Using RODBC
The `RODBC` package in R provides a robust interface for connecting to and interacting with
databases that comply with the Open Database Connectivity (ODBC) standard. This package
allows you to establish a connection to various database management systems (DBMS),
execute SQL queries, fetch data, and perform data manipulation tasks.
Key Concepts of RODBC
- ODBC (Open Database Connectivity): A standard API for accessing different DBMS.
ODBC drivers are available for many database systems, including SQL Server, MySQL,
Oracle, and others.
- DSN (Data Source Name): A data structure that contains the information about a specific
database connection.
3.3.1. Steps to Import Data Using RODBC
1. Install and Load the RODBC Package
[Link]("RODBC")
library(RODBC)
2. Set Up ODBC DSN
Set up an ODBC Data Source Name (DSN) for your database. This can be done through
your operating system’s ODBC Data Source Administrator tool.
3. Connect to the Database

8
Establish a connection to the database using the `odbcConnect` function, specifying the
DSN, user ID, and password.
conn <- odbcConnect("DSN_name", uid="your_username", pwd="your_password")
4. List Available Tables
You can list all the tables in the connected database to get an overview of its structure.
tables <- sqlTables(conn)
print(tables)
5. Fetch Data from a Table
Use the `sqlFetch` function to import an entire table into R.
data <- sqlFetch(conn, "your_table_name")
6. Execute SQL Queries
To run specific SQL queries and retrieve the results, use the `sqlQuery` function.
query_result <- sqlQuery(conn, "SELECT FROM your_table_name WHERE
some_column = 'some_value'")
7. Perform Data Transformation and Cleaning
After importing the data, you might need to clean and transform it using functions from
packages like `dplyr` or `tidyverse`.
library(dplyr)
cleaned_data <- query_result %>%
filter(some_column == "some_value") %>%
select(column1, column2, column3)

8. Close the Connection


Always close the database connection when you are done to free up resources.
close(conn)
Example Workflow

example workflow that demonstrates how to use RODBC to connect to a database, retrieve
data, and perform some basic operations.

# Install and load the RODBC package


[Link]("RODBC")
library(RODBC)

#Establish connection to the database


conn <- odbcConnect("DSN_name", uid="your_username", pwd="your_password")

# List tables in the database

9
tables <- sqlTables(conn)
print(tables)

# Fetch data from a specific table


data <- sqlFetch(conn, "your_table_name")

# Run a specific query to retrieve data


query_result <- sqlQuery(conn, "SELECT FROM your_table_name WHERE some_column
= 'some_value'")

# Perform data cleaning and transformation


library(dplyr)
cleaned_data <- query_result %>%
filter(some_column == "some_value") %>%
select(column1, column2, column3)

# Close the database connection


close(conn)

# View the cleaned data


print(cleaned_data)

Advantages of Using RODBC


• Flexibility: Supports a wide range of databases via ODBC drivers.
• SQL Integration: Allows direct execution of SQL queries, leveraging the power of SQL
for data manipulation.
• Interoperability: Can be used on different operating systems, making it versatile for
various deployment environments.

3.4 Importing Data from Databases Using DBI

• The `DBI` package in R provides a standardized interface for communication with


various database management systems (DBMS).
• It abstracts the specifics of different databases, allowing users to interact with
databases in a consistent manner.
• When combined with database-specific backends, `DBI` enables powerful and
flexible data importation and manipulation capabilities.
• DBI Interface: A set of functions for database operations that are consistent across
different DBMS.
• Database Backend: Specific packages that implement the DBI interface for particular
databases, such as `RMySQL`, `RPostgreSQL`, and `RSQLite`.

10
3.4.1. Steps to Import Data Using DBI

1. Install and Load Necessary Packages


Depending on the database you are working with, install and load `DBI` along with the
relevant backend package. For instance, if you are connecting to a PostgreSQL database:
[Link]("DBI")
[Link]("RPostgreSQL")
library(DBI)
library(RPostgreSQL)

2. Establish a Connection
Create a connection to the database using the `dbConnect` function from the specific
backend package.
con <- dbConnect(RPostgreSQL::PostgreSQL(), dbname = "your_database_name",
host = "your_host", port = 5432, user = "your_username", password =
"your_password")

3. List Available Tables


You can list all the tables in the connected database using `dbListTables`.
tables <- dbListTables(con)
print(tables)

4. Fetch Data from a Table


Use `dbReadTable` to import an entire table into R.
data <- dbReadTable(con, "your_table_name")

5. Execute SQL Queries


To run specific SQL queries and retrieve the results, use `dbGetQuery`.
query_result <- dbGetQuery(con, "SELECT FROM your_table_name WHERE
some_column = 'some_value'")

6. Perform Data Transformation and Cleaning

After importing the data, you might need to clean and transform it using functions from
packages like `dplyr` or `tidyverse`.
library(dplyr)
cleaned_data <- query_result %>%
filter(some_column == "some_value") %>%
select(column1, column2, column3)

7. Close the Connection


Always close the database connection when you are done to free up resources.
dbDisconnect(con)

11
Example
# Install and load the necessary packages
[Link]("DBI")
[Link]("RPostgreSQL")

library(DBI)
library(RPostgreSQL)

# Establish connection to the database


con <- dbConnect(RPostgreSQL::PostgreSQL(),
dbname = "your_database_name",
host = "your_host",
port = 5432,
user = "your_username",
password = "your_password")

# List tables in the database


tables <- dbListTables(con)
print(tables)

# Fetch data from a specific table


data <- dbReadTable(con, "your_table_name")

# Run a specific query to retrieve data


query_result <- dbGetQuery(con, "SELECT FROM your_table_name WHERE
some_column = 'some_value'")

# Perform data cleaning and transformation


library(dplyr)
cleaned_data <- query_result %>%
filter(some_column == "some_value") %>%
select(column1, column2, column3)

# Close the database connection


dbDisconnect(con)

# View the cleaned data


print(cleaned_data)

12
Advantages of Using DBI

• Consistency: Provides a consistent interface for different databases, making it easier to


switch between DBMS.
• Flexibility: Can be combined with various backend packages to connect to a wide range
of databases.
• Efficiency: Supports efficient data operations and query execution.

3.4.2 TSDBI

• TSDBI is an interface specifically designed for time series data.


• There are TSDBI packages for many popular databases

[Link] DATA

Dealing with missing values is an important part of data preprocessing and analysis in R.
Missing values, often represented as NA (Not Available) in R, can occur due to various
reasons such as data entry errors, sensor failures, or incomplete observations. Here are
some common ways to handle missing values in R:

1. Identify Missing Values


You can use the `[Link]()` function to check for missing values in a vector, matrix, or data
frame. For example:

# Check for missing values in a vector


x <- c(1, 2, NA, 4, 5)
[Link](x) # Returns TRUE for missing values

# Check for missing values in a data frame


data <- [Link](x = c(1, 2, NA, 4), y = c(5, NA, 7, 8))
[Link](data) # Returns a matrix of TRUE/FALSE for each element
```

2. Remove Missing Values

13
If the number of missing values is relatively small and their removal does not significantly
affect the analysis, you can remove them using the `[Link]()` function

# Remove rows with missing values


data_clean <- [Link](data)

In R, `NA` and `NULL` are both used to represent missing or non-existent values, but they
are different in their meaning and usage.

4.1. NA (Not Available)


• `NA` is used to represent missing or unknown values in vectors, matrices, data frames, and
other data structures.
• It is a logical constant that represents an undefined value.
• `NA` has a specific data type (e.g., `NA` for numeric, `NA` for character, `NA` for logical,
etc.).
• `NA` values can propagate through calculations and operations, leading to `NA` results.
• `NA` values are treated differently from other values in many functions and operations.

4.2. NULL
• `NULL` is used to represent an empty or non-existent object.
• It is a reserved word in R that represents the null object.
• `NULL` has no data type and no length.
• `NULL` is often used to initialize or reset an object to an empty state.
• Operations involving `NULL` objects may lead to errors or unexpected results,
depending on the operation.

examples
# NA examples
x <- c(1, 2, NA, 4, 5) # Vector with an NA value
[Link](x) # Returns TRUE for the NA element
y <- [Link](a = c(1, 2, NA), b = c("a", "b", NA)) # Data frame with NA values
sum(y$a, [Link] = TRUE) # Ignores NA values in the sum
# NULL examples
z <- NULL # Assign NULL to an object
length(z) # Returns 0 (NULL has no length)
my_list <- list(a = 1, b = NULL) # List with a NULL element
[Link](my_list$b) # Returns TRUE
# Comparing NA and NULL
[Link](NULL) # Returns FALSE (NULL is not treated as NA)
[Link](NA) # Returns FALSE (NA is not NULL)

14
In general, `NA` is used to represent missing or unknown values within data structures,
while `NULL` is used to represent the absence of an object or an uninitialized state.

5. Coercion
• When you call a function with an argument of the wrong type, R will try to coerce values
to a different type so that the function will work.
• There are two types of coercion that occur automatically in R:
➢ coercion with formal objects
➢ coercion with built-in types.

• R will look for a suitable method. If no exact match exists,then R will search for a coercion
method that converts the object to a type for which a suitable method does exist.
• R will automatically convert between built-in object types when appropriate.
• R will convert from more specific types to more general types.

For example, suppose that you define a vector x as follows:

x <- c(1, 2, 3, 4, 5)
x
typeof(x)
class(x)

OUTPUT
[1] 1 2 3 4 5
[1] "double"
[1] "numeric"

Let’s change the second element of the vector to the word “hat.” R will change the
object class to character and change all the elements in the vector to char:

x[2] <- "hat"


x
typeof(x)
class(x)

OUTPUT
[1] "1" "hat" "3" "4" "5"
[1] "character"
[1] "character"

Coercion rules:

15
• Logical values are converted to numbers: TRUE is converted to 1 and FALSE to 0.
• Values are converted to the simplest type required to represent all information.
• The ordering is roughly logical < integer < numeric < complex < character < list.
• Objects of type raw are not converted to other types.
• Object attributes are dropped when an object is coerced from one type to
another.
• You can inhibit coercion when passing arguments to functions by using the AsIs
• function (or, equivalently, the I function

6. COMBINING DATASETS IN R

• Combining datasets is a common task in data analysis and manipulation using R.


• There are several ways to combine datasets in R, depending on the structure of the data
and the desired result.

6.1 Pasting Together Data Structures


R provides several functions that allow you to paste together multiple data structures
into a single structure.

[Link]
• The simplest of these functions is paste.
• The paste function allows you to concatenate multiple character vectors into a single
vector. (If you concatenate a vector of another type, it will be coerced to a character
vector first.)
x <- c("a", "b", "c", "d", "e")
y <- c("A", "B", "C", "D", "E")
paste(x,y)

OUTPUT
[1] "a A" "b B" "c C" "d D" "e E"
By default, values are separated by a space; you can specify another separator with the
sep argument:
paste(x, y, sep="-")

OUTPUT
[1] "a-A" "b-B" "c-C" "d-D" "e-E"

16
• If you would like all of values in the returned vector to be concatenated with one
• another, then specify a value for the collapse argument.
• The value of collapse will be used as the separator in this value:

paste(x, y, sep="-", collapse="#")


OUTPUT
[1] "a-A#b-B#c-C#d-D#e-E"

6.1.2. rbind and cbind


• To bind together multiple data frames or matrices.
• The cbind function will combine objects by adding columns.
• The rbind function will combine objects by adding rows
example, let’s start with the data frame for the top five salaries
[Link]
[Link] [Link] team position salary
1 Manning Peyton Colts QB 18700000
2 Brady Tom Patriots QB 14626720 DATA FRAME 1
3 Pepper Julius Panthers DE 14137500
4 Palmer Carson Bengals QB 13980000
5 Manning Eli Giants QB 12916666

Now let’s create a new data frame with two more columns (a year and a rank):
year <- c(2008, 2008, 2008, 2008, 2008)
rank <- c(1, 2, 3, 4, 5)
[Link] <- [Link](year, rank)
[Link]
OUTPUT
year rank
1 2008 1
2 2008 2 DATA FRAME 2
3 2008 3

17
4 2008 4
5 2008 5

Combining these two data frames:


cbind([Link], [Link])

OUTPUT
[Link] [Link] team position salary year rank
1 Manning Peyton Colts QB 18700000 2008 1
2 Brady Tom Patriots QB 14626720 2008 2
3 Pepper Julius Panthers DE 14137500 2008 3
4 Palmer Carson Bengals QB 13980000 2008 4
5 Manning Eli Giants QB 12916666 2008 5

rbind example, suppose that you had a data frame with the top five salaries (as shown
above) and a second data frame with the next three salaries:
[Link]
[Link] [Link] team position salary
1 Manning Peyton Colts QB 18700000
2 Brady Tom Patriots QB 14626720
3 Pepper Julius Panthers DE 14137500
4 Palmer Carson Bengals QB 13980000
5 Manning Eli Giants QB 12916666

[Link]
[Link] [Link] team position salary
6 Favre Brett Packers QB 12800000
7 Bailey Champ Broncos CB 12690050
8 Harrison Marvin Colts WR 12000000

18
You could combine these into a single data frame using the rbind function:
rbind([Link], [Link])
[Link] [Link] team position salary
1 Manning Peyton Colts QB 18700000
2 Brady Tom Patriots QB 14626720
3 Pepper Julius Panthers DE 14137500
4 Palmer Carson Bengals QB 13980000
5 Manning Eli Giants QB 12916666
6 Favre Brett Packers QB 12800000
7 Bailey Champ Broncos CB 12690050
8 Harrison Marvin Colts WR 12000000

6.2. Merging Data by Common Fields


• Merging data by common fields, also known as joining or combining datasets
• It is a common task in data analysis and manipulation.
• In R, there are several functions and packages that can help you merge data based on one or
more common fields or columns.

1. merge()
• The `merge()` function in base R is used to merge two data frames by one or more
common columns or keys.
• It performs a relational database-style join operation.

df1 <- [Link](id = c(1, 2, 3), x = c(4, 5, 6))


df2 <- [Link](id = c(2, 3, 4), y = c(7, 8, 9))

# Inner join (only matching rows)


merged_data <- merge(df1, df2, by = "id")

# Left join (all rows from df1, matching rows from df2)
merged_data <- merge(df1, df2, by = "id", all.x = TRUE)

# Right join (all rows from df2, matching rows from df1)
merged_data <- merge(df1, df2, by = "id", all.y = TRUE)

19
# Full join (all rows from both datasets)
merged_data <- merge(df1, df2, by = "id", all = TRUE)

[Link]
• Explains how to change a variable in a data frame

7.1 Reassigning Variables


• In R, reassigning a variable refers to the process of changing the value or data type of
an existing variable.
• It is done using the assignment operator <-, which overwrites the previous value or data
type of the variable with a new one.
• By using the assignment operator `<-`, we can reassign the result of type conversion
functions like `[Link]()`, `[Link]()`, `[Link]()`, etc., to the same variable.
example of changing the type of a variable in R using the assignment operator:
# Create a character variable
x <- "10"
# Check the class of x x is character
class(x)
Output:
[1] "character"

# Convert x to a numeric variable


x <- [Link](x)
# Check the class of x again
class(x)
Output:
[1] "numeric" x is number

# Print the value of x


X
Output:
[1] 10

In this example:
1. We create a character variable `x` and assign it the string value `"10"`.
2. We check the class of `x` using `class(x)`, and it returns `"character"`.
3. We use the `[Link]()` function to convert the character variable `x` to a numeric
variable. We reassign the result of `[Link](x)` back to `x` using the assignment operator
`x <- [Link](x)`.

20
4. We check the class of `x` again using `class(x)`, and it now returns `"numeric"`.

7.2. The Transform Function


• The `transform()` function in R is used to create a new data frame by modifying or
adding columns to an existing data frame.
• It allows you to apply transformations or expressions to the existing columns and create
new columns in the resulting data frame.
syntax:
transform(data, transformation_formula, ...)
• data: The input data frame or data set.
• transformation_formula: A formula or set of formulas defining the transformations or
new columns to create. These formulas are written using the syntax `new_column_name
= expression`, where `expression` can involve existing columns and operations.
• ...: Additional arguments to be passed to the transformation expressions.

Example
For example, suppose that we wanted to perform the two transformations changing the
Date column to a Date format, and adding a new midpoint variable. We could do this
with transform using the following expression:

[Link] <- transform(dow30, Date=[Link](Date),+ mid = (High + Low) / 2)


names([Link])

output
[1] "symbol" "Date" "Open" "High" "Low"
[6] "Close" "Volume" "[Link]" "mid"

class([Link]$Date)

output
[1] "Date"

21
7.3 APPLYING A FUNCTION TO EACH ELEMENT OF AN
OBJECT
• In R, you can apply a function to each element of an object, such as a vector, matrix, or
data frame, using various functions like `apply`, `lapply`, `sapply`, `tapply`, and
`mapply`.
• These functions are part of the "apply" family and are designed to work with different
data structures.
• These functions are powerful tools for working with data structures in R, allowing you
to apply functions to elements or subsets of data in a concise and efficient manner.
• The choice of function depends on the structure of your input data and the desired
output format.

7.3.1. apply():
• This function applies a function over the margins of an array (matrix or data frame).
• It is commonly used to apply a function to rows or columns of a matrix or data frame.

Example:

# Apply sum function to each row of a matrix


my_matrix <- matrix(1:9, nrow = 3)
row_sums <- apply(my_matrix, 1, sum)
print(row_sums)

Output:
[1] 6 15 24

7.3.2. lapply():
• This function applies a function to each element of a list and returns a list of the same
length as the input.

Example:

# Apply sqrt function to each element of a list


my_list <- list(4, 9, 16, 25)
sqrt_list <- lapply(my_list, sqrt)
print(sqrt_list)

Output:
[[1]] 2
[[2]] 3
[[3]] 4
[[4]] 5
`

22
7.3.3. sapply():
• This function is a wrapper around `lapply` and tries to simplify the output by returning a
vector or matrix instead of a list, if possible.

Example:

# Apply sqrt function to each element of a list, returning a vector


my_list <- list(4, 9, 16, 25)
sqrt_vector <- sapply(my_list, sqrt)
print(sqrt_vector)

Output:
[1] 2 3 4 5

7.3.4 mapply()

• It is a function in R that applies a function to multiple list or vector arguments.


• It is a multivariate version of `sapply()` and can be thought of as a combination of
apply()` and `mapply()`.

The syntax for `mapply()` is:

mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, [Link] = TRUE)

• FUN: The function to be applied to each element of the input vectors or lists.
• ...: The vector or list arguments to which the function `FUN` should be applied.
• MoreArgs: A list of other arguments to be passed to `FUN`.
• SIMPLIFY: A logical value indicating whether the result should be simplified to a
vector or array if possible.
• [Link]: A logical value indicating whether the names of the output components
should be used if possible.

`mapply()` is useful when you need to apply a function to multiple arguments that are vectors
or lists of the same length. It can be particularly handy when working with data frames or lists
of different data types.

Example 1: Applying a function to multiple vectors

# Define vectors
x <- c(1, 2, 3)
y <- c(4, 5, 6)

# Apply the sum function to corresponding elements

23
result <- mapply(sum, x, y)
print(result)

Output:
[1] 5 7 9

In this example, `mapply()` applies the `sum` function to the corresponding elements of `x`
and `y`, resulting in a vector `[1] 5 7 9`.

Example 2: Applying a function to a data frame

# Create a data frame


df <- [Link](
x = c(1, 2, 3),
y = c(4, 5, 6),
z = c(7, 8, 9)
)

# Apply a function to each row of the data frame


result <- mapply(function(a, b, c) a + b + c, df$x, df$y, df$z)
print(result)

Output:
[1] 12 15 18

In this example, `mapply()` applies an anonymous function to each row of the data frame
`df`, summing the values in the `x`, `y`, and `z` columns for each row.

[Link] DATA

• Binning data, also known as data binning or bucketing,


• is the process of partitioning a continuous variable into discrete intervals or categories, called
bins or buckets
• binning data means grouping a set of individual values or observations into a smaller number
of "bins" or "buckets" based on a range or interval of values.
• The main purpose of binning data is to transform a continuous variable, which can
take on any value within a range, into a categorical or ordinal variable with a finite
number of categories or levels.
• Each bin or bucket represents a range of values from the original continuous variable.
Binning data can be useful for several reasons:

• Discretization
• Data Exploration and Visualization

24
• Reducing Noise and Smoothing Data

• Feature Engineering
• Data Compression

8.1 Shingles

• Shingles are a way to represent intervals in R.


• They can be overlapping, like roof shingles
• They are used extensively in the lattice package, when you want to use a numeric
• value as a conditioning value.

To create shingles in R, use the shingle function:

shingle(x, intervals=sort(unique(x)))

• To specify where to separate the bins, use the intervals argument.


• You can use a numeric vector to indicate the breaks or a two-column matrix, where
each row represents a specific interval.
• To create shingles where the number of observations is the same in each bin, you
• can use the [Link] function:

[Link](x, ...)

[Link]

The function cut is useful for taking a continuous variable and splitting it into discrete
pieces.
Here is the default form of cut for use with numeric vectors:

# numeric form
cut(x, breaks, labels = NULL, [Link] = FALSE, right = TRUE, [Link] = 3,
ordered_result = FALSE, ...)

There is also a version of cut for manipulating Date objects:


# Date form

cut(x, breaks, labels = NULL, [Link] = TRUE, right = FALSE, ...)

The cut function takes a numeric vector as input and returns a factor.
Each level in the factor corresponds to an interval of values in the input vector

25
[Link] Objects with a Grouping Variable
To combine a set of similar objects into a single data frame, with a column labeling the
source.
You can do this with the [Link] function in the lattice package:

library(lattice)
[Link](...)

For example, let’s combine three different vectors into a data frame:

[Link] <- seq(from=6.25, to=7.75, by=.25)


[Link] <- c(30, 31, 32, 33, 34, 36, 38, 40)
[Link] <- seq(from=7, to=12)
[Link]([Link], [Link], [Link])

OUTPUT

data which
hat.sizes1 6.25 [Link]
hat.sizes2 6.50 [Link]
hat.sizes3 6.75 [Link]
hat.sizes4 7.00 [Link]
hat.sizes5 7.25 [Link]
hat.sizes6 7.50 [Link]
hat.sizes7 7.75 [Link]
pants.sizes1 30.00 [Link]
pants.sizes2 31.00 [Link]
pants.sizes3 32.00 [Link]
pants.sizes4 33.00 [Link]
pants.sizes5 34.00 [Link]
pants.sizes6 36.00 [Link]
pants.sizes7 38.00 [Link]
pants.sizes8 40.00 [Link]
shoe.sizes1 7.00 [Link]
shoe.sizes2 8.00 [Link]

26
shoe.sizes3 9.00 [Link]
shoe.sizes4 10.00 [Link]
shoe.sizes5 11.00 [Link]
shoe.sizes6 12.00 [Link]

9. SUBSETS

• In R, subsets refer to a way of selecting or extracting specific elements from objects


such as vectors, matrices, data frames, or lists.
• Subsetting is a fundamental operation in R and is used extensively for data manipulation
and analysis

[Link] Notation

One way to take a subset of a data set is to use the bracket notation.

Subset Function:
The `subset()` function is used to create subsets of data frames or matrices based on logical
conditions or expressions. It has the following syntax:

subset(x, subset, select, ...)

• x: is the data frame or matrix from which you want to create a subset.
• Subset: is a logical expression that specifies the rows to be included in the subset.
• Select:is an optional vector specifying the columns to be included in the subset.

Example:

# Create a data frame


df <- [Link](x = 1:10, y = 11:20, z = 21:30)

# Subset rows where x is greater than 5 and select columns x and z


new_df <- subset(df, x > 5, select = c(x, z))

9.2. Bracket Notation:


• In R, bracket notation is used to subset vectors, matrices, data frames, and lists.
• There are two main types of bracket notation:

a. `[` (single bracket): Used for subsetting to extract elements or create a subset.
b. `[[` (double bracket): Used to extract a single element or a subset of a list or data
frame.

Example:

# Subset a vector
x <- c(1, 2, 3, 4, 5)
x[c(1, 3, 5)]

27
Output:
135

# Subset a data frame


df <- [Link](x = 1:5, y = 6:10)
df[df$x > 2, ] # Subset rows where x is greater than 2
df[, c("x", "y")] # Subset columns x and y

9.3. Random Sampling:


• In some cases, you may want to create a random subset of your data for various purposes,
such as cross-validation or bootstrap sampling.
• R provides several functions for random sampling, including `sample()` and
`slice_sample()` (from the `dplyr` package).

Example using `sample()`:

# Create a vector
x <- 1:100

# Take a random sample of 10 elements from x


random_sample <- sample(x, size = 10)

Random sampling is useful when you need to work with a representative subset of your
data, especially in cases where analyzing the entire dataset is computationally expensive
or when you need to perform resampling techniques like cross-validation or
bootstrapping.

10 SUMMERIZING FUNCTIONS
10.1. `tapply()`:
The `tapply()` function applies a function (e.g., `mean`, `sum`, `min`, `max`) over subsets
of a vector, defined by a second factor or ragged array.

# Calculate mean of values grouped by a factor


data <- c(1, 2, 3, 4, 5, 6)
group <- c("A", "A", "B", "B", "A", "B")
means <- tapply(data, group, mean)
print(means)

Output: A 3 B 4.333333

Indicates that the mean of the values in data for the group "A" (1, 2, and 5) is 3, and the
mean of the values in data for the group "B" (3, 4, and 6) is 4.333333.

10.2. `aggregate()`:
The `aggregate()` function splits data into subsets, computes summary statistics for each
subset, and returns the results in a convenient data frame.

# Calculate mean and standard deviation of mpg grouped by cylinder count

28
aggregate(mtcars$mpg, by = list(mtcars$cyl), FUN = function(x) c(mean = mean(x), sd
= sd(x)))

10.3. Aggregating Tables with `rowsum()`:


The `rowsum()` function from the `matrixStats` package is useful for calculating row or
column sums of a matrix or data frame.

library(matrixStats)
# Calculate row sums of a matrix 1 2 3 1+2+3=6
mat <- matrix(1:9, nrow = 3)
rowsum(mat, regroup = TRUE) 4 5 6 4+5+6=15

Output: 7 8 9 7+8+9=24
[1] 6 15 24

10.4. Counting Values:


You can use functions like `table()`, `count()` (from `plyr`), or `count()` (from `dplyr`) to
count the occurrences of values in a vector or a column of a data frame.

# Count occurrences of values in a vector


table(c("A", "B", "A", "C", "A")) No of A’s =3

Output: No of B’s =1
ABC
No of C’s =1
311

10.5. Reshaping Data:


• Reshaping data involves transforming the layout or structure of a data set, such as
converting between wide and long formats.
• The `reshape2` package provides functions like `melt()` and `dcast()` for this purpose.

library(reshape2)
# Convert wide format to long format
long_data <- melt(wide_data, [Link] = "id")

# Convert long format to wide format


wide_data <- dcast(long_data, id ~ variable, [Link] = "value")

11. Data Cleaning


Data cleaning in R is a crucial step in the data analysis process, and it involves several tasks
such as handling missing values, removing duplicates, correcting data types, and formatting
data. Here are some common data cleaning techniques in R:

29
1. Handling Missing Values:
• Use `[Link]()` to identify missing values in a data frame or vector.
• Remove rows or columns with missing values using `[Link]()` or `[Link]()`.
• Replace missing values with a specific value (e.g., mean, median) using functions like
`mean()`, `median()`, or the `replace()` function.
• Impute missing values using advanced techniques like k-nearest neighbors or regression
models

2. Removing Duplicates:
• Use `duplicated()` or `unique()` to identify and remove duplicate rows or columns.

3. Correcting Data Types:


• Use functions like `[Link]()`, `[Link]()`, `[Link]()` to convert data types.
• Check and correct data types using `str()`, `sapply()`, and `lapply()`.

4. Formatting Data:
• Use functions like `tolower()`, `toupper()`, `trim()`, `gsub()` for string manipulation and
formatting.
• Use `lubridate` package for working with dates and times.

5. Handling Outliers:
• Identify outliers using visualizations (boxplots, scatter plots) or statistical measures (z-
scores, IQR).
• Replace, remove, or cap outliers based on your analysis requirements.

6. Reshaping Data:
• Use `reshape2` or `tidyr` packages for reshaping data (e.g., wide to long format or vice
versa).

[Link] AND REMOVING DUPLICATES

find and remove duplicate rows or columns in a data frame or vector in R:

12.1 Finding Duplicates


1. Finding Duplicate Rows: Using `duplicated()` function:

# Load the dataset


data <- [Link]("your_data.csv")
# Find duplicate rows
duplicate_rows <- duplicated(data)

The `duplicated()` function returns a logical vector indicating which rows are
duplicates.

2. Finding Duplicate Columns: Using `duplicated()` function:

# Load the dataset

30
data <- [Link]("your_data.csv")
# Find duplicate columns
duplicate_cols <- duplicated(t(data))

Here, we first transpose the data frame using `t()`, and then apply `duplicated()` to find
duplicate columns.

[Link] Duplicates
1. Removing Duplicate Rows: Using `duplicated()` function:
# Load the dataset
data <- [Link]("your_data.csv")
# Remove duplicate rows
data_unique <- data[!duplicated(data), ]
```
By negating the result of `duplicated()` (`!duplicated(data)`), you can subset the data
frame to keep only the unique rows.

2. Removing Duplicate Columns: - Using a combination of `duplicated()` and `!`


operators:

# Load the dataset


data <- [Link]("your_data.csv")
# Remove duplicate columns
data_unique <- data[, !duplicated(data)]

13. SORTING
• Sorting data in R is a common task and there are several ways to accomplish it.
• The primary function for sorting is `sort()`, but there are also other functions and
methods that can be used depending on the type of data structure and your specific
requirements.

13.1. Sorting Vectors:

- `sort(x, decreasing = FALSE)`:


• This function sorts the elements of a vector `x` in ascending order by default.
• Setting `decreasing = TRUE` will sort the elements in descending order.

order(x, decreasing = FALSE)`:


• This function returns the indices of the sorted elements, rather than the sorted values
themselves.
• It can be useful when you want to rearrange rows or columns of a data frame based on
the order of a vector.

Example:

# Create a vector
x <- c(5, 3, 8, 2, 1)

31
# Sort in ascending order
sort(x)

Output: 1 2 3 5 8

# Sort in descending order


sort(x, decreasing = TRUE)

Output: 8 5 3 2 1

13.2. Sorting Data Frames:

data[order(data$column), ]`:
This syntax sorts the rows of a data frame `data` based on the values in the specified
`column` in ascending order.

data[order(-data$column), ]`: Adding a `-` before the column name will sort the rows in
descending order.

data[order(data$col1, data$col2), ]`: You can specify multiple columns to sort by, with
the leftmost column taking precedence.

Example:
# Create a data frame
df <- [Link](name = c("Alice", "Bob", "Charlie", "David", "Eva"),
age = c(25, 32, 18, 40, 28))

# Sort by age in ascending order


df[order(df$age), ]

OUTPUT
# name age
# 3 Charlie 18
# 1 Alice 25
#5 Eva 28
#2 Bob 32
# 4 David 40

# Sort by name in descending order


df[order(-df$name), ]

OUTPUT
# name age
#5 Eva 28
# 4 David 40
# 3 Charlie 18
#2 Bob 32
# 1 Alice 25

32
13.3. Sorting Lists:
`sort(list)`: This function sorts the elements of a list in ascending order.

Example:
# Create a list
my_list <- list(c = 3, a = 1, b = 2)

# Sort the list


sort(my_list)

Output: 1 2 3

33

You might also like