1.importing Data From External Files
1.importing Data From External Files
Importing data from Text files and other software, Exporting data, importing data from
databases- Database Connection packages, Missing Data - NA, NULL. Combining data
sets, Transformations, Binning Data, Subsets, summarizing functions. Data Cleaning,
Finding and removing Duplicates, Sorting.
[Link] Files
Most text files containing data are formatted similarly: each line of a text file represents an
observation (or record).
Each line contains a set of different variables associated with that observation. Sometimes,
different variables are separated by a special character called the delimiter.
Other times, variables are differentiated by their location on each line.
• The [Link] function reads a text file into R and returns a [Link] object.
• Each row in the input file is interpreted as an observation.
• Each column in the input file represents a variable.
• The [Link] function expects each field to be separated by a delimiter.
For example, suppose that you had a file called [Link] that contained the
following text (and only this text):
[Link],[Link],team,position,salary COLOUMN NAMES
"Manning","Peyton","Colts","QB",18700000
"Brady","Tom","Patriots","QB",14626720
"Pepper","Julius","Panthers","DE",14137500
"Palmer","Carson","Bengals","QB",13980000
"Manning","Eli","Giants","QB",12916666
1
> [Link] <- [Link]("[Link]", header=TRUE, sep=",", quote="\"")
2
1.1.2. Fixed-width files
To read a fixed-width format text file into a data frame, you can use the
[Link] function:
[Link](file, widths, header = , sep = , skip = , [Link], [Link], n = , buffersize = , ...)
Here is a description of the arguments to [Link]
Note that [Link] can also take many arguments used by [Link], including [Link],
[Link], colClasses, and [Link].
3
1.1.4. scan
Another useful function for reading more complex file formats is scan:
scan(file = "", what = double(0), nmax = -1, n = -1, sep = "", , skip = 0, nlines = 0,
[Link] = "NA", [Link] = TRUE, allowEscapes = FALSE, encoding = "unknown")
4
Example: If the observations in the file span multiple lines, it implies that the structure of the
CSV file might be a bit unusual. Typically, CSV files have one record per line, but in cases
where an observation spans multiple lines, the file structure might be less straightforward.
Name, Salary
John, 50000
"Alice,
Cooper",60000
Bob, 55000
"Emma,
Watson", 62000
Michael, 58000
In this case, "Alice Cooper" and "Emma Watson" have values that span multiple lines.
[Link] <- [Link]("[Link]",header=TRUE, sep=",", quote="\"")
OUTPUT
Name Salary
1 John 50000
2 Alice,\n 60000
3 Bob 55000
4 Emma,\n 62000
5 Michael 58000
2. Exporting Data
• R can also export R data objects (usually data frames and matrices) as text files.
5
• To export data to a text file, use the [Link] function:
[Link](x, file = "", append = FALSE, quote = TRUE, sep = " ", eol = "\n", na =
"NA", dec = ".", [Link] = TRUE, [Link] = TRUE))
There are wrapper functions for [Link] that call [Link] with different defaults.
These are useful if you want to create a file of comma-separated values,
6
3. Importing Data From Databases
• Importing data from databases into R involves several steps that allow you to
connect to a database, retrieve data, and bring it into R for analysis.
• This process is essential for data analysis workflows, as databases often store large
volumes of structured data.
• Here’s a comprehensive guide to understanding and performing data importation
from databases in R.
• One of the best approaches for working with data from a database is to export the
data to a text file and then import the text file into R.
• When dealing with very large data sets it is found that you can import data into R
at a much faster rate from text files than you can from database connections.
• extract a large amount of data once and then analyze the data,
• using R to produce regular reports or to repeat an analysis many times, then it might
be better to import data into R directly through a database connection
7
3.3. Importing Data from Databases Using RODBC
The `RODBC` package in R provides a robust interface for connecting to and interacting with
databases that comply with the Open Database Connectivity (ODBC) standard. This package
allows you to establish a connection to various database management systems (DBMS),
execute SQL queries, fetch data, and perform data manipulation tasks.
Key Concepts of RODBC
- ODBC (Open Database Connectivity): A standard API for accessing different DBMS.
ODBC drivers are available for many database systems, including SQL Server, MySQL,
Oracle, and others.
- DSN (Data Source Name): A data structure that contains the information about a specific
database connection.
3.3.1. Steps to Import Data Using RODBC
1. Install and Load the RODBC Package
[Link]("RODBC")
library(RODBC)
2. Set Up ODBC DSN
Set up an ODBC Data Source Name (DSN) for your database. This can be done through
your operating system’s ODBC Data Source Administrator tool.
3. Connect to the Database
8
Establish a connection to the database using the `odbcConnect` function, specifying the
DSN, user ID, and password.
conn <- odbcConnect("DSN_name", uid="your_username", pwd="your_password")
4. List Available Tables
You can list all the tables in the connected database to get an overview of its structure.
tables <- sqlTables(conn)
print(tables)
5. Fetch Data from a Table
Use the `sqlFetch` function to import an entire table into R.
data <- sqlFetch(conn, "your_table_name")
6. Execute SQL Queries
To run specific SQL queries and retrieve the results, use the `sqlQuery` function.
query_result <- sqlQuery(conn, "SELECT FROM your_table_name WHERE
some_column = 'some_value'")
7. Perform Data Transformation and Cleaning
After importing the data, you might need to clean and transform it using functions from
packages like `dplyr` or `tidyverse`.
library(dplyr)
cleaned_data <- query_result %>%
filter(some_column == "some_value") %>%
select(column1, column2, column3)
example workflow that demonstrates how to use RODBC to connect to a database, retrieve
data, and perform some basic operations.
9
tables <- sqlTables(conn)
print(tables)
10
3.4.1. Steps to Import Data Using DBI
2. Establish a Connection
Create a connection to the database using the `dbConnect` function from the specific
backend package.
con <- dbConnect(RPostgreSQL::PostgreSQL(), dbname = "your_database_name",
host = "your_host", port = 5432, user = "your_username", password =
"your_password")
After importing the data, you might need to clean and transform it using functions from
packages like `dplyr` or `tidyverse`.
library(dplyr)
cleaned_data <- query_result %>%
filter(some_column == "some_value") %>%
select(column1, column2, column3)
11
Example
# Install and load the necessary packages
[Link]("DBI")
[Link]("RPostgreSQL")
library(DBI)
library(RPostgreSQL)
12
Advantages of Using DBI
3.4.2 TSDBI
[Link] DATA
Dealing with missing values is an important part of data preprocessing and analysis in R.
Missing values, often represented as NA (Not Available) in R, can occur due to various
reasons such as data entry errors, sensor failures, or incomplete observations. Here are
some common ways to handle missing values in R:
13
If the number of missing values is relatively small and their removal does not significantly
affect the analysis, you can remove them using the `[Link]()` function
In R, `NA` and `NULL` are both used to represent missing or non-existent values, but they
are different in their meaning and usage.
4.2. NULL
• `NULL` is used to represent an empty or non-existent object.
• It is a reserved word in R that represents the null object.
• `NULL` has no data type and no length.
• `NULL` is often used to initialize or reset an object to an empty state.
• Operations involving `NULL` objects may lead to errors or unexpected results,
depending on the operation.
examples
# NA examples
x <- c(1, 2, NA, 4, 5) # Vector with an NA value
[Link](x) # Returns TRUE for the NA element
y <- [Link](a = c(1, 2, NA), b = c("a", "b", NA)) # Data frame with NA values
sum(y$a, [Link] = TRUE) # Ignores NA values in the sum
# NULL examples
z <- NULL # Assign NULL to an object
length(z) # Returns 0 (NULL has no length)
my_list <- list(a = 1, b = NULL) # List with a NULL element
[Link](my_list$b) # Returns TRUE
# Comparing NA and NULL
[Link](NULL) # Returns FALSE (NULL is not treated as NA)
[Link](NA) # Returns FALSE (NA is not NULL)
14
In general, `NA` is used to represent missing or unknown values within data structures,
while `NULL` is used to represent the absence of an object or an uninitialized state.
5. Coercion
• When you call a function with an argument of the wrong type, R will try to coerce values
to a different type so that the function will work.
• There are two types of coercion that occur automatically in R:
➢ coercion with formal objects
➢ coercion with built-in types.
• R will look for a suitable method. If no exact match exists,then R will search for a coercion
method that converts the object to a type for which a suitable method does exist.
• R will automatically convert between built-in object types when appropriate.
• R will convert from more specific types to more general types.
x <- c(1, 2, 3, 4, 5)
x
typeof(x)
class(x)
OUTPUT
[1] 1 2 3 4 5
[1] "double"
[1] "numeric"
Let’s change the second element of the vector to the word “hat.” R will change the
object class to character and change all the elements in the vector to char:
OUTPUT
[1] "1" "hat" "3" "4" "5"
[1] "character"
[1] "character"
Coercion rules:
15
• Logical values are converted to numbers: TRUE is converted to 1 and FALSE to 0.
• Values are converted to the simplest type required to represent all information.
• The ordering is roughly logical < integer < numeric < complex < character < list.
• Objects of type raw are not converted to other types.
• Object attributes are dropped when an object is coerced from one type to
another.
• You can inhibit coercion when passing arguments to functions by using the AsIs
• function (or, equivalently, the I function
6. COMBINING DATASETS IN R
[Link]
• The simplest of these functions is paste.
• The paste function allows you to concatenate multiple character vectors into a single
vector. (If you concatenate a vector of another type, it will be coerced to a character
vector first.)
x <- c("a", "b", "c", "d", "e")
y <- c("A", "B", "C", "D", "E")
paste(x,y)
OUTPUT
[1] "a A" "b B" "c C" "d D" "e E"
By default, values are separated by a space; you can specify another separator with the
sep argument:
paste(x, y, sep="-")
OUTPUT
[1] "a-A" "b-B" "c-C" "d-D" "e-E"
16
• If you would like all of values in the returned vector to be concatenated with one
• another, then specify a value for the collapse argument.
• The value of collapse will be used as the separator in this value:
Now let’s create a new data frame with two more columns (a year and a rank):
year <- c(2008, 2008, 2008, 2008, 2008)
rank <- c(1, 2, 3, 4, 5)
[Link] <- [Link](year, rank)
[Link]
OUTPUT
year rank
1 2008 1
2 2008 2 DATA FRAME 2
3 2008 3
17
4 2008 4
5 2008 5
OUTPUT
[Link] [Link] team position salary year rank
1 Manning Peyton Colts QB 18700000 2008 1
2 Brady Tom Patriots QB 14626720 2008 2
3 Pepper Julius Panthers DE 14137500 2008 3
4 Palmer Carson Bengals QB 13980000 2008 4
5 Manning Eli Giants QB 12916666 2008 5
rbind example, suppose that you had a data frame with the top five salaries (as shown
above) and a second data frame with the next three salaries:
[Link]
[Link] [Link] team position salary
1 Manning Peyton Colts QB 18700000
2 Brady Tom Patriots QB 14626720
3 Pepper Julius Panthers DE 14137500
4 Palmer Carson Bengals QB 13980000
5 Manning Eli Giants QB 12916666
[Link]
[Link] [Link] team position salary
6 Favre Brett Packers QB 12800000
7 Bailey Champ Broncos CB 12690050
8 Harrison Marvin Colts WR 12000000
18
You could combine these into a single data frame using the rbind function:
rbind([Link], [Link])
[Link] [Link] team position salary
1 Manning Peyton Colts QB 18700000
2 Brady Tom Patriots QB 14626720
3 Pepper Julius Panthers DE 14137500
4 Palmer Carson Bengals QB 13980000
5 Manning Eli Giants QB 12916666
6 Favre Brett Packers QB 12800000
7 Bailey Champ Broncos CB 12690050
8 Harrison Marvin Colts WR 12000000
1. merge()
• The `merge()` function in base R is used to merge two data frames by one or more
common columns or keys.
• It performs a relational database-style join operation.
# Left join (all rows from df1, matching rows from df2)
merged_data <- merge(df1, df2, by = "id", all.x = TRUE)
# Right join (all rows from df2, matching rows from df1)
merged_data <- merge(df1, df2, by = "id", all.y = TRUE)
19
# Full join (all rows from both datasets)
merged_data <- merge(df1, df2, by = "id", all = TRUE)
[Link]
• Explains how to change a variable in a data frame
In this example:
1. We create a character variable `x` and assign it the string value `"10"`.
2. We check the class of `x` using `class(x)`, and it returns `"character"`.
3. We use the `[Link]()` function to convert the character variable `x` to a numeric
variable. We reassign the result of `[Link](x)` back to `x` using the assignment operator
`x <- [Link](x)`.
20
4. We check the class of `x` again using `class(x)`, and it now returns `"numeric"`.
Example
For example, suppose that we wanted to perform the two transformations changing the
Date column to a Date format, and adding a new midpoint variable. We could do this
with transform using the following expression:
output
[1] "symbol" "Date" "Open" "High" "Low"
[6] "Close" "Volume" "[Link]" "mid"
class([Link]$Date)
output
[1] "Date"
21
7.3 APPLYING A FUNCTION TO EACH ELEMENT OF AN
OBJECT
• In R, you can apply a function to each element of an object, such as a vector, matrix, or
data frame, using various functions like `apply`, `lapply`, `sapply`, `tapply`, and
`mapply`.
• These functions are part of the "apply" family and are designed to work with different
data structures.
• These functions are powerful tools for working with data structures in R, allowing you
to apply functions to elements or subsets of data in a concise and efficient manner.
• The choice of function depends on the structure of your input data and the desired
output format.
7.3.1. apply():
• This function applies a function over the margins of an array (matrix or data frame).
• It is commonly used to apply a function to rows or columns of a matrix or data frame.
Example:
Output:
[1] 6 15 24
7.3.2. lapply():
• This function applies a function to each element of a list and returns a list of the same
length as the input.
Example:
Output:
[[1]] 2
[[2]] 3
[[3]] 4
[[4]] 5
`
22
7.3.3. sapply():
• This function is a wrapper around `lapply` and tries to simplify the output by returning a
vector or matrix instead of a list, if possible.
Example:
Output:
[1] 2 3 4 5
7.3.4 mapply()
• FUN: The function to be applied to each element of the input vectors or lists.
• ...: The vector or list arguments to which the function `FUN` should be applied.
• MoreArgs: A list of other arguments to be passed to `FUN`.
• SIMPLIFY: A logical value indicating whether the result should be simplified to a
vector or array if possible.
• [Link]: A logical value indicating whether the names of the output components
should be used if possible.
`mapply()` is useful when you need to apply a function to multiple arguments that are vectors
or lists of the same length. It can be particularly handy when working with data frames or lists
of different data types.
# Define vectors
x <- c(1, 2, 3)
y <- c(4, 5, 6)
23
result <- mapply(sum, x, y)
print(result)
Output:
[1] 5 7 9
In this example, `mapply()` applies the `sum` function to the corresponding elements of `x`
and `y`, resulting in a vector `[1] 5 7 9`.
Output:
[1] 12 15 18
In this example, `mapply()` applies an anonymous function to each row of the data frame
`df`, summing the values in the `x`, `y`, and `z` columns for each row.
[Link] DATA
• Discretization
• Data Exploration and Visualization
24
• Reducing Noise and Smoothing Data
• Feature Engineering
• Data Compression
8.1 Shingles
shingle(x, intervals=sort(unique(x)))
[Link](x, ...)
[Link]
The function cut is useful for taking a continuous variable and splitting it into discrete
pieces.
Here is the default form of cut for use with numeric vectors:
# numeric form
cut(x, breaks, labels = NULL, [Link] = FALSE, right = TRUE, [Link] = 3,
ordered_result = FALSE, ...)
The cut function takes a numeric vector as input and returns a factor.
Each level in the factor corresponds to an interval of values in the input vector
25
[Link] Objects with a Grouping Variable
To combine a set of similar objects into a single data frame, with a column labeling the
source.
You can do this with the [Link] function in the lattice package:
library(lattice)
[Link](...)
For example, let’s combine three different vectors into a data frame:
OUTPUT
data which
hat.sizes1 6.25 [Link]
hat.sizes2 6.50 [Link]
hat.sizes3 6.75 [Link]
hat.sizes4 7.00 [Link]
hat.sizes5 7.25 [Link]
hat.sizes6 7.50 [Link]
hat.sizes7 7.75 [Link]
pants.sizes1 30.00 [Link]
pants.sizes2 31.00 [Link]
pants.sizes3 32.00 [Link]
pants.sizes4 33.00 [Link]
pants.sizes5 34.00 [Link]
pants.sizes6 36.00 [Link]
pants.sizes7 38.00 [Link]
pants.sizes8 40.00 [Link]
shoe.sizes1 7.00 [Link]
shoe.sizes2 8.00 [Link]
26
shoe.sizes3 9.00 [Link]
shoe.sizes4 10.00 [Link]
shoe.sizes5 11.00 [Link]
shoe.sizes6 12.00 [Link]
9. SUBSETS
[Link] Notation
One way to take a subset of a data set is to use the bracket notation.
Subset Function:
The `subset()` function is used to create subsets of data frames or matrices based on logical
conditions or expressions. It has the following syntax:
• x: is the data frame or matrix from which you want to create a subset.
• Subset: is a logical expression that specifies the rows to be included in the subset.
• Select:is an optional vector specifying the columns to be included in the subset.
Example:
a. `[` (single bracket): Used for subsetting to extract elements or create a subset.
b. `[[` (double bracket): Used to extract a single element or a subset of a list or data
frame.
Example:
# Subset a vector
x <- c(1, 2, 3, 4, 5)
x[c(1, 3, 5)]
27
Output:
135
# Create a vector
x <- 1:100
Random sampling is useful when you need to work with a representative subset of your
data, especially in cases where analyzing the entire dataset is computationally expensive
or when you need to perform resampling techniques like cross-validation or
bootstrapping.
10 SUMMERIZING FUNCTIONS
10.1. `tapply()`:
The `tapply()` function applies a function (e.g., `mean`, `sum`, `min`, `max`) over subsets
of a vector, defined by a second factor or ragged array.
Output: A 3 B 4.333333
Indicates that the mean of the values in data for the group "A" (1, 2, and 5) is 3, and the
mean of the values in data for the group "B" (3, 4, and 6) is 4.333333.
10.2. `aggregate()`:
The `aggregate()` function splits data into subsets, computes summary statistics for each
subset, and returns the results in a convenient data frame.
28
aggregate(mtcars$mpg, by = list(mtcars$cyl), FUN = function(x) c(mean = mean(x), sd
= sd(x)))
library(matrixStats)
# Calculate row sums of a matrix 1 2 3 1+2+3=6
mat <- matrix(1:9, nrow = 3)
rowsum(mat, regroup = TRUE) 4 5 6 4+5+6=15
Output: 7 8 9 7+8+9=24
[1] 6 15 24
Output: No of B’s =1
ABC
No of C’s =1
311
library(reshape2)
# Convert wide format to long format
long_data <- melt(wide_data, [Link] = "id")
29
1. Handling Missing Values:
• Use `[Link]()` to identify missing values in a data frame or vector.
• Remove rows or columns with missing values using `[Link]()` or `[Link]()`.
• Replace missing values with a specific value (e.g., mean, median) using functions like
`mean()`, `median()`, or the `replace()` function.
• Impute missing values using advanced techniques like k-nearest neighbors or regression
models
2. Removing Duplicates:
• Use `duplicated()` or `unique()` to identify and remove duplicate rows or columns.
4. Formatting Data:
• Use functions like `tolower()`, `toupper()`, `trim()`, `gsub()` for string manipulation and
formatting.
• Use `lubridate` package for working with dates and times.
5. Handling Outliers:
• Identify outliers using visualizations (boxplots, scatter plots) or statistical measures (z-
scores, IQR).
• Replace, remove, or cap outliers based on your analysis requirements.
6. Reshaping Data:
• Use `reshape2` or `tidyr` packages for reshaping data (e.g., wide to long format or vice
versa).
The `duplicated()` function returns a logical vector indicating which rows are
duplicates.
30
data <- [Link]("your_data.csv")
# Find duplicate columns
duplicate_cols <- duplicated(t(data))
Here, we first transpose the data frame using `t()`, and then apply `duplicated()` to find
duplicate columns.
[Link] Duplicates
1. Removing Duplicate Rows: Using `duplicated()` function:
# Load the dataset
data <- [Link]("your_data.csv")
# Remove duplicate rows
data_unique <- data[!duplicated(data), ]
```
By negating the result of `duplicated()` (`!duplicated(data)`), you can subset the data
frame to keep only the unique rows.
13. SORTING
• Sorting data in R is a common task and there are several ways to accomplish it.
• The primary function for sorting is `sort()`, but there are also other functions and
methods that can be used depending on the type of data structure and your specific
requirements.
Example:
# Create a vector
x <- c(5, 3, 8, 2, 1)
31
# Sort in ascending order
sort(x)
Output: 1 2 3 5 8
Output: 8 5 3 2 1
data[order(data$column), ]`:
This syntax sorts the rows of a data frame `data` based on the values in the specified
`column` in ascending order.
data[order(-data$column), ]`: Adding a `-` before the column name will sort the rows in
descending order.
data[order(data$col1, data$col2), ]`: You can specify multiple columns to sort by, with
the leftmost column taking precedence.
Example:
# Create a data frame
df <- [Link](name = c("Alice", "Bob", "Charlie", "David", "Eva"),
age = c(25, 32, 18, 40, 28))
OUTPUT
# name age
# 3 Charlie 18
# 1 Alice 25
#5 Eva 28
#2 Bob 32
# 4 David 40
OUTPUT
# name age
#5 Eva 28
# 4 David 40
# 3 Charlie 18
#2 Bob 32
# 1 Alice 25
32
13.3. Sorting Lists:
`sort(list)`: This function sorts the elements of a list in ascending order.
Example:
# Create a list
my_list <- list(c = 3, a = 1, b = 2)
Output: 1 2 3
33