• One of the challenges faced while working with massive datasets involves gathering, preparing, and otherwise managing data from a variety of sources. Saving, loading, and removing R data structures • To save a data structure to a file that can be reloaded later or transferred to another system, use the save() function. • The save() function writes one or more R data structures to the location specified by the file parameter. • Suppose you have three objects named x, y, and z that you would like to save in a • permanent file. > save(x, y, z, file = "mydata.RData") • The load() command can recreate any data structures that have been saved to an .RData file. To load the mydata.RData file we saved in the preceding code, simply type: > load("mydata.RData") • After working on an R session for sometime, you may have accumulated a number of data structures. • The ls() listing function returns a vector of all the data structures currently in the memory. > ls() [1] "blood" "flu_status" "gender" "m" [5] "pt_data" "subject_name" "subject1" "symptoms" [9] "temperature" • R will automatically remove these from its memory upon quitting the session, but for large data structures, you may want to free up the memory sooner. • The rm() remove function can be used for this purpose. For example, to eliminate the m and subject1 objects, simply type: > rm(m, subject1) • The rm() function can also be supplied with a character vector of the object names to be removed. This works with the ls() function to clear the entire R session: > rm(list=ls()) Importing and saving data from CSV files • The most common tabular text file format is the CSV (Comma-Separated Values) file, which as the name suggests, uses the comma as a delimiter. • The CSV files can be imported to and exported from many common applications. A CSV file representing the medical dataset constructed previously could be stored as: subject_name,temperature,flu_status,gender,bl ood_type • John Doe,98.1,FALSE,MALE,O • Jane Doe,98.6,FALSE,FEMALE,AB • Steve Graves,101.4,TRUE,MALE,A • Given a patient data file named pt_data.csv located in the R working directory, the read.csv() function can be used as follows to load the file into R: > pt_data <- read.csv("pt_data.csv", stringsAsFactors = FALSE) • By default, R assumes that the CSV file includes a header line listing the names of the features in the dataset. • If a CSV file does not have a header, specify the optionheader = FALSE, as shown in the following command, and R will assign default • feature names in the V1 and V2 forms and so on: > mydata <- read.csv("mydata.csv", stringsAsFactors = FALSE, header = FALSE) • To save a data frame to a CSV file, use the write.csv() function. If your data frame is named pt_data, simply enter: > write.csv(pt_data, file = "pt_data.csv", row.names = FALSE) Exploring and understanding data • After collecting data and loading it into R's data structures, the next step in the machine learning process involves examining the data in detail. • We will explore the usedcars.csv dataset, which contains actual data about used cars. • Since the dataset is stored in the CSV form, we can use the read.csv() function to load the data into an R data frame: > usedcars <- read.csv("usedcars.csv", stringsAsFactors = FALSE) Exploring the structure of data • One of the first questions to ask is how the dataset is organized. • The str() function provides a method to display the structure of R data structures such as data frames, vectors, or lists. It can be used to create the basic outline for our data dictionary: > str(usedcars) • Using such a simple command, we learn a wealth of information about the dataset. Exploring numeric variables • To investigate the numeric variables in the used car data, we will employ a common set of measurements to describe values known as summary statistics. • The summary() function displays several common summary statistics. Let's take a look at a single feature, year: > summary(usedcars$year) • We can also use the summary() function to obtain summary statistics for several numeric variables at the same time: > summary(usedcars[c("price", "mileage")]) Measuring the central tendency – mean and median • Measures of central tendency are a class of statistics used to identify a value that falls in the middle of a set of data. • You most likely are already familiar with one common measure of center: the average. In common use, when something is deemed average, it falls somewhere between the extreme ends of the scale. • R also provides a mean() function, which calculates the mean for a vector of numbers: > mean(c(36000, 44000, 56000)) [1] 45333.33 • summary() output listed mean values for the price and mileage variables. The means suggest that the typical used car in this dataset was listed at a price of $12,962 and had an mileage of 44,261. • Another commonly used measure of central tendency is the median, which is the value that occurs halfway through an ordered list of values. • As with the mean, R provides a median() function, which we can apply to our salary data, as shown in the following example: > median(c(36000, 44000, 56000)) [1] 44000 Measuring spread – quartiles and the five-number summary • To measure the diversity, we need to employ another type of summary statistics that is concerned with the spread of data, or how tightly or loosely the values are spaced. • The five-number summary is a set of five statistics that roughly depict the spread of a feature's values. 1. Minimum (Min.) 2. First quartile, or Q1 (1st Qu.) 3. Median, or Q2 (Median) 4. Third quartile, or Q3 (3rd Qu.) 5. Maximum (Max.) • Minimum and maximum are the most extreme feature values, indicating the smallest and largest values, respectively. • R provides the min() and max() functions to calculate these values on a vector of data. • In R, range() function returns both the minimum and maximum value. range(usedcars$price) • Combining range() with the diff() difference function allows you to examine the range of data > diff(range(usedcars$price)) • The quartiles divide a dataset into four portions. • The seq() function is used to generate vectors of evenly-spaced values. This makes it easy to obtain other slices of data, such as the quintiles (five groups), as shown in • the following command: • > quantile(usedcars$price, seq(from = 0, to = 1, by = 0.20)) • 0% 20% 40% 60% 80% 100% • 3800.0 10759.4 12993.8 13992.0 14999.0 21992.0 Exploring categorical variables • The used car dataset had three categorical variables: model, color, and transmission. • Additionally, we might consider treating the year variable as categorical; although it has been loaded as a numeric (int) type vector, each year is a category that could apply to multiple cars. • A table that presents a single categorical variable is known as a one-way table. • The table() function can be used to generate one-way tables for our used car data. > table(usedcars$year) > table(usedcars$model) > table(usedcars$color) • The table() output lists the categories of the nominal variable and a count of the number of values falling into this category. • R can also perform the calculation of table proportions directly, by using the prop.table() command on a table produced by the table() function: model_table <- table(usedcars$model) prop.table(model_table) • The results of prop.table() can be combined with other R functions to transform the output. > color_pct <- table(usedcars$color) > color_pct <- prop.table(color_pct) * 100 > round(color_pct, digits = 1) Exploring relationships between variables • So far, we have examined variables one at a time, calculating only univariate statistics. • bivariate relationships, which consider the relationship between two variables. • Relationships of more than two variables are called multivariate relationships. Data preprocessing • Data preprocessing is the initial phase of Machine Learning where data is prepared for machine learning models. Steps in Data Preprocessing • Step 1: Importing the Dataset • Step 2: Handling the Missing Data • Step 3: Encoding Categorical Data. • Step 4: Splitting the Dataset into the Training and Test sets • Training set • Test set • Step 5: Feature Scaling • training_set • test_set Step 1: Importing the dataset • Here is how to achieve this. Dataset = read_csv('data.csv') • This code imports our data stored in CSV format. • We can have a look at our data using the ‘view()’ function: view(Dataset) Step 2: Handling the missing data • Before implementing our machine learning models, this problem needs to be solved, otherwise it will cause a serious problem to our machine learning models. Therefore, it’s our responsibility to ensure this missing data is eliminated from our dataset using the most appropriate technique. • Here are two techniques we can use to handle missing data: 1.Delete the observation reporting the missing data:This technique is suitable when dealing with big datasets and with very few missing values i.e. deleting one row from a dataset with thousands of observations can not affect the quality of the data. When the dataset reports many missing values, it can be very dangerous to use this technique. Deleting many rows from a dataset can lead to the loss of crucial information contained in the data.
To ensure this does not happen, we make use of an appropriate technique
that has no harm to the quality of the data. • Replace the missing data with the average of the feature in which the data is missing: • This technique is the best way so far to deal with the missing values. Many statisticians make use of this technique over that of the first one. Dataset$Age = ifelse(is.na(Dataset$Age), ave(Dataset$Age, FUN = function (x)mean(x, na.rm = TRUE)),Dataset$Age) • What does the code above really do? • Dataset$Age: simply take the Age column from our dataset. • In the Age column, we’ve just taken that from our data set, we need to replace the missing data, and at the same time keep the data that is not missing. • This objective is achieved by the use of the if-else statement. Our ifelse statement is taking three parameters: • The first parameter is if the condition is true. • The second parameter is the value we input if the condition is true. • The third parameter is the action we take if the condition is false. Our condition is is.na(Dataset$Age) • This will tell us if a value in the Dataset$Age is missing or not. It returns a logical output, YES if a value is missing and NO if a value is not missing. • The second parameter, the ‘ave()’ function, finds the mean of the Age column. • Because this column reports NA values, we need to exclude the null data in the calculation of the mean, otherwise we shall obtain the mean as NA. • This is the reason we pass na.rm = TRUE in our mean function just as to declare those values that should be used and those should be excluded when calculating the mean of the vector Age. • The third condition is the value that will be returned if the value in the Age column of the dataset is not missing. • The missing value that was in the Age column of our data set has successfully been replaced with the mean of the same column. • We do the same for the Salary column. Step 3: Encoding categorical data • Encoding refers to transforming text data into numeric data. Encoding Categorical data simply means we are transforming data that fall into categories into numeric data.
• In our dataset, the Country column is Categorical data with 3
levels i.e. France, Spain, and Germany. The purchased column is Categorical data as well with 2 categories, i.e. YES and NO.
• The machine models we built on our dataset are based on
mathematical equations and it’s only take numbers in those equations. • To transform a categorical variable into numeric, we use the factor() function. • Let start by encoding the Country column.
Step 4: Splitting the dataset into the training and test set • In machine learning, we split data into two parts: • Training set: The part of the data that we implement our machine learning model on. • Test set: The part of the data that we evaluate the performance of our machine learning model on. • library(caTools)# required library for data splition • set.seed(123) • split = sample.split(Dataset$Purchased, SplitRatio = 0.8)# returns true if observation goes to the Training set and false if observation goes to the test set. • #Creating the training set and test set separately • training_set = subset(Dataset, split == TRUE) • test_set = subset(Dataset, split == FALSE) • training_set • test_set Step 5: Feature scaling • It’s a common case that in most datasets, features also known as inputs, are not on the same scale. Many machine learning models are Euclidian distant-based. • It happens that, the features with the large units dominate those with small units when it comes to calculation of the Euclidian distance and it will be as if those features with small units do not exist. • To ensure this does not occur, we need to encode our features so that they all fall in the range between -3 and 3. There are several ways we can use to scale our features. The most used one is the standardization and normalization technique.
• The normalization technique is used when the data is
normally distributed while standardization works with both normally distributed and the data that is not normally distributed. • The formula for these two techniques is shown below. • training_set[, 2:3] = scale(training_set[, 2:3]) • test_set[, 2:3] = scale(test_set[, 2:3]) • training_set • test_set If we fail to do so, R will show us an error such as: • training_set = scale(training_set)# returns an error • The reason is that our encoded columns are not treated as numeric entries.