You are on page 1of 6

Assignment 2

Explain the following

a. Mention the list of commands used to clean up any of the object or for several
objects.
b. Explain in detail the treatment of missing values in R

Subject: Financial Analytics


Submitted
To
Dr.Jaleel Ahmed
By
Irfan Haider
MMS203009
Explain the following
c. Mention the list of commands used to clean up any of the object or for several
objects.
d. Explain in detail the treatment of missing values in R

Ans.(a)
Data Cleaning is the process of transforming raw data into consistent data that can be
analyzed. It is aimed at improving the content of statistical statements based on the data as
well as their reliability.
Two versions of the basic function

The command prompt function that clears a variable from the global environment comes in
two forms. One is remove(objects, list) and the other is rm(objects, list). Each form works the
same, the second is just quicker to write.

Partial clearing with rm()

You can use the rm() function to remove one or more variables. The most straightforward
way is using the objects argument as shown below.

# clear all in r tutorial


>a=5
> b = c(1, 2, 3)
> d = c(x=5, y=4, z=10, t=2)
>
>a
[1] 5

>b
[1] 1 2 3

>d
xyzt
5 4 10 2
>
> rm(a, b)
>
>a
Error: object 'a' not found

>b
Error: object 'b' not found
>d
xyzt
5 4 10 2

The second way is to use the list argument. Because it is a character vector, it can be equated
to an externally defined vector giving it greater flexibility. This allows different variables to
be deleted when using a loop or any situation where the same section of code is reused.

# clear all in r tutorial with list argument


>a=5
> b = c(1, 2, 3)
> d = c(x=5, y=4, z=10, t=2)
>
>a
[1] 5

>b
[1] 1 2 3

>d
xyzt
5 4 10 2
>
> rm(list = c("a", "b"))
>
>a
Error: object 'a' not found

>b
Error: object 'b' not found

>d
xyzt
5 4 10 2

Clear all with rm()

In this example, the rm() command prompt is called using the list argument where the
argument is equated to the ls() function which returns a vector of the names of the objects in
the global environment or browser cache.
# clear all in r with rm()
> data(Titanic)
>a=5
> b = c(1, 2, 3)
> d = c(x=5, y=4, z=10, t=2)
>
>a
[1] 5

>b
[1] 1 2 3

>d
xyzt
5 4 10 2
>
> rm(list = ls())
>
>a
Error: object 'a' not found

>b
Error: object 'b' not found

>d
Error: object 'd' not found

When this is done all the objects in the global environment are deleted including the data
from loaded packages or any other type of file folder. This is a very effective and simple way
to do a delete file or clear all in r to free memory or disk space on your device.

Now in all the above examples, rm() can be substituted with remove() and it will work just
fine. They are just variations of the same function. This is a helpful tool for automatically
clearing out of the console any variables or loaded packages that you are done using. It is also
an excellent tool for clearing all the variables at the same time. It is a simple tool to use on
any device, operating system, or web browser window, and one that can save you the time of
manually clearing out the global environment.
Ans.(b)
Missing Values
A missing value, represented by NA in R, is a placeholder for a datum of which the type is
known but its value isn't. Therefore, it is impossible to perform statistical analysis on data
where one or more values in the data are missing. One may choose to either omit elements
from a dataset that contain missing values or to impute a value, but missingness is something
to be dealt with prior to any analysis.
In practice, analysts, but also commonly used numerical software may confuse a missing
value with a default value or category. For instance, in Excel 2010, the result of adding the
contents of a field containing the number 1 with an empty field results in 1. This behaviour is
most definitely unwanted since Excel silently imputes `0' where it should have said
something along the lines of `unable to compute'. It should be up to the analyst to decide how
empty values are handled, since a default imputation may yield unexpected or erroneous
results for reasons that are hard to trace. Another commonly encountered mistake is to
confuse an NA in categorical data with the category unknown. If unknown is indeed a
category, it should be added as a factor level so it can be appropriately analyzed. Consider as
an example a categorical variable representing place of birth. Here, the category unknown
means that we have no knowledge about where a person is born. In contrast, NA indicates
that we have no information to determine whether the birth place is known or not.
The behaviour of R's core functionality is completely consistent with the idea that the analyst
must decide what to do with missing data. A common choice, namely `leave out records with
missing data' is supported by many base functions through the na.rm option.
age <- c(23, 16, NA)
mean(age)
## [1] NA
mean(age, na.rm = TRUE)
## [1] 19.5
Functions such as sum, prod, quantile, sd and so on all have this option. Functions
implementing bivariate statistics such as cor and cov offer options to include complete or
pairwise complete values.
Besides the is.na function, that was already mentioned in section 1.2.2, R comes with a few
other functions facilitating NA handling. The complete.cases function detects rows in a
data.frame that do not contain any missing value.
print(person)
## age height
## 1 21 6.0
## 2 42 5.9 ## 3 18 5.7*
## 4 21 complete.cases(person)
## [1] TRUE TRUE TRUE FALSE
The resulting logical can be used to remove incomplete records from the data.frame.
Alternatively the na.omit function, does the same.
(persons_complete <- na.omit(person))
## age height
## 1 21 6.0 ## 2 42 5.9
## 3 18 5.7* na.action(persons_complete)
## 4
## 4
## attr(,"class")
## [1] "omit"
The result of the na.omit function is a data.frame where incomplete rows have been deleted.
The row.names of the removed records are stored in an attribute called na.action.

You might also like