You are on page 1of 40

LECTURE 3

DATA
MANIPULATION
Dr. Sara Elhishi
Faculty of Computers and Information Sciences
Mansoura University, Egypt
Sara_shaker2008@mans.edu.eg
Introduction

Previously, we introduced how to collect and import your data

Today, we begin to address the next steps in the Data Science Process -
manipulating our data and exploring our data
Data Cleaning

- There are lots of ways to describe the general idea of manipulating data.

- You may hear terms like data cleansing, data pre-processing, and data wrangling.

- The general idea is that your data needs to be in a usable format before you can perform
analyses and modeling on it.

- For instance, real world data often does not come in a format that is immediately usable,
especially if the data entry is performed by humans
Take this example

- The names are in multiple cases, have special characters, are in varying orders, and varying number
of spaces

- The post codes have varying number of digits


- The survey column has multiple values that both correspond to a value of “Yes” or “Y”
- The date column has dates that are in many different formats
MEPS

Luckily for us, our MEPS data has already gone through pretty
thorough data cleansing and validation processes, and is
available to us in a relatively clean format, so we don’t have to
worry about a lot of these data pre-processing issues.
Dplyr - One of the most important packages from the tidyverse: dplyr.

Introduction - dplyr will also help us begin to explore and summarize our
data.

- Make sure you have the Tidyverse loaded in your R


environment
Dplyer Main Functions

Filter used to pick observations/rows in a data frame based on their values

Arrange used to reorder the rows in a data frame

Select used to pick variables/columns in a data frame

Mutate used to create new variables/columns in the data frame

Slice used to choose rows based on their location in the data frame

used to collapse many values down to a single summary value


Summarize
SELECT
Using ‘select()’ to pick specific columns

- It’s common in actuarial practice to work with sets of data that have
hundreds of fields, and maybe you know you’re only interested in five
of them for your analysis.
- In this case, it may be useful to select only these handful of columns
and store them in a new dataframe to make the data more
manageable.
GET COLUMN NAMES
Let’s first look at all the possible columns in the MEPS data using the
colnames() function.
Using ‘select()’ to pick specific columns

- Let’s say you’re only interested in the columns DUPERSID , PANEL , AGE42X , FAMINC18 ,
REGION42 , and TOTEXP18 .

- To retrieve only these columns and store them in a new dataframe called meps_mini , you
would code the following:
Use head() to preview dataframe
The Pipe %>% Operator

- The pipe operator allows you to combine multiple commands in a row, such that the output of
one dplyr function becomes the input of the next.

- Think of it as being synonymous with the word “then.”

- You start with a dataframe (i.e. the MEPS data), “then” you apply a dplyr verb to transform
that data, “then” you apply another dplyr verb to transform the data further.
Example

- say you wanted to start with the original MEPS data, “then” you wanted to retrieve the same
five columns as above, and “then” you decide you only want the DUPERSID and PANEL fields.
Example

- say you wanted to start with the


original MEPS data, “then” you
wanted to retrieve the same five
columns as above, and “then”
you decide you only want the
DUPERSID and PANEL fields.
Using ‘arrange()’ to reorder the rows of
a dataframe
- The arrange() verb is used to sort dataframes on one or multiple columns.

- create a new dataframe called “meps_sorted” in which we’ll use the select() function to pick
a few columns, and then sort the dataframe based on one of the columns.
Use the head and
tail() functions to
look at the first 6
and last 6 rows of
the new
dataframe.
Sort descending

- The default of the arrange() function is to sort in ascending order. If we want to sort in
descending order, we’ll use the desc() function within the arrange() function.
Sort on Multiple Columns

- create a new dataframe called “meps_multi_sort” that is sorted first in descending order by
FAMINC18 , and then in ascending order by AGE42X
FILTER
Using ’filter()` to subset rows in a data
frame
- The filter() function allows us to subset or “pick” specific rows out of a data frame based
on certain conditions

- let’s say we wanted a new data frame that only had rows in which a person’s age was equal to
25.

- You can use the count function to et insight about how many people are 25
Using ’filter()` to subset rows in a data
frame
- Let’s create a new data frame called fam_income_100k_plus which will only include rows in
which a person’s family income is greater than or equal to $100,000.
Using ’filter()` to subset rows in a data
frame
- What about if we wanted a data frame that included all regions except the Northeast?

- First, we’d look in the MEPS codebook for the REGION42 variable

- This tells you if you want to remove individuals in the Northeast, you need to remove
individuals with a REGION42 value of “1”.
Using ’filter()` to subset rows in a data
frame
- What about if we wanted a data frame that included all regions except the Northeast?

- First, we’d look in the MEPS codebook for the REGION42 variable

- This tells you if you want to remove individuals in the Northeast, you need to remove
individuals with a REGION42 value of “1”.
Logical Operators with Filter
AND – OR – NOT
Logical Operators with Filter

- Let’s say we wanted to filter our MEPS data to include only the rows where the Region is the
Northeast AND the age of the individuals is at least 50.
Logical Operators with Filter

- Now let’s suppose you wanted a data frame that includes either Northeast OR Midwest
individuals, and no others.
Challenging Example

- Let’s say you are asked to return a data frame that has the following requirements:

1. Includes only records in which the person is FEMALE, and

2. Includes only records in which the person is at least 35 years old but less than 38 years old,
and

3. Includes only records in which the person has a perceived mental health status of “Excellent”
or “Very Good”, and

4. Does not include any records in which the Region is in inapplicable or in the South or West
regions.
Challenging Example: Solution

- First, let’s look at the MEPS codebook to make sure we know which values to use from each
column.

- From this, we know that in order to select females, we’ll need to filter on a SEX value of 2.
Additionally, to filter on individuals with an Excellent or Very Good perceived mental health
status, we’ll need to filter MNHLTH42 to include the values of 1 OR 2.
Challenging Example: Solution
SLICE
slice() - choose rows based on location

- slice() allows you to pick out or select specific rows by their row-number in a dataset.

- lets suppose you were interested in creating a new data frame that only included rows 5
through 10 of the meps dataset.
slice() to select minimum or maximum

- Another common use of the slice() verb is to select minimum or maximum values out of a
data frame using slice_min() or slice_max() .

- Let’s say you wanted to find the 10 rows with the highest dollar amount for total medical
expenditures in 2018.
MUTATE
Modifying currently existing fields

- Let’s say we wanted to simplify the family income values to be rounded to the nearest one-
thousand. Additionally, we want these new, rounded values, to override what was previously in
the family income field

- To do this, we need to mutate the FAMINC18 field by rounding it to the nearest one-thousand.
Adding New Column

- let’s say you know that inflation has increased 20% from the year 2000 to 2018, and so you
want to try to convert the 2018 medical expenditure dollars to their year-2000 equivalent by
adjusting for inflation.
case_when()

- A case_when() statement allows to create a new column or modify an existing column using
multiple if-then criteria.

- Let’s say you’re working with the MEPS data, and you’re tired of going back and forth in the
codebook to look up what the values of different fields mean.

- One solution would be to use a case_when() statement to override the values of a given field
manually.
case_when()

- For example, let’s recode the REGION42 field so that it’s values correspond to the actual
regions instead of integers.
Task
1. Create a new data frame, called meps_ex_1 , which only includes the SEX , RACETHX , and
ADBMI42 fields from the meps dataset by using the select() function along with the pipe
operator, %>% .
2. Create a new data frame, called meps_ex_2 , which includes only the ADBMI42 field and sorts it
in descending order by ADBMI42 .
3. Create a new data frame, called meps_ex_3 , which filters the meps dataset to only include
members who have a ADBMI42 greater than or equal to 25, and have TOTEXP18 value of
between 5,000 and 10,000.
4. Create a new data frame, called meps_ex_4 , which takes the meps data and creates a new field
called totexp_hundreds that is equal to TOTEXP18 divided by 100 and rounded to the nearest
one-hundred
THANKS

You might also like