You are on page 1of 3

MATH1324 

Applied Analytics
Week 3­ Class Worksheet
Descriptive Statistics through Visualisation

Required Packages
The following packages will be required or may come in handy.
library(readr)
library(magrittr)
library(dplyr)

Data: Behavioral Risk Factor Surveillance Data
The cdc.csv data (under data repository) comes from the Behavioral Risk Factor Surveillance System
(BRFSS) survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify
risk factors in the adult population and report emerging health trends. For example, respondents are asked
about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level
of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss (http://www.cdc.gov/brfss)) contains
a complete description of the survey, including the research questions that motivate the study and many
interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there
are over 200 variables in this data set, we will work with a small subset.

Variables
The cdc.csv data frame with 20,000 observations on the following 9 variables:

genhlth: a categorical vector indicating general health, with categories excellent, very good, good, fair, and
poor.

exerany: a categorical vector, 1 if the respondent exercised in the past month and 0 otherwise

hlthplan: a categorical vector, 1 if the respondent has some form of health coverage and 0 otherwise

smoke100: a categorical vector, 1 if the respondent has smoked at least 100 cigarettes in their entire life and
0 otherwise

height: a numerical vector, respondent’s height in inches

weight: a numerical vector, respondent’s weight in pounds

wtdesire: a numerical vector, respondent’s desired weight in pounds

age: a numerical vector, respondent’s age in years

1/3
gender: a categorical vector, respondent’s gender

Use this data set to complete the following exercises.

Exercises:
Exercise 1 Download the cdc.csv data (available under the data repository). Import the
data into RStudio and assign the appropriate labels to the categorical factors:
exerany , hlthplan , smoke100 as 1: Yes, 0: No and gender m: Male and f:
Female.

Exercise 2 Check if genhlth variable’s labels are in a correct order, if not, re-order the
labels as excellent, very good, good, fair, and poor.

Exercise 3 Explore the categorical variables genhlth exerany , hlthplan , smoke100


and gender using frequency tables with proportions and bar graphs.

Exercise 4 Does the general health status depend on gender? Investigate using a cross-
table and a bar graph.

Exercise 5 Does the general health status depend on smoking status ( smoke100 )?
Investigate using a cross-table and a bar graph.

Exercise 6 Generate a new variable called Body Mass Index ( bmi ) using the formula:

weight (lb)
BM I = × 703
2
height (in)

Exercise 7 Produce a histogram and box plot of body mass index. What do you notice with
the plots?

Exercise 8 Calculate descriptive statistics for body mass index. Answer the following
questions:

1. What is the mean and median?


2. Which one is a better indicator of central tendency for body mass index?
Why?
3. What are the quartiles, IQR and standard deviation?

Exercise 9 Identify and remove outliers by applying a filter to the body mass index.

2/3
Exercise 10 Produce a side-by-side box plot of BMI among general health categories using
the filtered data.

3/3

You might also like