You are on page 1of 2

For this assignment use the data set senic.xlsx, which is on blackboard.

This data set consists of a

random sample of 113 hospitals. The objective is to study the infection risk and what factors
influence it. The variables from the data set are:
Variable Name
Identification number
Length of stay
Infection risk
Routing culturing
Routine chest X-ray
Number of beds
Medical school
Average daily census
Number of nurses
Available facilities
and services

Average length of stay in hospital (in days)
Average age of patients (in years)
Average estimated probability of acquiring infection in
hospital (in percent)
Ratio of number of cultures performed to number of patients
without signs or symptoms of pneumonia, times 100
Ratio of number of X-rays performed to number of patients
without signs or symptoms of pneumonia, times 100
Average number of beds in hospital
0 = Yes, 1 = No
Average number of patients in hospital per day
Average number of full-time licensed practical nurses
Percent of 35 potential facilities and services that are
provided by the hospital

The goal is to fit the best multiple regression model to the response (infection risk).
Do an analysis using the first 108 observations.

Use the stepwise regression method to see which model is the best. Repeat using subset
regression. Do they agree?
Are there any outliers in the data? Look for x-outliers, y-outliers, and high-influence points.
Come up with one model that you think best describes the data and can be used for future
predictions. Show the residual plot for this one. Does the model seem appropriate?
Use this model to predict (using prediction interval) y for the last 5 observations of the data and
see if the model is doing well.