You are on page 1of 4

CSC367- Spring 2018, SPSS practice exercises

SPSS tutorial #2

Problem 1 (Data Preprocessing):

The dataset stored under cpu_problem.xls (posted under the course documents for week 3)
contains 8 attributes (6 predictive attributes, 2 non-predictive) used to predict the relative CPU
performance (the ninth attribute in the dataset). The description of the attributes is as follows:
v1. vendor name: 30
(adviser, amdahl,apollo, basf, bti, burroughs, c.r.d, cambex, cdc, dec,
dg, formation, four-phase, gould, honeywell, hp, ibm, ipl, magnuson,
microdata, nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry,
sratus, wang)
v2. Model Name: many unique symbols
v3. MYCT: machine cycle time in nanoseconds (integer)
v4. MMIN: minimum main memory in kilobytes (integer)
v5. MMAX: maximum main memory in kilobytes (integer)
v6. CACH: cache memory in kilobytes (integer)
v7. CHMIN: minimum channels in units (integer)
v8. CHMAX: maximum channels in units (integer)
v9. PRP: published relative performance (integer)

a) Import the Excel file in SPSS and make sure that the types of the variables in SPSS
matches the types from the description of the attributes above (if they do not, you can use
the Variable View to make any appropriate changes; also add labels to your variables
using the description above)
b) Visualize and interpret the data

I. Use bar graphs for V1 and V2

II. Use box plots and histograms for the other variables

Page | 1
CSC367- Spring 2018, SPSS practice exercises

c) Perform binning for V3

Page | 2
CSC367- Spring 2018, SPSS practice exercises

d) Calculate the distances among cases and identify the most dissimilar two cases

e)

Perform a correlation analysis. Interpret the correlation matrix and summarize the relationships
among the variables based on this analysis. Are there any variables strongly correlated
(correlation greater than 0.8)?

Problem 2 (Feature selection through regression analysis)

Page | 3
CSC367- Spring 2018, SPSS practice exercises

The Forest Fire dataset provided by the University of California at Irvine repository for machine
learning algorithms (http://archive.ics.uci.edu/ml/datasets/Forest+Fires) provides the following attributes
considered to be important when predict the burned area of forest fires, in the northeast region of
Portugal, by using meteorological and other data:

Attribute Information:

1. X - x-axis spatial coordinate within the Montesinho park map: 1 to 9


2. Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
3. month - month of the year: 'jan' to 'dec'
4. day - day of the week: 'mon' to 'sun'
5. FFMC - FFMC index from the FWI system: 18.7 to 96.20
6. DMC - DMC index from the FWI system: 1.1 to 291.3
7. DC - DC index from the FWI system: 7.9 to 860.6
8. ISI - ISI index from the FWI system: 0.0 to 56.10
9. temp - temperature in Celsius degrees: 2.2 to 33.30
10. RH - relative humidity in %: 15.0 to 100
11. wind - wind speed in km/h: 0.40 to 9.40
12. rain - outside rain in mm/m2 : 0.0 to 6.4
13. area - the burned area of the forest (in ha): 0.00 to 1090.84 (this output variable is very
skewed towards 0.0, thus it may make sense to model with the logarithm transform).
1. Perform a forward selection on the dataset and analyze the regression model and the
selected features when predicting the burned area (V13).
2. Perform a backward selection on the dataset and analyze the regression model and the
selected features when predicting the burned area (V13)

Problem 3 (Feature extraction through PCA)

Perform Principal Component Analysis on the data provided for Problem 2.

Problem 4 (Dimensionality Reduction) Repeat Problems 2 and 3 on the Auto MPG data from:
http://archive.ics.uci.edu/ml/datasets/Auto+MPG

Page | 4