You are on page 1of 7

Hands-on Workshop: SAS Visual Analytics - CASE: Big Organics

Preparing Data Using SAS Visual Analytics

Data
The Big Organics data set contains 13 variables and 111,115 observations.

Name Type Class Description


Affluence Grade Numeric Measure Affluence Grade is a grade measured on a scale from 1 to 34
Age Numeric Measure Age is the age in years
Customer Loyalty ID Character Category Customer Loyalty Identification number
Gender Character Category Customer gender (M = male, F = female, U = unknown)
Geographic Region (5 regions in UK: South East, South
Geographic Region Character Category
West, Midlands, North, Scottish)
Loyalty Card Tenure is the time as a loyalty card member (in
Loyalty Card Tenure Numeric Measure
months: 0-39)
Loyalty Status Character Category Status of the loyalty card (Tin, Silver, Gold, Platinum)
Neighborhood Cluster-
Character Category Micro segment of the neighborhood - externally acquired
55 Level
Neighborhood Cluster-
Character Category Macro segment of the neighborhood - externally acquired
7 Level
Organics Purchase
Numeric Measure TARGET (discrete) - Number of Organic Products Purchased
Count
Organics Purchase TARGET (binary) - Organic Products Purchased? (1 = yes, 0 =
Numeric Measure
Indicator no)
Television Region Character Category Regional TV broadcasting (see details in this link)
Total Spend Numeric Measure Total amount spent (previously)

Goals
 Explore the data.
 Create a customer segmentation based predominately on demographics.
 Use predictive models to determine customers who most likely will buy organic products.

3
Hands-on Workshop: SAS Visual Analytics - CASE: Big Organics

Open the SAS Visual Analytics.

Click on Data – and select the BIGORGANICS data source in the TUNDATA directory. Click OK

Now you are ready to start!

5
Hands-on Workshop: SAS Visual Analytics - CASE: Big Organics

Exploration
Our Target variable Organics Purchase Indicator is a Measure, but it would be more convenient to have it
as a (binary) Category. Change the classification of it to Category.

To start with, let’s check the distribution of Gender, just drop the data item Gender into your page.

How many customers purchased Organic products?

Select Organics Purchase Indicator from the Data – Measures, drop it at the bottom of your visualization
window (you will see + Auto Chart) and a new chart will be created with our target variable.

Hands-on Exercise 1:

1) Create charts to explore Loyalty Status, Geographic Region, Affluence Grade, Age and Loyalty Card
Tenure. Explore other variables at your will.

6
Hands-on Workshop: SAS Visual Analytics - CASE: Big Organics

2) On Data click on to inspect the measurement details, quickly verify which ones present missing
values.

Transform and Modify


As you saw, we have some variables presenting missing values. Some modeling techniques as Decision
Trees can handle them, but others like Regressions cannot.

NOTE: In the Linear Regression, Logistic Regression, and Generalized Linear Model, by default, all
observations that contain a missing value in any assigned role variable are dropped. In some cases, the
fact that an observation contains a missing value provides relevant information. By selecting the
Informative Missingness property it handles missing values automatically, for measure variables missing
values are imputed with the observed mean, and an indicator variable is created, for category variables,
missing values are considered a distinct level.

So, to avoid discarding observations containing missing values and to deal with cases where the automatic
process would not be appropriate, we can treat these cases. This process is usually called Imputation.

First, Gender. On Data, click on ADD and select Add custom category. Call our new categorical variable:
Gender_IMP (for imputed), set it to be based on Gender, substitute the missing values by “U” (already
used for unknown), and the Remaining Values to Show as is.

It should look like this:

7
Hands-on Workshop: SAS Visual Analytics - CASE: Big Organics

Now we need to do the imputation for the measures. Let’s treat Loyalty Card Tenure by substituting the
missing values by 0. First, add a calculated item called: LC Tenure_IMP to substitute the missing values of
Loyalty Card Tenure by 0. Check its distribution.

8
Hands-on Workshop: SAS Visual Analytics - CASE: Big Organics

Hint: SAS Code for this is


IF ( 'Loyalty Card Tenure'n Missing )
RETURN 0
ELSE 'Loyalty Card Tenure'n

9
Hands-on Workshop: SAS Visual Analytics - CASE: Big Organics

Hands-on Exercise 2:

Do the imputation for these other variables:

1) Age – substitute the missing values by 50;


2) Affluence Grade - substitute the missing values by 0;

10

You might also like