You are on page 1of 32

http://www.sv-europe.

com/crisp-dm-methodology/

https://courses.bigdatauniversity.com/courses/course-
v1:BigDataUniversity+PA0101EN+2016/courseware/
a70fec8899fd464289ee11117f2600ce/
1096ce7eb267434f86019372585d06fb/

General Information

 This course is free.

 It is self-paced.

 It can be taken at any time.

 It can be taken as many times as you wish.

 Labs can be performed by downloading the Free trial version of IBM SPSS
Modeler.

 There is only ONE chance to pass the course, but multiple attempts per question
(see the Grading Scheme section for details)

Prerequisites

 None

Recommended skills prior to taking this course

 Basic knowledge of business statistics (recommended but not required)

Learning Objectives
In this course you will learn about:

 Introduction to Data Mining

 CRISP-DM Methodology
 Introduction to IBM SPSS Modeler - predictive data mining workbench

 SPSS Modeler interface

Syllabus

Lesson 1 - Introduction to Data Mining

 Introduction to Data Mining

 CRISP-DM Methodology

 Introduction to SPSS Modeler - predictive data mining workbench

 SPSS Modeler Interface


Lesson 2 - The Data Mining Process

 Business Understanding

 Data Understanding

 Data Preparation

Lesson 3 - Modeling Techniques

 Introduction to Common Modeling Techniques

 Cluster Analysis (Unsupervised Learning)

 Classification & Prediction (Supervised Learning)

 Classification - Training & Testing

 Sampling Data in Classification

 Predictive Modeling Algorithms in SPSS Modeler

 Automated Selection of Algorithms


Lesson 4 - Model Evaluation

 Metrics for Performance Evaluation


 Accuracy as Performance Evaluation tool

 Overcoming Limitations of Accuracy Measure

 ROC Curves
Lesson 5 - Deployment on IBM Bluemix

 Scoring new data

 Deployment of the Predictive Model

 What is IBM Bluemix?

 Predictive Modeling service: Deployment in the Cloud

 SPSS Collaboration and Deployment Services

About the Software

IBM SPSS Modeler is a comprehensive predictive analytics platform


designed to bring predictive intelligence to decisions made by
individuals, by groups, by systems – by your enterprise as a whole.  The
following video will provide an overview of the product.

IBM SPSS MODELER : THE POWER OF PREDICTIVE


INTELLIGENCE

Trial software
Register for the free trial of IBM SPSS Modeler software that can be used in
this course

Community Support
Visit the Predictive Analytics community for up-to-date information about
Predictive Analytics such as blogs, discussions, and more!

Grading scheme
1. The minimum passing mark for the course is 70% with the following weights:

 50% - All Review Questions

 50% - The Final Exam

2. Though Review Questions and the Final Exam have a passing mark of 50%
respectively, the only grade that matters is the overall grade for the course.

3. Review Questions have no time limit. You are encouraged to review the course
material to find the answers.  Please remember that the Review Questions are
worth 50% of your final mark.

4. The final exam has a 1 hour time limit.

5. Attempts are per question in both, the Review Questions and the Final Exam:

 One attempt - For True/False questions

 Two attempts - For any question other than True/False

6. There are no penalties for incorrect attempts.

7. Clicking the "Final Check" button when it appears, means your submission is FINAL.


You will NOT be able to resubmit your answer for that question ever again.
8. Check your grades in the course at any time by clicking on the "Progress" tab.

Certificate and Badge Information


When you pass this course you will receive:

 An online downloadable completion certificate.

This will be enabled immediately after passing the course (see section at
the bottom titled "Completion Certificate and Badge")

Change Log
This course was last updated on May 2nd, 2016:

 The course code was changed from DS101EN to PA0101EN.

Copyrights and Trademarks


IBM®, the IBM logo, and ibm.com® are trademarks or registered trademarks of
International Business Machines Corporation in the United States, other countries,
or both. A current list of IBM trademarks is available on the Web at “Copyright and
trademark information” at: ibm.com/legal/copytrade.shtml

References to IBM products or services do not imply that IBM intends to make |
them available in all countries in which IBM operates.

Netezza®, Netezza Performance Server®, NPS® are trademarks or registered


trademarks of IBM International Group B.V., an IBM Company.

Linux is a registered trademark of Linus Torvalds in the United States, other


countries, or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of
Microsoft Corporation in the United States, other countries, or both.

Java and all Java-based trademarks and logos are trademarks or registered
trademarks of Oracle and/or its affiliates.

UNIX is a registered trademark of The Open Group in the United States and other
countries.

Other product, company or service names may be trademarks or service marks of


others.

About this course


Module 1: Introduction to Data Mining

Learning Objectives

Introduction to Data Mining (11:20)

IBM SPSS Modeler Interface (4:52)

Lab 1 - SPSS Installation (3:35)

Review Questions
Learning objectives
In this lesson you will learn about:

 Introduction to Data Mining

 CRISP-DM Methodology

 Introduction to IBM SPSS Modeler - predictive data mining workbench

 SPSS Modeler Interface

THE DATA MINING PROCESS (12:06)

Skip to a navigable version of this video's transcript.

1. hello and welcome back to the predictive modeling fundamentals 14 this is a

2. lesson to taking today are you can electoral-rich product marketing manager

3. for IBM predictive analytics and Arman worries product manager for IBM

4. predictive analytics so in this lesson we will pick up where we left off we

5. will discuss the first steps of the first-team methodology we discuss data

6. preparation and data preprocessing and then we will perform data preparation

7. was SPSS model so our agenda for today's we're gonna start with the first half of

8. Chris p.m. to business understanding and we will discuss the use case that we

9. will be using throughout the rest of this course we will go over there and

10. standing and go to tools for data extortion available in my earlier and

11. then we'll go on to test preparation and processing and we'll review the tools

12. that are smaller force to be able to do that and then we're going to a lab band

13. so the first half of christiana get it is business understanding without

14. knowing what exactly directives and requirements are for the project

15. chances for success

16. attending my task reviews tell you have to be able to know what you're going to

17. do and you have to know what you gonna look for and you have to do what data
18. you're gonna need and why you doing its owner keys were looking at the sinking

19. sinking of the Titanic use cases these shoes case Titanic ship collided with a

20. sturdy and most people thrive because of the lack of rest because of a lack of

21. light jacket so most likely to survive more like they just have a few others

22. where women children and the upper class and our challenge project will be to

23. analyze a date and price per passenger as whether they're likely to survive but

24. now we're going to be doing it with a damn I saw this dataset 1010 dictator

25. comes from capitol and can go a couple proves to you know get startled to mine

26. project place for the only competition for a lot of data sets a publicly

27. available there and it's a great way to learn and explore they get there will

28. you doing so for our cases to studio saturated source the training testing

29. dataset so we don't have two separate or attached to these two sectors

30. subsections so some of the tools available energy business model for

31. exploring the available in the underside free agency can get the data you have a

32. grasp or that allows you to pick our various brands based on what they do you

33. have two different price and distributions histograms and web brass

34. and rallies across that when your value so this is something we'll be using

35. throughout the course so when it comes to data preparation this is where we

36. create an initial rocky and often times that provision tasks take a while to 32

37. responsible steps can be repeated in four times because oftentimes the data

38. and real four of his dirty it's incomplete and borrow actually values

39. are lacking for example you cannot measure you can have a chamber of

40. nothing which is obviously not correct or does not have found you too can be

41. noisy which means there is a result they can be consistent when it comes to many

42. meanings 04 discrepancies between different table so we have to account

43. for that time we have to take care of before we can run a model that actually

44. makes sense

45. so the key task and data processing data cleansing so filling in missing values
46. did that's no easy identifying and moving out fliers resolving

47. inconsistencies 10 this data integration was just an ablation of data from

48. different data bases science intensive have to transform the team by

49. normalizing marketing sometimes if you have a deal for example you can have a

50. rating date of the hearing today because it's gonna be so large

51. thousand dollars in damage so we normalize the static transform its 021

52. attributes 10 2010 we have to perform data reduction which is done by

53. Principal Component Analysis factor analysis basically you are kidding

54. reduced representation positive you're trying to capture as much capability in

55. this much information you didn't participate in a smaller size which is

56. necessary to pass trusting stuff especially when you're working with

57. extremely low to handle noisy team and one of them is bidding you can sort the

58. data and the partition it into men's deaths

59. number of customers with C span you can also call austerity to identify where

60. you know where the deputy to how they live together as distinct can help us to

61. trust each other so they can also help us identify outlines and then obviously

62. expert August crucial because of my room can detect potential outliers person

63. that has 2008 price subject matter expert actually knows that kato knows

64. the stakes so how fires did objects are very different from you know the general

65. representation of the tapes of this stand out and sometimes fires are needed

66. and we have a has to be corrected but sometimes hours actually bring

67. invaluable information and they have to be included in the model and this is

68. where subject matter expertise especially really comes into play when

69. it comes to did after formation this couple hours of suits moves noisy data

70. moving all the noise we can sometimes aggregated data I can we can normalize

71. it and in the end max ko you know everything from 0 to for example how

72. does he score normalization allows us to do that we can create new features are

73. attributes with her principal components analysis so some other tools that are
74. available to us for dinner provisional IRA certificate to type this allows us

75. to find what type of each attribute

76. continues data set income page is a categorical string values the flag yes

77. and no word no

78. particular order was a tapper the type node structure is you have you can say

79. you did to be an input which is used as an input or predictors of the moderate

80. acne to target which is the outcome that we're trying to cross the picket

81. sometimes Bebo so used and specifically association rules from outside counsel

82. not to sometimes it can take food from marlin or petition that means we're

83. using this variable to separate it into something for training testing for

84. volatility and other important issue to deal with when preparing it is handling

85. missing values and it's important to handle the table before him to

86. understand how to handle this tentative quality details you know obviously

87. profiteering escape the garbage in garbage out a result of modern art of

88. you can do to help business decisions and files and fields of missing data can

89. produce results if they're not identified the analysis so there's

90. different types missing you can have an older you or no value from the military

91. you can have a blank space for a string variable presentation of users specified

92. missing counties are going to be so for example for children that they can be

93. covered in 1999 wishes shows you that there is a really need to be out of 10

94. per episode and this noted by itself

95. hopes you handle the data preparation tasks helps you to analyze data identify

96. missing

97. values normalized fear today if you want to do he hopes you take care of some of

98. the spacing work which can be if you attempt to me

99. number of record operations options exist tomorrow as we continue to extract

100. transform and load ETL process who is he can you know sample taken sordid balance

101. it out you have a certain class that's under represented you can worst arafat
102. aggregated data driving new variables Ferguson variables out reclassified from

103. vehicles being partitioned into training and testing the worst critic

104. attributable to a series of dummy variables so a lot of things that this

105. is a lot of things available

106. powerful attempts of these operations with you didn't know either creating

107. with Myler is claiming which the expression building would you can't beat

108. you can be used for example the revenue actually it is really do love you can

109. you know you can see you can't do everything with point-and-click you can

110. check it to view is an integer variable you can convert available Tuesday 23

111. some different variables and there's a lot of tools available here for you with

112. this expression

113. so glad we're very loud too we're gonna load attending to explore it and where

114. appropriate for modeling and we will see you in the course

LAB 2 - DATA EXPLORATION AND PREPARATION (10:48)

www.kaggle.com/wayoflores

1.
predictive modeling fundamentals on in this tutorial what we're gonna be doing

2. is getting our data for a project we're going to put it in that space is modular

3. and start doing some preparation so the first step of this is getting our

4. dataset and there's a link in the document a tutorial that goes along with

5. the series that directs you to find the training and testing data sets but I

6. have that open here so it's just casual dot com slash see isn't cat's eyes

7. Titanic / data and so you should see something like I have on the screen here

8. what this is is a dataset of the passengers on the Titanic was some of

9. the information about them such as what they paid for a ticket for sex their age
10. where they were staying at on the Titanic and things like that and our

11. class variable we're trying to predict is whether they survived or died when

12. the Titanic sank so you just come out here it's to CSV files that you to

13. download so you wanna get the train CSV and the trustee as well as download

14. those files to where you work out of

15. and then we'll be ready to move forward I'm so once you have those downloaded

16. you're gonna want to go back to SPSS modeler and you can start a new stream

17. and what we're going to do is under the sources Palin we have a final note and

18. this is what you can use for importing site files into modeler so the dialogue

19. here it's very easy I was his final just to the right of that there's a button to

20. open a browser to select the father to load so just navigate to where you have

21. your training trial first that's we want to get in

22. so find this year and really quickly a nice feature we have as we can preview

23. the data so what we do this we can see the attributes that we have for the

24. dataset and their shoes in a show you the top 10 rows in the CSV file so we

25. have no idea where they survived the class which you can see websites in the

26. network right because we have a peek last name but looks like the clash as

27. part of the name and so that happens sometimes when you're a CSV file Wenders

28. quotes and the reader doesn't know exactly how to handle it so that's

29. alright so what we want to do here is the double quotes at the bottom

30. we're just gonna switch that from discarded paper discard so that's

31. changed

32. get a preview again you can see with the preview that price which in double

33. quotes two-parent discard 36 their data so now everything will be expected so

34. passenger deal experts are very busy 01 we have a class for one of the

35. passengers at the name the sex age and a few other details here

36. ok so we can click OK and now we see that this is pretty nice the nose been

37. remade the name of a file so fast now we know the desire training data and so we
38. can do now if we go to that output power had a table node and I just did that by

39. double-clicking with training notes elected so different way that you could

40. do that is to drag this on the road onto the campus and you can right click your

41. input }

42. or sellers and click Connect and click on where you tryna connected

43. to another way that you can connect nose and you can get rid of connections by

44. right clicking into leading so another way that is a little quicker a little

45. bit short cut if you're using a mouse and you have school in the center if you

46. click down on that while

47. connect you can see that draws a line and by releasing it next to the note I

48. want to go out to its gonna make that connection there so it's a nice shortcut

49. should make easy connections between notes ok now put it will just show which

50. is a visual way basically to have a great of the data within modeler so I

51. added that made the connection right click on it and click to run and this

52. shows us instead of just the top 10 rows we have all of our data and a nice table

53. so this is good just to get a sense of what we're gonna do it did you working

54. with but it's not really good for doing any kind of adjustments and you can see

55. here that we have some no values so this is kind of a first step and we're going

56. to continue on with doing some more analysis for their data so we can add we

57. can do now is exploring since we have a data model exploring so if you go to the

58. output Talat you can add a data audit node and so once again we can connect

59. our input data to the data audit noted and you can see right away

60. detective there's 12 fields and I just double click that and if I click Run

61. this information about the data that we have so there's a nice way to do some

62. initial exploratory analysis and data we see the graph we have some histograms

63. here today

64. age a tribute is normally distributed are categorical variables and the psych

65. sex is pretty well balanced and you can see that the details you have with the
66. descriptive statistics you know I was gonna be there for your continuous

67. variables but any categorical that's going to be we do have here for sex we

68. have no tells you the name of categories for cabin we have a hundred and forty

69. eight so that's all

70. may not be a good predictor for us but we do see that the age that the mean or

71. average ages twenty 9.67 we have quality which will ensure before we have 75%

72. complete fields 20% are complete records this is definitely something you want to

73. do every time you start a project we're doing check to make sure that here

74. data's high-quality ok so so that's just an exploratory analysis

75. the next step we want to do is do some or do some preparation of our data so if

76. you go to the field ops palette for 11 ill C type

77. so what this does

78. and dispensing double-clicking I pretty much the campus and what this does is

79. this gives us our field governor tribute or variable you want to say it

80. our measurement type so if it's a continuous categorical or flag also a

81. nominal and the role that it has so some things that we can do at this point

82. there was some work as we can just say that the survived variable is our target

83. so that means that as we build models it's going to automatically know that we

84. want to be predicting the survived that we want to predict that one is here

85. whether the passengers survived your not so this is just steps are gonna go

86. through this kind of thing so at this point we're gonna do is actually remove

87. some of these pay given the role of known so you can see this is probably

88. intuitive but things like me that's not going to be a good predictor for us

89. ticket we can switch that to known that will be a good predictor for us and

90. they're also gonna more cabinet

91. in part is being done so we would go to model

92. gonna be taking our imports and I'm trying to predict our target and then

93. click Apply here and ok


94. so the next thing we want to do is do some more data preparation but this time

95. we're gonna be using a Autodata prep note that can be found at field ops so

96. I'm just gonna make this up on the canvas and now if I connect these two

97. and go in and I have options here and just want to point out that you can see

98. that there's a red triangle here and you'll see that once we complete this

99. process that's going to change so in the toilet as guides steps for how we would

100. have all the setup we want to keep it set as balance for speed and accuracy

101. when we go to settings were gonna go to prepare inputs and targets and here is

102. some checkboxes these all should be pressure for what we're doing here I'm

103. going to uncheck to reorder nominal fields because we want to keep them in

104. the same order to as to make it easier we can leave the other box is checked

105. the other thing we want to do on this screen is unchecked the transform

106. continuous field so if we leave this checked it's what this note will do is

107. normalize our continuous variables to put them on a disk or transformations on

108. a scale based on our standard deviation and mean that we have calculated here so

109. just to keep everything we don't do any transformations British go unchecked

110. this then click Apply here and let's click on analyzed data to run this note

111. and we have some new transformed classes and we also have aged transformed so

112. this this did work for us and so we can click OK and that completed our data

113. preparation steps for for this lab

Learning objectives
In this lesson you will learn about:

 Introduction to Common Modeling Techniques

 Cluster Analysis (Unsupervised Learning)

 Classification & Prediction (Supervised Learning)

 Classification - Training & Testing


 Sampling Data in Classification

 Predictive Modeling Algorithms in SPSS Modeler

 Automated Selection of Algorithms

MODELING TECHNIQUES (9:06)

1.
both come to the predictive modeling Fundamentals class myself mechanic image

2. can and joined by a man whose product manager for IBM predictive analytics so

3. in this session we are going to introduce some common modeling

4. techniques were appointed to discuss the difference between supervised and

5. unsupervised worried and we will understand the algorithms that are

6. viewable IBM Business Partner so the agenda for today again we're going to

7. introduce two techniques were Canada Trust provides learning which is

8. questioned analysis discuss supervised learning a classic Eastern prediction

9. will go further into PATH station the training and testing down south

10. something we're seeing the previous like chair around at 8 a.m. accused case that

11. it would gain from the Titanic use case will talk about sampling data predictive

12. modeling algorithms and SPSS modeler and I don't get too excited about the times

13. so some of the common modeling techniques available to us if we break

14. down into pickpockets we have supervised learning which describes into English

15. classes for future predictions based training to become america's decision

16. trees regression analysis can use neighbors neural networks so you have a

17. tariff fit you have enough practice trip to China

18. so now you're going to build a model based on that particular train them on

19. that day in fact and then you can applies to a new it's a new customer we

20. don't know whether that customer will turn out so we're going to make a
21. prediction for that thirteen example supervised learning that there's a

22. supervised learning we're going to analyze data where these labels we r

23. comes up unknown to creep groups and classes for objects turn to each other

24. with a group of dissimilar to other groups have questioned out to some some

25. of the common message that we have our team is questioning hierarchical

26. questioning to stop fussing and other associations were analyzed therefore

27. events in instances that occur together for example she died percent beer

28. company purchased together probably toothpaste and toothbrushes also

29. purchased together so we look for these instances as some of the comments field

30. below are a priori ok so unsupervised learning to your customers whose

31. behavior collection of data and within to these questions the data points are

32. similar to one another within the same question but at the same time being are

33. dissimilar to appoint any other costs and cost analysis allows us to group a

34. dataset into these objects into these clusters and the classes are not

35. provided we don't know if what Chrysler Group outcome certainly the deal was

36. blocked so we lift tomorrow

37. group based on their difference dissimilarity to travel rather than

38. burning from Justin Bannan the flying this morning to new teen when it comes

39. to supervised learning crustacean prediction we have to do two things of

40. classification were ridiculously Journal turn fraud fraud purchase yes or no it

41. doesn't have to be by any

42. it could be multiple outcomes you can buy we can predict where customers can

43. buy three items that customers can buy more than three items our customers can

44. buy more than five but less than 10 and it took root structure across station

45. model based on the training set and use it to question you did this prediction

46. for Milan continuous fearful some predicting an owner missing out so

47. classification is for predicting vehicles prediction is for prosecution

48. prediction supervised learning in general is training and testing is very


49. important so we want to split heart into training second attempt except for a

50. plan tomorrow we are proud we want to put the 256 I seventy-five percent for

51. training to 24 percent 25% for testing so first we train the model on the

52. training dataset with existing crisis and then we tested the remainder of the

53. data sector was not included in the training is detected and this allows us

54. to evaluate them I'll compare the accuracy compared to model how it

55. performed in the training and testing data set up to a procedure which is the

56. percentage of families currently classified by tomorrow and you know we

57. want to see high accuracy for not just testing get a proper training and

58. sometimes if we have problems here we do really well with training for 10

59. tomorrow does not do with testing data that it's an offer

60. overfishing problem the model doesn't generalizable to new data that means

61. it's probably too complex and it starts to caption worries in the dance of the

62. teachers so at this point we have to go back and revisit tomorrow and may be

63. removed from the site which is transmitted to the future unknown

64. objects so we are too simple to track for for it and

65. dressed gatien why because an interesting reaction because we want to

66. deal with a smaller subset of a really really big deal except as representative

67. of the population so there's different approaches to patent just take a simple

68. template 30% of original sample which sometimes me that be appropriate for

69. their ballast for example we're talking to predict where they're actually going

70. to be benign or malignant and we're gonna have mostly but I cases but

71. somewhat and it's going to be reading for us to accurately predict both cases

72. end up with three percent of the cases were gonna do really well with

73. predicting bank is prepared to be able to pacify the case is cracked and so we

74. have a couple examples of past two temples which uses simple groups or

75. clusters or we can stratified sample sori select samples independently you

76. know not overlap acceptance straps for example men and women to sample in April
77. proportions or you know certain regions socioeconomic group so that our sample

78. proportions of these groups are appropriately represented we're

79. maintaining the original proportions of those variables and her has a lot a lot

80. a lot of different groups available for all needs of grannies from of

81. classification and prediction algorithms different decision tree algorithms

82. regression analysis neural networks generalized linear models country cuts

83. russians for survival data support vector machines vision to retain your

84. neighbor so there's a lot of actors material for clustering also we have

85. k-means Kohonen to step down early detection algorithm that helps to

86. identify potential for ice and associations algorithms farmer field so

87. what happens if you don't know what's coming out going to help you carry candy

88. store which which algorithm should you pick which have room is right for you

89. 12 miler there are automated algorithms available whereas this is part of a

90. select the best algorithms for your project given you know you're trying to

91. protect so we have had a classifier algorithm for America for predicting

92. continued skis or cluster and penchant for forecasting pictures so with that

93. let go ahead with a third lap and here we're going to build the logistic

94. regression model for a tenant Dana and then we're also gonna use are mining

95. feature of

MODEL EVALUATION

Learning objectives
In this lesson you will learn about:

 Metrics for Performance Evaluation

 Accuracy as Performance Evaluation tool

 Overcoming Limitations of Accuracy Measure

 ROC Curves
4.- MODEL EVALUATION

Play

0:00 / 9:03

SpeedSpeed1.0x
Volume

Maximum Volume.

Fill browser
Turn off transcripts

1.
2. and welcome to predictive modeling fundamentals one question for you today

3. I myself we can encourage product marketing manager for IBM predictive

4. analytics and their monies product manager for IBM products hours in this

5. session we're going to understand time to come it took me two metrics for

6. classification model evaluation gonna pick up where we left off in the

7. previous class we're gonna talk about a plan to take tomorrow off at NTNU Tina

8. and then we're going to use this disparate ourselves part of performance

9. and accuracy

10. jennifer is begin by review some of the concepts from my speed record data

11. mining tools training and testing data sampling data federal government to

12. metrics for performance evaluation discussed actors as performance

13. evaluation and other measures for overcoming ruled petitions of accuracy

14. we're going to talk about our secrets for measuring how well it performs and

15. we're gonna go into a lab previous reports have money to squeak and broken

16. up into three main categories supervised learning which is classification of

17. production where we look at historical data an attempt to predict outcomes or

18. crafts items into groups based on historical day so we're going to
19. describe the distinguished classes for future prediction classification

20. classification section of suppressing deals with both categorical data

21. production deals with continuous efforts we employ a decision trees kenyans

22. neighbors at work then we use regression for prediction degrees of separation

23. working where we could do

24. we cannot know belongs to you to create groups and classes for objects with

25. creating these customers there have been a similarity but also have high interest

26. in their class objective view similar with each other for a few different to

27. have to check out of cost accounting methods we use our team's quest 30

28. question to step up and then there's association rules were reality for event

29. occur to campus for a couple with two things commonly by introduction and

30. common method we use acronyms prosecution prediction we evaluate final

31. performance we put fear into training to approximately $0.60 to $0.70 going into

32. training and 24 percent went at some point so we train the model on

33. retrieving section of historical deal with existing classes at school

34. supervisor and then we can smile did not include testing and attract attention

35. and this variation allows us to compare how will tomorrow

36. performed at the training set and the technical you want to make sure to see

37. if that is generalizable to new data so we don't have the overfitting problem

38. where they're really well in the training to the people and the test to

39. see high accuracy for training and testing and then we used a smile for

40. questions thanks Peter known so sometimes it's important for us to

41. sample did it because we're doing such a matter of fact

42. we need to some Italian we need to deal with smaller subset of it and if such a

43. way that the subsidies representative large population of Argentina side so we

44. can sometimes take a sample taken the time to stop appropriate for the

45. balanced where we have a large study determined from one class only here

46. until this coming from another crisis wears too much more rare of classes
47. going important for us to accurately predict in this case will resolve this

48. with complex samples were stratified temples where we maintain the original

49. portions of prices can key so for example if we have percent positive and

50. negative cases in the regional powers for a couple also have 99% positive

51. cases and 10% mixed case once we have built a model it was looking at products

52. for customers and prospects would like to return and we want to see how well it

53. does so what's the one way to do that that's true actress generally do that by

54. looking at the confusion matrix matrix that was classified our forces how r

55. here so we have two positives and negatives to accurately classified as

56. either yes or no and then lost that case and false positives and it doesn't have

57. to be a binary outcome can be a multi-dimensional data we can be

58. multiple classes but worth looking to compare their versatile actress is a

59. performance evaluation is basically a measure that would look how many total

60. cases a clear pic less having a degree actually classified US troops and maybe

61. between accurately classify that's true positives best price for that is divided

62. by the total number of times has its limitations even though it's very

63. popular tool for example you have no negative cases technical cases in the

64. two important for us to accurately predict everything negative

65. a model is going to predict everything to be positive would have really high

66. accuracy and working with model did really well but in reality if it failed

67. cases they did not field especially for predicting

68. tumors and is extremely important for us to identify their career case because

69. it's easy to overlook that there's some other measures to deal with it for

70. example this procession which is true positives divided by the sum of

71. positives and false positives of the how many people can we prove we put it to be

72. true then restive city which is true positives divided by 2 percent plus

73. false negatives how many of the outcomes actually positive to predict to be

74. positive and negative divided by two neighbors 130 miles we can compare we
75. can compare them from the performance for example we ran one final question

76. and other one that decision for NYC how to perform in comparison to jump suit in

77. using ROC curve receiver operating characteristic curve to present a

78. performance for binary classification model and its ability to distinguish

79. false positives and true so there is a draft of the event his relation to the

80. graph that allows us to build this metric and in the sample she we have

81. streamlined the replay and we have two miles of build a blue-eyed represent a

82. decision tree and a green line 32 percent in which is the creation of

83. green line representing a decision tree here has done better

84. logistic regression because it's closer to the upper left corner that's what we

85. want to smile to be want to subscribe to this party and close to the upper left

86. corner as possible so that let's move on to the next level we're gonna perform

87. and we'll see in the next class

MODELING FOR PREDICTIONS (7:33)

1. welcome back to predictive modeling fundamental is one and this video we're

2. going over for scoring test data so in this tutorial what we're going to do is

3. kind of just keep building off stream that we've been working on the past few

4. labs so the last laugh we finally built some models and so we did some

5. evaluation so now what we really want to do is actually use our model that we

6. created and we want to see and limit our predictions with only those that were at

7. least 80% confident confident about and kind of go to next freshly I'm using our

8. model in a web application so they'll be in the next tutorial so this is your

9. work is going after testing data modeler and limit or disciplinary actions ok so

10. you should have this stream something looks like this stream on your in your
11. campus if you are just jumping in

12. want to go back to the first tutorials the first lives and either watch the

13. videos or go through the steps on your own I'm seeing get to this point we're

14. just going to continue moving forward with this so just to get started here

15. what you want to do next

16. we want to have our testing CSV file that you downloaded in the second lab

17. testing dataset for the Titanic data and we're gonna add that to modeler so

18. really the same as we did the first time you want to go to the sources palette at

19. the bottom of your screen just click and drag the wire file onto the canvas and

20. I'm going to double click that once again the same process as before I'm

21. gonna selected testing CSV file click Preview and we see once again this post

22. incorrectly so that's alright we know how to handle this in the double quotes

23. at the bottom of this dialog box we're gonna switched from discs are two-parent

24. discard and now just for that one change you can see now named as being read

25. incorrectly entered it looks pretty good

26. get something quick reply

27. and click OK and now you can see basically we're set up for now is we can

28. do something similar to what we did with our training data just with our testing

29. data although at this point we already built our model and so we've already

30. given the stream we've told it was important to look for so long as they're

31. testing data has in place it's going to be easy for us just to apply our model

32. so what we can do now just something I've been pointed out so far in the

33. environment of SPSS modeler that you see on the far right side at the top we have

34. a different types here and so these are pretty easy come in handy so if you're

35. working on multiple projects you can have multiple streams open and you can

36. quickly jump between them by clicking here it's gonna give you all the tables

37. of the graphs of the previews achieve created so that's a quick really good

38. thing to jump active you just what you've been working on and then models
39. is the third time here and this is what we're going to use at this point so

40. since our other classification model perform better let's go for it that way

41. and so this you can think of this just like something from the palace at the

42. bottom and check and drop it here on campus now we have this connected I'm

43. just gonna make a connection here directly between the testing data and

44. that model and to get an output just to kind of visualize something here I'm

45. gonna had a table to this also just click Iran and see if this works ok so

46. that was really quick

47. just to show you so if you are just joining are you did you see the last

48. video when you're first created the model especially the opacification and

49. it's running through you know a big data set and many different many different

50. models are trying to test out it can take some time so now the advantage of

51. what what we have here is rather than creating a model again we can use it

52. during our testing dataset and that was almost instantaneous to get those

53. results and you can see here we have are predicted class and our confidence so if

54. you recall the original the testing dataset it didn't have an actual class I

55. survived or died so 10 so all we had was passenger I D through and barked and

56. just by with this data monitor model now we have predicted 10 and the probability

57. for that something that you might want to do if you're doing this kind of work

58. is that ok that's good but I don't only one apply this model early wanna make a

59. prediction over a certain percentage confident and my results so that's what

60. we're gonna do now to do that kind of Jack so we can do is under our record

61. hops we have a no thats select and this is really powerful because this is where

62. you can select only certain roads are instances that meet your criteria so

63. what we can do and reconnect the filing no to the select and going in here this

64. dialogue is but here it's kind of a calculator lets you make your finals so

65. there's different ways you can use you can type it in manually what we can do

66. here though is the XFC survived that was the probabilities number that I showed
67. you from the table and so that's our variable and so what we want to do is

68. make sure that we only keep those that are greater 8.8

69. so we want to make sure that we're at least 80% confident in our results and

70. we can click check at the bottom of the just gonna do a throwing error let us

71. know via something totally wrong but that looks good so we can look ok here

72. and now it's at another table here so if I could help lead and tables going to

73. make that connection ok let's just run this right now you can see if you recall

74. from our previous table we have all the all the outputs and its 418 on this

75. table that we just selected where we live at it we only 273 and looking at

76. the far right column I just doing a spot check here you can see that everything

77. is higher than 20 which is what we want because the predictions that we feel

78. confident that was it for this tutorial on the next tutorial gonna be doing is

79. setting up a woman's account and we're going to use our model to put it onto a

80. web application or we can use it

81. dying dynamically on our website so we'll see you then

DEPLOYMENT ON IBM BLUEMIX (6:51)

1. additional employment services so into this agenda will cover my point is

2. whether the second will deploy of a mile and Wyoming terry's how to implement the

3. crowd and finally we will be able to talk about collaboration and Deployment

4. Services so when we're working in a data mining project when we use only the way

5. we split our data between training and we leaving

6. Train Your Mother and usually when you do is just be the data in a way that you

7. use 80% for example to train them and putting both 10 2010 what they used to

8. train them all the more important for you to know more accurate gonna be your

9. you're breaking while so wine is made them all those using the training and
10. the second is going on at this thing said

11. and compared them with the observed values so they can talk on the door

12. dataset is not big enough

13. you may have wrong room balance oh and in case we have her on the phone records

14. so we're splitting a d20 so we have a hundred records for the training and to

15. any

16. and 200 records for the testing going back to the crease methodology we saw

17. before that we have sixteen deaths West business understanding trying to figure

18. out what we're trying to do in this project

19. second they don't understanding what is made available and how can I get into my

20. environment the first episode of preparation which is basically getting

21. that they are ready for modeling then we come to modern fire and so actually

22. creating the model is generally not the end of the project because of the way

23. and what we try to do here is to deploy these smart model into a reason to my

24. man so we can get some benefit from that so for the moment they are different

25. solutions provided by APM they work well with SPSS let me point out here when he

26. says basis

27. solution for we share Lumix service available in the cloud and the thirties

28. I V YM

29. blaming terrorists I'm going to focus now on the cloud offering and I do it I

30. V emblem thanks for those who doesn't know what I V emblems of blood from a

31. self-service application hosting environment so they view here is that

32. you will be able to deploy applications and don't have to spend weeks or so the

33. application developers will focus only on their business logic on on the Nov 10

34. you don't have to be so the application developer for example what happened to

35. worry about how to install or manage the runtimes framework and libraries and

36. we're opening a bunch of different viruses there that are easy to manage

37. and you can tell that killed them as you know so I is based on Monday which is
38. open source and it is a very strong and growing community so you can access it

39. on the net and in there you will find many different and that when we are we

40. are interested who do they look pretty another service offered to do the

41. developers and at the time the weather to integrate capabilities into

42. applications so basically went to the end and that's what we will be going

43. into the blood stream to develop smaller and you uploaded into the area and its

44. gonna provide you the great point APA's so you will be able to call the into

45. your applications deployed so it's very simple to create the ferry you upload

46. the file and the service provider directly the point finally we have

47. deployments have a larger solution of the possible

48. and frame and we are in this training but this is so that I know you love

49. number five

50. according to the dentist uses my mother into the IBM Cloud and you will use this

51. application thank you very much

PRUEBAS

REVIEW QUESTION 1
 
(1/1 point)

Which of the following applications would require the use of data


mining? Select all that apply.
 Predicting the outcome of flipping a fair coin  Determining which products in a store are
likely to be purchased together  Predicting future stock prices using historical records  
Determining the total number of products sold by a store  Sorting a student database by gender
Determining which products in a store are likely to be purchased together, Predicting future stock
prices using historical records, - correct
You have used 2 of 2 submissions

REVIEW QUESTION 2
 
(1/1 point)

Which of the following is NOT a section of the Modeler Interface?

 Stream Canvas  Stream, Outputs, and Model Manager  Nodes  Palettes  All
of the above are sections of the Modeler Interface All of the above are sections of the Modeler
Interface - correct
You have used 2 of 2 submissions

REVIEW QUESTION 3
 
(1/1 point)

Which of the following is NOT a part of the Cross-Industry Process for


Data Mining?

 Data Storage Data Storage - correct  Modeling  Data Preparation  Business


Understanding  Evaluation
You have used 2 of 2 submissions

PRUEBA 2

REVIEW QUESTION 1
 
(1 point possible)
Which phase of the data mining process focuses on understanding the
project requirements and objectives?

 Data Preprocessing  Data Understanding  Data Exploration Data Exploration -


incorrect  Business Understanding  Data Preparation
You have used 2 of 2 submissions

REVIEW QUESTION 2
 
(1/1 point)

Which Data Preprocessing task focuses on removing outliers and filling


in missing values?

 Data Reduction  Data Transformation  Data Integration  Data Cleaning Data


Cleaning - correct  None of the above
You have used 2 of 2 submissions

REVIEW QUESTION 3
 
(1/1 point)

The IBM SPSS Modeler supports which data type?

 Ordinal  Categorical  Continuous  Nominal  All of the above All of the


above - correct

PRUEBA 3

REVIEW QUESTION 1
 
(1/1 point)

Which of the following methods are commonly used for supervised


learning tasks? Select all that apply.
 Neural Networks  Decision Trees  K-Means  CARMA  Regression
Neural Networks, Decision Trees, Regression, - correct
You have used 2 of 2 submissions

REVIEW QUESTION 2
 
(1 point possible)

Classification is a subset of supervised learning that focuses on modeling


continuous variables. True or false?

 True True - incorrect  False


You have used 1 of 1 submissions

REVIEW QUESTION 3
 
(1/1 point)

Which of the following algorithms is NOT supported by the SPSS


Modeler?

 K-Means  Logistic Regression  CARMA  Apriori  All of the above


algorithms are supported All of the above algorithms are supported - correct

PRUEBA 4:

REVIEW QUESTION 1
 
(1 point possible)

What is the term for a negative data point that is incorrectly classified as
positive?

 True Negative  False Positive  True Positive  False Negative  None of the
above None of the above - incorrect
You have used 2 of 2 submissions

REVIEW QUESTION 2
 
(1/1 point)

Which of the following is NOT a cost-sensitive performance metric?

 Specificity  Precision  Sensitivity  Accuracy Accuracy - correct  All of the


above metrics are cost-sensitive
You have used 2 of 2 submissions

REVIEW QUESTION 3
 
(1/1 point)

What is the formula for the precision metric?

 (True Positive) / (True Positive + False Negative)  (True Negative) / (True Negative +
False Positive)  (False Positive) / (True Positive + False Positive)  (True Positive) / (True
Positive + False Positive) (True Positive) / (True Positive + False Positive) - correct  (False
Positive) / (True Negative + True Positive)

EXAM 5:

REVIEW QUESTION 1
 
(1/1 point)

In general, the testing dataset should be significantly larger than the


training dataset. True or false?

 True  False False - correct


You have used 1 of 1 submissions

RREVIEW QUESTION 2
 
(1/1 point)

Which of the following is NOT a model deployment solution?

 SPSS Solution Publisher  IBM Collaboration and Deployment Services  CRISP-


DM CRISP-DM - correct  Bluemix  All of the above are model deployment solutions
RESETYOUR ANSWER 
You have used 1 of 2 submissions

REVIEW QUESTION 3
 
(1 point possible)

Which of the following statements are true of IBM Bluemix? Select all
that apply.

 Bluemix generally takes about a week to deploy an app  Bluemix is supported by a


growing community  Bluemix is closed-source  Bluemix provides a self-service
application-hosting environment  Bluemix provides built-in load-balancing capabilities x

You might also like