Data Science Methodology

Data Science Methodology - DS0103EN
https://www.coursera.org/learn/data-science-
methodology/home/week/2
From the above link download the notebook for each

module to workwith
Welcome! (4:01)
https://youtu.be/WKpRykMW8lU
1. Welcome to Data Science Methodology 101!
2. This is the beginning of a story -one that you'll be telling others about for years
3. to come.
4. It won't be in the form you experience here, but rather through the stories you'll be
5. sharing with others, as you explain how your understanding of a question resulted in an
6. answer that changed the way something was done.
7. Despite the recent increase in computing power and access to data over the last couple of
8. decades, our ability to use the data within the decision making process is either lost
9. or not maximized as all too often, we don't have a solid understanding of the questions
10. being asked and how to apply the data correctly to the problem at hand.
11. Here is a definition of the word methodology.
12. It's important to consider it because all too often there is a temptation to bypass
13. methodology and jump directly to solutions.
14. Doing so, however, hinders our best intentions in trying to solve a problem.
15. This course has one purpose, and that is to share a methodology that can be used within
16. data science, to ensure that the data used in problem solving is relevant and properly
17. manipulated to address the question at hand.

18. The data science methodology discussed in this course has been outlined by John Rollins,
19. a seasoned and senior data scientist currently practising at IBM. This course is built on
20. his experience and expresses his position on the importance of following a methodology
21. to be successful.
22. In a nutshell, the Data Science Methodology aims to answer 10 basic questions in a prescribed
23. sequence. As you can see from this slide, there are two questions designed to define
24. the issue and thus determine the approach to be used; then there are four questions
25. that will help you get organized around the data you will need, and finally there are
26. four additional questions aimed at validating both the data and the approach that gets designed.
27. Please take a moment now to familiarize yourself with the ten questions, as they will be vital
28. to your success.
29. This course is comprised of several components:
30. There are five modules, each going through two stages of the methodology, explaining
31. the rationale as to why each stage is required. Within the same module, a case study is
32. shared that supports what you have just learned. There's also a hands-on lab, which helps
33. to apply the material. And finally, there are 3 review questions
34. to test your understanding of the concepts.
35. When you're ready, take the final exam.
36. The case study included in the course, highlights how the data science methodology can be applied
37. in context.
38. It revolves around the following scenario: There is a limited budget for providing healthcare
39. to the public. Hospital readmissions for re-occurring problems can be seen as a sign of failure
40. in the system to properly address the patient condition prior to the initial patient discharge.
41. The core question is: What is the best way to allocate these funds to maximize their
42. use in providing quality care? As you'll see, if the new data science pilot
43. program is successful, it will deliver better patient care by giving physicians new tools
44. to incorporate timely, data-driven information into patient care decisions.
45. The case study sections display these icons at the top right hand corner of your screen to help you
46. differentiate theory from practice within each module.
47. A glossary of data science terms is also provided to assist with clarifying key terms used within
48. the course.
49. While participating in this course, if you come across challenges, or have questions,
50. then please explore the discussion and wiki sessions.
51. So, now that you're all set, adjust your headphones and let's get started!
Learning Objectives
Bookmark this page
Learning Objectives
In this course you will learn about:
o The major steps involved in tackling a data science problem.
o The major steps involved in practicing data science, from forming a concrete business or
research problem, to collecting and analyzing data, to building a model, and understanding the
feedback after model deployment.
o How data scientists think through tackling interesting real-world examples.
Syllabus
Bookmark this page
Syllabus
Module 1: From Problem to Approach
o Business Understanding
o Analytic Approach
Module 2: From Requirements to Collection
o Data Requirements
o Data Collection
Module 3: From Understanding to Preparation

o Data Understanding
o Data Preparation
Module 4: From Modeling to Evaluation
o Modeling
o Evaluation
Module 5: From Deployment to Feedback
o Deployment
o Feedback
Module 1 - From Problem to Approach
Learning Objectives
Bookmark this page
Learning Objectives
In this lesson you will learn about:
 Why we are interested in data science.
 What a methodology is, and why data scientists need a methodology.
 The data science methodology and its flowchart.
 How to apply business understanding and the analytic approach to any data science problem.
Business Understanding (5:02)

https://youtu.be/-emu36N2SPA
1. Welcome to Data Science Methodology 101 From Problem to Approach Business Understanding!
2. Has this ever happened to you?
3. You've been called into a meeting by your boss, who makes you aware of an important
4. task one with a very tight deadline that absolutely has to be met.
5. You both go back and forth to ensure that all aspects of the task have been considered
6. and the meeting ends with both of you confident that things are on track.
7. Later that afternoon, however, after you've spent some time examining the various issues
8. at play, you realize that you need to ask several additional questions in order to truly
9. accomplish the task.
10. Unfortunately, the boss won't be available again until tomorrow morning.
11. Now, with the tight deadline still ringing in your ears, you start feeling a sense of
12. uneasiness.
13. So, what do you do?
14. Do you risk moving forward or do you stop and seek clarification.
15. Data science methodology begins with spending the time to seek clarification, to attain
16. what can be referred to as a business understanding.
17. Having this understanding is placed at the beginning of the methodology because getting
18. clarity around the problem to be solved, allows you to determine which data will be used to
19. answer the core question.
20. Rollins suggests that having a clearly defined question is vital because it ultimately directs
21. the analytic approach that will be needed to address the question.
22. All too often, much effort is put into answering what people THINK is the question, and while
23. the methods used to address that question might be sound, they don't help to solve
24. the actual problem.
25. Establishing a clearly defined question starts with understanding the GOAL of the person
26. who is asking the question.
27. For example, if a business owner asks: "How can we reduce the costs of performing an activity?"
28. We need to understand, is the goal to improve the efficiency of the activity?
29. Or is it to increase the businesses profitability?
30. Once the goal is clarified, the next piece of the puzzle is to figure out the objectives
31. that are in support of the goal.
32. By breaking down the objectives, structured discussions can take place where priorities
33. can be identified in a way that can lead to organizing and planning on how to tackle the
34. problem.
35. Depending on the problem, different stakeholders will need to be engaged in the discussion
36. to help determine requirements and clarify questions.
37. So now, let's look at the case study related to applying "Business Understanding"
38. In the case study, the question being asked is: What is the best way to allocate the limited
39. healthcare budget to maximize its use in providing quality care?
40. This question is one that became a hot topic for an American healthcare insurance provider.
41. As public funding for readmissions was decreasing, this insurance company was at risk of having
42. to make up for the cost difference,which could potentially increase rates for its customers.
43. Knowing that raising insurance rates was not going to be a popular move, the insurance
44. company sat down with the health care authorities in its region and brought in IBM data scientists
45. to see how data science could be applied to the question at hand.
46. Before even starting to collect data, the goals and objectives needed to be defined.
47. After spending time to determine the goals and objectives, the team prioritized "patient
48. readmissions" as an effective area for review.
49. With the goals and objectives in mind, it was found that approximately 30% of individuals
50. who finish rehab treatment would be readmitted to a rehab center within one year; and that
51. 50% would be readmitted within five years.
52. After reviewing some records, it was discovered that the patients with congestive heart failure
53. were at the top of the readmission list.
54. It was further determined that a decision-tree model could be applied to review this scenario,
55. to determine why this was occurring.
56. To gain the business understanding that would guide the analytics team in formulating and
57. performing their first project, the IBM Data scientists, proposed and delivered an on-site
58. workshop to kick things off.
59. The key business sponsors involvement throughout the project was critical, in that the sponsor:
60. Set overall direction
61. Remained engaged and provided guidance.
62. Ensured necessary support, where needed.

63. Finally, four business requirements were identified for whatever model would be built.
64. Namely:
65. Predicting readmission outcomes for those patients with Congestive Heart Failure
66. Predicting readmission risk.
67. Understanding the combination of events that led to the predicted outcome
68. Applying an easy-to-understand process to new patients, regarding their readmission
69. risk.
70. This ends the Business Understanding section of this course.
71. Thanks for watching!
Analytic Approach (3:23)

https://youtu.be/kE6Ee4z4wDI
1. Welcome to Data Science Methodology 101 From problem to approach Analytic Approach!
2. Selecting the right analytic approach depends on the question being asked.
3. The approach involves seeking clarification from the person who is asking the question,
4. so as to be able to pick the most appropriate path or approach.
5. In this video we'll see how the second stage of the data science methodology is applied.
6. Once the problem to be addressed is defined, the appropriate analytic approach for the
7. problem is selected in the context of the business requirements.
8. This is the second stage of the data science methodology.
9. Once a strong understanding of the question is established, the analytic approach can
10. be selected.
11. This means identifying what type of patterns will be needed to address the question most
12. effectively.
13. If the question is to determine probabilities of an action, then a predictive model might
14. be used.
15. If the question is to show relationships, a descriptive approach maybe be required.
16. This would be one that would look at clusters of similar activities based on events and
17. preferences.
18. Statistical analysis applies to problems that require counts.
19. For example if the question requires a yes/ no answer, then a classification approach
20. to predicting a response would be suitable.
21. Machine Learning is a field of study that gives computers the ability to learn without
22. being explicitly programmed.
23. Machine Learning can be used to identify relationships and trends in data that might otherwise not
24. be accessible or identified.
25. In the case where the question is to learn about human behaviour, then an appropriate
26. response would be to use Clustering Association approaches.
27. So now, let's look at the case study related to applying Analytic Approach.
28. For the case study, a decision tree classification model was used to identify the combination
29. of conditions leading to each patient's outcome.
30. In this approach, examining the variables in each of the nodes along each path to a
31. leaf, led to a respective threshold value.
32. This means the decision tree classifier provides both the predicted outcome, as well as the
33. likelihood of that outcome, based on the proportion at the dominant outcome, yes or no, in each
34. group.
35. From this information, the analysts can obtain the readmission risk, or the likelihood of
36. a yes for each patient. If the dominant outcome is yes, then the risk
37. is simply the proportion of yes patients in the leaf.
38. If it is no, then the risk is 1 minus the proportion of no patients in the leaf.
39. A decision tree classification model is easy for non-data scientists to understand and
40. apply, to score new patients for their risk of readmission.
41. Clinicians can readily see what conditions are causing a patient to be scored as high-risk
42. and multiple models can be built and applied at various points during hospital stay.
43. This gives a moving picture of the patient's risk and how it is evolving with the various
44. treatments being applied. For these reasons, the decision tree classification
45. approach was chosen for building the Congestive Heart Failure readmission model.
46. This ends the Analytic Approach section for this course.
48.
Lab - From Problem to Approach in Python

Bookmark this page
Lab Instructions
This course uses Cognitive Class Labs (formerly known as BDU Labs), an online virtual lab
environment to help you get hands-on experience without the hassle of installing and configuring
the tools. You will get access to popular open-source data science tools right inside your
browser, like Jupyter Notebooks, which you will use to get hands-on practice with
Spark in this lab.
How to start the hands-on session for this module:
Click the Start lab buttons below, follow the instructions in the notebook and start
learning! :)
Still unable to get the lab to load? We want to know right away! Contact us via
the Support button on the right-hand side of this window, so we can get it fixed for you
and all the other students. follow the instructions in the notebook and start learning! :)
From Problem to Approach - Exercise Lab (External

resource)
Skills Network Labs (SN Labs) is a virtual lab environment used in this course. Your Username
and email will be passed to Skills Network Labs and will only be used for communicating
important information to enhance your learning experience.
Start Lab
Lab Issue Workaround

discussion posted 27 days ago by emma_leigh_king_lund
If start lab isn't working for you, go to Coursera and find this
course https://www.coursera.org/learn/data-science-methodology?specialization=ibm-data-science a
nd click "Enroll". Ignore their pricing and scroll to the bottom and click on "Audit course". Find the
corresponding lab and go to that page. DO NOT click "Open Tool" because it has the same issue as
the lab here. However, Coursera has a link to download the notebook (ipynb format) so you should
be able to open it in Jupyter. Obviously this isn't an ideal fix, but it at least allows you to complete the
lab!
Module 2 - From Requirements to Collection
Learning Objectives
 Data requirements and data understanding.
 What occurs during data collection.
 How to apply data requirements and data collection to any data science problem.
Data Requirements (3:28)

https://youtu.be/ffucHZU-oco
1. Welcome to Data Science Methodology 101 From Requirements to Collection Data Requirements!
2. If your goal is to make a spaghetti dinner but you don't have the right ingredients
3. to make this dish, then your success will be compromised.
4. Think of this section of the data science methodology as cooking with data.
5. Each step is critical in making the meal.
6. So, if the problem that needs to be resolved is the recipe, so to speak, and data is an
7. ingredient, then the data scientist needs to identify:
8. which ingredients are required, how to source or the collect them,
9. how to understand or work with them, and how to prepare the data to meet the desired
10. outcome.
11. Building on the understanding of the problem at hand, and then using the analytical approach
12. selected, the Data Scientist is ready to get started.
13. Now let's look at some examples of the data requirements within the data science methodology.
14. Prior to undertaking the data collection and data preparation stages of the methodology,
15. it's vital to define the data requirements for decision-tree classification.
16. This includes identifying the necessary data content, formats and sources for initial data
17. collection.
18. So now, let's look at the case study related to applying "Data Requirements".
19. In the case study, the first task was to define the data requirements for the decision tree
20. classification approach that was selected.
21. This included selecting a suitable patient cohort from the health insurance providers
22. member base.
23. In order to compile the complete clinical histories, three criteria were identified
24. for inclusion in the cohort.
25. First, a patient needed to be admitted as in-patient within the provider service area,
26. so they'd have access to the necessary information.
27. Second, they focused on patients with a primary diagnosis of congestive heart failure during
28. one full year.
29. Third, a patient must have had continuous enrollment for at least six months, prior
30. to the primary admission for congestive heart failure, so that complete medical history
31. could be compiled.
32. Congestive heart failure patients who also had been diagnosed as having other significant
33. medical conditions, were excluded from the cohort because those conditions would cause
34. higher-than-average re-admission rates and, thus, could skew the results.
35. Then the content, format, and representations of the data needed for decision tree classification
36. were defined.
37. This modeling technique requires one record per patient, with columns representing the
38. variables in the model.
39. To model the readmission outcome, there needed to be data covering all aspects of the patient's
40. clinical history.
41. This content would include admissions, primary, secondary, and tertiary diagnoses, procedures,
42. prescriptions, and other services provided either during hospitalization or throughout
43. patient/doctor visits.
44. Thus, a particular patient could have thousands of records, representing all their related
45. attributes.
46. To get to the one record per patient format, the data scientists rolled up the transactional
47. records to the patient level, creating a number of new variables to represent that information.
48. This was a job for the data preparation stage, so thinking ahead and anticipating subsequent
49. stages is important.
50. This ends the Data Requirements section for this course.
52. Data Collection (2:54)

https://youtu.be/RtKbyTpp6dw
1. Welcome to Data Science Methodology 101 From Requirements to Collection Data Collection!
2. After the initial data collection is performed, an assessment by the data scientist takes
3. place to determine whether or not they have what they need.
4. As is the case when shopping for ingredients to make a meal, some ingredients might be
5. out of season and more difficult to obtain or cost more than initially thought.
6. In this phase the data requirements are revised and decisions are made as to whether or not
7. the collection requires more or less data.
8. Once the data ingredients are collected, then in the data collection stage, the data scientist
9. will have a good understanding of what they will be working with.
10. Techniques such as descriptive statistics and visualization can be applied to the data
11. set, to assess the content, quality, and initial insights about the data.
12. Gaps in data will be identified and plans to either fill or make substitutions will
13. have to be made.
14. In essence, the ingredients are now sitting on the cutting board.
15. Now let's look at some examples of the data collection stage within the data science methodology.
16. This stage is undertaken as a follow-up to the data requirements stage.
17. So now, let's look at the case study related to applying "Data Collection".
18. Collecting data requires that you know the source or, know where to find the data elements
19. that are needed.
20. In the context of our case study, these can include:
21. demographic, clinical and coverage information of patients,
22. provider information, claims records, as well as
23. pharmaceutical and other information related to all the diagnoses of the congestive heart
24. failure patients.
25. For this case study, certain drug information was also needed, but that data source was
26. not yet integrated with the rest of the data sources.
27. This leads to an important point: It is alright to defer decisions about unavailable data,
28. and attempt to acquire it at a later stage.
29. For example, this can even be done after getting some intermediate results from the predictive
30. modeling.
31. If those results suggest that the drug information might be important in obtaining a good model,
32. then the time to try to get it would be invested.
33. As it turned out though, they were able to build a reasonably good model without this
34. drug information.
35. DBAs and programmers often work together to extract data from various sources, and then
36. merge it.
37. This allows for removing redundant data, making it available for the next stage of the methodology,
38. which is data understanding.
39. At this stage, if necessary, data scientists and analytics team members can discuss various ways
40. to better manage their data, including automating certain processes in the database, so that
41. data collection is easier and faster.
Lab - From Requirements to Collection in R
Lab - From Requirements to Collection in Python
Module 3 - From Understanding to Preparation
Learning Objectives
 What it means to understand data.
 What it means to prepare or clean data.
 Ways in which data is prepared.

 How to apply data understanding and data preparation to any data science problem.
Data Understanding (3:16)

No video link refer downloaded video
1. Welcome to Data Science Methodology 101 From Understanding to Preparation Data

Understanding!
2. Data understanding encompasses all activities related to constructing the data set.
3. Essentially, the data understanding section of the data science methodology answers the
4. question: Is the data that you collected representative of the problem to be solved?
5. Let's apply the data understanding stage of our methodology, to the case study we've
6. been examining.
7. In order to understand the data related to congestive heart failure admissions, descriptive
8. statistics needed to be run against the data columns that would become variables in the
9. model.
10. First, these statistics included Hearst, univariates, and statistics on each variable, such as mean,
11. median, minimum, maximum, and standard deviation.
12. Second, pairwise correlations were used, to see how closely certain variables were related,
13. and which ones, if any, were very highly correlated, meaning that they would be essentially
redundant,
14. thus making only one relevant for modeling.
15. Third, histograms of the variables were examined to understand their distributions.
16. Histograms are a good way to understand how values or a variable are distributed, and
17. which sorts of data preparation may be needed to make the variable more useful in a model.
18. For example, for a categorical variable that has too many distinct values to be informative
19. in a model, the histogram would help them decide how to consolidate those values.
20. The univariates, statistics, and histograms are also used to assess data quality.
21. From the information provided, certain values can be re-coded or perhaps even dropped if
22. necessary, such as when a certain variable has many missing values.
23. The question then becomes, does "missing" mean anything?
24. Sometimes a missing value might mean "no", or "0" (zero), or at other times it simply
25. means "we don't know". Or, if a variable contains invalid or misleading values, such
26. as a numeric variable called "age" that contains 0 to 100 and also 999, where that
27. "triple-9" actually means "missing", but would be treated as a valid value unless
28. we corrected it.
29. Initially, the meaning of congestive heart failure admission was decided on the basis
30. of a primary diagnosis of congestive heart failure.
31. But working through the data understanding stage revealed that the initial definition
32. was not capturing all of the congestive heart failure admissions that were expected, based
33. on clinical experience.
34. This meant looping back to the data collection stage and adding secondary and tertiary diagnoses,
35. and building a more comprehensive definition of congestive heart failure admission.
36. This is just one example of the interactive processes in the methodology.
37. The more one works with the problem and the data, the more one learns and therefore the
38. more refinement that can be done within the model, ultimately leading to a better solution
39. to the problem.
40. This ends the Data Understanding section of this course.
42.
Data Preparation - Concepts (3:02)

Bookmark this page
This section consists of two videos:
1. Data Preparation - Concepts.
2. Data Preparation - Case Study.
You can navigate through the videos using the tabs above!
Data Preparation - Concepts (3:02)
No video link refer downloaded video
1. Welcome toData Science Methodology 101 From Understanding to Preparation Data Preparation
2. - Concepts!
3. In a sense, data preparation is similar to washing freshly picked vegetables in so far
4. as unwanted elements, such as dirt or imperfections, are removed.
5. Together with data collection and data understanding, data preparation is the most time-consuming
6. phase of a data science project, typically taking seventy percent and even up to even
7. ninety percent of the overall project time.
8. Automating some of the data collection and preparation processes in the database, can
9. reduce this time to as little as 50 percent.
10. This time savings translates into increased time for data scientists to focus on creating
11. models.
12. To continue with our cooking metaphor, we know that the process of chopping onions to
13. a finer state will allow for its flavours to spread through a sauce more easily than
14. that would be the case if we were to drop the whole onion into the sauce pot.
15. Similarly, transforming data in the data preparation phase is the process of getting the data into
16. a state where it may be easier to work with.
17. Specifically, the data preparation stage of the methodology answers the question: What
18. are the ways in which data is prepared?
19. To work effectively with the data, it must be prepared in a way that addresses missing
20. or invalid values and removes duplicates, toward ensuring that everything is properly
21. formatted.
22. Feature engineering is also part of data preparation.
23. It is the process of using domain knowledge of the data to create features that make the
24. machine learning algorithms work.
25. A feature is a characteristic that might help when solving a problem.

26. Features within the data are important to predictive models and will influence the results
27. you want to achieve.
28. Feature engineering is critical when machine learning tools are being applied to analyze
29. the data.
30. When working with text, text analysis steps for coding the data are required to be able
31. to manipulate the data.
32. The data scientist needs to know what they're looking for within their dataset to address
33. the question.
34. The text analysis is critical to ensure that the proper groupings are set, and that the
35. programming is not overlooking what is hidden within.
36. The data preparation phase sets the stage for the next steps in addressing the question.
37. While this phase may take a while to do, if done right the results will support the project.
38. If this is skipped over, then the outcome will not be up to par and may have you back
39. at the drawing board.
40. It is vital to take your time in this area, and use the tools available to automate common
41. steps to accelerate data preparation.
42. Make sure to pay attention to the detail in this area.
43. After all, it takes just one bad ingredient to ruin a fine meal.
44. This ends the Data Preparation section of this course, in which we've reviewed key concepts.
Data Preparation - Case Study (4:14)

https://youtu.be/ZpDcerPCbRw
https://www.youtube.com/watch?v=ZpDcerPCbRw
1. Welcome to Data Science Methodology 101 From Understanding to Preparation Data Preparation
2. - Case Study!
3. In a sense, data preparation is similar to washing freshly picked vegetables insofar
4. as unwanted elements, such as dirt or imperfections, are removed.
5. So now, let's look at the case study related to applying Data Preparation concepts.
6. In the case study, an important first step in the data preparation stage was to actually
7. define congestive heart failure.
8. This sounded easy at first but defining it precisely, was not straightforward.
9. First, the set of diagnosis-related group codes needed to be identified, as congestive
10. heart failure implies certain kinds of fluid buildup.
11. We also needed to consider that congestive heart failure is only one type of heart failure.
12. Clinical guidance was needed to get the right codes for congestive heart failure.
13. The next step involved defining the re-admission criteria for the same condition.
14. The timing of events needed to be evaluated in order to define whether a particular congestive
15. heart failure admission was an initial event, which is called an index admission, or a congestive
16. heart failure-related re-admission.
17. Based on clinical expertise, a time period of 30 days was set as the window for readmission
18. relevant for congestive heart failure patients, following the discharge from the initial admission.
19. Next, the records that were in transactional format were aggregated, meaning that the data
20. included multiple records for each patient.
21. Transactional records included professional provider facility claims submitted for physician,
22. laboratory, hospital, and clinical services.
23. Also included were records describing all the diagnoses, procedures, prescriptions,
24. and other information about in-patients and out-patients.
25. A given patient could easily have hundreds or even thousands of these records, depending
26. on their clinical history.
27. Then, all the transactional records were aggregated to the patient level, yielding a single record
28. for each patient, as required for the decision-tree classification method that would be used for
29. modeling.
30. As part of the aggregation process, many new columns were created representing the information
31. in the transactions.
32. For example, frequency and most recent visits to doctors, clinics and hospitals with diagnoses,
33. procedures, prescriptions, and so forth.
34. Co-morbidities with congestive heart failure were also considered, such as diabetes, hypertension,
35. and many other diseases and chronic conditions that could impact the risk of re-admission
36. for congestive heart failure.
37. During discussions around data preparation, a literary review on congestive heart failure
38. was also undertaken to see whether any important data elements were overlooked, such as co-
morbidities
39. that had not yet been accounted for.
40. The literary review involved looping back to the data collection stage to add a few
41. more indicators for conditions and procedures.
42. Aggregating the transactional data at the patient level, meant merging it with the other
43. patient data, including their demographic information, such as age, gender, type of
44. insurance, and so forth.
45. The result was the creation of one table containing a single record per patient, with many columns
46. representing the attributes about the patient in his or her clinical history.
47. These columns would be used as variables in the predictive modeling.
48. Here is a list of the variables that were ultimately used in building the model.
49. The dependent variable, or target, was congestive heart failure readmission within 30 days following
50. discharge from a hospitalization for congestive heart failure, with an outcome of either yes
51. or no.
52. The data preparation stage resulted in a cohort of 2,343 patients meeting all of the criteria
53. for this case study.
54. The cohort was then split into training and testing sets for building and validating the
55. model, respectively.
56. This ends the Data Preparation section of this course, in which we applied the key concepts
57. to the case study.

Python code is not printed in this, use the lab

notebook .ipnyb file to see the code and experiment
Module 4 - From Modeling to Evaluation
Learning Objectives
 What the purpose of data modeling is.
 Some characteristics of the modeling process.
 What it means to evaluate a model.
 Ways in which a model is evaluated.
 How to apply modeling and model evaluation to any data science problem.
Modeling - Concepts (2:55)

Bookmark this page
This section consists of two videos:
1. Modeling - Concepts.
2. Modeling - Case Study.
You can navigate through the videos using the tabs above!
https://youtu.be/aOfgFFE7g4s
1. Welcome to Data Science Methodology 101 From Modeling to Evaluation Modeling - Concepts!
2. Modelling is the stage in the data science methodology where the data scientist has the
3. chance to sample the sauce and determine if it's bang on or in need of more seasoning!
4. This portion of the course is geared toward answering two key questions:
5. First, what is the purpose of data modeling, and
6. second, what are some characteristics of this process?
7. Data Modelling focuses on developing models that are either descriptive or predictive.
8. An example of a descriptive model might examine things like: if a person did this,
9. then they're likely to prefer that.
10. A predictive model tries to yield yes/no, or stop/go type outcomes.
11. These models are based on the analytic approach that was taken, either statistically driven
12. or machine learning driven.
13. The data scientist will use a training set for predictive modelling.
14. A training set is a set of historical data in which the outcomes are already known.
15. The training set acts like a gauge to determine if the model needs to be calibrated.
16. In this stage, the data scientist will play around with different algorithms to ensure
17. that the variables in play are actually required.
18. The success of data compilation, preparation and modelling, depends on the understanding
19. of the problem at hand, and the appropriate analytical approach being taken.
20. The data supports the answering of the question, and like the quality of the ingredients in
21. cooking, sets the stage for the outcome.
22. Constant refinement, adjustments and tweaking are necessary within each step to ensure the
23. outcome is one that is solid.
24. In John Rollins' descriptive Data Science Methodology, the framework is geared to do
25. 3 things: First,
26. understand the question at hand. Second,
27. select an analytic approach or method to solve the problem, and
28. third,
29. obtain, understand, prepare, and model the data.
30. The end goal is to move the data scientist to a point where a data model can be built
31. to answer the question.
32. With dinner just about to be served and a hungry guest at the table, the key question
33. is: Have I made enough to eat?
34. Well, let's hope so.
35. In this stage of the methodology, model evaluation, deployment, and feedback loops ensure that
36. the answer is near and relevant.
37. This relevance is critical to the data science field
38. overall, as it ís a fairly new field of study, and we are interested in the possibilities
39. it has to offer.
40. The more people that benefit from the outcomes of this practice, the further the field will
41. develop.
42. This ends the Modeling to Evaluation section of this course, in which we reviewed the key
43. concepts related to modeling. Thanks for watching!
44.
Modeling - Case Study (3:56)

https://youtu.be/9br6fZUYxdc
1. Welcome to Data Science Methodology 101 From Modeling to Evaluation Modeling - Case Study!
2. Modelling is the stage in the data science methodology where the data scientist has the
3. chance to sample the sauce and determine if it's bang on or in need of more seasoning!
4. Now, let's apply the case study to the modeling stage within the data science methodology.
5. Here, we'll discuss one of the many aspects of model building, in this case, parameter
6. tuning to improve the model.
7. With a prepared training set, the first decision tree classification model for congestive heart
8. failure readmission can be built.
9. We are looking for patients with high-risk readmission, so the outcome of interest will
10. be congestive heart failure readmission equals "yes".
11. In this first model, overall accuracy in classifying the yes and no outcomes was 85%.
12. This sounds good, but it represents only 45% of the "yes". The actual readmissions
13. are correctly classified, meaning that the model is not very accurate.
14. The question then becomes: How could the accuracy of the model be improved in predicting the
15. yes outcome?
16. For decision tree classification, the best parameter to adjust is the relative cost of
17. misclassified yes and no outcomes.
18. Think of it like this:
19. When a true, non-readmission is misclassified, and action is taken to reduce that patient's
20. risk, the cost of that error is the wasted intervention.
21. A statistician calls this a type I error, or a false-positive.
22. But when a true readmission is misclassified, and no action is taken to reduce that risk,
23. then the cost of that error is the readmission and all its attended costs, plus the trauma
24. to the patient.
25. This is a type II error, or a false-negative.
26. So we can see that the costs of the two different kinds of misclassification errors can be quite
27. different.
28. For this reason, it's reasonable to adjust the relative weights of misclassifying the
29. yes and no outcomes.
30. The default is 1-to-1, but the decision tree algorithm, allows the setting of a higher
31. value for yes.

32. For the second model, the relative cost was set at 9-to-1.
33. This is a very high ratio, but gives more insight to the model's behaviour.
34. This time the model correctly classified 97% of the yes, but at the expense of a very low
35. accuracy on the no, with an overall accuracy of only 49%.
36. This was clearly not a good model.
37. The problem with this outcome is the large number of false-positives, which would recommend
38. unnecessary and costly intervention for patients, who would not have been re-admitted anyway.
39. Therefore, the data scientist needs to try again to find a better balance between the
40. yes and no accuracies.
41. For the third model, the relative cost was set at a more reasonable 4-to-1.
42. This time 68% accuracy was obtained on only yes, called sensitivity by statisticians,
43. and 85% accuracy on the no, called specificity, with an overall accuracy of 81%.
44. This is the best balance that can be obtained with a rather small training set through adjusting
45. the relative cost of misclassified yes and no outcomes parameter.
46. A lot more work goes into the modeling, of course, including iterating back to the data
47. preparation stage to redefine some of the other variables, so as to better represent
48. the underlying information, and thereby improve the model.
49. This concludes the Modeling section of the course, in which we applied the Case Study
50. to the modeling stage within the data science methodology.
Evaluation (3:57)
https://youtu.be/Bzq0u5GGPUM
1. Welcome to Data Science Methodology 101 From Modeling to Evaluation - Evaluation!
2. A model evaluation goes hand-in-hand with model building as such, the modeling and
3. evaluation stages are done iteratively.
4. Model evaluation is performed during model development and before the model is deployed.
5. Evaluation allows the quality of the model to be assessed but it's also an opportunity
6. to see if it meets the initial request.
7. Evaluation answers the question: Does the model used really answer the initial question
8. or does it need to be adjusted?
9. Model evaluation can have two main phases.
10. The first is the diagnostic measures phase, which is used to ensure the model is working
11. as intended.
12. If the model is a predictive model, a decision tree can be used to evaluate if the answer
13. the model can output, is aligned to the initial design.
14. It can be used to see where there are areas that require adjustments.
15. If the model is a descriptive model, one in which relationships are being assessed, then
16. a testing set with known outcomes can be applied, and the model can be refined as needed.
17. The second phase of evaluation that may be used is statistical significance testing.
18. This type of evaluation can be applied to the model to ensure that the data is being
19. properly handled and interpreted within the model.
20. This is designed to avoid unnecessary second guessing when the answer is revealed.
21. So now, let's go back to our case study so that we can apply the "Evaluation" component
22. within the data science methodology.
23. Let's look at one way to find the optimal model through a diagnostic measure based on
24. tuning one of the parameters in model building.
25. Specifically we'll see how to tune the relative cost of misclassifying yes and no outcomes.
26. As shown in this table, four models were built with four different relative misclassification
27. costs.
28. As we see, each value of this model-building parameter increases the true-positive rate,
29. or sensitivity, of the accuracy in predicting yes, at the expense of lower accuracy in predicting
30. no, that is, an increasing false-positive rate.
31. The question then becomes, which model is best based on tuning this parameter?
32. For budgetary reasons, the risk-reducing intervention could not be applied to most or all congestive
33. heart failure patients, many of whom would not have been readmitted anyway.
34. On the other hand, the intervention would not be as effective in improving patient care
35. as it should be, with not enough high-risk congestive heart failure patients targeted.
36. So, how do we determine which model was optimal?
37. As you can see on this slide, the optimal model is the one giving the maximum separation
38. between the blue ROC curve relative to the red base line.
39. We can see that model 3, with a relative misclassification cost of 4-to-1, is the best of the 4 models.
40. And just in case you were wondering, ROC stands for receiver operating characteristic curve,
41. which was first developed during World War II to detect enemy aircraft on radar.
42. It has since been used in many other fields as well.
43. Today it is commonly used in machine learning and data mining.
44. The ROC curve is a useful diagnostic tool in determining the optimal classification
45. model.
46. This curve quantifies how well a binary classification model performs, declassifying the yes and
47. no outcomes when some discrimination criterion is varied.
48. In this case, the criterion is a relative misclassification cost.
49. By plotting the true-positive rate against the false-positive rate for different values
50. of the relative misclassification cost, the ROC curve helped in selecting the optimal
51. model.
52. This ends the Evaluation section of this course.

Module 5 - From Deployment to Feedback
Learning Objectives
 What happens when a model is deployed.
 Why model feedback is important.
Deployment - Concepts & Case Study (3:31)

https://youtu.be/fKNA3XjJJLE
1. Welcome to Data Science Methodology 101 From Deployment to Feedback - Deployment!
2. While a data science model will provide an answer, the key to making the answer relevant
3. and useful to address the initial question, involves getting the stakeholders familiar
4. with the tool produced.
5. In a business scenario, stakeholders have different specialties that will help make
6. this happen, such as the solution owner, marketing, application developers, and IT administration.
7. Once the model is evaluated and the data scientist is confident it will work, it is deployed
8. and put to the ultimate test.
9. Depending on the purpose of the model, it may be rolled out to a limited group of users
10. or in a test environment, to build up confidence in applying the outcome for use across the board.
11. So now, let's look at the case study related to applying Deployment"
12. In preparation for solution deployment, the next step was to assimilate the knowledge
13. for the business group who would be designing and managing the intervention program to reduce
14. readmission risk.
15. In this scenario, the business people translated the model results so that the clinical staff
16. could understand how to identify high-risk patients and design suitable intervention
17. actions.
18. The goal, of course, was to reduce the likelihood that these patients would be readmitted within
19. 30 days after discharge.
20. During the business requirements stage, the Intervention Program Director and her team
21. had wanted an application that would provide automated, near real-time risk assessments
22. of congestive heart failure.
23. It also had to be easy for clinical staff to use, and preferably through browser-based
24. application on a tablet, that each staff member could carry around.
25. This patient data was generated throughout the hospital stay.
26. It would be automatically prepared in a format needed by the model and each patient would
27. be scored near the time of discharge.
28. Clinicians would then have the most up-to-date risk assessment for each patient, helping
29. them to select which patients to target for intervention after discharge.
30. As part of solution deployment, the Intervention team would develop and deliver training for
31. the clinical staff.
32. Also, processes for tracking and monitoring patients receiving the intervention would
33. have to be developed in collaboration with IT developers and database administrators,
34. so that the results could go through the feedback stage and the model could be refined over
35. time.
36. This map is an example of a solution deployed through a Cognos application.
37. In this case, the case study was hospitalization risk for patients with juvenile diabetes.
38. Like the congestive heart failure use case, this one used decision tree classification
39. to create a risk model that would serve as the foundation for this application.
40. The map gives an overview of hospitalization risk nationwide, with an interactive analysis
41. of predicted risk by a variety of patient conditions and other characteristics.
42. This slide shows an interactive summary report of risk by patient population within a given
43. node of the model, so that clinicians could understand the combination of conditions for
44. this subgroup of patients.
45. And this report gives a detailed summary on an individual patient, including the patient's
46. predicted risk and details about the clinical history, giving a concise summary for the
47. doctor.
48. This ends the Deployment section of this course.
Feedback (3:08)
https://youtu.be/cAHagn7SOMU
1. Welcome to the Data Science Methodology 101 From Deployment to Feedback - Feedback!
2. Once in play, feedback from the users will help to refine the model and assess it for
3. performance and impact.
4. The value of the model will be dependent on successfully incorporating feedback and making
5. adjustments for as long as the solution is required.
6. Throughout the Data Science Methodology, each step sets the stage for the next.
7. Making the methodology cyclical, ensures refinement at each stage in the game.
8. The feedback process is rooted in the notion that, the more you know, the more that you'll
9. want to know.
10. That's the way John Rollins sees it and hopefully you do too.
11. Once the model is evaluated and the data scientist is confident it'll work, it is deployed
12. and put to the ultimate test: actual, real-time use in the field.
13. So now, let's look at our case study again, to see how the Feedback portion of the methodology is
14. applied.
15. The plan for the feedback stage included these steps:
16. First, the review process would be defined and put into place, with overall responsibility
17. for measuring the results of a "flying to risk" model of the congestive heart failure
18. risk population.
19. Clinical management executives would have overall responsibility for the review process.
20. Second, congestive heart failure patients receiving intervention would be tracked and
21. their re-admission outcomes recorded.
22. Third, the intervention would then be measured to determine how effective it was in reducing
23. re-admissions.
24. For ethical reasons, congestive heart failure patients would not be split into controlled
25. and treatment groups.
26. Instead, readmission rates would be compared before and after the implementation of the
27. model to measure its impact.
28. After the deployment and feedback stages, the impact of the intervention program on re-admission
29. rates would be reviewed after the first year of its implementation.
30. Then the model would be refined, based on all of the data compiled after model implementation
31. and the knowledge gained throughout these stages.
32. Other refinements included: Incorporating information about participation
33. in the intervention program, and possibly refining the model to incorporate
34. detailed pharmaceutical data.
35. If you recall, data collection was initially deferred because the pharmaceutical data was
36. not readily available at the time.
37. But after feedback and practical experience with the model, it might be determined that
38. adding that data could be worth the investment of effort and time.
39. We also have to allow for the possibility that other refinements might present themselves
40. during the feedback stage.
41. Also, the intervention actions and processes would be reviewed and very likely refined
42. as well, based on the experience and knowledge gained through initial deployment and feedback.
43. Finally, the refined model and intervention actions would be redeployed, with the feedback
44. process continued throughout the life of the Intervention program.
45. This is the end of the Feedback portion of this course.

Course Summary (3:25)
Video is not available – check if the video available on

coursera
1. Welcome to Data Science Methodology 101 Course Summary!
2. We've come to the end of our story, one that we hope you'll share.
3. You've learned how to think like a data scientist, including taking the steps involved
4. in tackling a data science problem and applying them to interesting, real-world examples.
5. These steps have included:
6. forming a concrete business or research problem, collecting and analyzing data,
7. building a model, and understanding the feedback after model deployment.
8. In this course, you've also learned methodical ways of moving from problem to approach, including
9. the importance of
10. understanding the question, the business goals and objectives, and
11. picking the most effective analytic approach to answer the question and solve the problem.
12. You've also learned methodical ways of working with the data, specifically,
13. determining the data requirements, collecting the appropriate data,
14. understanding the data, and then preparing the data for modeling!
15. You've also learned how to model the data by using the appropriate analytic approach,
16. based on the data requirements and the problem that you were trying to solve Once the approach
17. was selected, you learned the steps involved in
18. evaluating and deploying the model, getting feedback on it, and
19. using that feedback constructively so as to improve the model.
20. Remember that the stages of this methodology are iterative!
21. This means that the model can always be improved for as long as the solution is needed, regardless
22. of whether the improvements come from constructive feedback, or from examining newly available
23. data sources.

24. Using a real case study, you learned how data science methodology can be applied in context,
25. toward successfully achieving the goals that were set out in the business requirements
26. stage.
27. You also saw how the methodology contributed additional value to business units by incorporating
28. data science practices into their daily analysis and reporting functions.
29. The success of this new pilot program that was reviewed in the case study was evident
30. by the fact that physicians were able to deliver better patient care by using new tools to
31. incorporate timely data-driven information into patient care decisions.
32. And finally, you learned, in a nutshell, the true meaning of a methodology!
33. That its purpose is to explain how to look at a problem, work with data in support of
34. solving the problem, and come up with an answer that addresses the root problem.
35. By answering 10 simple questions methodically, we've taught you that a methodology can
36. help you solve not only your data science problems, but also any other problem.
37. Your success within the data science field depends on your ability to apply the right
38. tools, at the right time, in the right order, to the address the right problem.
39. And that is the way John Rollins sees it!
40. We hope you've enjoyed taking the Data Science Methodology course and found it to be a valuable
41. experience one that you'll share with others!
42. And, of course, we also hope that you'll review and take other data science courses
43. in the Data Science Fundamentals learning path!
44. Now, if you're ready and up to the challenge, please take the final exam.

Data Science Methodology

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Methodology

Uploaded by

Copyright:

Available Formats

Data Science Methodology - DS0103EN

From the above link download the notebook for each

6. answer that changed the way something was done.

11. Here is a definition of the word methodology.

13. methodology and jump directly to solutions.

17. manipulated to address the question at hand.

28. to your success.

29. This course is comprised of several components:

34. to test your understanding of the concepts.

35. When you're ready, take the final exam.

44. to incorporate timely, data-driven information into patient care decisions.

48. the course.

50. then please explore the discussion and wiki sessions.

o The major steps involved in tackling a data science problem.

o How data scientists think through tackling interesting real-world examples.

Module 2: From Requirements to Collection

Module 3: From Understanding to Preparation

Module 4: From Modeling to Evaluation

Module 5: From Deployment to Feedback

 Why we are interested in data science.

 What a methodology is, and why data scientists need a methodology.

 The data science methodology and its flowchart.

Business Understanding (5:02)

2. Has this ever happened to you?

9. accomplish the task.

13. So, what do you do?

16. what can be referred to as a business understanding.

19. answer the core question.

24. the actual problem.

26. who is asking the question.

29. Or is it to increase the businesses profitability?

31. that are in support of the goal.

36. to help determine requirements and clarify questions.

39. healthcare budget to maximize its use in providing quality care?

48. readmissions" as an effective area for review.

51. 50% would be readmitted within five years.

53. were at the top of the readmission list.

55. to determine why this was occurring.

58. workshop to kick things off.

60. Set overall direction

61. Remained engaged and provided guidance.

62. Ensured necessary support, where needed.

66. Predicting readmission risk.

68. Applying an easy-to-understand process to new patients, regarding their readmission

70. This ends the Business Understanding section of this course.

71. Thanks for watching!

Analytic Approach (3:23)

4. so as to be able to pick the most appropriate path or approach.

7. problem is selected in the context of the business requirements.

8. This is the second stage of the data science methodology.

15. If the question is to show relationships, a descriptive approach maybe be required.

18. Statistical analysis applies to problems that require counts.

20. to predicting a response would be suitable.

22. being explicitly programmed.

24. be accessible or identified.

26. response would be to use Clustering Association approaches.

29. of conditions leading to each patient's outcome.

31. leaf, led to a respective threshold value.

37. is simply the proportion of yes patients in the leaf.

40. apply, to score new patients for their risk of readmission.

47. Thanks for watching!

Lab - From Problem to Approach in Python

How to start the hands-on session for this module: