You are on page 1of 35

Data Science Project Workflow

The Software Engineering side that often gets overlooked

Annie Ying, David Meyer, Mike Claes, Saranya HV


4 September 2019

#DataSymposium2019
Are you a data scientist, in a
leadership role of a data science
team, a customer of data science,
or any other roles?

30 seconds

What are some successful outcomes of a data


science project?

1 minute
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
When you think of Data Science,
what tasks do you think of?
1 minute

Of these tasks, which ones do you find the most


difficult to accomplish at Cisco?

2 minutes

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Data Science Project Outcomes
and Assets for Stakeholders
Products
(e.g., features on Cisco
Ready)

Models, Insights,
Experiments
(scripts, Jupyter Notebooks)
Prototypes
(API/Tableau/Barebone
Web App)

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Most data scientists tend to focus their
effort in the data and the science

Data Science Project Workflow But other phases and the


Software Engineering Side are
equally important for successful
project outcomes:
SE • Provide relevant insights
• Deploy data science solutions
• Do so efficiently & effectively
Maintenance SE
SE
1. What tools to support the
SE authoring of code and models?
5. What software
engineering roles do we SE
need to support our work? 2. How can we coordinate
and share work between
SE
members in a data science
team?
4. How can we leverage DevOps
3. How do we ensure the
and ML Infrastructure to allow data
quality of the models? When
scientists focus on what they do
do we know a model is
best© (the modeling)?
2018 Cisco https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
ready for deployment?
Data Science Project Workflow

SE

Maintenance SE
SE
1. What tools to support the
SE authoring of code and models?
SE

SE

https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Tools for authoring code
and models depends on
the work SE

Maintenance SE
Product features SE

SE

SE

Models,
Insights, SE

Prototypes Experiments

Data heavy
Experimentation
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Data heavy
Experimentation
Jupyter Notebooks
• (+) Interactivity
Product features • (+) Graphing
• (+) Quick (and dirty)
• Caveats
Models,
Insights, • (-) Coding tool support
Prototypes Experiments
• (-) Discourages extensive
code structure
• (-) Out of order execution
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Beyond experimentation
Modeling
Feature Engineering SE

Maintenance SE
Product features SE

SE

SE

Models,
Insights, SE

Prototypes Experiments

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Beyond experimentation
Modeling Editors / IDEs
Feature Engineering
• (+) Encourages code
structure
Product features
• (+) Coding tool support like
auto-completion, type
Models,
checking, refactoring)
Insights, • Caveats
Prototypes Experiments
• (-) Less interactivity
• (-) Less support for graphs
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Data Science Project Workflow

SE

Maintenance SE
SE

SE

SE
2. How can we coordinate
and share work between
SE
members in a data science
team?

https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Multiple members in the
team - Potentially all
these activities SE

Maintenance SE
Product features SE

SE

SE

Models,
Insights, SE

Prototypes Experiments

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Basic processes for >1 member teams
Models,
Experiments

Product Prototypes

• Version control (git repo on GitHub)


• Changes to notebooks are difficult to track and review
• Dependency management for replicating the environment (e.g., using
requirements.txt in Python)
Beyond initial modelling- ongoing DS project

DS
Models, Experiment
Bug fixes, branch
Code
model
review
enhancements
skills: DS

Product Prototype
branch branch

Code review
skills: SE & DS DS/SE
SE
• Iterative DS process involving various roles in the team is error prone
• Proper GIT workflow templates can assist this iterative process
• GIT activities in the sync points are crucial
Standardizing How We Access Data

• Steps to ‘re-create’ someone else’s work


• Clone Git repo
• Install requirements (pip install –r requirements.txt)

• Now you need to access required data:

• pd.read_csv(‘mydata.csv’)
• Where is mydata.csv? How can I re-create it?

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Standardizing How We Access Data

https://wwwin-github.cisco.com/Data-Analytics/pyciscodb https://wwwin-github.cisco.com/EDSO-CI/ciscodb_R

• Common methodology for accessing necessary data


• Now others know exactly which table(s) are required and
how to access them.
• OK to build a personal ‘cache’ to speed up access,
however, this makes it clear how that cache was created.

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Data Science Project Workflow

SE

Maintenance SE
SE

SE

SE

SE

3. How do we ensure the


quality of the models? When
do we know a model is
https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
ready for deployment?
When is the model ready?

Who has deployed a


model with lower
performance than your
cross-validation results?

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
When is the model ready?

• Intrinsic evaluation
• Cross validation
• Extrinsic evaluation
• Use unseen dataset for
evaluation
• Instrumentation in the
POC
• UI to allow explicit
feedback collection
• Can ultimately use this as
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential input to the ML!
#DataSymposium2019
Data Science Prototype → Production
Desired Outcomes SE Solutions
• Product Quality • Thorough Testing
• Data quality → Model quality → Result quality • Unit testing- E.g.: how are we handling data errors?
• Dev and regression testing – metrics monitoring

• Efficient Process • Good Architecture


• Faster iterations • Loosely coupled code

• Performance optimization • Distributed training, non blocking code

• Continuous automatic updates- online learning • CI/CD to monitor metrics and automate model
updates
Data Science Project Workflow

SE

Maintenance SE
SE

SE

SE

SE

4. How can we leverage DevOps


and ML Infrastructure to allow data
scientists focus on what they do
best© (the modeling)?
2018 Cisco https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
How can we leverage DevOps and Agile
methodologies to execute more quickly?
Version Control Project Management CI/CD
System + Communications + Pipeline
+ Security

Model Model
Model Production
Development Maintenance
Onboarding Job Scheduling SLA
Data Connections • ETL • Model Performance Metrics
EDA + Visualization • Training & Reporting
Feature Engineering • Evaluation • Response Time / Refresh
Model Training • Alerts + Logging Rate

Model Evaluation & Validation Documentation & Publishing • Support Structure (RACI)

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Leveraging templates for production
How to review
production
architecture b/w Invest in
segregated teams? Re-usable packages authoring
and hosting
in-house
packages
ML DevOps DS team
DS team DS team

Code
templates
Better architecture by Maintain project templates for
utilizing standardized different use cases Eg: Deep
templates and in-house learning vs conventional ML
packages
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Data Science Project Workflow

SE

Maintenance SE
SE

SE
5. What software
engineering roles do we SE
need to support our work?

SE

https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Hardest part about ML isn’t ML

Only a small fraction of real-world ML systems is composed of the ML


code. The required surrounding infrastructure is vast and complex.

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Slide reference from a webinar on MLFlow by
Become Self-Sufficient

Data Scientist Data Engineer Project Manager Business SME

----------------------------------------------------------------
Data Curation & ETL Domain Knowledge
Infrastructure Architecture Hypothesis Generation
Experiment Design
DevOps & CI/CD Acceptance Testing
Feature Creation
Software Engineering
Model Creation Sprint Planning
Model Validation & Testing Communications
Visualization & Storytelling Integration Coordination
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Team Structure
• Distinguish between role of
centralized and distributed
teams (tooling, standards,
enterprise projects vs. domain • Limit size & scope of project
focused projects) teams and apply principle of
• Expertise needed in machine lease privilege
learning, data engineering, • Small is fast, small is safe
infrastructure & software • Standardize within your team on
integration, project management the tech stack you will leverage

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Real world scenario- Data Scientist and ML
Engineer

Initial EDA, prototyping, Optimize, scale the


viz, selling the solution prototype based on
underlying architecture
Work with product
Multithreading,
manager, evaluate,
multiprocessing, data
freeze the prototype for
pipelining, distributed
production
training
Data Scientist ML Engineer
Iteratively prototyping Follow dev templates,
enhancements along with evaluate models over the
step 2. time in production
Putting it all together: Example
Use Case

SE

5. What software Maintenance SE


engineering roles do we SE
need to support our work? 1. What tools to support the
SE authoring of code and models?
SE
2. How can we coordinate
and share work between
SE
members in a data science
team?
4. How can we leverage DevOps
3. How do we ensure the
and ML Infrastructure to allow data
quality of the models? When
scientists focus on what they do
do we know a model is
best© (the modeling)?
2018 Cisco https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
ready for deployment?
Example Use Case
Multi-cloud Architecture
for Subscription Churn https://subscription-churn.cisco.com/

• Primarily developed & maintained • Model Deployment


by team of 4 • Used Batch scoring and stored
results in BigQuery & Snowflake
• Exploration & Development
• Used BigQuery & Notebooks in GCP • Visualization
Created custom Webapp to
Data Prep & Training


display interactive, actionable
• Scheduled using Composer/Airflow model results
• Model Tracking and Versioning • Maintenance
• Used MLFlow to store metrics & • Used CI/CD Pipeline to deploy
serialized models latest version to CAE
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Data Science Project Outcomes
and Assets for Stakeholders
Products
(e.g., features on Cisco
Ready)

Models, Insights,
Experiments
(scripts, Jupyter Notebooks)
Prototypes
(API/Tableau/Barebone
Web App)

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Most data scientists tend to focus their
effort in the data and the science

Conclusion But other phases and the


Software Engineering Side are
equally important for successful
project outcomes:
SE • Provide relevant insights
• Deploy data science solutions
• Do so efficiently & effectively
Maintenance SE
SE

SE

SE

SE

https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019

You might also like