Data Science Project Workflow: The Software Engineering Side That Often Gets Overlooked

Data Science Project Workflow
The Software Engineering side that often gets overlooked
Annie Ying, David Meyer, Mike Claes, Saranya HV

4 September 2019
#DataSymposium2019
Are you a data scientist, in a
leadership role of a data science
team, a customer of data science,
or any other roles?
30 seconds
What are some successful outcomes of a data

science project?
1 minute
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
When you think of Data Science,
what tasks do you think of?
1 minute
Of these tasks, which ones do you find the most

difficult to accomplish at Cisco?
2 minutes
Data Science Project Outcomes
and Assets for Stakeholders
Products
(e.g., features on Cisco
Ready)
Models, Insights,
Experiments
(scripts, Jupyter Notebooks)
Prototypes
(API/Tableau/Barebone
Web App)
Most data scientists tend to focus their
effort in the data and the science
Data Science Project Workflow But other phases and the

Software Engineering Side are
equally important for successful
project outcomes:
SE • Provide relevant insights
• Deploy data science solutions
• Do so efficiently & effectively
Maintenance SE
SE
1. What tools to support the
SE authoring of code and models?
5. What software
engineering roles do we SE
need to support our work? 2. How can we coordinate
and share work between
SE
members in a data science
team?
4. How can we leverage DevOps
3. How do we ensure the
and ML Infrastructure to allow data
quality of the models? When
scientists focus on what they do
do we know a model is
best© (the modeling)?
2018 Cisco https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
ready for deployment?
SE
Maintenance SE
SE
1. What tools to support the
SE
SE
https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
Tools for authoring code
and models depends on
the work SE
Maintenance SE
Product features SE
SE
SE
Models,
Insights, SE
Prototypes Experiments
Data heavy
Experimentation
Data heavy
Experimentation
Jupyter Notebooks
• (+) Interactivity
Product features • (+) Graphing
• (+) Quick (and dirty)
• Caveats
Models,
Insights, • (-) Coding tool support
• (-) Discourages extensive
code structure
• (-) Out of order execution
Beyond experimentation
Modeling
Feature Engineering SE
Maintenance SE
Product features SE
SE
SE
Models,
Insights, SE
Beyond experimentation
Modeling Editors / IDEs
Feature Engineering
• (+) Encourages code
structure
Product features
• (+) Coding tool support like
auto-completion, type
Models,
checking, refactoring)
Insights, • Caveats
• (-) Less interactivity
• (-) Less support for graphs
SE
Maintenance SE
SE
SE
SE
2. How can we coordinate
SE
team?
Multiple members in the
team - Potentially all
these activities SE
Maintenance SE
Product features SE
SE
SE
Models,
Insights, SE
Basic processes for >1 member teams
Models,
Experiments
Product Prototypes
• Version control (git repo on GitHub)

• Changes to notebooks are difficult to track and review
• Dependency management for replicating the environment (e.g., using
requirements.txt in Python)
Beyond initial modelling- ongoing DS project
DS
Models, Experiment
Bug fixes, branch
Code
model
review
enhancements
skills: DS
Product Prototype
branch branch
Code review
skills: SE & DS DS/SE
SE
• Iterative DS process involving various roles in the team is error prone
• Proper GIT workflow templates can assist this iterative process
• GIT activities in the sync points are crucial
Standardizing How We Access Data
• Steps to ‘re-create’ someone else’s work

• Clone Git repo
• Install requirements (pip install –r requirements.txt)
• Now you need to access required data:
• pd.read_csv(‘mydata.csv’)
• Where is mydata.csv? How can I re-create it?
Standardizing How We Access Data
https://wwwin-github.cisco.com/Data-Analytics/pyciscodb https://wwwin-github.cisco.com/EDSO-CI/ciscodb_R
• Common methodology for accessing necessary data

• Now others know exactly which table(s) are required and
how to access them.
• OK to build a personal ‘cache’ to speed up access,
however, this makes it clear how that cache was created.
SE
Maintenance SE
SE
SE
SE
SE

When is the model ready?
Who has deployed a

model with lower
performance than your
cross-validation results?
When is the model ready?
• Intrinsic evaluation
• Cross validation
• Extrinsic evaluation
• Use unseen dataset for
evaluation
• Instrumentation in the
POC
• UI to allow explicit
feedback collection
• Can ultimately use this as
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential input to the ML!
#DataSymposium2019
Data Science Prototype → Production
Desired Outcomes SE Solutions
• Product Quality • Thorough Testing
• Data quality → Model quality → Result quality • Unit testing- E.g.: how are we handling data errors?
• Dev and regression testing – metrics monitoring
• Efficient Process • Good Architecture

• Faster iterations • Loosely coupled code
• Performance optimization • Distributed training, non blocking code
• Continuous automatic updates- online learning • CI/CD to monitor metrics and automate model
updates
SE
Maintenance SE
SE
SE
SE
SE

How can we leverage DevOps and Agile
methodologies to execute more quickly?
Version Control Project Management CI/CD
System + Communications + Pipeline
+ Security
Model Model
Model Production
Development Maintenance
Onboarding Job Scheduling SLA
Data Connections • ETL • Model Performance Metrics
EDA + Visualization • Training & Reporting
Feature Engineering • Evaluation • Response Time / Refresh
Model Training • Alerts + Logging Rate
Model Evaluation & Validation Documentation & Publishing • Support Structure (RACI)
Leveraging templates for production
How to review
production
architecture b/w Invest in
segregated teams? Re-usable packages authoring
and hosting
in-house
packages
ML DevOps DS team
DS team DS team
Code
templates
Better architecture by Maintain project templates for
utilizing standardized different use cases Eg: Deep
templates and in-house learning vs conventional ML
packages
SE
Maintenance SE
SE
SE
5. What software
need to support our work?
SE
Hardest part about ML isn’t ML
Only a small fraction of real-world ML systems is composed of the ML

code. The required surrounding infrastructure is vast and complex.
Slide reference from a webinar on MLFlow by
Become Self-Sufficient
Data Scientist Data Engineer Project Manager Business SME
----------------------------------------------------------------
Data Curation & ETL Domain Knowledge
Infrastructure Architecture Hypothesis Generation
Experiment Design
DevOps & CI/CD Acceptance Testing
Feature Creation
Software Engineering
Model Creation Sprint Planning
Model Validation & Testing Communications
Visualization & Storytelling Integration Coordination
Team Structure
• Distinguish between role of
centralized and distributed
teams (tooling, standards,
enterprise projects vs. domain • Limit size & scope of project
focused projects) teams and apply principle of
• Expertise needed in machine lease privilege
learning, data engineering, • Small is fast, small is safe
infrastructure & software • Standardize within your team on
integration, project management the tech stack you will leverage
Real world scenario- Data Scientist and ML
Engineer
Initial EDA, prototyping, Optimize, scale the

viz, selling the solution prototype based on
underlying architecture
Work with product
Multithreading,
manager, evaluate,
multiprocessing, data
freeze the prototype for
pipelining, distributed
production
training
Data Scientist ML Engineer
Iteratively prototyping Follow dev templates,
enhancements along with evaluate models over the
step 2. time in production
Putting it all together: Example
Use Case
SE
5. What software Maintenance SE

need to support our work? 1. What tools to support the
SE
2. How can we coordinate
SE
team?
Example Use Case
Multi-cloud Architecture
for Subscription Churn https://subscription-churn.cisco.com/
• Primarily developed & maintained • Model Deployment

by team of 4 • Used Batch scoring and stored
results in BigQuery & Snowflake
• Exploration & Development
• Used BigQuery & Notebooks in GCP • Visualization
Created custom Webapp to
Data Prep & Training
•
•
display interactive, actionable
• Scheduled using Composer/Airflow model results
• Model Tracking and Versioning • Maintenance
• Used MLFlow to store metrics & • Used CI/CD Pipeline to deploy
serialized models latest version to CAE
Data Science Project Outcomes
and Assets for Stakeholders
Products
(e.g., features on Cisco
Ready)
Models, Insights,
Experiments
(scripts, Jupyter Notebooks)
Prototypes
(API/Tableau/Barebone
Web App)
Most data scientists tend to focus their
effort in the data and the science
Conclusion But other phases and the

Software Engineering Side are
equally important for successful
project outcomes:
SE • Provide relevant insights
• Deploy data science solutions
• Do so efficiently & effectively
Maintenance SE
SE
SE
SE
SE

Data Science Project Workflow: The Software Engineering Side That Often Gets Overlooked

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Project Workflow: The Software Engineering Side That Often Gets Overlooked

Uploaded by

Copyright:

Available Formats

Data Science Project Workflow

The Software Engineering side that often gets overlooked

Annie Ying, David Meyer, Mike Claes, Saranya HV

What are some successful outcomes of a data

Of these tasks, which ones do you find the most

Data Science Project Workflow But other phases and the

• Version control (git repo on GitHub)

• Steps to ‘re-create’ someone else’s work

• Now you need to access required data:

• Common methodology for accessing necessary data

3. How do we ensure the

Who has deployed a

• Efficient Process • Good Architecture

• Performance optimization • Distributed training, non blocking code

4. How can we leverage DevOps

Only a small fraction of real-world ML systems is composed of the ML

Data Scientist Data Engineer Project Manager Business SME

Initial EDA, prototyping, Optimize, scale the

5. What software Maintenance SE

• Primarily developed & maintained • Model Deployment

Conclusion But other phases and the

You might also like