Professional Documents
Culture Documents
#DataSymposium2019
Are you a data scientist, in a
leadership role of a data science
team, a customer of data science,
or any other roles?
30 seconds
1 minute
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
When you think of Data Science,
what tasks do you think of?
1 minute
2 minutes
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Data Science Project Outcomes
and Assets for Stakeholders
Products
(e.g., features on Cisco
Ready)
Models, Insights,
Experiments
(scripts, Jupyter Notebooks)
Prototypes
(API/Tableau/Barebone
Web App)
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Most data scientists tend to focus their
effort in the data and the science
SE
Maintenance SE
SE
1. What tools to support the
SE authoring of code and models?
SE
SE
https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Tools for authoring code
and models depends on
the work SE
Maintenance SE
Product features SE
SE
SE
Models,
Insights, SE
Prototypes Experiments
Data heavy
Experimentation
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Data heavy
Experimentation
Jupyter Notebooks
• (+) Interactivity
Product features • (+) Graphing
• (+) Quick (and dirty)
• Caveats
Models,
Insights, • (-) Coding tool support
Prototypes Experiments
• (-) Discourages extensive
code structure
• (-) Out of order execution
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Beyond experimentation
Modeling
Feature Engineering SE
Maintenance SE
Product features SE
SE
SE
Models,
Insights, SE
Prototypes Experiments
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Beyond experimentation
Modeling Editors / IDEs
Feature Engineering
• (+) Encourages code
structure
Product features
• (+) Coding tool support like
auto-completion, type
Models,
checking, refactoring)
Insights, • Caveats
Prototypes Experiments
• (-) Less interactivity
• (-) Less support for graphs
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Data Science Project Workflow
SE
Maintenance SE
SE
SE
SE
2. How can we coordinate
and share work between
SE
members in a data science
team?
https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Multiple members in the
team - Potentially all
these activities SE
Maintenance SE
Product features SE
SE
SE
Models,
Insights, SE
Prototypes Experiments
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Basic processes for >1 member teams
Models,
Experiments
Product Prototypes
DS
Models, Experiment
Bug fixes, branch
Code
model
review
enhancements
skills: DS
Product Prototype
branch branch
Code review
skills: SE & DS DS/SE
SE
• Iterative DS process involving various roles in the team is error prone
• Proper GIT workflow templates can assist this iterative process
• GIT activities in the sync points are crucial
Standardizing How We Access Data
• pd.read_csv(‘mydata.csv’)
• Where is mydata.csv? How can I re-create it?
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Standardizing How We Access Data
https://wwwin-github.cisco.com/Data-Analytics/pyciscodb https://wwwin-github.cisco.com/EDSO-CI/ciscodb_R
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Data Science Project Workflow
SE
Maintenance SE
SE
SE
SE
SE
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
When is the model ready?
• Intrinsic evaluation
• Cross validation
• Extrinsic evaluation
• Use unseen dataset for
evaluation
• Instrumentation in the
POC
• UI to allow explicit
feedback collection
• Can ultimately use this as
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential input to the ML!
#DataSymposium2019
Data Science Prototype → Production
Desired Outcomes SE Solutions
• Product Quality • Thorough Testing
• Data quality → Model quality → Result quality • Unit testing- E.g.: how are we handling data errors?
• Dev and regression testing – metrics monitoring
• Continuous automatic updates- online learning • CI/CD to monitor metrics and automate model
updates
Data Science Project Workflow
SE
Maintenance SE
SE
SE
SE
SE
Model Model
Model Production
Development Maintenance
Onboarding Job Scheduling SLA
Data Connections • ETL • Model Performance Metrics
EDA + Visualization • Training & Reporting
Feature Engineering • Evaluation • Response Time / Refresh
Model Training • Alerts + Logging Rate
Model Evaluation & Validation Documentation & Publishing • Support Structure (RACI)
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Leveraging templates for production
How to review
production
architecture b/w Invest in
segregated teams? Re-usable packages authoring
and hosting
in-house
packages
ML DevOps DS team
DS team DS team
Code
templates
Better architecture by Maintain project templates for
utilizing standardized different use cases Eg: Deep
templates and in-house learning vs conventional ML
packages
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Data Science Project Workflow
SE
Maintenance SE
SE
SE
5. What software
engineering roles do we SE
need to support our work?
SE
https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Hardest part about ML isn’t ML
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Slide reference from a webinar on MLFlow by
Become Self-Sufficient
----------------------------------------------------------------
Data Curation & ETL Domain Knowledge
Infrastructure Architecture Hypothesis Generation
Experiment Design
DevOps & CI/CD Acceptance Testing
Feature Creation
Software Engineering
Model Creation Sprint Planning
Model Validation & Testing Communications
Visualization & Storytelling Integration Coordination
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Team Structure
• Distinguish between role of
centralized and distributed
teams (tooling, standards,
enterprise projects vs. domain • Limit size & scope of project
focused projects) teams and apply principle of
• Expertise needed in machine lease privilege
learning, data engineering, • Small is fast, small is safe
infrastructure & software • Standardize within your team on
integration, project management the tech stack you will leverage
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Real world scenario- Data Scientist and ML
Engineer
SE
Models, Insights,
Experiments
(scripts, Jupyter Notebooks)
Prototypes
(API/Tableau/Barebone
Web App)
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019
Most data scientists tend to focus their
effort in the data and the science
SE
SE
SE
https://wiki.cisco.com/display/DANALYTICS/Model+Lifecycle+Management
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential #DataSymposium2019