You are on page 1of 45

Synapse Intelligence

Findings summary

DATE WHO
Nov 2020 Veronika Monohan
Matthew Hicks
Santosh Balasubramanian
Objective
Identify the main pain points in the process of building a cloud-
based data analytics solution and reveal opportunities for
introducing optimizations and automated actions in the spirit of
"intelligent workspaces“ which would allow customers to benefit
from intelligent detection and actions regarding solution
Background patterns, opportunities, problems, and recommendations.
Methodology
Hybrid. Exploratory interviews and concept testing.
Participants
10 external data professionals who have experience in building
cloud-based data analytics solutions for their work including:
• Designing the solution;
• Configuring and deploying the solution;
• Optimizing the cost and security aspects of the solution;
• Managing the access and resources used by the solution.

©©
Microsoft Corporation
Microsoft Corporation
Azure
Azure
Main challenges

Most expensive and time-consuming aspects


Deck outline
Specific challenges related to monitoring, cost control, access control and data

Ideal solution based on participants unaided responses to open ended questions

Desired optimizations, recommendations and automations

Feedback on feature concepts

© Microsoft Corporation
Azure
Main challenges

© Microsoft Corporation Azure


Main challenges

• Choosing the combination of products and services for the solution is


challenging, because it takes a lot of time to stay informed about all new
available products on the market and learn how you can integrate them
Solution architecture and and use them together.
• In addition, finding an optimal solution in terms of cost and
design challenges performance is not straight-forward. Estimating the cost and
Based on participants’ current projects performance requires building a proof-of-concept project and
experimenting with different settings until reaching a satisfactory
outcome.
• Another challenge for participants is switching between platforms in
case they decided that certain platform would meet their project needs
better.

© Microsoft Corporation
Azure
Main challenges

• Participants mentioned that the lack of experience with new technology


is another struggle for them. Participants said that they want to utilize
new services and their benefits, but it’s quite time-consuming to stay
informed about all available solutions, because there are too many of
Experience and learning them on the market and some of them have steep learning curve.
challenges • Setting up a new service which takes too many steps and requires
reading long, unclear or incomplete documentation is a major pain
Based on participants’ current projects point. E.g., when deploying ADF in Azure DevOps is really hard to setup
CICD, so P7 has to read the “really big tutorial” every time just to deploy
one service which he finds frustrating.
• Another challenge caused by using certain technologies is finding
qualified data professionals who are already experienced with it when
recruiting. E.g., P10 had hard time finding employees who can work with
columnar database engine. (DE Leads)

© Microsoft Corporation
Azure
Main challenges

• Exploring data and trying to find patterns in new datasets that have not
been used before, especially when they are large is amongst the main
challenges because it takes time and has a high level of ambiguity. For P4
it takes staring at the data from different views, defining something
which is certain and exploring it in a narrow focus to identify initial
patterns.
Data related challenges • Detecting changes in the source data coming from the business is tricky
and requires P10 to run hourly schema integrity checks as part of their
Based on participants’ current projects
monitoring.
• P8 wants the ability to transform the data while migrating it by swapping
some of the old tables with new. Currently, he has to take three separate
steps – load the data to Snowflake, use SQL to conduct transformations ,
and then orchestrate it in Ruby. He wishes to have the ability to blend the
first two steps.
• Main data related challenge for P7 is the extra time he needs to spend to
verify the results of data transformations in Databricks which take
significantly longer compared to SQL. (E.g., getting results in SQL takes 3-
5 seconds and in Databricks, 25-30 seconds.)

© Microsoft Corporation
Azure
Main challenges

• P4 had connection problems when running Jupyter notebooks on remote


server using Jupyter Hub and Jupyter Lab, the reason was not identified.
Connectivity challenges In his opinion using a managed instance instead of having to maintain
the Jupyter server would have been a better option.
Based on participants’ current projects
• P9 faced connectivity issues caused by firewalls between their secure
systems and the cloud products.
• When working with Databricks, if you want to use DevOps to track your
code, both must be in the same subscription. Alternatively, you can use a
VPN network connection, but it is highly unstable and inefficient because
of access credentials that you need to fill out multiple times. (P7)

© Microsoft Corporation
Azure
Most expensive and time-consuming
aspects

© Microsoft Corporation Azure


Cost

0 1 2 3 4 5 6 7 8 9 10

Compute 9
DE, DS, DA, SA
Most expensive aspects
Infrastructure and services 8
When asked what are the most expensive DE, DS, DA, SA
aspects of their solution, participants
named compute, infrastructure and
services cost and developers’ Developer time and salary DS, DE, SA 5
reimbursement. These were their top-of-
mind unaided responses.
Storage DE 1

*DE – data engineer, DS – data scientist, DA – data analyst, SA – solution architect

© Microsoft Corporation
Azure
Time

Most time-consuming aspects

Time-consuming aspect / Persona* DE DS DA SA


Collecting and translating business requirements (4)
Transforming the incoming data to the right format (4)
Solving data quality issues, cleaning data (2)
Exploring and understanding the data (2)
Setting up the data pipeline and scheduling (2)
Validating, optimizing and preparing the data output for stakeholders (4)
Optimizing job performance (2)
Optimizing compute (1)
Subject matter expertise gaps** (5)

*DE – data engineer, DS – data scientist, DA – data analyst, SA – solution architect

© Microsoft Corporation
Azure
Specific behaviors and gaps related to
monitoring, data, cost & access control

© Microsoft Corporation Azure


Monitoring

Behavior patterns
5/10 use at least one dedicated monitoring tool, e.g., Sentry, Airflow,
Monitoring SnowPlow or CloudWatch. (DS,DA,DE)
6/10 have a dashboard as part of their monitoring solution. (DS,DA,DE)
The main success criteria for monitoring is
knowing about the failures and being able to 5/10 have setup additional alerts, e.g., email or slack message, and like to get
fix them before your stake holders start notified when there is some anomaly so they can take quick actions to fix it if
complaining. needed. (Mostly DE, but also DS and DA)
2/10 find it important to have the ability to manually setup thresholds for the
monitoring alerts. (DS, SA)
Gaps
Monitoring is harder for serverless than provisioned compute, because the
jobs run for longer time.
When looking at the job, P8 wants to see “the separate executions for the job
instead of only seeing a green dot” once the job has run.

© Microsoft Corporation
Azure
Cost control

Most of the participants said that it’s tricky to predict the cost, because it is
hard to estimate the exact inputs your solution needs. Therefore, most of
them apply a trial-and-error approach.
“If there is an unexpected cost, there isn’t much you can do. You can only post-
Cost control fix it”. – P1
Cost control strategies
The main success criteria for cost control Setting spending caps is one of the most common cost control strategies
is when the incurred cost matches the mentioned by 5 participants. While it is not ideal solution for them, it helps
initial estimate. them prevent unexpectedly high bills.
Others try to estimate the cost of the different components of the solution
by using online calculators provided by the cloud platforms, and then
monitor on a daily or weekly basis the actual cost. This strategy was
mentioned by participants with more flexible budgets.
Only one participant shared that he uses a tool called Cloudability which
shows the AWS cost broken down by different resource levels and doesn’t
need to setup spending limits. (P10)

* 6/10 participants were involved in the cost control of their project.

© Microsoft Corporation
Azure
Cost control

Most participants said that their organization is not currently tracking the cost
incurred by separate users or processes. Some said that they try to get project
Cost control level estimates instead.
Only one reported that his company’s DevOps team is trying to track the cost
by service and API call. Yet, the way to achieve it is cumbersome, because it
6/10 participants were involved in cost requires the developers to enter many tags which is very time-consuming.
control for the project. (P3)
“The biggest annoyance is that I have to enter up to 10 or 15 different tags that
are basically the same and I have to enter them in like as many as like 10-20
different spots where technically it's the same service, so it could be easier”. –
P3

* 6/10 participants were involved in the cost control of their project.

© Microsoft Corporation
Azure
Access

• Most of the participants said that in their company there is no way to


Access control discover who needs access to data and resources than getting a request
from the particular user.
Granting and requesting access
• For smaller teams this is done via direct communication, e.g., verbal, email
or message. For larger teams, the person who needs access would submit
Centralized
way of a ticket which would be directed to the specific admin or team in charge
setting up
access; 2 of the data/resource.

Manual way
of setting up
access; 6

0 1 2 3 4 5 6 7

© Microsoft Corporation
Azure
Access

• When granting access roles manually P6 has no way of checking if the


Access control person who requests the role is supposed to get it.
Challenges • The main access control challenge for P7 is the integration between Key
Vault and Azure DBX. Accessing keys stored in the Key Vault from DBX is
hard, because he needs to go through 8 steps and there is no simple 1-2
clicks.
• P10 had to currently take on the role of managing access, and his main
struggle was learning how to do it.

© Microsoft Corporation
Azure
Data

The incoming data needs to be transformed. There could be different


reasons:
data which is coming in a wrong format (P8)
Data challenges •
• event data which needs to be standardized (P2, P4)
• data that has lots of invalid values. He had to set boundaries in terms of
ranges for the values in order to filter it. (P2)
Hard to find suitable data sources and know if they will still be available in the
future. (P3)
Understanding well the business background so you can model the data
properly. (P10)
When working with a real-time system, P10 has to update the data every 3
hours but the updates disturb lots of the normal operation processes and the
end-user experience.

© Microsoft Corporation
Azure
The ideal solution

© Microsoft Corporation Azure


Magic wand

• 4/10 want to remove the setup process and get started with their actual
data related work. The current process has too many steps such as
manually setting up the server endpoints, storage, VM-s and instances
Ideal solution which takes several days. Ideally, participants would like to have an
automated one-click deployment with perfect security around it.
Simplified setup and deployment process
“I’m not getting paid for building the infrastructure.” – P7
• In addition to that they want to have a template, so they can automate
the deployment of different environments in a CICD process, so they
don’t have to do the same setup all over again. They might have
between 3-4 and 20-30 environments. In addition, using the template is
the best way to replicate the solution.

© Microsoft Corporation
Azure
Magic wand

4/10 want to skip the process of cleaning and transforming the data to the
right format so they can jump straight into the part of the work they enjoy
most which is analyzing it. The most common reasons for transforming the
data are:
Ideal solution • There is corrupt data which needs to be removed because it causes
processes to fail. (P10)
Getting clean, reliable data • The data comes in a wrong/different format and needs to be
converted/standardized. (P3, P4, P8)
• Dealing with upstream dependencies. (P3)
“If I can just get to a point where I can start plotting things, analyzing things,
making predictions, looking for anomalies, looking for trending data, making
forecasts… That part, at least for me, that’s the part I enjoy. However,
everything leading up to that it feels like routine, automatable work that still
ends up consuming an inordinate amount of time for me”. – P3

© Microsoft Corporation
Azure
Magic wand

• P4 wants the ability to do data exploration without having to write code.


So, he can get a quick glimpse of some of the basic data stats. (P4)
• P5 would remove the part of the ETL process where he has to load the
Ideal solution data from source to data warehouse because he finds it time-consuming
and frustrating. Today it is a necessary step because the data needs to
Further ideas be loaded in order to be used for analysis but doesn’t add much to their
business.
• P8 wants to skip the process of gathering requirements from
stakeholders.

© Microsoft Corporation
Azure
Optimizations

P3 would like the ability to identify suitable data sources which do


“forensic” data analysis using some sort of AI. He can ask a question
using a natural language and get suggestions from the system on what
Optimizations would be a suitable data source for it, as well a general info about the
data there.
Smart data source recommendation “If it can do some NLP and basically understand my question and then
go look at the columns and figure out: “Hey does it look like maybe
some of the column names overlap at least with your question? Do the
columns seem well populated? Does that seem clean? Is there recent
data there? How often does the data get updated right? What's the
granularity of the data there?” – P3

© Microsoft Corporation
Azure
Optimizations

P6 wants the ability to provide his business users with self-serve reports
where they can drag, drop and create their own insights. This would
save time because the business users will not have to reach out to the
Optimizations developers to get reports.
Self-serve reports for business “End-2-end self-serve reports” – P6
stakeholders

© Microsoft Corporation
Azure
Optimizations

P10 wants decoupling of compute and storage because they have hit
Optimizations the storage limit of the current cluster, they use but are far from the
compute capacity. Having the option to select storage and compute
Decoupling of compute and storage. separately would potentially improve the cost-efficiency of their project.
Decoupling of reads and writes
He also wants to decouple the reads and writes. The reason is that when
they process new writes, their users normal read processes get
disrupted.

© Microsoft Corporation
Azure
Optimizations

P5 wants an affordable one-click ETL solution.


Optimizations “Here is my data source. Here is where I want to get it to. That's all I need
to do.” – P5
Automated ETL

© Microsoft Corporation
Azure
Recommendations

• About cost savings – queries that are rarely used or how to optimize queries. (3)
“We've noticed no one ever queries this data that's being loaded. Do you need to load
it? If not, then you can like turn it off and you don’t need to pay for it” – P5

Recommendations • System recommendations about better resource management based on the actual
usage in terms of storage and resources. (P1, P2)
• Recommendation about the most appropriate storage or folder structure based on the
data and the features they have, e.g., buckets or blob type storage. (P1)
• Tips about problematic areas in the solution, e.g., code which can cause it to break.
(P2)
• Recommendations about security concerns or how to improve security. (P2, P7)
• Data cleansing and prep. (P3)
• How to improve the query performance, e.g., in terms of structuring the queries. (P5,
P6, P7)
• Reports which can be reused (P6)

© Microsoft Corporation
Azure
Recommendations

• Exploratory data analysis takes a lot of time for new data - scatterplots and stats, how
many values are 0, nice to get a solid view of basic stats so he can know very quickly
when a data field is empty too often, edge cases, outliers;
Recommendations • Reuse the existing results.
• Automatic alerts about inbuilt analytics in email shape about the system performance
in case of high throughput.
• What tables are not used, so he can remove or offload them. This way he can save
time and cost and improve the performance. (P10)
• Switching from one service to another which would save cost.
• How to optimize code that might cause breakdowns.

© Microsoft Corporation
Azure
Automatic actions

• Evaluating data quality, setting up pre-processing, deploying


the logic

Automatic actions • Automatic adjustment of the cluster based on the queries he


wants to use.
• Basic stats run by the system so he can see max, min, mode,
mean, etc. this would be helpful – auto exploratory stats (P4)
• Schema integrity checks, updates, reloading and refreshing
the table. (P5, P10)
• Automated review of access permissions for each account
and information about the activity of the users. This would
be useful for security audits. (P8)

© Microsoft Corporation
Azure
Intelligent features

© Microsoft Corporation Azure


Features budget

Feature Total sum out of $1000


Automatic alerting 245
Specific features/ideas Intelligent result caching 240
9/10 want Automatic alerting and Intelligent Data profiling 195
result caching. Central solution monitoring and exploration 185
7/10 want Data profiling, Central solution Predictions, recommendations, and
monitoring and exploration and Predictions,
recommendations and optimizations.
optimizations 135
1/10 doesn’t want Central monitoring but
would still pay for it.

© Microsoft Corporation
Azure
Features budget

Average feature budget per persona out of $100

Feature/Persona Data engineer Data scientist Data Analyst Solution Architect


Automatic alerting 26 12 23 9
Intelligent result caching 24 20 13 26
Data profiling 18 37 30 29
Central solution monitoring and
exploration 17 21 26 17
Predictions, recommendations, and
optimizations 14 10 8 19

© Microsoft Corporation
Azure
Features budget

Average feature budget per Spark vs SQL users out of $100

Feature/Persona Spark SQL


Automatic alerting 14 32
Intelligent result caching 23 25
Data profiling 29 13
Central solution monitoring and
exploration 13 23
Predictions, recommendations, and
optimizations 23 8

© Microsoft Corporation
Azure
Automatic alerting

Most appealing to: Data engineers & SQL users

9/10 participants would like to have automatic alerting as part of their solution
Participants found that automatic alerting would bring them the following
benefits:
• Spot anomalies in data or query performance.
Automatic alerting •

Spot changes in the incoming data.
Save troubleshooting time.
• Increase the availability and improve their customer’s satisfaction.
Configure alerts for a range of
automatically-detected conditions, such "I'm ranking these (features) based on how I feel they would contribute to the
as changes, trends, or anomalies in your relationship we have with our customers. So, this would help us build our trust
data, code, resources, or user activity. most.“ – P8
This feature would resolve some of the main pain points for P8 and P10.
“This would be fantastic, because it covers what I mentioned before - a recurring
issue that happens several times a week and takes several hours to fix, caused by
significant changes in the data.” – P8
“My bread and butter.” – P10
His team built a whole framework for alerting that covers changes, trends and
anomalies. Their custom solution doesn’t cover code, resources or user activity,
which also would be very useful.

© Microsoft Corporation
Azure
Automatic alerting

Most appealing to: Data engineers & SQL users

The main concern about this feature is that it doesn’t alert them too much. Ideally,
they would like to be able to setup thresholds. (3/9)
Automatic alerting “Great, but careful, it might give lots of false positives, nice to have a framework to
this but let the devs select the thresholds.” – P3
Configure alerts for a range of
automatically-detected conditions, such
as changes, trends, or anomalies in your
data, code, resources, or user activity.

© Microsoft Corporation
Azure
Result caching

Appealing to: Data engineers, solution architects, SQL & Spark users

9/10 participants found intelligent result caching helpful because it would:


Intelligent result caching • Save time and cost.
• Speed up processing and improve performance.
If you have recurring pipelines, jobs, or
P1 and P2 said this is their favorite feature. Other participants with large
queries in your analytics solution, you
workloads and scenarios with duplicated processing and many recurring queries
can save time and money by having the highly appreciated the intelligent result caching.
system automatically reuse the results
of any redundant computations that are “Awesome, because we run lots of the same queries!” – P1
run in each recurrence. The system “This would save a lot of time and cost.” – P6
caches partial computations for reuse in
future iterations of recurring jobs /
pipelines.

© Microsoft Corporation
Azure
Result caching

Appealing to: Data engineers, solution architects, SQL & Spark users

4/9 said they think this type of feature already exists in some other
services/providers
“Fantastic, but most backends already have this, absolutely essential feature.” –
Intelligent result caching P3
“It’s currently available in Redshift, not sure it works the same way, but R caches
If you have recurring pipelines, jobs, or the computation for recurring queries, the first time running a query always takes
queries in your analytics solution, you much longer than later.” – P10
can save time and money by having the
system automatically reuse the results
of any redundant computations that are
run in each recurrence. The system
caches partial computations for reuse in
future iterations of recurring jobs /
pipelines.

© Microsoft Corporation
Azure
Data profiling

Most appealing to: Data scientists, data analysts, solution architects & Spark users

This feature got the max budget investment of $80 by P3.


7/10 said that having data profiling especially at the beginning of a
project or when loading new data will be helpful.

Data profiling According to participants, the feature will increase their efficiency and
reduce the cost. Some said that knowing more about their data skew,
size etc., will help them setup their schemas and tune their queries
Get insights into the data you are proactively.
exploring, ingesting, or using in your P3 specified that this is his “number one” feature from all, because he
analytics solution, such as data skew, trusts that it can be easily automated, and currently he wastes lots of
size, format, and schema. time on exploring his data.
P4 described this features as one of his dream solutions: “This is what I
was talking about - basic exploratory stats, I want that.”
To make this feature even better, participants want to see also info about
data volume and what data is missing (rows, columns).

© Microsoft Corporation
Azure
Data profiling

Most appealing to: Data scientists, data analysts, solution architects & Spark users

3/10 were not so excited about data profiling.


They said that they will do it anyway by running several simple
Data profiling statements or that they prefer to handle the data structure
themselves. They did not report challenges around data structure
in the beginning.
Get insights into the data you are
exploring, ingesting, or using in your
analytics solution, such as data skew,
size, format, and schema.

© Microsoft Corporation
Azure
Central monitoring

Appealing to: Data scientists, data analysts & SQL users

6/10 are excited about adding central monitoring to their solution.


Participants find it useful, because it would help them understand better insights
Central solution monitoring and and trends about their solution which might stay unnoticed and reduce the time,
they currently spend checking different logs to obtain the same information.
exploration
4/10 compared this type of monitoring to what tools like SnowPlow and AWS
A central UI/experience to monitor all CloudWatch do. P5 said that “AWS has tons of monitoring for different products, but
elements of the analytics solution, it's not centralized around Analytics.” He found that the monitoring specific to the
allowing you to observe and explore analytics would be particularly useful for his project.
the solution and its status from high to
low altitudes; navigate between
“"That's currently like a missing part of our kernel solutions”– P5
compute resources, activity history, “Great, we could gather helpful insights from something like this, we can have
and code artifacts based on how they shards, metrics.” – P2
are related and used in the solution.

© Microsoft Corporation
Azure
Central monitoring

Appealing to: Data scientists, data analysts & SQL users

4/10 did not find that their work will benefit from central monitoring. According to
P10 alerts are more helpful for day-to-day management, but higher level
Central solution monitoring and managers might find this type of monitoring helpful.
“Not sure why we need a central solution - each individual team can monitor
exploration their solution. Why I need to have a central place, in my project? I can check the
analytics of my resources, what's the point of having a central place? I’m not
A central UI/experience to monitor all concerned how another team is using resources”. – P6
elements of the analytics solution,
allowing you to observe and explore “Not too bad but I don’t want it.” – P9
the solution and its status from high to “Good BUT more relevant for higher management level than day to day
low altitudes; navigate between management level. Day to day - alert when something fails or an issue happens.”
compute resources, activity history,
– P10
and code artifacts based on how they
are related and used in the solution.

© Microsoft Corporation
Azure
Predictions

Appealing to: Data engineers, solution architects & Spark users

7/10 participants found it useful but were skeptical that it can actually work.
Predictions, They can trust it for components that the system can automatically keep track of such
as computational failures, disk size or memory problems, as well as prediction on data
recommendations, and size which will be accumulated over time and query performance.
optimizations “Really good, because I have to do this manually, this would be very helpful and save
time.” – P7
Receive contextual guidance,
predictions, and recommendations to “If it's in terms of queries - it would be helpful, if it is on a table level - not sure if it would
help you build your solution, predict be useful - not sure how the guidance would work here” – P6
computational failures, or optimize
data based on certain
conditions/indicators

© Microsoft Corporation
Azure
Predictions

Appealing to: Data engineers, solution architects & Spark users

However, 4/10 participants were “skeptical” and had mixed feelings about the solution
architecture recommendations, because the usage of data is very different across
companies, and they were not sure how
Predictions,
“Very skeptical of recommendations on building solutions, but predicting failures and …
recommendations, and could be really helpful” – P4
“This would cost us more, I wouldn't like to increase the cost because of it;” – P9
optimizations
"Sounds like science fiction, but I would love to have something like this!
Receive contextual guidance, It would speed up every project. In the beginning we usually have freeze time when
predictions, and recommendations to trying to identify the potential architecture of the solution based on the problem at hand.
help you build your solution, predict This can not only save time but uncover something I didn't think about. Mostly useful for
computational failures, or optimize a new project.” – P10
data based on certain
conditions/indicators One participant said this feature is similar to a feature which currently exists in
BigQuery.

© Microsoft Corporation
Azure
Thanks!

© Copyright Microsoft
© Microsoft Corporation. All rights reserved.
Corporation Azure
Compute

Reasons to choose Challenges


Provisioned • Session time limit concern: “What • Region where the machine has to
Estimate needed resources based on: be hosted
if the user session is longer and
• Trial and error • Connectivity problems
the compute session expires?” – • Scaling up
• Provider recommendations P1 • Lack of automated deployment
• Want more control over the • You always have to pay the time
Compute configuration in terms of GPU and
for spinning up the server, even
to run 1 query
CPU • The storage and compute limits
• The tool they chose uses are not optimal for their solution
provisioned (Airflow) – hit storage and far from
compute
• 6 use only provisioned
• 2 use only serverless
• 2 use a combination of
both.
Serverless • Easier to use since it’s managed • Access control is role-based or db
based access for bigger teams.
service. “Everything is taken care
• No control over the configuration
of.” – P6
• Small team
• Automated scaling

© Microsoft Corporation
Azure

You might also like