You are on page 1of 41

Big Data Analytics Seng_6251

Software Analytics
Data Analytics for Software Engineering

• Gebeyehu B. (Dr. of Eng.)

gebeyehu2009@gmail.com

BDU: Bahir Dar Institute of Technology: Computing Faculty


1
New era … Software itself is changing
 Software big data are data sets that are so big that can’t be handled efficiently by
common database management systems, which is structured, unstructured and
constant flow data formats,

Software
Services
 Software big data represents data in the form of coding, design, flow, usability, etc.
attributes
BDU: Bahir Dar Institute of Technology: Computing Faculty
2
New era … Software itself is changing
 Source of software big data includes
 Software lifecycle itself, Intelligence devices,

 Usability, Practitioner,

 Developer, Users,

 Satellite remote sensing Aerial surveying

 Radar Sensor networks

 Digital cameras Location of readings of RFID

 Internet of things, and Others,


BDU: Bahir Dar Institute of Technology: Computing Faculty
2
How people use Software is
Changing …

BDU: Bahir Dar Institute of Technology: Computing Faculty


3
How people use Software is Changing …

Individual Isolated

Not much data/content


generation

BDU: Bahir Dar Institute of Technology: Computing Faculty


4
How people use Software is Changing …

Individual Isolated

Not much data/content


generation

BDU: Bahir Dar Institute of Technology: Computing Faculty


5
How people use Software is Changing …

Social

Individual Isolated Collaborative

Not much data/content generation Huge amount of data/artifacts


generated anywhere anytime

BDU: Bahir Dar Institute of Technology: Computing Faculty


6
How Software is built & operated is changing …

Data pervasive
Code Centric
In-lab Testing Debugging in the large

Informed decision making


Experience & Gut-feeling
Distributed development
Centralized Development
Continuous release
Long Product Cycle
… …

BDU: Bahir Dar Institute of Technology: Computing Faculty


8
Software big data analytics

BDU: Bahir Dar Institute of Technology: Computing Faculty


7
Software Analytics
 Software analytics is to enable software practitioners to perform data exploration
and analysis to obtain insightful and actionable information for data-driven tasks
around software and services

 A huge wealth of various data exists in software lifecycle, including source code,
feature specifications, bug reports, test cases, execution traces/logs, and real-world user
feedback, etc.

 Data plays an essential role in modern software development, because hidden in the
data is information about the quality of software and services as well as the
dynamics of software development.

BDU: Bahir Dar Institute of Technology: Computing Faculty


9
Software Analytics
 Various analytical and computing technologies: pattern recognition, machine learning,
data mining, information visualization and large-scale data computing & processing apply to
perform effective and efficient data exploration and analysis in engineering software and
services

BDU: Bahir Dar Institute of Technology: Computing Faculty


110
Software Analytics
 Software Systems
 Depending on scale and complexity, the spectrum of software systems can span
from operating systems for devices to large networked systems that consist of
thousands of servers.

 System quality such as reliability, performance and security, is the key to success
of modern software systems.

 As the system scale and complexity greatly increase, larger amount of data, e.g.,
run-time traces and logs, is generated; and data has become a critical media to
monitor, analyze, understand and improve system quality.

BDU: Bahir Dar Institute of Technology: Computing Faculty


121
Software Analytics
 Software Users
 Users are always right because ultimately they pay for the software and services
in various ways.

 Therefore, it is important to continuously create the best user experience.

 Usage data collected from the real world reveals how users interact with software
and services.

 The data is incredibly valuable for software practitioners to better understand


their customers and gain insights on how to improve user experience accordingly.

BDU: Bahir Dar Institute of Technology: Computing Faculty


132
Software Analytics
 Development Process
 Software development has evolved from its traditional form to exhibit different
characteristics.

 The process is more agile and engineers are more collaborative.

 Analytics on software development data provides a powerful mechanism that we


can leverage to achieve higher development productivity.

BDU: Bahir Dar Institute of Technology: Computing Faculty


143
Research topics – the trinity view

Software Software
 Covering different areas of
System Users software domain

 Throughout entire development


Software cycle
Development
Process
 Enabling practitioners to obtain
insights

BDU: Bahir Dar Institute of Technology: Computing Faculty


154
The goal: to solve software related problems
 Solutions come from software dara analytics
 Runtime traces Program logs System events Perf counters
 Usage log User surveys ,
 Source code Bug history Check-in history Test cases,

 Software data analyzing records with real-time updating, as the data software change from
day to day, which demands update on a daily basis.
 For each research topics we have locate data features to make data palatable for
computational analysis as we proposed,
 We approach organizing the data, so an analyst wishing to study trends in our research
goals and interests could narrow the data down and do the necessary analytics to gain
value?
 Keep in mind that the data are in varied formats (numbers, addresses (x-y), text, data-base, video,
audio).
BDU: Bahir Dar Institute of Technology: Computing Faculty
164
Output – insightful information
• Conveys meaningful and useful understanding or knowledge
towards completing the target task

• Not easily attainable via directly investigating raw data without aid of
analytics technologies

• Examples
– It is easy to count the number of re-opened bugs, but how to find out the
primary reasons for these re-opened bugs?
– When the availability of an online service drops below a threshold, how to
localize the problem?
BDU: Bahir Dar Institute of Technology: Computing Faculty
176
Output – actionable information
• Enables software practitioners to come up with concrete solutions
towards completing the target task,

• Examples
– Why bugs were re-opened?
• A list of bug groups each with the same reason of re-opening,

– Why availability of online services dropped?


• A list of problematic areas with associated confidence values

BDU: Bahir Dar Institute of Technology: Computing Faculty


187
Connection to practice
• Software Analytics is naturally tied with software development
practice

• Getting real

Real Real Real Real


Data Problems Users Tools

BDU: Bahir Dar Institute of Technology: Computing Faculty


198
Performance issues in the real world
• One of top user complaints

• Impacting large number of users every day

• High impact on usability and productivity

High Disk I/O High CPU consumption

• As modern software systems tend to get more and more complex, given limited
time and resource before software release, development- site testing and
debugging become more and more insufficient to ensure satisfactory software
performance.
BDU: Bahir Dar Institute of Technology: Computing Faculty
209
Performance debugging in the large

BDU: Bahir Dar Institute of Technology: Computing Faculty


20
Performance debugging in the large

Network

Trace Storage
Trace collection

BDU: Bahir Dar Institute of Technology: Computing Faculty


222
Performance debugging in the large

Network

Trace Storage
Trace collection
Trace analysis

BDU: Bahir Dar Institute of Technology: Computing Faculty


22
Performance debugging in the large

Bug Database
Network
Bug filing

Trace Storage
Trace collection
Trace analysis

BDU: Bahir Dar Institute of Technology: Computing Faculty


23
Performance debugging in the large

Problematic Pattern
Repository Bug Database
Network
Bug filing

Trace Storage
Trace collection
Trace analysis

BDU: Bahir Dar Institute of Technology: Computing Faculty


24
Performance debugging in the large
Pattern Matching

Bug update
Problematic Pattern
Repository Bug Database
Network
Bug filing

Trace Storage
Trace collection
Trace analysis

BDU: Bahir Dar Institute of Technology: Computing Faculty


25
Performance debugging in the large
Pattern Matching

Bug update
Problematic Pattern
Repository Bug Database
Network
Bug filing
Key to issue
discovery
Trace Storage
Trace collection
Trace analysis

BDU: Bahir Dar Institute of Technology: Computing Faculty


26
Performance debugging in the large
Pattern Matching

Bug update
Problematic Pattern
Repository Bug Database
Network
Bug filing
Key to issue
discovery
Trace Storage Bottleneck of
Trace collection scalability
Trace analysis

BDU: Bahir Dar Institute of Technology: Computing Faculty


27
Performance debugging in the large
Pattern Matching

Bug update
Problematic Pattern
Repository Bug Database
Network
How many issues are Bug filing
still unknown? Key to issue
discovery
Trace Storage Bottleneck of
Trace collection scalability
Trace analysis

BDU: Bahir Dar Institute of Technology: Computing Faculty


28
Performance debugging in the large
Pattern Matching

Bug update
Problematic Pattern
Repository Bug Database
Network
How many issues are Bug filing
still unknown? Key to issue
discovery
Trace Storage Bottleneck of
Trace collection scalability
Which trace file should I
Trace analysis
investigate first?

BDU: Bahir Dar Institute of Technology: Computing Faculty


29
Problem definition
• Given OS traces collected from tens of thousands
(potentially millions) of users,

• help domain experts identify impactful program execution patterns


• (that cause the most impactful underlying performance problems)

• with limited time and resource.

BDU: Bahir Dar Institute of Technology: Computing Faculty


30
Challenges
Large-scale trace data
• TBs of trace files and increasing
Internet • Millions of events in single trace stream

Highly complex analysis


• Numerous program runtime combinations triggering
performance problems
• Multi-layer runtime components from application to
kernel being intertwined

Combination of expertise
• Generic machine learning tools without domain
knowledge guidance do not work well

BDU: Bahir Dar Institute of Technology: Computing Faculty


31
Technical highlights
• Machine learning for system domain
– Formulate the discovery of problematic execution patterns as callstack mining
& clustering

– Systematic mechanism to incorporate domain knowledge

• Interactive performance analysis system


– Parallel mining infrastructure based on HPC + MPI

– Visualization aided interactive exploration

BDU: Bahir Dar Institute of Technology: Computing Faculty


32
Impact
“We believe that the MSRA tool is highly valuable and much more efficient for
mass trace (100+ traces) analysis. For 1000 traces, we believe the tool saves us
4-6 weeks of time to create new signatures, which is quite a significant
productivity boost.”

Highly effective new issue discovery on Windows mini-hang

Continuous impact on future Windows versions

BDU: Bahir Dar Institute of Technology: Computing Faculty


33
Incident Management: Workflow
Alert On-
Call Restore
Engineers the
(OCEs) service

Detect a
service Fix root cause
issue via
Investigate postmortem
the problem analysis

BDU: Bahir Dar Institute of Technology: Computing Faculty


34
Incident Management: Characteristics
Online Service
Shrink-Wrapped
Incident
Software Debugging
Management

Root Cause and Fix Workaround

Debugger No Debugger

Controlled
Live Data
Environment

BDU: Bahir Dar Institute of Technology: Computing Faculty


35
Incident Management: Challenges
Large volume and noisy data

Highly complex problem space

Knowledge not well organized

No knowledge of entire system

BDU: Bahir Dar Institute of Technology: Computing Faculty


36
Conclusion

Information Visualization
Software Software Vertical
System Users

Data Analysis Algorithms


Horizontal
Software
Development
Process Large-scale Computing

BDU: Bahir Dar Institute of Technology: Computing Faculty


37
Conclusion

 Software big data refers to large software data-sets that overflow ordinary data
management systems,

 Software big data is data that is software and its service referenced, which is
common analytics techniques, mapping and software analytics can be applied,

 Software big data methods allow multidimensional screening and “data mining” to
locate parts of the mass that are showing interesting relationships, trends, or
comparisons.

 Those interesting parts of a Software big data set can be sorted into small data-sets
that can have the more powerful traditional analysis methods applied to them,
BDU: Bahir Dar Institute of Technology: Computing Faculty
37
Question:

How will Software big data affect organizational processes

BDU: Bahir Dar Institute of Technology: Computing Faculty


38
End

BDU: Bahir Dar Institute of Technology: Computing Faculty


38

You might also like