Data Quality For Analytics

CROSSINGS: The Journal of Business Transformation
DATA QUALITY FOR ANALYTICS:

clean input drives better decisions
Organizations are increasingly relying on analytics and advanced data

visualization techniques to deliver incremental business value. However, when
their efforts are hampered by data quality issues, the credibility of their entire
analytics strategy comes into question. Because analytics traditionally is seen
as a presentation of a broad landscape of data points, it is often assumed
that data quality issues can be ignored since they would not impact broader
trends. But should bad data be ignored to allow analytics to proceed? Or should
they stall to enable data quality issues to be addressed? In this article, Niko
Papadakos, Mohit Sharma, Mohit Arora and Kunal Bahl use a shipping
industry scenario to highlight the dependence on quality data and discuss how
companies can address data quality in parallel with the deployment of their
analytics platforms to deliver even greater business value.
AN ANALYTICS USE CASE: FUEL CONSUMPTION IN THE SHIPPING INDUSTRY

Shipping companies are increasingly analyzing the financial and operational performance of their vessels against
competitors, industry benchmarks and other vessels within their fleet. A three-month voyage, such as a round trip
from the US West Coast to the Arabian Gulf, can generate a large volume of operational data, most of which is manually
collected and reported by the onboard crew.
Fuel is one of the largest cost components for a shipping company. Optimum fuel consumption in relation to the speed
of the vessel is a tough balancing act for most companies. The data collected daily by the fleet is essential to analyze
the best-fit speed and consumption curve. Figure 1 demonstrates an example of a speed versus fuel consumption
exponential curve plotted to determine the optimum speed range at which the ships should operate. With only a
few errors made by the crew in entering the data (such as an incorrect placement of a decimal point), the analysis
presented is unusable for making decisions. The poor quality of data makes it impossible to determine the relationship
between a change in speed and the proportional change in fuel consumption as presented in Figure 1.
Speed-Consumption Curve
10,000.00 Vessel A » BALLAST

Vessel A » LADEN
9,000.00 Vessel C » BALLAST
Vessel C » LADEN
8,000.00 Vessel F » BALLAST
Vessel F » LADEN
7,000.00
Fuel Consumption
6,000.00
5,000.00
4,000.00
3,000.00
2,000.00
1,000.00
0.00
20.00 40.00 60.00 80.00 100.00 120.00

Speed
Figure 1: Speed – Fuel consumption curves (including data quality issues).
If the outliers are removed, the analysis shown in Figure 2 provides a clear a correlation between the speed of the
vessel and its fuel consumption.
Speed-Consumption Curve
120.00 Vessel A » BALLAST
Vessel A » LADEN
110.00
Vessel C » BALLAST
100.00 Vessel C » LADEN
Vessel F » BALLAST
90.00 Vessel F » LADEN
Fuel Consumption
80.00
70.00
60.00
50.00
40.00
30.00
20.00
10.00
8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00
Speed
Figure 2: Speed – Fuel consumption curves (cleaned data by removing outliers).

As shown in these examples, most analytics programs are designed based on the belief that removing outliers is all
that is needed to make sense of the data, and there are many data analysis tools available that can help with that.
However, what if some of those outliers are not outliers and were the result of a scenario that needs to be considered?
For instance, in the example, what if some of the outliers were actual fuel consumption points captured when the ship
encountered inclement weather? By ignoring these data points, users can make assumptions without considering
important dimensions—and that could lead to very different decisions. This approach not only makes the analysis
dubious, but also often leads to incorrect conclusions.
In some cases, the practice of removing outliers can lead to the deletion of a significant number of data points from the
analysis. But can users get the answer they are looking for by ignoring 40 percent of the data set? Companies need to
be able to determine the speed at which vessels are most efficient with a lot more certainty. Data quality issues only
reduce the confidence in the analysis conducted. In the shipping example, a difference in speed of 1 to 2 knots can
potentially result in a difference of $500,000 to $700,000 in fuel consumption for a round trip US West Coast to Arabian
Gulf voyage at the current bunker price.
Does this mean that data needs to be validated 100 percent before it can be used for analytics? Does the entire
universe of data need to be clean before it is useful for analytics? Absolutely not. In fact, companies should only clean
the data they intend to use. The right approach can help to determine which issues should be addressed to manage
data quality.
DATA USED FOR ANALYTICS: WHERE SHOULD I USE MY CLEANSING TOOLS?

Analytics use cases have specific needs in terms of which pieces of data are critical to the analysis. For each piece
of data, the rules or standards required to make it suitable for the analysis must also be defined. But not all data
standards have equal priority.
For instance, in the shipping example above, it might be more important to ensure that the data used for analysis
is accurate as compared to ensuring that all the data is available. In other words, using 80 percent of 100 percent
accurate data to generate the trend is better than using 100 percent of data that is only 80 percent accurate. An
organization should focus most of its energy on data used by high-impact business processes.
To manage the quality of data, organizations need a robust data quality management framework. This will enable them
to control, monitor and improve data as it relates to various analytics use cases.
APPROACH TO DATA QUALITY MANAGEMENT

Data is created during the course of a single business process and it moves across an organization as it goes through
the different stages of one or more business processes. As data flows from one place to the next, it transforms and
presents itself in other forms. Unless it is managed and governed properly, it can lose its integrity.
Although each type of data needs a distinct plan and approach for management, there is a generic framework that can
be leveraged to effectively manage all types of data. As shown in Figure 3, the data quality management framework
consists of three components: control, monitor and improve.
Improve Control
Fix when data quality drops Validate before loading
Monitor
Assess periodically
Figure 3: Data quality management framework.
Control Monitor
The best way to manage the quality of data in an It is natural to think that if a company has strong controls
information system is to ensure that only the data that at each system’s entry gate, then the data managed
meets the desired standards is allowed to enter the within the systems will always be high in quality. In
system. This can be achieved by putting strong controls reality, as processes mature, people responsible for
in place at the front end of each data entry system, managing the data change, systems grow old and the
or by putting validation rules in the integration layer quality controls are not always maintained to keep up
responsible for moving data from one system to another. with the desired data quality levels. This generates the
Unfortunately, this is not always feasible or economically need for periodic data quality monitoring by running
viable when, for example, data is captured manually and validation rules against stored data to ensure the quality
then later captured in a system, or when modifications meets the desired standards.
to applications are too expensive, particularly with
commercial off-the-shelf (COTS) software. In addition, as information is copied from one system
to another, the company needs to monitor the data
In one particular case, a company decided against to ensure it is consistent across systems or against
implementing changes to one of its main data a “system of record.” Data quality monitors enable
capture COTS applications that would have enforced organizations to proactively uncover issues before they
stricter data controls. They relied instead on training, impact the business decision-making process.
monitoring and reporting on the use of the system to As shown in Figure 4, an industry-standard five-
help them improve their business process, and as a dimension model can be leveraged to set up effective
result, experienced improved data quality. However, data quality monitors.
companies that have implemented strong quality
controls at the entry gates for every system have
realized very effective data quality management.

Correctness Measure the degree of data accuracy
Completeness Measure the degree to which all required data is present
Currency Measure the degree to which data is refreshed or made available

at the time it is needed
Measure the degree to which data adheres to standards and how

Conformity well it is represented in an expected format
Measure the degree to which data is in sync or uniform across the

Consistency various systems in the enterprise
Figure 4: The five Cs of data quality.
An example of a monitoring dashboard is shown in Figure 5. It is built to provide early detection of data quality issues.
This enables organizations to perform root-cause analysis and to prioritize their investments in training, business
process alignment or redesign.
Figure 5: Sample data quality monitoring dashboard.

Improve DATA QUALITY IS AS MUCH ABOUT THE
PEOPLE AS IT IS ABOUT TECHNOLOGY
When data quality monitors report a dip in quality,
a number of remediation steps can be taken. As In addition to the technical challenges faced by most
mentioned above, system enhancements, training data projects, there are often organizational hurdles
and adjusting processes involves both technology and that also must be overcome. This becomes particularly
people. When a dip in quality occurs, it may be the pronounced in organizations where data is vast,
right time to start a data quality improvement plan. diverse and often owned by different departments with
Typically, an improvement plan includes data cleansing, conflicting priorities. Therefore, a combination of data
which can be either done manually by business users governance, stakeholder management and careful
or via automation. If the business can define rules planning are needed, along with the right approach and
to fix the data, then data cleansing programs can be solution. Key challenges that must be addressed for data
easily developed to automate the data improvement quality initiatives include the following:
process. This, followed by business validation, ensures
1. Stewardship—Like any corporate asset, data needs
that the data is back to its desired quality level. Often,
stewardship. A data steward is needed to provide direction
organizations make the mistake of ending data
and influence resources to control, monitor and improve
quality improvement programs after a round of
data. The data steward should be someone with a
successful validation.
strategic understanding of business goals and an interest
in building organizational capabilities around data-driven
A critical step that is often missed is enhancing data
decision making. Having a holistic understanding will help
quality controls to ensure the same issues don’t happen
the data steward direct appropriate levels of rigor and
again. This requires a thorough root-cause analysis
priority to improve data quality.
of the issues and data quality controls that need to
be added to the source systems to prevent the same 2. Business Case—Organizations are unlikely to invest
issues from reoccurring. Implementing these steps is in data quality initiatives just for the sake of improving
even more critical when a project includes reference data quality. A definition of clean data and a justification
or master data, such as client, product or market data. for why it is important for analytics as well as operations
Also, organizations that are implementing an integration needs to be documented. Some of the common themes
solution will benefit from taking on this additional effort in the business case include accurate and credible data
as it enables quality data to flow across the enterprise in for reporting, reduced rework at various levels and good
a solution that can be scaled over time. quality decisions. The business case should present the
data issues as opportunities that can unlock significant
Most effective data quality management programs gains in the form of analytics and/or become the
are centrally run by an enterprise-level function foundation of future growth.
and are only successful if they are done in partnership
with the business. Ultimately, it is the business that 3. Ownership—Often, personnel other than data
owns the data, while the IT teams are the enablers. But stewards and data entry personnel (data custodians)
how can the business contribute to these seemingly use the data for decision making. In that context, it is
technical programs? imperative for custodians to understand the importance
of good quality data. The drive and ownership for
entering and maintaining good quality data needs to
grow organically. As an example, the crew onboard
a vessel is more likely to take ownership of entering
good quality and timely data about port time or fuel
consumption if they know that the decisions involving
asset utilization and efficiency are driven from data
reported by the crew.

4. Sustainable Governance—Making data quality issues visible or measuring the quality of data is good information to
have, but ultimately does not move the needle in terms of improving data quality. A sustainable governance structure
with close cohesion between data stewards, data custodians and a supporting model is required. It is nice to know that
the data supporting a certain business process is at 60 percent or 90 percent quality, but that in and of itself will not
automatically drive the right behaviors. A balanced approach of educating and training data custodians and enforcing
data quality standards is recommended. With a changing business landscape and personnel, reinforcing the correct
data entry process from time to time may improve quality. On the other hand, to ensure that overall data quality does not
drop over time, effective monitoring and controls are also equally important. Doing one without the other may work in
the short term, but may not be sustainable over time. For real change and improvement to happen, organizations need
to implement a robust and sustainable data governance model.
5. Communication—Any data quality initiative is likely to meet resistance from some groups of stakeholders and poor
communication can make matters worse. Therefore, a well-thought-out communication plan must be put in place
to inform and educate people about the initiative and quantify how it may impact them. Also, it is important to clarify
that the objective is not just to fix the existing bad data, but to also put tools and processes in place to improve and
maintain the quality at the source itself. This communication can be in the form of mailers, roadshows or lunch-
and-learn sessions. Further, the sponsors and stakeholders must be kept engaged throughout the lifecycle of the
program to maintain their support.
6. Remediation—Every attempt should be made to make the lives of data stewards easier. They should not view data
quality monitoring and remediation routines as excessive or a hindrance to their day-to-day job. If data collection can
be integrated and the concept of a single version of truth replicated across the value chain, it will ultimately improve
the quality of data. For example, if the operational data captured by a trading organization (such as cargo type,
shipment size or counterparty information) is integrated with pipeline or marine systems, it will ultimately enable
pipeline and shipping companies to focus on collecting and maintaining data that is intrinsic to their operation.
CONCLUSION THE AUTHORS
As organizations increasingly rely on their vast
collections of data for analytics in search of a Niko Papadakos
is a Director at Sapient Global Markets
competitive advantage, they need to take a practical and
in Houston, focusing on data. He has
fit-for-purpose approach to data quality management. more than 20 years of experience
This critical dependency for analytics is attainable by across financial services, energy and
transportation. Niko joined Sapient Global
following these principles: Markets in 2004 and has led project
engagements in key accounts involving
› Tackle analytics with an eye on data quality data modeling, reference and market
data strategy and implementation,
information architecture, data
› Use analytics use cases to prioritize data quality governance and data quality.
hot spots npapadakos@sapient.com
› Decide on a strategy for outliers and use the 80/20

rule when pruning the data set Mohit Sharma
is a Senior Manger and Enterprise
Architect with eight years of experience
› Ensure decisions are trustworthy and make in the design and implementation of
data quality stick by addressing root causes and solutions for oil and gas trading and
supply management. During this time,
implementing a monitoring effort Mohit was engaged in multiple large
and complex enterprise transformation
› More than any other program, make this one programs for oil and gas majors. Most
recently, he developed a total cost of
business-led for optimum results ownership (TCO) model for a major North
American gas trading implementation.
mosharma@sapient.com
Mohit Arora
is a Senior Manager at Sapient Global
Markets and is based in Houston. He has
over 11 years of experience leading large
data management programs for energy
trading and risk management clients
as well as for major investment banks
and asset management firms. Mohit
is an expert in data management and
has a strong track record of delivering
many data programs that include
reference data management, trade data
centralization, data migration, analytics,
data quality and data governance.
marora@sapient.com
Kunal Bahl
is a Senior Manager in Sapient Global
Markets’ Midstream Practice based in
San Francisco. He is focused on Marine
Transportation and his recent assignments
include leading a data integration and
analytics program for an integrated
oil company, process automation for
another integrated oil company and power
trading system integration for a regional
transmission authority.
kbahl@sapient.com

ABOUT SAPIENT GLOBAL MARKETS
Sapient Global Markets, a part of Publicis.Sapient, is a leading provider of services to today’s evolving financial
and commodity markets. We provide a full range of capabilities to help our clients grow and enhance their
businesses, create robust and transparent infrastructure, manage operating costs, and foster innovation
throughout their organizations. We offer services across Advisory, Analytics, Technology, and Process, as well
as unique methodologies in program management, technology development, and process outsourcing. Sapient
Global Markets operates in key financial and commodity centers worldwide, including Boston, Calgary, Chicago,
Düsseldorf, Frankfurt, Houston, London, Los Angeles, Milan, New York, Singapore, Washington D.C. and Zürich, as
well as in large technology development and operations outsourcing centers in Bangalore, Delhi, and Noida, India.
For more information, visit sapientglobalmarkets.com.
© 2015 Sapient Corporation.

Trademark Information: Sapient and the Sapient logo are trademarks or registered trademarks of Sapient Corporation or its subsidiaries in the
U.S. and other countries. All other trade names are trademarks or registered trademarks of their respective holders.
Sapient is not regulated by any legal, compliance or financial regulatory authority or body. You remain solely responsible for obtaining independent
legal, compliance and financial advice in respect of the Services.

Data Quality For Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Quality For Analytics

Uploaded by

Copyright:

Available Formats

CROSSINGS: The Journal of Business Transformation