Data Quality, AI and ML Full Guide

THE ROLE OF DATA
QUALITY IN AI AND ML.

A full (and free) guide
BROUGHT TO YOU BY
AI IS HERE, ARE YOU IN CONTROL?
Artificial Intelligence (AI) and Machine Learning (ML) are becoming a prominent part of the business
landscape as we continue to advance through the digital age. But is it as simple as plugging your
data into some code? Are businesses missing out on the true potential of these powerful approach-
es to data science due to poor data quality? In this guide, we aim to dissect the importance of
high-quality data and the harm that can come from poor quality data, as they are used in AI
and ML.
THE CURRENT DATA LANDSCAPE
If a company wants to continually evolve and stay ahead of the competition, it must have a data and
technology-centric approach. Today, we are gathering data at an exponential pace and data is be-
coming a key strategic asset for businesses. But as we accelerate into a data-driven future, it’s becom-
ing more and more challenging for businesses to get a grip of their data. Understanding how to utilize
and unlock its true value is becoming increasingly complex. Data is not only growing in size but also
in complexity. The amount of data sources and the mix of data types add to this growing complexity,
leading to confusion for many companies.
HOW MUCH DATA IS BEING GENERATED?
But just how much data is being generated and gathered in 2020? The numbers are truly astounding.
At the end of 2017, an estimated 3.8 billion people around the world were using the internet. By the
end of 2019, this figure stood at a whopping 4.6 billion. That’s an additional 800 million people online
generating data every day. Every minute, 200 million emails are sent, 4.2 million Google searches
are conducted, and 480 thousand Tweets are posted. In terms of the ‘Global Datasphere’ (the total
amount of data generated), it’s estimated that by the end of 2019 it stood at 4.4 ZB, up from 2.7 ZB in
2017. To put this into perspective, one Zettabyte is equivalent to one billion Terabytes or one trillion
Gigabytes[1].
DATA STORAGE & COLLECTION
Our methods of data storage and collection have also advanced in recent years. Huge amounts of
business data are generated every day and most of it is now being stored. In the past, storage ca-
pacity was low and storing vast amounts of data was expensive. There was less desire to store data
unless you had a direct use for it sometime in the immediate future. Today things are very different.
With data regulations like GDPR, companies are much more conscious of how and where they store
their data. Additionally, companies feel pressured to keep up with new technologies. In an effort to
stay competitive, many companies feel compelled to try new technologies as soon as they become
available. In the fast-paced digital world that we live in, a lack of innovation can see you left behind as
your competition steamrolls ahead.
AI and ML are examples of this type of technological pressure companies are experiencing. The goal
of these technologies is to use large data sets to uncover patterns and action new (and more effi-
cient) ways of working. With good use of AI and ML, companies can become more cost-efficient, gain
a better understanding of their data and make more informed and accurate decisions.
AI IS ONLY AS POWERFUL AS YOUR DATA
As for all systems that work with data, the result will never be better than the quality of the data you
start with. And the quality of the data is not only about the actual quality of the input but also about
using the correct data from the right data sources. To ensure that we are in control of our AI and
Machine Learning output, we need to combine correct algorithms with correct data of the expected
quality. This may seem like an obvious statement, but for the vast majority of businesses, ensuring
data quality is a huge challenge.
Most often the output is based on a complex set of data sources. For example, data warehouses, data
sets, transformations, multiple BI systems and applications, expressions and graphical views, etc. As
we continue to use more systems and applications, this complexity continues to grow.
OUTSOURCING & DATA QUALITY
Additionally, many businesses today are relying on outsourcing. These businesses are outsourc-
ing their maintenance, development, and even control to external companies. This means that the
company is not in full control of their data. Instead, they are trusting their external partners to control
their data for them and often contracts are signed before trust is earned. You might be thinking “well
this is how business works”. This might be true, but for data management, it’s worth thinking care-
fully about who you trust with your data as well as how much data they have access to. Data has the
potential to transform the future of a company.
IMPROVE DATA COLLABORATION WITH NODEGRAPH
One effective way of regaining control of your data is to employ data intelligence software, such as
NodeGraph. NodeGraph’s intuitive platform makes data sharing across an organization much sim-
pler, helping to ensure that the right people understand the right data to deliver accurate results. Our
platform offers:
Fully-automated metadata management

End-to-end data lineageRegulatory compliance
Data cataloging
Unit and regression testing
Impact analysis
Data migration
& much more
TAKE CONTROL OF YOUR DATA
Good quality data is essential for meaningful and actionable outputs from AI and ML algorithms, but
its use doesn’t stop there. In subsequent sections, we’re going to focus on exactly how poor quality
data leads to bad outcomes for AI and ML, but here we’re looking at the wider picture.
CYBERSECURITY
Having control over your data is essential for robust security. How will you know when your data has
been breached if you are not 100% clear on where your data is and how it is being utilized? No com-
pany is immune to cyberattacks, especially today. Data breaches and cyber-attacks are happening at
an alarming rate all over the world. It seems that as soon as we advance our detection and preven-
tion techniques, the bad actors also advance their methods of intrusion. It’s become a constant cat
and mouse battle for data. Why? Because data is extremely powerful in the modern age.
Having control over your data is essential for robust security. How will you know when your data has
been breached if you are not 100% clear on where your data is and how it is being utilized? No com-
pany is immune to cyberattacks, especially today. Data breaches and cyber-attacks are happening at
an alarming rate all over the world. It seems that as soon as we advance our detection and preven-
tion techniques, the bad actors also advance their methods of intrusion. It’s become a constant cat
and mouse battle for data. Why? Because data is extremely powerful in the modern age.
If you are a small company and all of your systems become locked (you have no control over your
data), this can have devastating consequences. In fact, one study found that approximately 60% of
small businesses close within 6 months of being hacked[2]. With small companies, it’s often a num-
bers game for the hackers. The more companies they target with phishing emails or malware attacks,
the higher the chance that they will be successful in their malicious goals.
In large companies, security tends to be much tougher, making it more difficult for cybercriminals
to successfully hack the company. However, the rewards of being successful are much greater. Large
companies have much larger sets of data and potentially sensitive intellectual property.
A DATA-DRIVEN APPROACH
To build a successful company that lasts and outperforms its competition, you must have a data-driv-
en and data-centric approach to all areas of your business. Whether it’s data security, transformation,
business intelligence or anything else, data management is key. To do this successfully, you need to
ensure that you are in control of your data. This means you are in control of how your data is used
and the quality of your data. Sounds like a challenge? One important part of getting on top is to en-
sure that you have a well thought out process for Data Governance and the right tools to support this.
With a modern, fully-automated data intelligence platform such as NodeGraph, you get a full insight
into your entire data landscape. You have full control and understanding of your data, all the way
down to the individual fields. All dependencies, transformations, expressions, data sources used, BI
tools and so on, are right in front of you. You become empowered to make the right business deci-
sions through the technology you choose to support your data journey.
HOW BAD DATA HARMS ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The data governance framework acts as a guide that underpins the program. The framework consists
of rules, policies, technologies, processes and organizational structures that form a part of the pro-
gram. The framework will also detail the mission statement and goals, as well as how the success of
the program will be measured. Since the framework is intended to give clarity and direction to the
business, it is usually shared internally so that all stakeholders can understand the scope, goals and
plans for the program.
WHAT IS “BAD DATA”?
“Bad data” can mean several things. Sometimes it means that the data is labeled incorrectly, is full of
errors, has missing values or is otherwise poor in quality. When this data is used, the results will not
be a true reflection of reality. This means predictive models will fail. The model may predict some-
thing will happen less frequently than it would if it had the full data set, and costly breakdowns of
machinery might ensue.
Sometimes bad data might mean the data is biased or too complex for the algorithm to handle.
Humans have unconscious biases that we apply to the world around us. These biases have a way of
sneaking into how we design our data analytics systems. For example, in recent years it was widely
reported that facial recognition systems are much less accurate at identifying black or Asian faces
than white faces. The machine learning algorithm isn’t inherently biased towards white faces, but the
data it was fed was mostly white faces. This meant that it became much more accurate at identifying
these faces.
WATSON
Similarly, IBM designed a computer system called Watson[3], which was tasked with helping doctors
with cancer diagnoses. It’s a noble and ambitious aim, and there was a lot of excitement surrounding
this machine. While human doctors are limited by biases, knowledge and experience, Watson was
not. The idea was that Watson would be able to look at any combination of symptoms and come up
with a diagnosis and a confidence level and do all of this in a matter of seconds. However, Watson
proved much less impressive in practice.
There were several issues with Watson. Firstly, many doctors around the world complained that
Watson was biased towards American methods of diagnosis and treatment. Some of the suggestions
Watson came up with didn’t fit into the oncology landscape in these countries. Secondly, Watson had
great trouble understanding handwritten medical notes from doctors, meaning it was missing a good
chunk of new data in this form.
Lastly, and crucially, Watson could only think in statistics. On the surface, this might seem like a good
thing, and it is for 99% of scenarios. Watson could crunch thousands of medical journals and other
data sets to make confident assertions about what type of cancer a person might have and how to
treat it. It would look at common cancers in that area, key demographics, symptoms and so on. But it
couldn’t understand when something was significant in the same way a doctor or scientist can.
For example, in 2018, the FDA approved a new cancer drug that is effective against all tumors that
have a certain genetic mutation. They did this based on one study of only 55 patients, solely because
the results were so promising. Watson ignored the relevance of this study due to how small it was — it
didn’t deem it significant.
In business, the stakes might be less severe than cancer diagnoses, but these issues with complexity
and bias still exist.
HOW TO ENSURE “GOOD’ DATA
NodeGraph makes it possible to ensure good data by tracking and tracing the origins, movements,
and touchpoints of data to provide insight into its quality and uncover its value.
HOW TO ADDRESS THE DATA QUALITY ISSUE
If poor quality data is bad, then how do you combat this? How do you ensure your business only
utilizes high-quality data? Evidence has shown that companies that take a holistic approach to data
analytics and data pipeline get better outcomes from their analytics than companies who don’t. To
break down what this means, we’ll be looking at some recent findings from an IDC survey with busi-
ness decision-makers and also some industry best practices.
HOW MUCH DATA DO YOU CAPTURE?
Of course, capturing data is an essential part of data science. However, the survey found some key
differences in how different companies perceive the volume of data they collect. Most respondents
reported that they can capture 70-90% of relevant data. The keyword here is ‘relevant’. It’s thought
that most respondents were referring to data they have generated themselves from their internal
systems. However, companies who admitted to struggling with finding relevant data were also the
companies considered to be in the top 20% of leaders. Why? Because these companies were taking a
more holistic approach to data capture. They weren’t just focused on their data, but also the relevant
data that could be gathered from their external partners or other parties.
WHAT IS HIGH-QUALITY DATA?
Additionally, almost 50% of respondents reported that they have issues with data quality. The key to
addressing this issue is to continually tweak your solutions until they are fit for purpose. For data to
be considered high quality, 5 criteria need to be met:
Accuracy — The data needs to be accurate to be useful.

Completeness — There can’t be missing values because this will lead to inaccurate results.
Relevancy — The data must be relevant to the purpose it’s being used for.
Up-to-date — Old data can tell you a lot about the past, but the more up-to-date data you have, the bet-
ter. The world is changing rapidly, and this will be reflected in your data. If you use old data to make new
decisions, you’ll get out-of-date solutions to modern problems.
Consistency — The format of the data must be consistent to avoid erroneous results.
To succeed in all of these areas, you need to start taking control of your data by utilizing good data
profiling tools. Automation is extremely useful here since it can do most of the heavy lifting for you.
The data pipeline must also be carefully designed to avoid duplicate data which can lead to skewed
results in AI and ML algorithms. The pipeline must follow a clear and logical design that works at the
enterprise level. Data must be audited regularly to ensure data management is working correctly.
Requirements for data accuracy must be clearly documented and expectations must be clearly out-
lined. Data governance teams and data project teams must be fully involved in any data projects.
THE ROI OF HIGH-QUALITY DATA
As with most business decisions, ROI must be considered. When you’re considering spending hefty
sums of money on automated data solutions to transform your business into a data-driven enter-
prise, ROI naturally enters the discussion. Most companies understand the value of data science, but
few companies understand the impact poor quality data has on their business.
One study by Gartner, which looked at 140 companies, found that on average, these companies
estimated they were losing $8.2 million annually to poor data quality. This was the average across all
companies, but the figures become more alarming as you delve deeper. A huge 22% of companies
estimated they were losing more than $20 million a year[4].
This loss comes from both unrealized revenues and costs savings, as well as the poor outcomes that
come from erroneous data, inconsistencies, duplicated data and missing data.
ONLY EVER USE GOOD QUALITY DATA AGAIN WITH NODEGRAPH
NodeGraph is a leading data intelligence platform that reveals deep insights into an organization’s
data, allowing businesses to make quicker, more trusted and controlled decisions from its data. With
functionalities ranging from field-level end-to-end data lineage to unit and regression testing, this
powerful platform leverages businesses through data understanding.
NodeGraph provides you with easy access and transparency of your company’s data and its sources.
Knowing the what, where, how and why gives you confidence that everyone is using the right data to
make actual, rather than assumed, decisions. Don’t waste time, resource, and money when the only
fully automated metadata extraction platform on the market can do the work for you! Enable easy
sharing of data throughout the organization, automate documentation and testing and reduce errors.
NodeGraph’s automated data intelligence platform helps any company achieve data understanding
by connecting ALL their data touchpoints. Here are just a few of the tools that we support: Power BI,
Tableau and Qlik.
FIND OUT WHAT NODEGRAPH CAN DO FOR YOU
Now that you know more about data quality and its role in AI and ML, how can we help? Connect with
one of our platform experts to learn how NodeGraph can help your organization establish and main-
tain a foolproof data quality framework. Our platform is compatible with all BI software, including
Qlik, Power BI, Microsoft SQL Server, Tableau and Snowflake.
Let’s Connect

Data Quality, AI and ML Full Guide

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Quality, AI and ML Full Guide

Uploaded by

Copyright:

Available Formats

THE ROLE OF DATA

QUALITY IN AI AND ML.

THE CURRENT DATA LANDSCAPE

HOW MUCH DATA IS BEING GENERATED?

DATA STORAGE & COLLECTION

AI IS ONLY AS POWERFUL AS YOUR DATA

OUTSOURCING & DATA QUALITY

IMPROVE DATA COLLABORATION WITH NODEGRAPH

Fully-automated metadata management

WHAT IS “BAD DATA”?

HOW TO ENSURE “GOOD’ DATA

HOW TO ADDRESS THE DATA QUALITY ISSUE

HOW MUCH DATA DO YOU CAPTURE?

WHAT IS HIGH-QUALITY DATA?

Accuracy — The data needs to be accurate to be useful.

THE ROI OF HIGH-QUALITY DATA

ONLY EVER USE GOOD QUALITY DATA AGAIN WITH NODEGRAPH

You might also like