Professional Documents
Culture Documents
”
Due to the advent of new technologies, devices, and communication means like social networking
sites, the amount of data produced by mankind is growing rapidly every year. The amount of data
produced by us from the beginning of time till 2003 was 5 billion gigabytes. If you pile up the
data in the form of disks it may fill an entire football field. The same amount was created in every
two days in 2011, and in every ten minutes in 2013. This rate is still growing enormously. Though
all this information produced is meaningful and can be useful when processed, it is being
neglected.
Big data is a collection of large datasets that cannot be processed using traditional computing
techniques. It is not a single technique or a tool, rather it has become a complete subject, which
involves various tools, technqiues and frameworks.
Big data involves the data produced by different devices and applications. Given below are some
of the fields that come under the umbrella of Big Data.
Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the performance
information of the aircraft.
Social Media Data − Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and
‘sell’ decisions made on a share of different companies made by the customers.
Power Grid Data − the power grid data holds information consumed by a particular node
with respect to a base station.
Transport Data − Transport data includes model, capacity, distance and availability of a
vehicle.
Search Engine Data − Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it
will be of three types.
Structured data − Relational data.
Semi Structured data − XML data.
Unstructured data − Word, PDF, Text, Media Logs.
The New York Stock Exchange generates about one terabyte of new trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads,
message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.
Types of Big Data
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period of time, talent in computer science has achieved greater success
in developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a
size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.
Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
Looking at these figures one can easily understand why the name Big Data is given and imagine
the challenges involved in its storage and processing.
Do you know? Data stored in a relational database management system is one example of
a 'structured' data.
Any data with unknown form or the structure is classified as unstructured data. In addition to the
size being huge, un-structured data poses multiple challenges in terms of its processing for deriving
value out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don't know how to derive value out of it since this data
is in its raw form or unstructured format.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over the years
Please note that web application data, which is unstructured, consists of log files, transaction
history files etc. OLTP systems are built to work with structured data wherein data is stored in
relations (tables).
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a
very crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one
characteristic which needs to be considered while dealing with Big Data.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most
of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices,
PDFs, audio, etc. are also being considered in the analysis applications. This variety of
unstructured data poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data
is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
BIG DATA ANALYTICS
Big data analytics examines large amounts of data to uncover hidden patterns, correlations and
other insights. With today’s technology, it’s possible to analyze your data and get answers from it
almost immediately – an effort that’s slower and less efficient with more traditional business
intelligence solutions.
The concept of big data has been around for years; most organizations now understand that if
they capture all the data that streams into their businesses, they can apply analytics and get
significant value from it. But even in the 1950s, decades before anyone uttered the term “big
data,” businesses were using basic analytics (essentially numbers in a spreadsheet that were
manually examined) to uncover insights and trends.
The new benefits that big data analytics brings to the table, however, are speed and efficiency.
Whereas a few years ago a business would have gathered information, run analytics and
unearthed information that could be used for future decisions, today that business can identify
insights for immediate decisions. The ability to work faster – and stay agile – gives organizations
a competitive edge they didn’t have before.
Big data analytics helps organizations harness their data and use it to identify new opportunities.
That, in turn, leads to smarter business moves, more efficient operations, higher profits and
happier customers. In his report Big Data in Big Companies, IIA Director of Research Tom
Davenport interviewed more than 50 businesses to understand how they used big data. He found
they got value in the following ways:
1. Cost reduction. Big data technologies such as Hadoop and cloud-based analytics bring
significant cost advantages when it comes to storing large amounts of data – plus they can
identify more efficient ways of doing business.
2. Faster, better decision making. With the speed of Hadoop and in-memory analytics, combined
with the ability to analyze new sources of data, businesses are able to analyze information
immediately – and make decisions based on what they’ve learned.
3. New products and services. With the ability to gauge customer needs and satisfaction through
analytics comes the power to give customers what they want. Davenport points out that with big
data analytics, more companies are creating new products to meet customers’ needs.
Who’s using it?
Think of a business that relies on quick, agile decisions to stay competitive, and most likely big
data analytics is involved in making that business tick. Here’s how different types of
organizations might use the technology:
Clinical research is a slow and expensive process, with trials failing for a variety of reasons.
Advanced analytics, artificial intelligence (AI) and the Internet of Medical Things (IoMT)
unlocks the potential of improving speed and efficiency at every stage of clinical research by
delivering more intelligent, automated solutions.
Financial institutions gather and access analytical insight from large volumes of unstructured
data in order to make sound financial decisions. Big data analytics allows them to access the
information they need when they need it, by eliminating overlapping, redundant tools and
systems.
For manufacturers, solving problems is nothing new. They wrestle with difficult problems on a
daily basis - from complex supply chains, to motion applications, to labor constraints and
equipment breakdowns. That's why big data analytics is essential in the manufacturing industry,
as it has allowed competitive organizations to discover new cost saving opportunities and
revenue opportunities.
Big data is a given in the health care industry. Patient records, health plans, insurance
information and other types of information can be difficult to manage – but are full of key
insights once analytics are applied. That’s why big data analytics technology is so important to
heath care. By analyzing large amounts of information – both structured and unstructured –
quickly, health care providers can provide lifesaving diagnoses or treatment options almost
immediately.
Big Data Analytics for Government
Certain government agencies face a big challenge: tighten the budget without compromising
quality or productivity. This is particularly troublesome with law enforcement agencies, which
are struggling to keep crime rates down with relatively scarce resources. And that’s why many
agencies use big data analytics; the technology streamlines operations while giving the agency a
more holistic view of criminal activity.
Customer service has evolved in the past several years, as savvier shoppers expect retailers to
understand exactly what they need, when they need it. Big data analytics technology helps
retailers meet those demands. Armed with endless amounts of data from customer loyalty
programs, buying habits and other sources, retailers not only have an in-depth understanding of
their customers, they can also predict trends, recommend new products – and boost profitability.
2. Diagnostic Analytics
The obvious successor to descriptive analytics is diagnostic analytics. Diagnostic analytical tools
aid an analyst to dig deeper into an issue at hand so that they can arrive at the source of a problem.
In a structured business environment, tools for both descriptive and diagnostic analytics go hand-
in-hand!
3. Predictive Analytics
Any business that is pursuing success should have foresight. Predictive analytics helps businesses
to forecast trends based on the current events. Whether it’s predicting the probability of an event
happening in future or estimating the accurate time it will happen can all be determined with the
help of predictive analytical models. Usually, many different but co-dependent variables are
analyzed to predict a trend in this type of analysis. For example, in the healthcare domain,
prospective health risks can be predicted based on an individual’s habits/diet/genetic composition.
Therefore, these models are most important across various fields.
In today’s big data context, the previous approaches are either incomplete or suboptimal. For
example, the SEMMA methodology disregards completely data collection and preprocessing of
different data sources. These stages normally constitute most of the work in a successful big data
project.
A big data analytics cycle can be described by the following stage −
MODEL BUILDING
Data modeling is a set of tools and techniques used to understand and analyze how an
organization should collect, update, and store data. It is a critical skill for the
business analyst who is involved with discovering, analyzing, and specifying changes to how
software systems create and maintain information.
successful predictive analytics project is executed step by step. As you immerse yourself in
the details of the project, watch for these major milestones:
MODEL VALIDATION
Model validation is defined within regulatory guidance as “the set of processes and activities
intended to verify that models are performing as expected, in line with their design objectives, and
business uses.” It also identifies “potential limitations and assumptions, and assesses their possible
impact.”
Generally, validation activities are performed by individuals independent of model development
or use. Models, therefore, should not be validated by their owners as they can be highly technical,
and some institutions may find it difficult to assemble a model risk team that has sufficient
functional and technical expertise to carry out independent validation. When faced with this
obstacle, institutions often outsource the validation task to third parties.
1 Conceptual Design
The foundation of any model validation is its conceptual design, which needs documented
coverage assessment that supports the model’s ability to meet business and regulatory needs and
the unique risks facing a bank.
The design and capabilities of a model can have a profound effect on the overall effectiveness of
a bank’s ability to identify and respond to risks. For example, a poorly designed risk assessment
model may result in a bank establishing relationships with clients that present a risk that is greater
than its risk appetite, thus exposing the bank to regulatory scrutiny and reputation damage.
A validation should independently challenge the underlying conceptual design and ensure that
documentation is appropriate to support the model’s logic and the model’s ability to achieve
desired regulatory and business outcomes for which it is designed.
2 System Validation
All technology and automated systems implemented to support models have limitations. An
effective validation includes: firstly, evaluating the processes used to integrate the model’s
conceptual design and functionality into the organisation’s business setting; and, secondly,
examining the processes implemented to execute the model’s overall design. Where gaps or
limitations are observed, controls should be evaluated to enable the model to function effectively.
To establish a robust framework for data validation, guidance indicates that the accuracy of source
data be assessed. This is a vital step because data can be derived from a variety of sources, some
of which might lack controls on data integrity, so the data might be incomplete or inaccurate.
4 Process Validation
To verify that a model is operating effectively, it is important to prove that the established
processes for the model’s ongoing administration, including governance policies and procedures,
support the model’s sustainability. A review of the processes also determines whether the models
are producing output that is accurate, managed effectively, and subject to the appropriate controls.
If done effectively, model validation will enable your bank to have every confidence in its various
models’ accuracy, as well as aligning them with the bank’s business and regulatory expectations.
By failing to validate models, banks increase the risk of regulatory criticism, fines, and penalties.
The complex and resource-intensive nature of validation makes it necessary to dedicate sufficient
resources to it. An independent validation team well versed in data management, technology, and
relevant financial products or services – for example, credit, capital management, insurance, or
financial crime compliance – is vital for success. Where shortfalls in the validation process are
identified, timely remedial actions should be taken to close the gaps.