You are on page 1of 19

“90% of the world’s data was generated in the last few years.


Due to the advent of new technologies, devices, and communication means like social networking
sites, the amount of data produced by mankind is growing rapidly every year. The amount of data
produced by us from the beginning of time till 2003 was 5 billion gigabytes. If you pile up the
data in the form of disks it may fill an entire football field. The same amount was created in every
two days in 2011, and in every ten minutes in 2013. This rate is still growing enormously. Though
all this information produced is meaningful and can be useful when processed, it is being
neglected.

What is Big Data?

Big data is a collection of large datasets that cannot be processed using traditional computing
techniques. It is not a single technique or a tool, rather it has become a complete subject, which
involves various tools, technqiues and frameworks.

What Comes Under Big Data?

Big data involves the data produced by different devices and applications. Given below are some
of the fields that come under the umbrella of Big Data.
 Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the performance
information of the aircraft.
 Social Media Data − Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
 Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and
‘sell’ decisions made on a share of different companies made by the customers.
 Power Grid Data − the power grid data holds information consumed by a particular node
with respect to a base station.
 Transport Data − Transport data includes model, capacity, distance and availability of a
vehicle.
 Search Engine Data − Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it
will be of three types.
 Structured data − Relational data.
 Semi Structured data − XML data.
 Unstructured data − Word, PDF, Text, Media Logs.

Examples of Big Data

Following are some the examples of Big Data-

The New York Stock Exchange generates about one terabyte of new trade data per day.

Social Media

The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads,
message exchanges, putting comments etc.

A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.
Types of Big Data

Big Data' could be found in three forms:

1. Structured
2. Unstructured
3. Semi-structured

Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period of time, talent in computer science has achieved greater success
in developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a
size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.

Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.

Looking at these figures one can easily understand why the name Big Data is given and imagine
the challenges involved in its storage and processing.

Do you know? Data stored in a relational database management system is one example of
a 'structured' data.

Examples of Structured Data

An 'Employee' table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane Female Finance 550000


Unstructured

Any data with unknown form or the structure is classified as unstructured data. In addition to the
size being huge, un-structured data poses multiple challenges in terms of its processing for deriving
value out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don't know how to derive value out of it since this data
is in its raw form or unstructured format.

Examples of Un-structured Data

The output returned by 'Google Search'

Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.

Examples of Semi-structured Data

Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over the years

Please note that web application data, which is unstructured, consists of log files, transaction
history files etc. OLTP systems are built to work with structured data wherein data is stored in
relations (tables).

Characteristics of Big Data

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a
very crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one
characteristic which needs to be considered while dealing with Big Data.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most
of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices,
PDFs, audio, etc. are also being considered in the analysis applications. This variety of
unstructured data poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data
is massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
BIG DATA ANALYTICS
Big data analytics examines large amounts of data to uncover hidden patterns, correlations and
other insights. With today’s technology, it’s possible to analyze your data and get answers from it
almost immediately – an effort that’s slower and less efficient with more traditional business
intelligence solutions.

History and evolution of big data analytics

The concept of big data has been around for years; most organizations now understand that if
they capture all the data that streams into their businesses, they can apply analytics and get
significant value from it. But even in the 1950s, decades before anyone uttered the term “big
data,” businesses were using basic analytics (essentially numbers in a spreadsheet that were
manually examined) to uncover insights and trends.

The new benefits that big data analytics brings to the table, however, are speed and efficiency.
Whereas a few years ago a business would have gathered information, run analytics and
unearthed information that could be used for future decisions, today that business can identify
insights for immediate decisions. The ability to work faster – and stay agile – gives organizations
a competitive edge they didn’t have before.

Why is big data analytics important?

Big data analytics helps organizations harness their data and use it to identify new opportunities.
That, in turn, leads to smarter business moves, more efficient operations, higher profits and
happier customers. In his report Big Data in Big Companies, IIA Director of Research Tom
Davenport interviewed more than 50 businesses to understand how they used big data. He found
they got value in the following ways:

1. Cost reduction. Big data technologies such as Hadoop and cloud-based analytics bring
significant cost advantages when it comes to storing large amounts of data – plus they can
identify more efficient ways of doing business.
2. Faster, better decision making. With the speed of Hadoop and in-memory analytics, combined
with the ability to analyze new sources of data, businesses are able to analyze information
immediately – and make decisions based on what they’ve learned.
3. New products and services. With the ability to gauge customer needs and satisfaction through
analytics comes the power to give customers what they want. Davenport points out that with big
data analytics, more companies are creating new products to meet customers’ needs.
Who’s using it?

Think of a business that relies on quick, agile decisions to stay competitive, and most likely big
data analytics is involved in making that business tick. Here’s how different types of
organizations might use the technology:

Big Data Analytics for Life Sciences

Clinical research is a slow and expensive process, with trials failing for a variety of reasons.
Advanced analytics, artificial intelligence (AI) and the Internet of Medical Things (IoMT)
unlocks the potential of improving speed and efficiency at every stage of clinical research by
delivering more intelligent, automated solutions.

Big Data Analytics for Banking

Financial institutions gather and access analytical insight from large volumes of unstructured
data in order to make sound financial decisions. Big data analytics allows them to access the
information they need when they need it, by eliminating overlapping, redundant tools and
systems.

Big Data Analytics for Manufacturing

For manufacturers, solving problems is nothing new. They wrestle with difficult problems on a
daily basis - from complex supply chains, to motion applications, to labor constraints and
equipment breakdowns. That's why big data analytics is essential in the manufacturing industry,
as it has allowed competitive organizations to discover new cost saving opportunities and
revenue opportunities.

Big Data Analytics for Health Care

Big data is a given in the health care industry. Patient records, health plans, insurance
information and other types of information can be difficult to manage – but are full of key
insights once analytics are applied. That’s why big data analytics technology is so important to
heath care. By analyzing large amounts of information – both structured and unstructured –
quickly, health care providers can provide lifesaving diagnoses or treatment options almost
immediately.
Big Data Analytics for Government

Certain government agencies face a big challenge: tighten the budget without compromising
quality or productivity. This is particularly troublesome with law enforcement agencies, which
are struggling to keep crime rates down with relatively scarce resources. And that’s why many
agencies use big data analytics; the technology streamlines operations while giving the agency a
more holistic view of criminal activity.

Big Data Analytics for Retail

Customer service has evolved in the past several years, as savvier shoppers expect retailers to
understand exactly what they need, when they need it. Big data analytics technology helps
retailers meet those demands. Armed with endless amounts of data from customer loyalty
programs, buying habits and other sources, retailers not only have an in-depth understanding of
their customers, they can also predict trends, recommend new products – and boost profitability.

Different Types Of Data Analytics


Let me take you through the main types of analytics and the scenarios under which they are
normally employed.
1. Descriptive Analytics
As the name implies, descriptive analysis or statistics can summarize raw data and convert it into
a form that can be easily understood by humans. They can describe in detail about an event that
has occurred in the past. This type of analytics is helpful in deriving any pattern if any from past
events or drawing interpretations from them so that better strategies for the future can be framed
This is the most frequently used type of analytics across organizations. It’s crucial in revealing
the key metrics and measures within any business.

2. Diagnostic Analytics
The obvious successor to descriptive analytics is diagnostic analytics. Diagnostic analytical tools
aid an analyst to dig deeper into an issue at hand so that they can arrive at the source of a problem.
In a structured business environment, tools for both descriptive and diagnostic analytics go hand-
in-hand!

3. Predictive Analytics
Any business that is pursuing success should have foresight. Predictive analytics helps businesses
to forecast trends based on the current events. Whether it’s predicting the probability of an event
happening in future or estimating the accurate time it will happen can all be determined with the
help of predictive analytical models. Usually, many different but co-dependent variables are
analyzed to predict a trend in this type of analysis. For example, in the healthcare domain,
prospective health risks can be predicted based on an individual’s habits/diet/genetic composition.
Therefore, these models are most important across various fields.

Predictive analytics can be further categorized as

1. Predictive Modelling –What will happen next, if ?


2. Root Cause Analysis-Why this actually happened?
3. Data Mining- Identifying correlated data (click here to get sample use-cases with code).
4. Forecasting- What if the existing trends continue?
5. Monte-Carlo Simulation – What could happen?
6. Pattern Identification and Alerts –When should an action be invoked to correct a process.
4. Prescriptive Analytics
This type of analytics explains the step-by-step process in a situation. For instance, a prescriptive
analysis is what comes into play when your Uber driver gets the easier route from Gmaps. The
best route was chosen by considering the distance of every available route from your pick-up route
to the destination and the traffic constraints on each road. A data analyst would need to apply one
or more of the above analytics processes as a part of his job.

Big Data Life Cycle

In today’s big data context, the previous approaches are either incomplete or suboptimal. For
example, the SEMMA methodology disregards completely data collection and preprocessing of
different data sources. These stages normally constitute most of the work in a successful big data
project.
A big data analytics cycle can be described by the following stage −

 Business Problem Definition


 Research
 Human Resources Assessment
 Data Acquisition
 Data Mugging
 Data Storage
 Exploratory Data Analysis
 Data Preparation for Modeling and Assessment
 Modeling
 Implementation
In this section, we will throw some light on each of these stages of big data life cycle.
Business Problem Definition
This is a point common in traditional BI and big data analytics life cycle. Normally it is a non-
trivial stage of a big data project to define the problem and evaluate correctly how much potential
gain it may have for an organization. It seems obvious to mention this, but it has to be evaluated
what are the expected gains and costs of the project.
Research
Analyze what other companies have done in the same situation. This involves looking for
solutions that are reasonable for your company, even though it involves adapting other solutions
to the resources and requirements that your company has. In this stage, a methodology for the
future stages should be defined.
Human Resources Assessment
Once the problem is defined, it’s reasonable to continue analyzing if the current staff is able to
complete the project successfully. Traditional BI teams might not be capable to deliver an optimal
solution to all the stages, so it should be considered before starting the project if there is a need
to outsource a part of the project or hire more people.
Data Acquisition
This section is key in a big data life cycle; it defines which type of profiles would be needed to
deliver the resultant data product. Data gathering is a non-trivial step of the process; it normally
involves gathering unstructured data from different sources. To give an example, it could involve
writing a crawler to retrieve reviews from a website. This involves dealing with text, perhaps in
different languages normally requiring a significant amount of time to be completed.
Data Mugging
Once the data is retrieved, for example, from the web, it needs to be stored in an easyto-use
format. To continue with the reviews examples, let’s assume the data is retrieved from different
sites where each has a different display of the data.
Suppose one data source gives reviews in terms of rating in stars, therefore it is possible to read
this as a mapping for the response variable y ∈ {1, 2, 3, 4, 5}. Another data source gives reviews
using two arrows system, one for up voting and the other for down voting. This would imply a
response variable of the form y ∈ {positive, negative}.
In order to combine both the data sources, a decision has to be made in order to make these two
response representations equivalent. This can involve converting the first data source response
representation to the second form, considering one star as negative and five stars as positive. This
process often requires a large time allocation to be delivered with good quality.
Data Storage
Once the data is processed, it sometimes needs to be stored in a database. Big data technologies
offer plenty of alternatives regarding this point. The most common alternative is using the Hadoop
File System for storage that provides users a limited version of SQL, known as HIVE Query
Language. This allows most analytics task to be done in similar ways as would be done in
traditional BI data warehouses, from the user perspective. Other storage options to be considered
are MongoDB, Redis, and SPARK.
This stage of the cycle is related to the human resources knowledge in terms of their abilities to
implement different architectures. Modified versions of traditional data warehouses are still being
used in large scale applications. For example, teradata and IBM offer SQL databases that can
handle terabytes of data; open source solutions such as postgreSQL and MySQL are still being
used for large scale applications.
Even though there are differences in how the different storages work in the background, from the
client side, most solutions provide a SQL API. Hence having a good understanding of SQL is still
a key skill to have for big data analytics.
This stage a priori seems to be the most important topic, in practice, this is not true. It is not even
an essential stage. It is possible to implement a big data solution that would be working with real-
time data, so in this case, we only need to gather data to develop the model and then implement
it in real time. So there would not be a need to formally store the data at all.
Exploratory Data Analysis
Once the data has been cleaned and stored in a way that insights can be retrieved from it, the data
exploration phase is mandatory. The objective of this stage is to understand the data, this is
normally done with statistical techniques and also plotting the data. This is a good stage to
evaluate whether the problem definition makes sense or is feasible.
Data Preparation for Modeling and Assessment
This stage involves reshaping the cleaned data retrieved previously and using statistical
preprocessing for missing values imputation, outlier detection, normalization, feature extraction
and feature selection.
Modelling
The prior stage should have produced several datasets for training and testing, for example, a
predictive model. This stage involves trying different models and looking forward to solving the
business problem at hand. In practice, it is normally desired that the model would give some
insight into the business. Finally, the best model or combination of models is selected evaluating
its performance on a left-out dataset.
Implementation
In this stage, the data product developed is implemented in the data pipeline of the company. This
involves setting up a validation scheme while the data product is working, in order to track its
performance. For example, in the case of implementing a predictive model, this stage would
involve applying the model to new data and once the response is available, evaluate the model.

MODEL BUILDING
Data modeling is a set of tools and techniques used to understand and analyze how an
organization should collect, update, and store data. It is a critical skill for the
business analyst who is involved with discovering, analyzing, and specifying changes to how
software systems create and maintain information.

successful predictive analytics project is executed step by step. As you immerse yourself in
the details of the project, watch for these major milestones:

1. Defining Business Objectives


The project starts with using a well-defined business objective. The model is supposed to address
a business question. Clearly stating that objective will allow you to define the scope of your project,
and will provide you with the exact test to measure its success.
2. Preparing Data
You’ll use historical data to train your model. The data is usually scattered across multiple sources
and may require cleansing and preparation. Data may contain duplicate records and outliers;
depending on the analysis and the business objective, you decide whether to keep or remove them.
Also, the data could have missing values, may need to undergo some transformation, and may be
used to generate derived attributes that have more predictive power for your objective. Overall,
the quality of the data indicates the quality of the model.
3. Sampling Your Data
You’ll need to split your data into two sets: training and test datasets. You build the model using
the training dataset. You use the test data set to verify the accuracy of the model’s output. Doing
so is absolutely crucial. Otherwise you run the risk of overfitting your model — training the model
with a limited dataset, to the point that it picks all the characteristics (both the signal and the noise)
that are only true for that particular dataset. An model that’s overfitted for a specific data set will
perform miserably when you run it on other datasets. A test dataset ensures a valid way to
accurately measure your model’s performance.
4. Building the Model
Sometimes the data or the business objectives lend themselves to a specific algorithm or model.
Other times the best approach is not so clear-cut. As you explore the data, run as many algorithms
as you can; compare their outputs. Base your choice of the final model on the overall results.
Sometimes you’re better off running an ensemble of models simultaneously on the data and
choosing a final model by comparing their outputs.
5. Deploying the Model
After building the model, you have to deploy it in order to reap its benefits. That process may
require co-ordination with other departments. Aim at building a deployable model. Also be sure
you know how to present your results to the business stakeholders in an understandable and
convincing way so they adopt your model. After the model is deployed, you’ll need to monitor its
performance and continue improving it. Most models decay after a certain period of time. Keep
your model up to date by refreshing it with newly available data.

MODEL VALIDATION

Model validation is defined within regulatory guidance as “the set of processes and activities
intended to verify that models are performing as expected, in line with their design objectives, and
business uses.” It also identifies “potential limitations and assumptions, and assesses their possible
impact.”
Generally, validation activities are performed by individuals independent of model development
or use. Models, therefore, should not be validated by their owners as they can be highly technical,
and some institutions may find it difficult to assemble a model risk team that has sufficient
functional and technical expertise to carry out independent validation. When faced with this
obstacle, institutions often outsource the validation task to third parties.

The Four Elements


Model validation consists of four crucial elements which should be considered:

1 Conceptual Design
The foundation of any model validation is its conceptual design, which needs documented
coverage assessment that supports the model’s ability to meet business and regulatory needs and
the unique risks facing a bank.

The design and capabilities of a model can have a profound effect on the overall effectiveness of
a bank’s ability to identify and respond to risks. For example, a poorly designed risk assessment
model may result in a bank establishing relationships with clients that present a risk that is greater
than its risk appetite, thus exposing the bank to regulatory scrutiny and reputation damage.

A validation should independently challenge the underlying conceptual design and ensure that
documentation is appropriate to support the model’s logic and the model’s ability to achieve
desired regulatory and business outcomes for which it is designed.
2 System Validation
All technology and automated systems implemented to support models have limitations. An
effective validation includes: firstly, evaluating the processes used to integrate the model’s
conceptual design and functionality into the organisation’s business setting; and, secondly,
examining the processes implemented to execute the model’s overall design. Where gaps or
limitations are observed, controls should be evaluated to enable the model to function effectively.

3 Data Validation and Quality Assessment


Data errors or irregularities impair results and might lead to an organisation’s failure to identify
and respond to risks. Best practise indicates that institutions should apply a risk-based data
validation, which enables the reviewer to consider risks unique to the organisation and the model.

To establish a robust framework for data validation, guidance indicates that the accuracy of source
data be assessed. This is a vital step because data can be derived from a variety of sources, some
of which might lack controls on data integrity, so the data might be incomplete or inaccurate.

4 Process Validation
To verify that a model is operating effectively, it is important to prove that the established
processes for the model’s ongoing administration, including governance policies and procedures,
support the model’s sustainability. A review of the processes also determines whether the models
are producing output that is accurate, managed effectively, and subject to the appropriate controls.

If done effectively, model validation will enable your bank to have every confidence in its various
models’ accuracy, as well as aligning them with the bank’s business and regulatory expectations.
By failing to validate models, banks increase the risk of regulatory criticism, fines, and penalties.

The complex and resource-intensive nature of validation makes it necessary to dedicate sufficient
resources to it. An independent validation team well versed in data management, technology, and
relevant financial products or services – for example, credit, capital management, insurance, or
financial crime compliance – is vital for success. Where shortfalls in the validation process are
identified, timely remedial actions should be taken to close the gaps.

LEARNING DATA ANALYTIC MODEL


As depicted in Figure 2, the four dimensions of the proposed reference model for LA are:
- What? What kind of data does the system gather, manage, and use for the analysis?
- Who? Who is targeted by the analysis?
- Why? Why does the system analyze the collected data?
- How? How does the system perform the analysis of the collected data?

Big Learning Analytics


The abundance of educational data, as pointed out in the “what?” dimension of the LA reference
model, and the recent attention on the potentiality of efficient infrastructures for capturing and
processing large amounts of data, known as big data, have resulted in a growing interest in big
learning analytics among LA researchers and practitioners (Dawson et al., 2014). Big learning
analytics refers to leveraging big data analytics methods to generate value in TEL environments. 8
Harnessing big data in the TEL domain has enormous potential. LA stakeholders have access to a
massive volume of data from learners’ activities across various learning environments which,
through the use of big data analytics methods, can be used to develop a greater understanding of the
learning experiences and processes in the new open, networked, and increasingly complex learning
environments.
A key challenge in big learning analytics is how to aggregate and integrate raw data from multiple,
heterogeneous sources, often available in different formats, to create a useful educational data set
that reflects the distributed activities of the learner; thus leading to more precise and solid LA results.
Furthermore, handling of big data is a technical challenge because efficient analytics methods and
tools have to be implemented to deliver meaningful results without too much delay, so that
stakeholders have the opportunity to act on newly gained information in time. Strategies and best
practices on harnessing big data in TEL have to be found and shared by the LA research community.

You might also like