You are on page 1of 9

What is Big Data?

Bigdata is a term used to describe a collection of data that is huge in size and yet growing
exponentially with time.

Big data is a combination of structured, semi-structured and unstructured data collected by


organizations that can be mined for information and used in ML projects, predictive modeling
and other advanced analytics applications.

Some examples are transaction processing systems, customer databases, documents, emails,
medical records, mobile apps and social networks.

Difference between traditional data and big data:

Small Data/ Traditional data Big Data


Mostly structured Mostly unstructured
Store in MB, GB, TB Store in PB, EB
Increase gradually Increase exponentially
Locally present, centralized Globally present, distributed
Sql server, oracle Hadoop, Spark
Single node Multimode cluster

HISTORY OF BIG DATA:

The term Big Data was coined by Roger Mougalas back in 2005. However, the application of big
data and the quest to understand the available data is something that has been in existence for
a long time.

John Graunt in 1663 recorded and analyzed information on the rate of mortality in London. He
did this in an effort to raise awareness on the effects of the bubonic plague that was ongoing at
the time.

The starting point of modern data begins in 1889 when a computing system was invented by
Herman Hollerith in an attempt to organize census data.

After Herman Hollerith’s input, the next noteworthy data development leap happened in 1937
under Franklin D. Roosevelt’s presidential administration in the United States. After the United
States congress passed the Social Security Act, the government was required to keep track of
millions of Americans. The government contracted IBM to develop a punch card-reading
system that would be applied in this extensive data project.
However, the very first data-processing machine was named Colossus and was developed by
the British in order to decipher Nazi codes in World War II, 1943. This machine worked by
searching for any patterns that would appear regularly in the intercepted messages.

The first data centre was built by the United States government in 1965 for the purpose of
storing millions of tax returns and fingerprint sets. This project, however, did not persist due to
fear of sabotage or acquisition.

Tim Berners-Lee a British computer scientist invented the World Wide Web in 1989.
Berners-Lee’s intention was to enable the sharing of information through a hypertext system.

The first super-computer was built in 1995. This computer had the capacity to handle work that
would take a single person thousands of years in a matter of seconds.

In 21st Century, the world was first introduced to the term Big Data by Roger Mougalas. In the
same year (2005), Yahoo created the now open-source Hadoop with the intention of indexing
the entire World Wide Web. Today, Hadoop is used by millions of businesses to go through
colossal amounts of data.

 Introduction to Big Data platform

● A big data platform is a type of IT solution that combines the features and capabilities of
several big data applications and utilities within a single solution, this is then used
further for managing as well as analyzing Big Data.
● It focuses on providing its users with efficient analytics tools for massive datasets. 
● The users of such platforms can custom build applications according to their use case
like to calculate customer loyalty (E-Commerce user case), and so on.
● Goal: The main goal of a Big Data Platform is to achieve: Scalability, Availability,
Performance, and Security.
● Example: Some of the most commonly used Big Data Platforms are :
▪ Hadoop Delta Lake Migration Platform
▪ Data Catalog Platform
▪ Data Ingestion Platform
▪ IoT Analytics Platform

Drivers for Big Data

Big Data has quickly risen to become one of the most desired topics in the industry. The main
business drivers for such rising demand for Big Data Analytics are :

1. The digitization of society


2. The drop in technology costs
3. Connectivity through cloud computing
4. Increased knowledge about data science
5. Social media applications
6. The rise of Internet-of-Things (IoT)
Example: A number of companies that have Big Data at the core of their strategy like: Apple,
Amazon, Facebook and Netflix have become very successful at the beginning of the 21st century.

Big-Data Architecture:

Sources Layer: The Big Data sources are the ones that govern the Big Data architecture.
● The designing of the architecture depends heavily on the data sources.
● The data is arriving from numerous sources that too in different formats. These include
relational databases, company servers and sensors such as IoT devices, third-party
data providers, etc.
● These sources pile up a huge amount of data in no time. The Big Data architecture is
designed such that it is capable of handling this data.

Data storage: Data for batch processing operations is typically stored in a distributed file store
that can hold high volumes of large files in various formats.
● This kind of store is often called a data lake.
● Options for implementing this storage include Blob storage, data lake store, SQL
database, Cosmos DB.

Batch Processing: Since the data is so huge in size, the architecture needs a batch processing
system to filter, aggregate, and process data for advanced analytics.
● These are long-running batch jobs. This involves reading the data from the storage layer,
processing it, and finally writing the outputs to the new files.
● Hadoop is the most commonly used solution for it.
● Options include U-SQL, Hive, Pig, Spark

Real-time message ingestion: If the solution includes real-time sources, the architecture must
include a way to capture and store real-time messages for stream processing.

● This might be a simple data store, where incoming messages are dropped into a folder
for processing.
● The primary goal of data ingestion is to furnish trouble-free transportation of data into
the further layers of data architecture. Generally, Kafka Streams or REST APIs are used
for Ingestion.

Stream Processing: Processing the data arriving in real-time is the hottest trend in the Big Data
world.

● The Big Data architecture, therefore, must include a system to capture and store
real-time data.
● This can be done by simply ingesting the real-time data into a data store for processing.
The architecture needs to have a robust system for dealing with real-time data.

Analytical data store: Many big data solutions prepare data for analysis and then serve the
processed data in a structured format that can be queried using analytical tools. The analytical
data store used to serve these queries can be a relational data warehouse, as seen in most
traditional business intelligence (BI) solutions.

Analysis and reporting: The goal of most big data solutions is to provide insights into the data
through analysis and reporting.

● To empower users to analyze the data, the architecture may include a data modeling
layer and it might also support self-service BI, using the modeling and visualization
technologies in Microsoft Power BI or Microsoft Excel.
● Analysis and reporting can also take the form of interactive data exploration by data
scientists or data analysts.
● For these scenarios, many Azure services support analytical notebooks, such as Jupyter,
enabling these users to leverage their existing skills with Python or R. For large-scale
data exploration, you can use Microsoft R Server, either standalone or with Spark.

Orchestration: Most big data solutions consist of repeated data processing operations,
encapsulated in workflows, that transform source data, move data between multiple sources
and sinks, load the processed data into an analytical data store, or push the results straight to a
report or dashboard. To automate these workflows, you can use an orchestration technology
such Azure Data Factory or Apache Oozie and Sqoop.

5 V's of Big Data:

There are five v's of Big Data that explains the characteristics.

● Volume
● Veracity
● Variety
● Value
● Velocity

Volume:

● The name Big Data itself is related to an enormous size.


● Big Data is a vast 'volumes' of data generated from many sources daily, such as business
processes, machines, social media platforms, networks, human interactions, and many
more.
● Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded each day.
Big data technologies can handle large amounts of data.
Variety:

● Big Data can be structured, unstructured, and semi-structured that are being collected
from different sources.
● Data will only be collected from databases and sheets in the past, But these days the
data will comes in array forms, that are PDFs, Emails, audios, photos, videos, etc.

Veracity:

● It is how accurate or truthful a data set may be. It means how much the data is reliable.
● It has many ways to filter or translate the data.
● It is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.

Velocity:

● It plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time.

Components of a Big Data Ecosystem:

1. Ingestion:

● The ingestion layer is the very first step of pulling in raw data.

● It comes from internal sources, relational databases, non-relational databases,


social media, emails, phone calls etc.

● There are two kinds of ingestions :

▪ Batch, in which large groups of data are gathered and delivered together.
▪ Streaming, which is a continuous flow of data. This is necessary for
real-time data analytics.

2. Storage :

● Storage is where the converted data is stored in a data lake or warehouse and
eventually processed.

● The data lake/warehouse is the most essential component of a big data


ecosystem.

● It needs to contain only thorough, relevant data to make insights as valuable as


possible.

● It must be efficient with as little redundancy as possible to allow for quicker


processing.

3. Analysis:
● In the analysis layer, data gets passed through several tools, shaping it into
actionable insights.
● There are four types of analytics on big data :
▪ Diagnostic: Explains why a problem is happening.
▪ Descriptive: Describes the current state of a business through historical
data.
▪ Predictive: Projects future results based on historical data.
▪ Prescriptive: Takes predictive analytics a step further by projecting best
future efforts. 

4. Consumption :

● The final big data component is presenting the information in a format digestible
to the end-user.

● This can be in the forms of tables, advanced visualizations and even single
numbers if requested.

● The most important thing in this layer is making sure the intent and meaning of
the output is understandable.

Importance of Big data:


Big Data importance doesn’t revolve around the amount of data a company has. Its importance
lies in the fact that how the company utilizes the gathered data.
1. Cost Savings:

Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to businesses when
they have to store large amounts of data. These tools help organizations in identifying more
effective ways of doing business.

2. Time-Saving

Real-time in-memory analytics helps companies to collect data from various sources. Tools like
Hadoop help them to analyze data immediately thus helping in making quick decisions based on
the learnings.

3. Understand the market conditions:

Big Data analysis helps businesses to get a better understanding of market situations. For
example, analysis of customer purchasing behavior helps companies to identify the products
sold most and thus produces those products accordingly. This helps companies to get ahead of
their competitors. 

4. Social media listening: 

Companies can perform sentiment analysis using Big Data tools. These enable them to get
feedback about their company, that is, who is saying what about the company. 

5. Boost Customer Acquisition and Retention

Big data analytics helps businesses to identify customer related trends and patterns. Customer
behavior analysis leads to a profitable business.

6. Solve Advertisers Problem and Offer Marketing Insights

Big data analytics shapes all business operations. It enables companies to fulfill customer
expectations. Big data analytics helps in changing the company’s product line. It ensures
powerful marketing campaigns.

7. The driver of Innovations and Product Development: Big data makes companies capable to
innovate and redevelop their products.

Big Data features –security, compliance, auditing and protection

Big Data security:


● Big data security is the collective term for all the measures and tools used to guard both
the data and analytics processes from attacks, theft, or other malicious activities that
could harm or negatively affect them.

● For companies that operate on the cloud, big data security challenges are multi-faceted.
When customers give their personal information to companies, they trust them with
personal data which can be used against them if it falls into the wrong hands.

Big Data Compliance:

● Data compliance is the practice of ensuring that sensitive data is organized and managed
in such a way as to enable organizations to meet enterprise business rules along with
legal and governmental regulations.

● Organizations that don’t implement these regulations can be fined up to tens of millions
of dollars and even receive a 20-year penalty.

Big Data Auditors:

● Auditors can use big data to expand the scope of their projects and draw comparisons
over larger populations of data.
● Big data also helps financial auditors to streamline the reporting process and detect
fraud.
● These professionals can identify business risks in time and conduct more relevant and
accurate audits.

Big Data protection:

● Data protection is important as organizations that don’t implement regulations, can be


fined up to tens of millions of dollars and even receive a 20-year penalty.

You might also like