Professional Documents
Culture Documents
In this guide, you'll learn more about what big data analytics is, why it's important,
and its benefits for many different industries today. You'll also learn about types of
analysis used in big data analytics, find a list of common tools used to perform it,
and find suggested courses that can help you get started on your own data
analytics professional journey.
What is big data analytics?
Big data analytics is the process of collecting, examining, and analyzing large
amounts of data to discover market trends, insights, and patterns that can help
companies make better business decisions. This information is available quickly and
efficiently so that companies can be agile in crafting plans to maintain their
competitive advantage.
Technologies such as business intelligence (BI) tools and systems help organizations
take the unstructured and structured data from multiple sources. Users (typically
employees) input queries into these tools to understand business operations and
performance. Big data analytics uses the four data analysis methods to uncover
meaningful insights and derive solutions.
So, what makes data “big”?
Big data is characterized by the five V's: volume, velocity, variety, variability, and
value [1]. It’s complex, so making sense of all of the data in the business requires
both innovative technologies and analytical skills.
Big data analytics is the use of advanced analytic techniques against very large,
diverse big data sets that include structured, semi-structured and unstructured data,
from different sources, and in different sizes from terabytes to zettabytes.
It can be defined as data sets whose size or type is beyond the ability of
traditional relational databases to capture, manage and process the data with low
latency. Characteristics of big data include high volume, high velocity and high
variety. Sources of data are becoming more complex than those for traditional data
because they are being driven by artificial intelligence (AI), mobile devices, social
media and the Internet of Things (IoT). For example, the different types of data
originate from sensors, devices, video/audio, networks, log files, transactional
applications, web and social media — much of it generated in real time and at a
very large scale.
With big data analytics, you can ultimately fuel better and faster decision-making,
modelling and predicting of future outcomes and enhanced business intelligence.
As you build your big data solution, consider open source software such as Apache
Hadoop, Apache Spark and the entire Hadoop ecosystem as cost-effective, flexible
data processing and storage tools designed to handle the volume of data being
generated today.
VOLUME
The exponential growth in the data storage as the data is now more than text data.
The data can be found
in the format of videos, music’s and large images on our social media channels. It
is very common to
have Terabytes and Petabytes of the storage system for enterprises. As the database
grows the
applications and architecture built to support the data needs to be re-evaluated
quite often.
Sometimes the same data is re-evaluated with multiple angles and even though the
original data is the
same the new found intelligence creates explosion of the data. The big volume
indeed represents Big
Data
VELOCITY
The data growth and social media explosion have changed how we look at the data.
There was a time
when we used to believe that data of yesterday is recent. The matter of the fact
newspapers is still
following that logic. However, news channels and radios have changed how fast
we receive the news.
Today, people reply on social media to update them with the latest happening. On
social media
sometimes a few seconds old messages (a tweet, status updates etc.) is not
something interests users.
They often discard old messages and pay attention to recent updates. The data
movement is now almost
real time and the update window has reduced to fractions of the seconds. This high
velocity data
represent Big Data.
Velocity: Velocity essentially refers to the speed at which data is being
created in real- time. We have moved from simple desktop applications
like payroll application to real- time processing applications.
Volume: Volume can be in Terabytes or Petabytes or Zettabytes.
Gartner Glossary Big data is high-volume, high-velocity and/or high variety
information assets that demand cost-effective, innovative forms of information
processing that enable enhanced insight and decision making.
For the sake of easy comprehension, we will look at the definition in
three parts. Refer Figure 1.4.
VARIETY
Data can be stored in multiple format. For example database, excel, csv, access or
for the matter of the
fact, it can be stored in a simple text file. Sometimes the data is not even in the
traditional format as we
assume, it may be in the form of video, SMS, pdf or something we might have not
thought about it. It
is the need of the organization to arrange it and make it meaningful.It will be easy
to do so if we have data in the same format, however it is not the case most of the
time.
The real world have data in many different formats and that is the challenge we
need to overcome with
the Big Data. This variety of the data represent Big Data.
Data generates information and from information we can draw
valuable insight. As depicted in Figure 1.1, digital data can be broadly classified
into structured, semi-structured, and unstructured data.
1. Unstructured data: This is the data which does not conform to a data model or
is not in a form which can be used easily by a computer program. About 80% data
of an organization is in this format; for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters. researches, white papers, body of an email,
etc.
2. Semi-structured data: Semi-structured data is also referred to as self describing
structure. This is the data which does not conform to a data model but has some
structure. However, it is not in a form which can be used easily by a computer
program. About 10% data of an organization is
in this format; for example, HTML, XML, JSON, email data etc. Figure 1.1
classification of digital data
3. Structured data: When data follows a pre-defined schema/structure we say it is
structured data. This is the data which is in an organized form (e.g., in rows and
columns) and be easily used by a computer program. Relationships exist between
entities of data, such as classes and their objects. About 10% data of an
organization is in this format. Data stored in databases is an example of structured
data.
Variety: Data can be structured data, semi-structured data and unstructured data.
Data stored in a database is an example of structured data.HTML data, XML data,
email data,
1.4 DEFINITION OF BIG DATA
• Big data is high-velocity and high-variety information assets that demand cost
effective, innovative forms of information processing for enhanced insight and
decision making.
• Big data refers to datasets whose size is typically beyond the storage
capacity of and also complex for traditional database software tools
• Big data is anything beyond the human & technical infrastructure needed to
support storage, processing and analysis.
• It is data that is big in volume, velocity and variety. Refer to figure 1.3
Part I of the definition: "Big data is high-volume, high-velocity, and
high-variety information assets" talks about voluminous data (humongous data)
that may have great variety (a good mix of structured, semi-structured. and
unstructured data) and will require a
good speed/pace for storage, preparation, processing and analysis
Part II of the definition: "cost effective, innovative forms of information
processing" talks about embracing new techniques and technologies to capture
(ingest), store, process, persist, integrate and visualize the high volume, high-
velocity, and high-variety data.
Part III of the definition: "enhanced insight and decision making" talks about
deriving deeper, richer and meaningful insights and then using these insights to
make faster and better decisions to gain business value and thus a competitive
edge.
Data —> Information —> Actionable intelligence —> Better decisions —
>Enhanced business value
Example of big data analytics
For example, big data analytics is integral to the modern health care industry. As
you can imagine, thousands of patient records, insurance plans, prescriptions, and
vaccine information need to be managed. It comprises huge amounts of structured
and unstructured data, which can offer important insights when analytics are applied.
Big data analytics does this quickly and efficiently so that health care providers can
use the information to make informed, life-saving diagnoses.
Why is big data analytics important?
Big data analytics is important because it helps companies leverage their data to
identify opportunities for improvement and optimization. Across different business
segments, increasing efficiency leads to overall more intelligent operations, higher
profits, and satisfied customers. Big data analytics helps companies reduce costs and
develop better, customer-centric products and services.
Data analytics helps provide insights that improve the way our society functions. In
health care, big data analytics not only keeps track of and analyzes individual
records, but plays a critical role in measuring COVID-19 outcomes on a global
scale. It informs health ministries within each nation’s government on how to
proceed with vaccinations and devises solutions for mitigating pandemic outbreaks
in the future.
Use big data to stay competitive
Almost eight in 10 users (79 percent) believe that “companies that do not embrace
big data will lose their competitive position and may even face extinction,”
according to an Accenture report [2]. In their survey of Fortune 500 companies,
Accenture found that 95 percent of companies with revenues over $10 billion
reported being “highly satisfied” or “satisfied” with their big data-driven business
outcomes [2]
There are quite a few advantages to incorporating big data analytics into a business
or organization. These include:
Cost reduction: Big data can reduce costs in storing all the business data in one
place. Tracking analytics also helps companies find ways to work more efficiently to
cut costs wherever possible.
Product development: Developing and marketing new products, services, or brands
is much easier when based on data collected from customers’ needs and wants. Big
data analytics also helps businesses understand product viability and keep up with
trends.
Strategic business decisions: The ability to constantly analyze data helps
businesses make better and faster decisions, such as cost and supply chain
optimization.
Customer experience: Data-driven algorithms help marketing efforts (targeted ads,
as an example) and increase customer satisfaction by delivering an enhanced
customer experience.
Risk management: Businesses can identify risks by analyzing data patterns and
developing solutions for managing those risks.
Big data in the real world
Big data analytics helps companies and governments make sense of data and make
better, informed decisions.
There are four main types of big data analytics that support and inform different
business decisions.
1. Descriptive analytics
Descriptive analytics refers to data that can be easily read and interpreted. This data
helps create reports and visualize information that can detail company profits and
sales.
Diagnostics analytics helps companies understand why a problem occurred. Big data
technologies and tools allow users to mine and recover data that helps dissect an
issue and prevent it from happening in the future.
Predictive analytics looks at past and present data to make predictions. With
artificial intelligence (AI), machine learning, and data mining, users can analyze the
data to predict market trends.
Example: Within the energy sector, utility companies, gas producers, and pipeline
owners identify factors that affect the price of oil and gas in order to hedge risks.
Tools used in big data analytics
Harnessing all of that data requires tools. Thankfully, technology has advanced so
that there are many intuitive software systems available for data analysts to use.
Hadoop: An open-source framework that stores and processes big data sets. Hadoop
is able to handle and analyze structured and unstructured data.
Spark: An open-source cluster computing framework used for real-time processing
and analyzing data.
Data integration software: Programs that allow big data to be streamlined across
different platforms, such as MongoDB, Apache, Hadoop, and Amazon EMR.
Stream analytics tools: Systems that filter, aggregate, and analyze data that might
be stored in different platforms and formats, such as Kafka.
Distributed storage: Databases that can split data across multiple servers and have
the capability to identify lost or corrupt data, such as Cassandra.
Predictive analytics hardware and software: Systems that process large amounts
of complex data, using machine learning and algorithms to predict future outcomes,
such as fraud detection, marketing, and risk assessments.
Data mining tools: Programs that allow users to search within structured and
unstructured big data.
NoSQL databases: Non-relational data management systems ideal for dealing with
raw and unstructured data.
Data warehouses: Storage for large amounts of data collected from many different
sources, typically using predefined schemas.