You are on page 1of 15

UNIT-1

INTRODUCTION:

INTRODUCTION TO BIG DATA:

What is Data?
The quantities, characters, or symbols on which operations are performed by a computer, which
may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical,
or mechanical recording media.

What is Big Data?


Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is
a data with so large size and complexity that none of traditional data management tools can store
it or process it efficiently. Big data is also a data but with huge size.

What is an Example of Big Data?

1. The New York Stock Exchange is an example of Big Data that generates about one

terabyte of new trade data per day.


2. Social Media

3. The statistic shows that 500+terabytes of new data get ingested into the databases of
social media site Facebook, every day. This data is mainly generated in terms of photo
and video uploads, message exchanges, putting comments etc.

4. A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With


many thousand flights per day, generation of data reaches up to many Petabytes.

Types Of Big Data


Following are the types of Big Data:

1. Structured
2. Unstructured
3. Semi-structured

Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data. Over the period of time, talent in computer science has achieved greater
success in developing techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it. However, nowadays, we are foreseeing
issues when a size of such data grows to a huge extent, typical sizes are being in the rage of
multiple zettabytes.

Examples Of Structured Data

An ‘Employee’ table in a database is an example of Structured Data

Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the
size being huge, un-structured data poses multiple challenges in terms of its processing for
deriving value out of it. A typical example of unstructured data is a heterogeneous data source
containing a combination of simple text files, images, videos etc. Now day organizations have
wealth of data available with them but unfortunately, they don’t know how to derive value out of
it since this data is in its raw form or unstructured format.

Examples Of Un-structured Data

The output returned by ‘Google Search’

Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-

Characteristics Of Big Data


Big data can be described by the following characteristics:

 Volume
 Variety
 Velocity
 Variability

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a
very crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one
characteristic which needs to be considered while dealing with Big Data solutions.

(ii) Variety – The next aspect of Big Data is its variety.


Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by
most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring
devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of
unstructured data poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The
flow of data is massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.

What is a big data platform?

The constant stream of information from various sources is becoming more intense[4], especially
with the advance in technology. And this is where big data platforms come in to store and
analyze the ever-increasing mass of information.

A big data platform is an integrated computing solution that combines numerous software
systems, tools, and hardware for big data management. It is a one-stop architecture that solves all
the data needs of a business regardless of the volume and size of the data at hand. Due to their
efficiency in data management, enterprises are increasingly adopting big data platforms to gather
tons of data and convert them into structured, actionable business insights[5].

Currently, the marketplace is flooded with numerous Open source and commercially available
big data platforms. They boast different features and capabilities for use in a big data
environment.

Characteristics of a big data platform

Any good big data platform should have the following important features:

 Ability to accommodate new applications and tools depending on the evolving business needs

 Support several data formats

 Ability to accommodate large volumes of streaming or at-rest data

 Have a wide variety of conversion tools to transform data to different preferred formats

 Capacity to accommodate data at any speed

 Provide the tools for scouring the data through massive data sets
 Support linear scaling

 The ability for quick deployment

 Have the tools for data analysis and reporting requirements

Big data platform examples


Here are 6 big data platforms that can help manage petabytes of data and provide actionable
insights:

Apache Hadoop

Cloudera

Amazon Web Services

Oracle

Snowflake

CHALLENGES OF CONVENTIONAL SYSTEMS:

1. ‘Analytics' has been used in the business intelligence world to provide tools and intelligence to
gain insight into the data
2. Data mining is used in enterprises to keep pace with the critical monitoring and analysis of
mountains of data
3. How to unearth all the hidden information through the vast amount of data ?

COMMON CHALLENGES:

1. It cannot work on unstructured data efficiently


2. It is built on top of the relational data model
3. It is batch oriented and we need to wait for nightly ETL (extract, transform and load) and
transformation jobs to complete before the required insight is obtained
4. Parallelism in a traditional analytics system is achieved through costly hardware like
MPP (Massively Parallel Processing) systems
5. Inadequate support of aggregated summaries of data

Data Challenges:
• Volume, Velocity, Variety & Veracity

• Data discovery and comprehensiveness

• Scalability

• Storage issues

Process Challenges:

• Capturing data

• Aligning data from different sources

• Transforming data into suitable form for data analysis

• Modeling data(mathematically, simulation)

• Understanding output, visualizing results and display issues on mobile devices

Management Challenges:

• Security

• Privacy

• Governance

• Ethical issues

Traditional / RDBMS:

• Designed to handle well structured data

• traditional storage vendor solutions are very expensive

• shared block-level storage is too slow

• read data in 8k or 16k block size

• Schema-on-write requires data be validated before it can be written to disk.

• Software licenses are too expensive

• Get data from disk and load into memory requires application
Solution constraints:

• Inexpensive storage

• A data platform that could handle large volumes of data and be linearly scalable at cost and

performance

• A highly parallel processing model that was highly distributed to access and compute the data

very fast

• A data repository that could break down the silos and store structured, semistructured, and

unstructured data to make it easy to correlate and analyze the data together

The nature of data:

Data is the plural of datum, so it is always treated as plural. We can find data in all the situations
of the world around us, in all the structured or unstructured, in continuous or discrete conditions,
in weather records, stock market logs, in photo albums, music playlists, or in our Twitter
accounts. In fact, data can be seen as the essential raw material of any kind of human activity.
According to the Oxford English Dictionary:

Data are known facts or things used as basis for inference or reckoning.
Categorical data are values or observations that can be sorted into groups or categories. There are
two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic
ordering to its categories. For example, housing is a categorical variable having two categories
(own and rent). An ordinal variable has an established ordering. For example, age as a variable
with three orderly categories (young, adult, and elder).

Numerical data are values or observations that can be measured. There are two kinds of
numerical values, discrete and continuous. Discrete data are values or observations that can be
counted and are distinct and separate. For example, number of lines in a code. Continuous data
are values or observations that may take on any value within a finite or infinite interval. For
example, an economic time series such as historic gold prices.

The kinds of datasets used in this book are as follows:

 E-mails (unstructured, discrete)

 Digital images (unstructured, discrete)

 Stock market logs (structured, continuous)

 Historic gold prices (structured, continuous)

 Credit approval records (structured, discrete)

 Social media friends and relationships (unstructured, discrete)

 Tweets and trending topics (unstructured, continuous)

 Sales records (structured, continuous)

INTELLIGENT DATA INTELLIGENCE:

Data:
Data is nothing but things known or anything that is assumed; facts from which conclusions can be
gathered.

Data Analysis

 Breaking up of any data into parts i.e., the examination of these parts to know about their nature,
proportion, function, interrelationship, etc.
 A process in which the analyst moves laterally and recursively between three modes: describing
data (profiling, correlation, summarizing), assembling data (scrubbing, translating, synthesizing,
filtering) and creating data (deriving, formulating, simulating).
 It is a sense of making data. The process of finding and identifying the meaning of data.

Data Visualization

 It is a process of revealing already existing data and/or its features (origin, metadata, allocation),
which includes everything from the table to charts and multidimensional animation (Min Yao,
2014) .
 To form an intellectual image of something not there to the sight.
 Visual data analysis is another form of data analysis, in which some or all forms of data
visualization may be used to give feedback sign to the analyst. Our product uses visual signs such
as charts, interactive browsing, and workflow process cues to help the analyst in moving through
the modes of data analysis.
 The main advantage of visual representations is to discover, make sense of data and
communicating data. Data visualization is a central part and an essential means to carry out data
analysis and then, once the importance have been identified and understood, it is easy to
communicate those meanings to others.

Importance of IDA:

Intelligent Data Analysis (IDA) is one of the major issues in artificial intelligence and information.
Intelligent data analysis discloses hidden facts that are not known previously and provides potentially
important information or facts from large quantities of data (White, 2008). It also helps in making a
decision. Based on machine learning, artificial intelligence, recognition of pattern, and records and
visualization technology mainly, IDA helps to obtain useful information, necessary data and interesting
models from a lot of data available online in order to make the right choices.

Intelligent data analysis helps to solve a problem that is already solved as a matter of routine. If the data is
collected for the past cases together with the result that was finally achieved, such data can be used to
revise and optimize the presently used strategy to arrive at a conclusion.

In certain cases, if some questions arise for the first time, and have only a little knowledge about it, data
from the related situations helps us to solve the new problem or any unknown relationships can be
discovered from the data to gain knowledge in an unfamiliar area.

Steps Involved In IDA:

IDA, in general, includes three stages: (1) Preparation of data; (2) data mining; (3) data validation and
explanation (Keim & Ward, 2007). The preparation of data involves opting for the required data from the
related data source and incorporating it into a data set that can be used for data mining.
The main goal of intelligent data analysis is to obtain knowledge. Data analysis is the process of a
combination of extracting data from data set, analyzing, classification of data, organizing, reasoning, and so
on. It is challenging to choose suitable methods to resolve the complexity of the process.

Regarding the term visualization, we have moved away from visualization to use the term charting. The
term analysis is used for the method of incorporating, influencing, filtering and scrubbing the data, which
certainly contains, but is not limited to interrelating with their data through charts.

The Goal of Data Analysis:

Data analysis need not essentially involve arithmetic or statistics. While it is true that analysis often
involves one or both, and that many analytical pursuits cannot be handled without them, much of the data
analysis that people perform in the course of their work involves at most mathematics no more
complicated than the calculation of the mean of a set of values. The essential activity of analysis is a
comparison (of values, patterns, etc.), which can often be done by simply using our eyes.

The aim of the analysis is not to find out appealing information in the data. Rather, this is only a vital part
of the process (Berthold & Hand, 2003). The aim is to make sense of data (i.e., to understand what it
means) and then to make decisions based on the understanding that is achieved. Information in and of
itself is not useful. Even understanding information in and of it is not useful. The aim of data analysis is to
make better decisions.

The process of data analysis starts with the collection of data that can add to the solution of any given
problem, and with the organization of that data in some regular form. It involves identifying and applying a
statistical or deterministic schema or model of the data that can be manipulated for explanatory or
predictive purposes. It then involves an interactive or automated solution that explores the structured
data in order to extract information – a solution to the business problem – from the data.

The Goal of Visualization 

The basic idea of visual data mining is to present the data in some visual form, allowing the user to gain
insight into the data, draw conclusions, and directly interact with the data. Visual data analysis techniques
have proven to be of high value in exploratory data analysis. Visual data mining is mainly helpful when the
only little fact is known about the data and the exploration goals are indistinct.

The main uses of visual data examination over data analysis methods are:

 Visual data examination can simply deal with highly non-homogeneous and noisy data.
 Visual data exploration is spontaneous and requires no knowledge of complex mathematical or
arithmetical algorithms or parameters.
 Visualization can present a qualitative outline of the data, letting data phenomenon to be secluded
for further quantitative analysis. Accordingly, visual data examination usually allows a quicker data
investigation and often provides fascinating results, especially in cases where automatic
algorithms fail.
 Visual data examination techniques provide a much higher degree of assurance in the findings of
the exploration.

ANALYTIC PROCESS AND TOOLS:


As we’re growing with the pace of technology, the demand to track data is increasing
rapidly. Today, almost 2.5quintillion bytes of data are generated globally and it’s useless
until that data is segregated in a proper structure. It has become crucial for businesses to
maintain consistency in the business by collecting meaningful data from the market today
and for that, all it takes is the right data analytic tool and a professional data analyst to
segregate a huge amount of raw data by which then a company can make the right
approach

1. APACHE Hadoop

It’s a Java-based open-source platform that is being used to store and process big data. It is built on a cluster
system that allows the system to process data efficiently and let the data run parallel. It can process both
structured and unstructured data from one server to multiple computers. Hadoop also offers cross-
platform support for its users. Today, it is the best big data analytic tool and is popularly used by many tech
giants such as Amazon, Microsoft, IBM, etc.
Features of Apache Hadoop:
 Free to use and offers an efficient storage solution for businesses.
 Offers quick access via HDFS (Hadoop Distributed File System).
 Highly flexible and can be easily implemented with MySQL, and JSON.
 Highly scalable as it can distribute a large amount of data in small segments.
 It works on small commodity hardware like JBOD or a bunch of disks.

2. Cassandra

APACHE Cassandra is an open-source NoSQL distributed database that is used to fetch large amounts of data.
It’s one of the most popular tools for data analytics and has been praised by many tech companies due to its
high scalability and availability without compromising speed and performance. It is capable of delivering
thousands of operations every second and can handle petabytes of resources with almost zero downtime. It
was created by Facebook back in 2008 and was published publicly.
Features of APACHE Cassandra:
 Data Storage Flexibility: It supports all forms of data i.e. structured, unstructured, semi-structured, and
allows users to change as per their needs.
 Data Distribution System: Easy to distribute data with the help of replicating data on multiple data centers.
 Fast Processing: Cassandra has been designed to run on efficient commodity hardware and also offers fast
storage and data processing.
 Fault-tolerance: The moment, if any node fails, it will be replaced without any delay.

3. Qubole

It’s an open-source big data tool that helps in fetching data in a value of chain using ad-hoc analysis in
machine learning. Qubole is a data lake platform that offers end-to-end service with reduced time and effort
which are required in moving data pipelines. It is capable of configuring multi-cloud services such as AWS,
Azure, and Google Cloud. Besides, it also helps in lowering the cost of cloud computing by 50%.
Features of Qubole:
 Supports ETL process: It allows companies to migrate data from multiple sources in one place.
 Real-time Insight: It monitors user’s systems and allows them to view real-time insights
 Predictive Analysis: Qubole offers predictive analysis so that companies can take actions accordingly for
targeting more acquisitions.
 Advanced Security System: To protect users’ data in the cloud, Qubole uses an advanced security system
and also ensures to protect any future breaches. Besides, it also allows encrypting cloud data from any
potential threat.

4. Xplenty

It is a data analytic tool for building a data pipeline by using minimal codes in it. It offers a wide range of
solutions for sales, marketing, and support. With the help of its interactive graphical interface, it provides
solutions for ETL, ELT, etc. The best part of using Xplenty is its low investment in hardware & software and
its offers support via email, chat, telephonic and virtual meetings. Xplenty is a platform to process data for
analytics over the cloud and segregates all the data together.  
Features of Xplenty:
 Rest API: A user can possibly do anything by implementing Rest API
 Flexibility: Data can be sent, and pulled to databases, warehouses, and salesforce.
 Data Security: It offers SSL/TSL encryption and the platform is capable of verifying algorithms and
certificates regularly.
 Deployment: It offers integration apps for both cloud & in-house and supports deployment to integrate apps
over the cloud.

5. Spark

APACHE Spark is another framework that is used to process data and perform numerous tasks on a large
scale.  It is also used to process data via multiple computers with the help of distributing tools. It is widely
used among data analysts as it offers easy-to-use APIs that provide easy data pulling methods and it is capable
of handling multi-petabytes of data as well. Recently, Spark made a record of processing 100 terabytes of
data in just 23 minutes which broke the previous world record of Hadoop (71 minutes). This is the reason
why big tech giants are moving towards spark now and is highly suitable for ML and AI today.  
Features of APACHE Spark:
 Ease of use: It allows users to run in their preferred language. (JAVA, Python, etc.)
 Real-time Processing: Spark can handle real-time streaming via Spark Streaming
 Flexible: It can run on, Mesos, Kubernetes, or the cloud.

6. Mongo DB

Came in limelight in 2010, is a free, open-source platform and a document-oriented (NoSQL) database that
is used to store a high volume of data. It uses collections and documents for storage and its document consists
of key-value pairs which are considered a basic unit of Mongo DB. It is so popular among developers due to
its availability for multi-programming languages such as Python, Jscript, and Ruby.  
Features of Mongo DB:
 Written in C++: It’s a schema-less DB and can hold varieties of documents inside.
 Simplifies Stack: With the help of mongo, a user can easily store files without any disturbance in the stack.
 Master-Slave Replication: It can write/read data from the master and can be called back for backup.

7. Apache Storm

A storm is a robust, user-friendly tool used for data analytics, especially in small companies. The best part
about the storm is that it has no language barrier (programming) in it and can support any of them. It was
designed to handle a pool of large data in fault-tolerance and horizontally scalable methods. When we
talk about real-time data processing, Storm leads the chart because of its distributed real-time big data
processing system, due to which today many tech giants are using APACHE Storm in their system. Some of
the most notable names are Twitter, Zendesk, NaviSite, etc.
Features of Storm:
 Data Processing: Storm process the data even if the node gets disconnected
 Highly Scalable: It keeps the momentum of performance even if the load increases
 Fast: The speed of APACHE Storm is impeccable and can process up to 1 million messages of 100 bytes
on a single node.

8. SAS

Today it is one of the best tools for creating statistical modeling used by data analysts. By using SAS, a data
scientist can mine, manage, extract or update data in different variants from different sources. Statistical
Analytical System or SAS allows a user to access the data in any format (SAS tables or Excel worksheets).
Besides that it also offers a cloud platform for business analytics called SAS Viya and also to get a strong grip
on AI & ML, they have introduced new tools and products.  

Features of SAS:
 Flexible Programming Language: It offers easy-to-learn syntax and has also vast libraries which make it
suitable for non-programmers
 Vast Data Format: It provides support for many programming languages which also include SQL and
carries the ability to read data from any format.
 Encryption: It provides end-to-end security with a feature called SAS/SECURE.

9. Data Pine

Datapine is an analytical used for BI and was founded back in 2012 (Berlin, Germany). In a short period of
time, it has gained much popularity in a number of countries and it’s mainly used for data extraction (for
small-medium companies fetching data for close monitoring). With the help of its enhanced UI design, anyone
can visit and check the data as per their requirement and offer in 4 different price brackets, starting from $249
per month. They do offer dashboards by functions, industry, and platform.
Features of Datapine:
 Automation: To cut down the manual chase, datapine offers a wide array of AI assistant and BI tools.
 Predictive Tool: datapine provides forecasting/predictive analytics by using historical and current data, it
derives the future outcome.
 Add on: It also offers intuitive widgets, visual analytics & discovery, ad hoc reporting, etc.

10. Rapid Miner

It’s a fully automated visual workflow design tool used for data analytics. It’s a no-code platform and users
aren’t required to code for segregating data. Today, it is being heavily used in many industries such as ed-tech,
training, research, etc. Though it’s an open-source platform but has a limitation of adding 10000 data rows
and a single logical processor. With the help of Rapid Miner, one can easily deploy their ML models to the
web or mobile (only when the user interface is ready to collect real-time figures).
Features of Rapid Miner:
 Accessibility: It allows users to access 40+ types of files (SAS, ARFF, etc.) via URL
 Storage: Users can access cloud storage facilities such as AWS and dropbox
 Data validation: Rapid miner enables the visual display of multiple results in history for better evaluation.
ANALYSIS VS REPORTING:

The terms reporting and analytics are often used interchangeably. This is not surprising since both take in data as
“input” — which is then processed and presented in the form of charts, graphs, or dashboards.

Reports and analytics help businesses improve operational efficiency and productivity, but in different ways. While
reports explain what is happening, analytics helps identify why it is happening. Reporting summarizes and organizes
data in easily digestible ways while analytics enables questioning and exploring that data further. It provides
invaluable insights into trends and helps create strategies to help improve operations, customer satisfaction, growth,
and other business metrics.

Reporting and analysis are both important for an organization to make informed decisions by presenting data in a
format that is easy to understand. In reporting, data is brought together from different sources and presented in an
easy-to-consume format. Typically,  modern reporting apps today offer next-generation dashboards with high-level
data visualization capabilities. There are several types of reports being generated by companies including financial
reports, accounting reports, operational reports, market reports, and more. This helps understand how each function
is performing at a glance. But for further insights, it requires analytics.

Analytics enables business users to cull out insights from data, spot trends, and help make better decisions. Next-
generation analytics takes advantage of emerging technologies like AI, NLP, and machine learning to offer predictive
insights based on historical and real-time data.

To run analytics, reporting is not necessary.

For instance, let us take a look at a manufacturing company that uses Oracle ERP to manage various functions
including accounting, financial management, project management,  procurement, and supply chain. For business
users, it is critical to have a finger on the pulse of all key data. Additionally, specific teams need to periodically
generate reports and present data to senior management and other stakeholders. In addition to reporting, it is also
essential to analyze data from various sources and gather insights. The problem today is people are using reporting
and analytics interchangeably. When the time comes to replace an end-of-life operational reporting tool, they are
using solutions that are designed for analytics. This would be a waste of time and resources.

It is critical that operational reporting is done using a tool built for that purpose. Ideally, it’ll be a self-service tool so
business users don’t have to rely on IT to generate reports. It must have the ability to drill down into several layers of
data when needed. Additionally, if you’re using Oracle ERP you need an operational reporting tool like Orbit that
seamlessly integrates data from various business systems – both on-premise and cloud. In this blog, we look at the
nuances of both operational reporting and analytics and why it is critical to have the right tools for the right tasks.
Steps Involved in Building a Report and Preparing Data for Analytics
To build a report, the steps involved broadly include:

 Identifying the business need

 Collecting and gathering relevant data

 Translating the technical data

 Understanding the data context

 Creating reporting dashboards

 Enabling real-time reporting

 Offer the ability to drill down into reports


For data analytics, the steps involved include:

 Creating a data hypothesis

 Gathering and transforming data

 Building analytical models to ingest data, process it and offer insights

 Use tools for data visualization, trend analysis, deep dives, etc.
 Using data and insights for making decisions
 Five Key Differences Between Reporting and Analysis
 One of the key differences between reporting and analytics is that, while a report involves organizing data
into summaries, analysis involves inspecting, cleaning, transforming, and modeling these reports to gain
insights for a specific purpose.
 Knowing the difference between the two is essential to fully benefit from the potential of both without missing
out on key features of either one. Some of the key differences include:


 1. Purpose: Reporting involves extracting data from different sources within an organization and monitoring it
to gain an understanding of the performance of the various functions. By linking data from across functions,
it helps create a cross-channel view that facilitates comparison to understand data easily. An analysis is
being able to interpret data at a deeper level, interpreting it and providing recommendations on actions.
 2. The Specifics: Reporting involves activities such as building, consolidating, organizing, configuring,
formatting, and summarizing. It requires clean, raw data and reports that may be generated periodically,
such as daily, weekly, monthly, quarterly, and yearly. Analytics includes asking questions, examining,
comparing, interpreting, and confirming. Enriching data with big data can help predict future trends as well.
 3. The Final Output: In the case of reporting, outputs such as canned reports, dashboards, and alerts push
information to users. Through analysis, analysts try to extract answers using business queries and present
them in the form of ad hoc responses, insights, recommended actions, or a forecast. Understanding this key
difference can help businesses leverage analytics better.
 4. People: Reporting requires repetitive tasks that can be automated. It is often used by functional business
heads who monitor specific business metrics. Analytics requires customization and therefore depends on
data analysts and scientists. Also, it is used by business leaders to make data-driven decisions.
 5. Value Proposition: This is like comparing apples to oranges. Both reporting and analytics serve a different
purpose. By understanding the purpose and using them correctly, businesses can derive immense value
from both.

You might also like