BDA Unit 1

CCS334 BIG DATA ANALYTICS
UNIT I UNDERSTANDING BIG DATA

Introduction to big data – convergence of key trends – unstructured data – industry examples of big
data – web analytics – big data applications– big data technologies – introduction to Hadoop – open
source technologies – cloud and big data – mobile business intelligence – Crowd sourcing analytics –
inter and trans firewall analytics.
Introduction to big data

What is Big Data?
Big Data can be defined as a high amount of data that cannot be processed or stored with the help of standard
processing equipment and data storage. A massive amount of data is produced daily, and interpreting and
manually processing complex and expansive datasets are next to impossible. It requires modern tools and
expert skills to interpret large volumes of data and provide them to organizations with valuable insights to
help businesses grow.
Importance of Big Data
Big Data does not take care of how much data is there, but how it can be used. Data can be taken from various
sources for analyzing it and finding answers which enable:
 Reduction in cost.
 Time reductions.
 New product development with optimized offers.
 Well-groomed decision making.
Different Types of Big Data
Big data types in Big Data are used to categorize the numerous kinds of data generated daily. Primarily there
are 3 types of data in analytics. The following types of Big Data with examples are explained below:-
1. Structured Data: Any data that can be processed, is easily accessible, and can be stored in a fixed format
is called structured data. In Big Data, structured data is the easiest to work with because it has highly
coordinated measurements that are defined by setting parameters. Structured types of Big Data are:-
 Address
 Age
 Credit/debit card numbers
 Contact
 Expenses
 Billing
2. Unstructured Data: Unstructured data in Big Data is where the data format constitutes multitudes of
unstructured files (images, audio, log, and video). This form of data is classified as intricate data because of
its unfamiliar structure and relatively huge size. A stark example of unstructured data is an output returned by
‘Google Search’ or ‘Yahoo Search.’
3. Semi-structured Data: In Big Data, semi-structured data is a combination of both unstructured and
structured types of data. This form of data constitutes the features of structured data but has unstructured
information that does not adhere to any formal structure of data models or any relational database. Some
semi-structured data examples include XML and JSON.
Characteristics of Big Data
General characteristics of Big Data can be referred to as the five Vs: Volume, Velocity, Variety, Veracity, and
Value. They have been elucidated below:-
 Volume: Volume is the size of a dataset processed and stored in the Big Data System and is known to
be its most important and prominent feature. The size of data usually ranges from petabytes to
exabytes and is processed with advanced processing technology.
 Velocity: Velocity is referred to as the data accumulation rate, which also helps analysts determine if it
falls under the classification of regular data or Big Data. Data needs real-time evaluation, which
requires well-integrated systems for handling the amount and pace of generated data.
 Variety: Variety is defined as the type of data format and the way it is organized and made ready to be
processed. The data accumulation rate also influences whether the data is classified as Big Data or
regular data. The speed of data processing essentially means that more data will be available than the
previous set and also that the data processing rate will be high.
 Veracity: Veracity is the quality and reliability of the data in concern. Unreliable data devalues the
authenticity of Big Data, especially when the data is updated in real-time. Therefore, data authenticity
requires regular checks at every level of collection and processing.
 Value: Value is also worth considering in collecting and processing Big Data. More than the amount
of data, the value of that data is important for acquiring insights.
 Variability: Variability is the characteristic of Big Data that enables it to be formatted and used for
actionable purposes.
Benefits of Big Data
Collecting, processing, analyzing, and storing Big Data has several perks that adhere to modern-day
conglomerate needs. Some of the added benefits of Big Data are as follows:-
1. Predictive analysis: This holds a significant amount of benefit in Big Data because it directly enhances
businesses' growth via forecasting, better decision-making, ensuring maximum operational efficiency, and
mitigating risks.
2. Enhanced business growth: With data analysis tools, businesses across the globe have improved their
digital marketing strategies with the help of data acquired from social media platforms.
3. Time and cost saving: Big Data collects and stores data from variegated sources for producing actionable
insights. Companies can easily save money and time with the help of advanced analytics tools for filtering out
unusable or irrelevant data.
4. Increase profit margin: With the help of different types of Big Data analytics, companies can increase
revenue with more sales leads. With the help of Big Data analysis, companies can determine how their
products and services are faring on the market and how customers are receiving them. This can help them
make more informed decisions about the areas that require investing time and resources.
Challenges of Big Data
 Rapid Data Growth: The growth velocity at such a high rate creates a problem to look for insights
using it. There no 100% efficient way to filter out relevant data.
 Storage: The generation of such a massive amount of data needs space for storage, and organizations
face challenges to handle such extensive data without suitable tools and technologies.
 Unreliable Data: It cannot be guaranteed that the big data collected and analyzed are totally (100%)
accurate. Redundant data, contradicting data, or incomplete data are challenges that remain within it.
 Data Security: Firms and organizations storing such massive data (of users) can be a target of
cybercriminals, and there is a risk of data getting stolen. Hence, encrypting such colossal data is also a
challenge for firms and organizations.
Unstructured Data
What is Unstructured Data?

Unstructured data is the data which does not conforms to a data model and has no easily identifiable
structure such that it can not be used by a computer program easily. Unstructured data is not organised in a
pre-defined manner or does not have a pre-defined data model, thus it is not a good fit for a mainstream
relational database.
Characteristics of Unstructured Data:
 Data neither conforms to a data model nor has any structure.
 Data can not be stored in the form of rows and columns as in Databases
 Data does not follows any semantic or rules
 Data lacks any particular format or sequence
 Data has no easily identifiable structure
 Due to lack of identifiable structure, it can notused by computer programs easily
Sources of Unstructured Data:
 Web pages
 Images (JPEG, GIF, PNG, etc.)
 Videos
 Memos
 Reports
 Word documents and PowerPoint presentations
 Surveys
Advantages of Unstructured Data:
 Its supports the data which lacks a proper format or sequence
 The data is not constrained by a fixed schema
 Very Flexible due to absence of schema.
 Data is portable
 It is very scalable
 It can deal easily with the heterogeneity of sources.
 These type of data have a variety of business intelligence and analytics applications.
Disadvantages Of Unstructured data:
 It is difficult to store and manage unstructured data due to lack of schema and structure
 Indexing the data is difficult and error prone due to unclear structure and not having pre-defined
attributes. Due to which search results are not very accurate.
 Ensuring security to data is difficult task.
Problems faced in storing unstructured data:
 It requires a lot of storage space to store unstructured data.
 It is difficult to store videos, images, audios, etc.
 Due to unclear structure, operations like update, delete and search is very difficult.
 Storage cost is high as compared to structured data
 Indexing the unstructured data is difficult
Possible solution for storing Unstructured data:
 Unstructured data can be converted to easily manageable formats
 using Content addressable storage system (CAS) to store unstructured data.
It stores data based on their metadata and a unique name is assigned to every object stored in it.The
object is retrieved based on content not its location.
 Unstructured data can be stored in XML format.
 Unstructured data can be stored in RDBMS which supports BLOBs
Web Analytics
What is Web Analytics?
Web analytics is the gathering, synthesizing, and analysis of website data with the goal of improving the
website user experience. It’s a practice that’s useful for managing and optimizing websites, web
applications, or other web products. It’s highly data-driven and assists in making high-quality website
decisions web analytics is helpful for understanding which channels users come through to your website.
You can also identify popular site content by calculating the average length of stay on your web pages and
how users interact with them—including which pages prompt users to leave.
The process of web analytics involves:
Setting business goals: Defining the key metrics that will determine the success of your business and
website
Collecting data: Gathering information, statistics, and data on website visitors using analytics tools
Processing data: Converting the raw data you’ve gathered into meaningful ratios, KPIs, and other
information that tell a story
Reporting data: Displaying the processed data in an easy-to-read format
Developing an online strategy: Creating a plan to optimize the website experience to meet business goals
Experimenting: Doing A/B tests to determine the best way to optimize website performance
Why Are Web Analytics Important?

Web analytics have a wide range of applications from marketing to product optimization. In all cases, web
analytics allow businesses to make decisions based on data as opposed to user research or gut feeling.
What Are the Risks of Web Analytics?
The main concerns around web analytics come from misuse of data. There has been mounting concern that
companies have been using data in ways that they’re not supposed to, which in turn negatively affects people.
Applications of Big Data
1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like Amazon, Walmart,
Big Bazar etc.) management team has to keep data of customer’s spending habit (in which product customer
spent, in which brand they wish to spent, how frequently they spent), shopping behavior, customer’s most
liked product (so that they can keep those products in the store). Which product is being searched/sold most,
based on that data, production/collection rate of that product get fixed.
Banking sector uses their customer’s spending behavior-related data so that they can provide the offer to a
particular customer to buy his particular liked product by using bank’s credit or debit card with discount or
cashback. By this way, they can send the right offer to the right person at the right time.
2. Recommendation: By tracking customer spending habit, shopping behavior, Big retails store provide a
recommendation to the customer. E-commerce site like Amazon, Walmart, Flipkart does product
recommendation. They track what product a customer is searching, based on that data they recommend that
type of product to that customer.
As an example, suppose any customer searched bed cover on Amazon. So, Amazon got data that customer
may be interested to buy bed cover. Next time when that customer will go to any google page, advertisement
of various bed covers will be seen. Thus, advertisement of the right product to the right customer can be sent.
YouTube also shows recommend video based on user’s previous liked, watched video type. Based on the
content of a video, the user is watching, relevant advertisement is shown during video running. As an example
suppose someone watching a tutorial video of Big data, then advertisement of some other big data course will
be shown during that video.
3. Smart Traffic System: Data about the condition of the traffic of different road, collected through camera
kept beside the road, at entry and exit point of the city, GPS device placed in the vehicle (Ola, Uber cab, etc.).
All such data are analyzed and jam-free or less jam way, less time taking ways are recommended. Such a way
smart traffic system can be built in the city by Big data analysis. One more profit is fuel consumption can be
reduced.
4. Secure Air Traffic System: At various places of flight (like propeller etc) sensors present. These sensors
capture data like the speed of flight, moisture, temperature, other environmental condition. Based on such data
analysis, an environmental parameter within flight are set up and varied.
By analyzing flight’s machine-generated data, it can be estimated how long the machine can operate
flawlessly when it to be replaced/repaired.
5. Auto Driving Car: Big data analysis helps drive a car without human interpretation. In the various spot of
car camera, a sensor placed, that gather data like the size of the surrounding car, obstacle, distance from those,
etc. These data are being analyzed, then various calculation like how many angles to rotate, what should be
speed, when to stop, etc carried out. These calculations help to take action automatically.
6. Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant tool (like Siri in Apple
Device, Cortana in Windows, Google Assistant in Android) to provide the answer of the various question
asked by users. This tool tracks the location of the user, their local time, season, other data related to question
asked, etc. Analyzing all such data, it provides an answer.
As an example, suppose one user asks “Do I need to take Umbrella?”, the tool collects data like location of
the user, season and weather condition at that location, then analyze these data to conclude if there is a chance
of raining, then provide the answer.
7. IoT:
 Manufacturing company install IOT sensor into machines to collect operational data. Analyzing such data,
it can be predicted how long machine will work without any problem when it requires repairing so that
company can take action before the situation when machine facing a lot of issues or gets totally down.
Thus, the cost to replace the whole machine can be saved.
 In the Healthcare field, Big data is providing a significant contribution. Using big data tool, data regarding
patient experience is collected and is used by doctors to give better treatment. IoT device can sense a
symptom of probable coming disease in the human body and prevent it from giving advance treatment.
IoT Sensor placed near-patient, new-born baby constantly keeps track of various health condition like
heart bit rate, blood presser, etc. Whenever any parameter crosses the safe limit, an alarm sent to a doctor,
so that they can take step remotely very soon.
8. Education Sector: Online educational course conducting organization utilize big data to search candidate,
interested in that course. If someone searches for YouTube tutorial video on a subject, then online or offline
course provider organization on that subject send ad online to that person about their course.
9. Energy Sector: Smart electric meter read consumed power every 15 minutes and sends this read data to the
server, where data analyzed and it can be estimated what is the time in a day when the power load is less
throughout the city. By this system manufacturing unit or housekeeper are suggested the time when they
should drive their heavy machine in the night time when power load less to enjoy less electricity bill.
10. Media and Entertainment Sector: Media and entertainment service providing company like Netflix,
Amazon Prime, Spotify do analysis on data collected from their users. Data like what type of video, music
users are watching, listening most, how long users are spending on site, etc are collected and analyzed to set
the next business strategy.
Big Data Technologies

What is Big Data Technology?
Big Data Technology can be defined as a Software-Utility that is designed

to Analyse, Process and Extract the information from an extremely complex and large data sets which
the Traditional Data Processing Software could never deal with.
Types of Big Data Technologies:

Big Data Technology is mainly classified into two types:
1. Operational Big Data Technologies

2. Analytical Big Data Technologies
Firstly, The Operational Big Data is all about the normal day to day data that we generate. This could be
the Online Transactions, Social Media, or the data from a Particular Organisation etc. You can even
consider this to be a kind of Raw Data which is used to feed the Analytical Big Data Technologies.
A few examples of Operational Big Data Technologies are as follows:
Online ticket bookings, which includes your Rail tickets, Flight tickets, movie tickets etc.
 Online shopping which is your Amazon, Flipkart, Walmart, Snap deal and many more.
 Data from social media sites like Facebook, Instagram, what’s app and a lot more.
 The employee details of any Multinational Company.
So, with this let us move into the Analytical Big Data Technologies.
Analytical Big Data is like the advanced version of Big Data Technologies. It is a little complex than the
Operational Big Data. Few examples of Analytical Big Data Technologies are aollows:
 Stock marketing
 Carrying out the Space missions where every single bit of information is crucial.
 Weather forecast information.
 Medical fields where a particular patients health status can be monitored.
Top Big Data Technologies
Top big data technologies are divided into 4 fields which are classified as follows:
 Data Storage
 Data Mining
 Data Analytics
 Data Visualization
Data Storage
Hadoop
Hadoop Framework was designed to store and process data in a Distributed Data Processing
Environment with commodity hardware with a simple programming model. It can Store and Analyse the
data present in different machines with High Speeds and Low Costs.
 Developed by: Apache Software Foundation in the year 2011 10th of Dec.
 Written in: JAVA
 Current stable version: Hadoop 3.11
MongoDB
The NoSQL Document Databases like MongoDB, offer a direct alternative to the rigid schema used
in Relational Databases. This allows MongoDB to offer Flexibility while handling a wide variety
of Datatypes at large volumes and across Distributed Architectures.
 Developed by: MongoDB in the year 2009 11th of Feb

 Written in: C++, Go, JavaScript, Python
 Current stable version: MongoDB 4.0.10
Hunk
Hunk lets you access data in remote Hadoop Clusters through virtual indexes and lets you use the
Splunk Search Processing Language to analyse your data. With Hunk, you can Report and Visualize large
amounts from your Hadoop and NoSQL data sources.
 Developed by: Splunk INC in the year 2013.

 Current stable version: Splunk Hunk 6.2
Data Mining
Presto
Presto is an open source Distributed SQL Query Engine for running Interactive Analytic Queries against
data sources of all sizes ranging from Gigabytes to Petabytes. Presto allows querying data
in Hive, Cassandra, Relational Databases and Proprietary Data Stores.
 Developed by: Apache Foundation in the year 2013.

 Current stable version: Presto 0.22
Rapid Miner
RapidMiner is a Centralized solution that features a very powerful and robust Graphical User Interface that
enables users to Create, Deliver, and maintain Predictive Analytics. It allows creating very Advanced
Workflows, Scripting support in several languages.
 Developed by: RapidMiner in the year 2001

 Current stable version: RapidMiner 9.2
Elasticsearch
Elasticsearch is a Search Engine based on the Lucene Library. It provides a Distributed, MultiTenant-
capable, Full-Text Search Engine with an HTTP Web Interface and Schema-free JSON documents.
 Developed by: Elastic NV in the year 2012.

 Current stable version: ElasticSearch 7.1
Data Analytics
Kafka
Apache Kafka is a Distributed Streaming platform. A streaming platform has Three Key Capabilities that are
as follows:
 Publisher
 Subscriber
 Consumer
This is similar to a Message Queue or an Enterprise Messaging System.
 Developed by: Apache Software Foundation in the year 2011
 Written in: Scala, JAVA
 Current stable version: Apache Kafka 2.2.0
Splunk
Splunk captures, Indexes, and correlates Real-time data in a Searchable Repository from which it can
generate Graphs, Reports, Alerts, Dashboards, and Data Visualizations. It is also used for Application
Management, Security and Compliance, as well as Business and Web Analytics.
 Developed by: Splunk INC in the year 2014 6th May

 Written in: AJAX, C++, Python, XML
 Current stable version: Splunk 7.3
KNIME
KNIME allows users to visually create Data Flows, Selectively execute some or All Analysis steps, and
Inspect the Results, Models, and Interactive views. KNIME is written in Java and based on Eclipse and
makes use of its Extension mechanism to add Plugins providing Additional Functionality.
 Developed by: KNIME in the year 2008

 Current stable version: KNIME 3.7.2
Spark
Spark provides In-Memory Computing capabilities to deliver Speed, a Generalized Execution Model to
support a wide variety of applications, and Java, Scala, and Python APIs for ease of development.
 Developed by: Apache Software Foundation

 Written in: Java, Scala, Python, R
 Current stable version: Apache Spark 2.4.3
R-Language
R is a Programming Language and free software environment for Statistical Computing and Graphics.
The R language is widely used among Statisticians and Data Miners for developing Statistical Software and
majorly in Data Analysis.
 Developed by: R-Foundation in the year 2000 29th Feb

 Written in: Fortran
 Current stable version: R-3.6.0
Data Visualization
Tableau
Tableau is a Powerful and Fastest growing Data Visualization tool used in the Business
Intelligence Industry. Data analysis is very fast with Tableau and the Visualizations created are in the form
of Dashboards and Worksheets.
 Developed by: TableAU 2013 May 17th

 Written in: JAVA, C++, Python, C
 Current stable version: TableAU 8.2
Plotly
Mainly used to make creating Graphs faster and more efficient. API libraries for Python, R, MATLAB,
Node.js, Julia, and Arduino and a REST API. Plotly can also be used to style Interactive
Graphs with Jupyter notebook.
 Developed by: Plotly in the year 2012

 Written in: JavaScript
 Current stable version: Plotly 1.47.4
Big Data and Cloud Computing


1. BigData :
Big data refers to the data which is huge in size and also increasing rapidly with respect to time. Big data
includes structured data, unstructured data as well as semi-structured data. Big data can not be stored and
processed in traditional data management tools it needs specialized big data management tools. It refers to
complex and large data sets having 5 V’s volume, velocity, Veracity, Value and variety information assets. It
includes data storage, data analysis, data mining and data visualization.
Examples of the sources where big data is generated includes social media data, e-commerce data, weather
station data, IoT Sensor data etc.
Characteristics of Big Data :
 Variety of Big data – Structured, unstructured, and semi structured data
 Velocity of Big data – Speed of data generation
 Volume of Big data – Huge volumes of data that is being generated
 Value of Big data – Extracting useful information and making it valuable
 Variability of Big data – Inconsistency which can be shown by the data at times.
Advantages of Big Data :
 Cost Savings
 Better decision-making
 Better Sales insights
 Increased Productivity
 Improved customer service.
Disadvantages of Big Data :
 Incompatible tools
 Security and Privacy Concerns
 Need for cultural change
 Rapid change in technology
 Specific hardware needs.
2. CloudComputing :
Cloud computing refers to the on demand availability of computing resources over internet. These resources
includes servers, storage, databases, software, analytics, networking and intelligence over the Internet and all
these resources can be used as per requirement of the customer. In cloud computing customers have to pay as
per use. It is very flexible and can be resources can be scaled easily depending upon the requirement. Instead
of buying any IT resources physically, all resources can be availed depending on the requirement from the
cloud vendors. Cloud computing has three service models i.e Infrastructure as a Service (IaaS), Platform as a
Service (PaaS) and Software as a Service (SaaS).
Examples of cloud computing vendors who provides cloud computing services are Amazon Web Service
(AWS), Microsoft Azure, Google Cloud Platform, IBM Cloud Services etc.
Characteristics of Cloud Computing :
 On-Demand availability
 Accessible through a network
 Elastic Scalability
 Pay as you go model
 Multi-tenancy and resource pooling.
Advantages of Cloud Computing :
 Back-up and restore data
 Improved collaboration
 Excellent accessibility
 Low maintenance cost
 On-Demand Self-service.
Disadvantages of Cloud Computing :
 Vendor lock-in
 Limited Control
 Security Concern
 Downtime due to various reason
 Requires good Internet connectivity.
Difference between Big Data and Cloud Computing :
S.No. BIG DATA CLOUD COMPUTING
Big data refers to the data which is Cloud computing refers to the on demand
01. huge in size and also increasing rapidly availability of computing resources over
with respect to time. internet.
Cloud Computing Services includes

Big data includes structured data,
Infrastructure as a Service (IaaS), Platform
02. unstructured data as well as semi-
as a Service (PaaS) and Software as a
structured data.
Service (SaaS).
Volume of data, Velocity of data, On-Demand availability of IT resources,

Variety of data, Veracity of data, and broad network access, resource pooling,
03. Value of data are considered as the 5 elasticity and measured service are
most important characteristics of Big considered as the main characteristics of
data. cloud computing.
The purpose of big data is to

The purpose of cloud computing is to store
organizing the large volume of data
and process data in cloud or availing
04. and extracting the useful information
remote IT services without physically
from it and using that information for
installing any IT resources.
the improvement of business.
Distributed computing is used for

Internet is used to get the cloud based
05. analyzing the data and extracting the
services from different cloud vendors.
useful information.
Big data management allows

centralized platform, provision for Cloud computing services are cost
06.
backup and recovery and low effective, scalable and robust.
maintenance cost.
Some of the challenges of big data are

Some of the challenges of cloud computing
variety of data, data storage and
07. are availability, transformation, security
integration, data processing and
concern, charging model.
resource management.
Big data refers to huge volume of data, Cloud computing refers to remote IT
08. its management, and useful resources and different internet service
information extraction. models.
Cloud computing is used to store data and

Big data is used to describe huge information on remote servers and also
09.
volume of data and information. processing the data using remote
infrastructure.
Some of the cloud computing vendors who

Some of the sources where big data is
provides cloud computing services are
generated includes social media data,
10. Amazon Web Service (AWS), Microsoft
e-commerce data, weather station data,
Azure, Google Cloud Platform, IBM Cloud
IoT Sensor data etc.
Services etc.
Mobile Business Intelligence

What is Mobile Business Intelligence?
BI delivers relevant and trustworthy information to the right person at the right time. Mobile business
intelligence is the transfer of business intelligence from the desktop to mobile devices such as the BlackBerry,
iPad, and iPhone.
The ability to access analytics and data on mobile devices or tablets rather than desktop computers is referred
to as mobile business intelligence. The business metric dashboard and key performance indicators (KPIs) are
more clearly displayed.
With the rising use of mobile devices, so have the technology that we all utilise in our daily lives to make our
lives easier, including business. Many businesses have benefited from mobile business intelligence.
Essentially, this post is a guide for business owners and others to educate them on the benefits and pitfalls of
Mobile BI.
Need for mobile BI?

Mobile phones' data storage capacity has grown in tandem with their use. You are expected to make decisions
and act quickly in this fast-paced environment. The number of businesses receiving assistance in such a
situation is growing by the day.
To expand your business or boost your business productivity, mobile BI can help, and it works with both
small and large businesses. Mobile BI can help you whether you are a salesperson or a CEO. There is a high
demand for mobile BI in order to reduce information time and use that time for quick decision making.
(Source)
As a result, timely decision-making can boost customer satisfaction and improve an enterprise's reputation
among its customers. It also aids in making quick decisions in the face of emerging risks.
Data analytics and visualisation techniques are essential skills for any team that wants to organise work,
develop new project proposals, or wow clients with impressive presentations.
Advantages of mobile BI
1. Simple access
Mobile BI is not restricted to a single mobile device or a certain place. You can view your data at any time
and from any location. Having real-time visibility into a firm improves production and the daily efficiency of
the business. Obtaining a company's perspective with a single click simplifies the process.
2. Competitive advantage
Many firms are seeking better and more responsive methods to do business in order to stay ahead of the
competition. Easy access to real-time data improves company opportunities and raises sales and capital. This
also aids in making the necessary decisions as market conditions change.
3. Simple decision-making
As previously stated, mobile BI provides access to real-time data at any time and from any location. During
its demand, Mobile BI offers the information. This assists consumers in obtaining what they require at the
time. As a result, decisions are made quickly.
4. Increase Productivity
By extending BI to mobile, the organization's teams can access critical company data when they need it.
Obtaining all of the corporate data with a single click frees up a significant amount of time to focus on the
smooth and efficient operation of the firm. Increased productivity results in a smooth and quick-running firm.
Disadvantages of mobile
1. Stack of data
The primary function of a mobile BI is to store data in a systematic manner and then present it to the user as
required. As a result, Mobile BI stores all of the information and does end up with heaps of earlier data. The
corporation only needs a small portion of the previous data, but they need to store the entire information,
which ends up in the stack
2. Expensive
Mobile BI can be quite costly at times. Large corporations can continue to pay for their expensive services,
but small businesses cannot. As the cost of mobile BI is not sufficient, we must additionally consider the rates
of IT workers for the smooth operation of BI, as well as the hardware costs involved.
However, larger corporations do not settle for just one Mobile BI provider for their organisations; they
require multiple. Even when doing basic commercial transactions, mobile BI is costly.
3. Time consuming
Businesses prefer Mobile BI since it is a quick procedure. Companies are not patient enough to wait for data
before implementing it. In today's fast-paced environment, anything that can produce results quickly is
valuable. The data from the warehouse is used to create the system, hence the implementation of BI in an
enterprise takes more than 18 months.
4. Data breach
The biggest issue of the user when providing data to Mobile BI is data leakage. If you handle sensitive data
through Mobile BI, a single error can destroy your data as well as make it public, which can be detrimental to
your business.
Many Mobile BI providers are working to make it 100 percent secure to protect their potential users'
data. It is not only something that mobile BI carriers must consider, but it is also something that we, as
users, must consider when granting data access authorization. (From)
5. Poor quality data
Because we work online in every aspect, we have a lot of data stored in Mobile BI, which might be a
significant problem. This means that a large portion of the data analysed by Mobile BI is irrelevant or
completely useless. This can speed down the entire procedure. This requires you to select the data that
is important and may be required in the future.
Crowdsourcing
What is Crowdsourcing?

Crowdsourcing is a sourcing model in which an individual or an organization gets support from a large,
open-minded, and rapidly evolving group of people in the form of ideas, micro-tasks, finances, etc.
Crowdsourcing typically involves the use of the internet to attract a large group of people to divide tasks or to
achieve a target. The term was coined in 2005 by Jeff Howe and Mark Robinson. Crowdsourcing can help
different types of organizations get new ideas and solutions, deeper consumer engagement, optimization of
tasks, and several other things.
Let us understand this term deeply with the help of an example.
Where Can We Use Crowdsourcing?
Crowdsourcing is touching almost all sectors from education to health. It is not only accelerating innovation
but democratizing problem-solving methods. Some fields where crowdsourcing can be used.
1. Enterprise
2. IT
3. Marketing
4. Education
5. Finance
6. Science and Health
How To Crowdsource?
1. For scientific problem solving, a broadcast search is used where an organization mobilizes a crowd to
come up with a solution to a problem.
2. For information management problems, knowledge discovery and management is used to find and
assemble information.
3. For processing large datasets, distributed human intelligence is used. The organization mobilizes a crowd
to process and analyze the information.
Examples Of Crowdsourcing
1. Doritos: It is one of the companies which is taking advantage of crowdsourcing for a long time for an
advertising initiative. They use consumer-created ads for one of their 30-Second Super Bowl
Spots(Championship Game of Football).
2. Starbucks: Another big venture which used crowdsourcing as a medium for idea generation. Their white
cup contest is a famous contest in which customers need to decorate their Starbucks cup with an original
design and then take a photo and submit it on social media.
3. Lays:” Do us a flavor” contest of Lays used crowdsourcing as an idea-generating medium. They asked the
customers to submit their opinion about the next chip flavor they want.
4. Airbnb: A very famous travel website that offers people to rent their houses or apartments by listing them
on the website. All the listings are crowdsourced by people.
Advantages Of Crowdsourcing
1. Evolving Innovation: Innovation is required everywhere and in this advancing world innovation has a
big role to play. Crowdsourcing helps in getting innovative ideas from people belonging to different
fields and thus helping businesses grow in every field.
2. Save costs: There is the elimination of wastage of time of meeting people and convincing them. Only
the business idea is to be proposed on the internet and you will be flooded with suggestions from the
crowd.
3. Increased Efficiency: Crowdsourcing has increased the efficiency of business models as several
expertise ideas are also funded.
Disadvantages Of Crowdsourcing
1. Lack of confidentiality: Asking for suggestions from a large group of people can bring the threat of idea
stealing by other organizations.
2. Repeated ideas: Often contestants in crowdsourcing competitions submit repeated, plagiarized ideas
which leads to time wastage as reviewing the same ideas is not worthy.
What Are the Main Types of Crowdsourcing?
Crowdsourcing involves obtaining information or resources from a wide swath of people. In general, we can
break this up into four main categories:
 Wisdom - Wisdom of crowds is the idea that large groups of people are collectively smarter than
individual experts when it comes to problem-solving or identifying values (like the weight of a cow
or number of jelly beans in a jar).
 Creation - Crowd creation is a collaborative effort to design or build something. Wikipedia and other
wikis are examples of this. Open-source software is another good example.
 Voting - Crowd voting uses the democratic principle to choose a particular policy or course of action
by "polling the audience."
 Funding - Crowdfunding involved raising money for various purposes by soliciting relatively small
amounts from a large number of funders.
Introduction to Hadoop
Hadoop is an Apache open source framework written in java that allows distributed processing of large
datasets across clusters of computers using simple programming models. The Hadoop framework application
works in an environment that provides distributed storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines, each offering local computation
and storage.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
 Processing/Computation layer (MapReduce), and
 Storage layer (Hadoop Distributed File System).
Hadoop Architecture
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at Google for
efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes)
of commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which
is an Apache open-source framework.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a
distributed file system that is designed to run on commodity hardware. It has many similarities with existing
distributed file systems. However, the differences from other distributed file systems are significant. It is
highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access
to application data and is suitable for applications having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the following two
modules −
Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource management.
How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but
as an alternative, you can tie together many commodity computers with single-CPU, as a single functional
distributed system and practically, the clustered machines can read the dataset in parallel and provide a much
higher throughput. Moreover, it is cheaper than one high-end server. So this is the first motivational factor
behind using Hadoop that it runs across clustered and low-cost machines.
 Hadoop runs code across a cluster of computers. This process includes the following core tasks that
Hadoop performs −
 Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M
and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
Hadoop Ecosystem
Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. It
includes Apache projects and various commercial tools and solutions. There are four major elements of
Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to
supplement or support these major elements. All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data sets
of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of
log files.
HDFS consists of two core components i.e.
 Name node
 Data Node
Name Node is the prime node which contains metadata (data about data) requiring comparatively fewer
resources than the data nodes that stores the actual data. These data nodes are commodity hardware in the
distributed environment. Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of the
system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources
across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
 Resource Manager
 Nodes Manager
 Application Manager
Resource manager has the privilege of allocating resources for the applications in a system whereas Node
managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the
processing’s logic and helps to write applications which transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map generates
a key-value pair based result which is later on processed by the Reduce() method.
Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple, Reduce()
takes the output generated by Map() as input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language
similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of MapReduce are taken
care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java
runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop
Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets.
However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL datatypes
are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC Drivers and
HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and connection whereas
HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name suggests
helps the system to develop itself based on some patterns, user/environmental interaction or on the basis of
algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and classification
which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the
help of its own libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative
real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch processing,
hence both are used in most of the companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop
Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge database, the
request must be processed within a short quick span of time. At such times, HBase comes handy as it gives us
a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that carry out a huge task in
order to make Hadoop capable of processing large datasets. They are as follows:
Solr, Lucene: These are the two services that perform the task of searching and indexing with the help of
some java libraries, especially Lucene is based on Java which allows spell check mechanism, as well.
However, Lucene is driven by Solr.
Zookeeper: There was a huge issue of management of coordination and synchronization among the resources
or the components of Hadoop which resulted in inconsistency, often. Zookeeper overcame all the problems by
performing synchronization, inter-component based communication, grouping, and maintenance.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a
single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie workflow is the
jobs that need to be executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those that
are triggered when some data or external stimulus is given to it.
Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it
automatic distributes the data and work across the machines and in turn, utilizes the underlying
parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather
Hadoop library itself has been designed to detect and handle failures at the application layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues to operate
without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is compatible on all the
platforms since it is Java based.
Note:
Refer firewall , open source technologies in technical publications book

BDA Unit 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDA Unit 1

Uploaded by

Copyright:

Available Formats

CCS334 BIG DATA ANALYTICS

UNIT I UNDERSTANDING BIG DATA

Introduction to big data

Importance of Big Data

Different Types of Big Data

Characteristics of Big Data

Benefits of Big Data

What is Unstructured Data?

The process of web analytics involves:

Why Are Web Analytics Important?

What Are the Risks of Web Analytics?

Applications of Big Data

Big Data Technologies

Big Data Technology can be defined as a Software-Utility that is designed

Types of Big Data Technologies:

1. Operational Big Data Technologies

A few examples of Operational Big Data Technologies are as follows:

Top Big Data Technologies

 Developed by: MongoDB in the year 2009 11th of Feb

 Developed by: Splunk INC in the year 2013.

 Developed by: Apache Foundation in the year 2013.

 Developed by: RapidMiner in the year 2001

 Developed by: Elastic NV in the year 2012.

 Developed by: Splunk INC in the year 2014 6th May

 Developed by: KNIME in the year 2008

 Developed by: Apache Software Foundation

 Developed by: R-Foundation in the year 2000 29th Feb

 Developed by: TableAU 2013 May 17th

 Developed by: Plotly in the year 2012

Big Data and Cloud Computing

Cloud Computing Services includes

Volume of data, Velocity of data, On-Demand availability of IT resources,

The purpose of big data is to

Distributed computing is used for

Big data management allows

Some of the challenges of big data are

Cloud computing is used to store data and

Some of the cloud computing vendors who

Mobile Business Intelligence

Need for mobile BI?

What Are the Main Types of Crowdsourcing?

How Does Hadoop Work?

You might also like