Unit 4

Dept.
of CSE & ISE, FET
BIG DATA
TECHNOLOGY
SAHANA SHETTY,
DEPT. OF CSE, FET
CONTENTS
2
• Big Data Technology-I: The elephant in the
room: Hadoop’s parallel world old Vs. new
approaches; Data discovery: work the way
people’s minds work; Open source technology
for big data analytics; The cloud and big data;
Predictive analytics moves into the limelight;
Software as a service BI. Mobile business
intelligence is going mainstream; Ease of
mobile application deployment; Crowdsourcing
analytics; Inter – and Trans-firewall analytics.
Dept. of CSE & ISE, FET

2005: Doug Cutting and Michael J. Cafarella

developed Hadoop to support distribution for
the Nutch search engine project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache

Software Foundation.
2003
4
2004
2006
• 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1

5
terabyte of data in 209 seconds, compared to previous record of
297 seconds)
• 2009 - Avro and Chukwa became new members of Hadoop

Framework family
• 2010 - Hadoop's Hbase, Hive and Pig subprojects completed,

adding more computational power to Hadoop framework
• 2011 - ZooKeeper Completed
• 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.

- Ambari, Cassandra, Mahout have been added
• Hadoop:
6 • an open-source software framework that supports
data-intensive distributed applications, licensed under
the Apache v2 license. or
Hadoop is an open-source platform for storage and
processing of diverse data types that enables data-
driven enterprises to rapidly derive the complete
value from all their data.
• Goals / Requirements:
• Abstract and facilitate the storage and processing of

large and/or rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• High scalability and availability
• Use commodity (cheap!) hardware with little

redundancy
• Fault-tolerance
The two critical components of Hadoop are:
7
1. The Hadoop Distributed File System (HDFS):
The Hadoop Distributed File System (HDFS) is the
primary data storage system used by Hadoop
applications. It employs a NameNode and DataNode
architecture to implement a distributed file system
that provides high-performance access to data across
highly scalable Hadoop clusters.
8 2. MAPREDUCE
• MapReduce is a processing technique and a program model for distributed

computing based on java. The MapReduce algorithm contains two important
tasks, namely Map and Reduce. Map takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map
as an input and combines those data tuples into a smaller set of tuples. As
the sequence of the name MapReduce implies, the reduce task is always
performed after the map job.
• The major advantage of MapReduce is that it is easy to scale data processing
over multiple computing nodes. Under the MapReduce model, the data
processing primitives are called mappers and reducers. Decomposing a data
processing application into mappers and reducers is sometimes nontrivial.
But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change. This simple scalability
is what has attracted many programmers to use the MapReduce model.
HDFS follows the master-slave architecture and it has

9 the following elements.
Namenode
The namenode is the commodity hardware that contains
the GNU/Linux operating system and the namenode
software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the
master server and it does the following tasks −
•Manages the file system namespace.
•Regulates client’s access to files.
•It also executes file system operations such as
renaming, closing, and opening files and directories.
10
Datanode
11 The datanode is a commodity hardware having the GNU/Linux
operating system and datanode software. For every node
(Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.
•Datanodes perform read-write operations on the file systems, as
per client request.
•They also perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a
file system will be divided into one or more segments and/or stored
in individual data nodes. These file segments are called as blocks.
In other words, the minimum amount of data that HDFS can read
or write is called a Block. The default block size is 64MB, but it can
be increased as per the need to change in HDFS configuration.
12
Hadoop Framework Tools

13
Hadoop’s Architecture: MapReduce Engine

14
Old vs. New Approaches

15 guru Abhishek Mehta to get his perceptions of the
Data
differences between the “old” and “new” types of big
data analytics.
Mehta is a former Bank of America executive and MIT
Media Lab executive-in-residence.
He recently launched Tresata, a company that is
developing the fi rst Hadoop-powered Big Data
analytics platform focused on fi nancial industry data.
According to him -The old way is a data and analytics
technology stack with different layers “ cross-
communicating data” and working on “scale-up”
expensive hardware. The new way is a data and
analytics platform that does all the data processing
and analytics in one “layer,” without moving data back
and forth on cheap but scalable (“scale out”)
commodity hardware
1.The technology stack has changed. New proprietary
technologies and open-source inventions enable
16
different approaches that make it easier and more
affordable to store, manage, and analyze data.
2. Hardware and storage is affordable and continuing to
get cheaper to enable massive parallel processing.
3. The variety of data is on the rise and the ability to
handle unstructured data is on the rise.
Data Discovery: Work the Way People ’s Minds Work
17
Data discovery : the term used to describe the new
wave of business intelligence that enables users to
explore data, make discoveries, and uncover insights
in a dynamic and intuitive way versus predefined
queries and preconfigured drill-down dashboards.
• This approach has resonated with many business
users who are looking for the freedom and fl exibility
to view Big Data.
• Tableau Software and QlikTech International.
companies’ approach to the market is much different
than the traditional BI software vendor. They grew
through a sales model that many refer to as “land and
expand.”
Open-Source Technology for Big Data Analytics
Open-source software is computer software that is

18
available in source code form under an open-source
license that permits users to study, change, and
improve and at times also to distribute the software.
The open-source name came out of a 1998 meeting in
Palo Alto in reaction to Netscape ’s announcement of
a source code release for Navigator (as Mozilla).
• “One of the key attributes of the open-source analytics
stack is that it ’s not constrained by someone else ’s
predetermined ideas or vision,” says David
Champagne, chief technology officer at Revolution
Analytics, a provider of advanced analytics.
The old model ’s end state was a monolithic stack of
proprietary tools and systems that could not be
19
swapped out, modified, or upgraded without the
original vendor ’s support.
• The status quo rested on several assumptions,
including:
1. The amounts of data generated would be
manageable
2. Programming resources would remain scarce
3. Faster data processing would require bigger, more
expensive hardware.
The Cloud and Big Data
20
“There will be Big Data platforms that companies will
build, especially for the core operational systems of
the world. Where we continue to have an explosive
amount of data come in and because the data is so
proprietary that building out an infrastructure in-house
seems logical. I actually think it ’s going to go to the
cloud, it ’s just a
matter of time! It ’s not value add enough to collect,
process and store data”.
—Avinash Kaushik, Google's
digital marketing evangelist

• Abhishek Mehta is one of those individuals who
believes that cloud models are inevitable for every

21
industry and it ’s just a matter of when an
industry will shift to the cloud model.
• Mehta calls it “the next industrial revolution, where
the raw material is data and data factories replace
manufacturing factories.”
He pointed out a few guiding principles that his firm
stands by:
1. Stop saying “cloud.”
2. Acknowledge the business issues.
3. Fix some core technical gaps.
Predictive Analytics Moves into the Limelight
• To master analytics, enterprises will move from being

22
in reactive positions (business intelligence) to forward
leaning positions (predictive analytics).
• Algorithmic trading and supply chain optimization are
just two typical examples where predictive analytics
have greatly reduced the friction in business.
Some Dept.
leading trends that are making way to the
of CSE & ISE, FET
forefront of businesses today:

23
■ Recommendation engines similar to those used in
Netflix and Amazon that use past purchases and
buying behavior to recommend new purchases.
■ Risk engines for a wide variety of business areas,
including market and credit risk, catastrophic risk, and
portfolio risk.
■ Innovation engines for new product innovation, drug
discovery, and consumer and fashion trends to predict
potential new product formulations
and discoveries.
■ Customer insight engines that integrate a wide
variety of customer related info, including sentiment,
behavior, and even emotions.
■ Optimization engines that optimize complex
interrelated operations and decisions that are too
24
overwhelming for people to systematically handle at
scales. such as when, where, and how to seek natural
resources to maximize output while reducing
operational cost.
Software as a Service BI
• The software industry has seen some successful

25
companies excel in the game of software as a service
(SaaS) industry, such as salesforce.com .
• The basic principal is to make it easy for companies to
gain access to solutions without the
headache of building and maintaining their own
onsite implementation.
• According to a recent article in TechTarget , “SaaS BI
can be a good choice when there ’s little or no budget
money available for buying BI software and related
hardware.
• In theDept.world of web analytics, there was another
of CSE & ISE, FET
significant SaaS BI invention named Omniture (now

26
owned by Adobe).
• Omniture ’s success was fueled by their ability to
handle Big Data in the form of weblog data. We spoke
with Josh James, the creator of Omniture and now the
founder and CEO of Domo, a SaaS BI provider.
• James answer for why his business was so successful
is as follows:
• Scaling the SaaS delivery model
• Killer sales organization
• A focus on customer success.
• JamesDept.
explained the three market reasons why he
of CSE & ISE, FET
started Domo, knowing he had to fix three problems in

27
traditional BI is as follows.
1. Relieving the IT choke point
2. Transforming BI from cost center to a revenue
generator
3. The user experience
Mobile Business Intelligence Is Going Mainstream:
• 28
Dan Kerzner, SVP Mobile at MicroStrategy, a leading
provider of business intelligence software. He has
been in the BI space for quite a while.
• According to him the combination of multi-touch and
having a software oriented device is what has
unlocked the potential of these devices to really bring
mobile analytics and intelligence to a much wider
audience in a productive way.
Ease of Mobile Application Deployment:
• 29
Three elements that have impacted the viability of
mobile BI:
1. Location—the GPS component and location . . . know
where you are in time as well as the movement.
2. It ’s not just about pushing data; you can transact
with your smart phone based on information you get.
3. Multimedia functionality allows the visualization
pieces to really come into play.
Three challenges with mobile BI include:
1.30
Managing standards for rolling out these devices.
2. Managing security (always a big challenge).
3. Managing “bring your own device,” where you have
devices both owned by the company and devices
owned by the individual, both contributing to
productivity.
Crowdsourcing Analytics:
• Crowdsourcing is a great way to capitalize on the

31
resources that can build algorithms and predictive
models.
• According to Anthony Goldbloom, Kaggle ’s founder
and CEO, “The idea is that someone comes to us with
a problem, we put it up on our website, and then
people from all over the world can compete to see who
can produce the best solution.”
• Crowdsourcing is a disruptive business model whose
roots are in technology but is extending beyond
technology to other areas.
• There are various types of crowdsourcing, such as
crowd voting, crowd purchasing, wisdom of crowds,
crowd funding, and contests.
Inter- and Trans-Firewall Analytics:
32
33

Unit 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4

Uploaded by

Copyright:

Available Formats

Dept.

of CSE & ISE, FET

Dept. of CSE & ISE, FET

2005: Doug Cutting and Michael J. Cafarella

The project was funded by Yahoo.

2006: Yahoo gave the project to Apache

• 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1

• 2009 - Avro and Chukwa became new members of Hadoop

• 2010 - Hadoop's Hbase, Hive and Pig subprojects completed,

• 2011 - ZooKeeper Completed

• 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.

• Abstract and facilitate the storage and processing of

• High scalability and availability

• Use commodity (cheap!) hardware with little

• MapReduce is a processing technique and a program model for distributed

HDFS follows the master-slave architecture and it has

Hadoop Framework Tools

Hadoop’s Architecture: MapReduce Engine

Old vs. New Approaches

Open-source software is computer software that is

digital marketing evangelist

believes that cloud models are inevitable for every

• To master analytics, enterprises will move from being

forefront of businesses today:

• The software industry has seen some successful

significant SaaS BI invention named Omniture (now

started Domo, knowing he had to fix three problems in

• Crowdsourcing is a great way to capitalize on the

You might also like