You are on page 1of 9

Big Data Analytics Notes

Application of big data analytics:


• EBay, Facebook, LinkedIn and Twitter are among the
companies that used HDFS to underpin big data analytics
to address requirements similar to Yahoo's.
• Marketing. Targeted marketing campaigns depend on
marketers knowing a lot about their target audiences.
Marketers can get this information from several sources,
including CRM systems, direct mail responses, point-of-
sale systems, Facebook and Twitter. Because much of
this data is unstructured, an HDFS cluster is the most
cost-effective place to put data before analyzing it.
• Oil and gas providers. Oil and gas companies deal with a
variety of data formats with very large data sets,
including videos, 3D earth models and machine sensor
data. An HDFS cluster can provide a suitable platform for
the big data analytics that's needed.
• Research. Analyzing data is a key part of research, so,
here again, HDFS clusters provide a cost-effective way to
store, process and analyze large amounts of data.
Reading file from a Hadoop URL

READING DATA USING THE FILESYSTEM API:


As the previous section explained, sometimes it is
impossible to set a URLStreamHandlerFactory for your
application. In this case, you will need to use the FileSystem
API to openan input stream for a file.
A file in a Hadoop filesystem is represented by a
Hadoop Path object (and not ajava.io.File object, since its
semantics are too closely tied to the local filesystem). You
canthink of a Path as a Hadoop filesystem URI, such as
hdfs://localhost/user/tom/ quangle.txt.
FileSystem is a general filesystem API, so the first step is
to retrieve an instance for the filesystem we want to use—
HDFS in this case. There are several static factory methods
forgetting a FileSystem instance:
public static FileSystem get(Configuration conf) throws
IOException
public static FileSystem get(URI uri, Configuration conf)
throws IOException
public static FileSystem get(URI uri, Configuration conf,
String user) throws IOException.
Anatomy of File Write in HDFS:
Step 1: The client creates the file by calling create() on
Distributed File System(DFS).
Step 2: DFS makes an RPC call to the name node to create a
new file in the file system’s namespace, with no blocks
associated with it. The name node performs various checks to
make sure the file doesn’t already exist and that the client
has the right permissions to create the file. If these checks
pass, the name node prepares a record of the new file;
otherwise, the file can’t be created and therefore the client is
thrown an error i.e. IOException
Step 3: Because the client writes data, the DFSOutputStream
splits it into packets, which it writes to an indoor queue called
the info queue. The data queue is consumed by the
DataStreamer, which is liable for asking the name node to
allocate new blocks by picking an inventory of suitable data
nodes to store the replicas. The list of data nodes forms a
pipeline.The DataStreamer streams the packets to the
primary data node within the pipeline, which stores each
packet and forwards it to the second data node within the
pipeline.
Step 4: Similarly, the second data node stores the packet and
forwards it to the third (and last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of
packets that are waiting to be acknowledged by data nodes,
called an “ack queue”.

Step 6: This action sends up all the remaining packets to the


data node pipeline and waits for acknowledgments before
connecting to the name node to signal whether the file is
complete or not.
Anatomy of File Read in HDFS
Step 1: The client opens the file it wishes to read by calling
open() on the File System Object(which for HDFS is an
instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node,
using remote procedure calls (RPCs), to determine the
locations of the first few blocks in the file. For each block, the
name node returns the addresses of the data nodes that have
a copy of that block. The DFS returns an FSDataInputStream
to the client for it to read data from. FSDataInputStream in
turn wraps a DFSInputStream, which manages the data node
and name node I/O.
Step 3: The client then calls read() on the stream.
DFSInputStream, which has stored the info node addresses
for the primary few blocks within the file, then connects to
the primary (closest) data node for the primary block in the
file.
Step 4: Data is streamed from the data node back to the
client, which calls read() repeatedly on the stream.
Step 5: When the end of the block is reached,
DFSInputStream will close the connection to the data node,
then finds the best data node for the next block. This
happens transparently to the client, which from its point of
view is simply reading an endless stream.
Step 6: When the client has finished reading the file, a
function is called, close() on the FSDataInputStream.

Limitations of Hadoop distributed file system:


1. Issue with Small Files: Hadoop does not suit for small
data. (HDFS) Hadoop distributed file system lacks the
ability to efficiently support the random reading of small
files because of its high-capacity design.
2. Slow Processing Speed: In Hadoop, with a parallel and
distributed algorithm, the MapReduce process large
data sets. There are tasks that we need to perform: Map
and Reduce and, MapReduce requires a lot of time to
perform these tasks thereby increasing latency. Data is
distributed and processed over the cluster in
MapReduce which increases the time and reduces
processing speed.
3. Support for Batch Processing Only: Hadoop supports
batch processing only, it does not process streamed
data, and hence overall performance is slower. The
MapReduce framework of Hadoop does not leverage the
memory of the Hadoop cluster to the maximum.
4. No delta iteration: Hadoop is not so efficient for iterative
processing, as Hadoop does not support cyclic data
flow(i.e. a chain of stages in which each output of the
previous stage is the input to the next stage).
5. Not easy to use: In Hadoop, MapReduce developers
need to hand code for each and every operation which
makes it very difficult to work. MapReduce has no
interactive mode, but adding one such as hive and pig
makes working with MapReduce a little easier for
adopters.

You might also like