Big data analytics can be applied in various domains like e-commerce, marketing, oil and gas, and research. HDFS provides a cost-effective platform for storing and analyzing large, unstructured datasets from different sources. HDFS allows files to be read and written in a distributed manner across clusters. When writing a file, data is split and stored across data nodes with replication for fault tolerance. When reading, the file locations are determined and data is streamed from nodes. However, HDFS has limitations like being unsuitable for small files and only supporting batch processing.
Big data analytics can be applied in various domains like e-commerce, marketing, oil and gas, and research. HDFS provides a cost-effective platform for storing and analyzing large, unstructured datasets from different sources. HDFS allows files to be read and written in a distributed manner across clusters. When writing a file, data is split and stored across data nodes with replication for fault tolerance. When reading, the file locations are determined and data is streamed from nodes. However, HDFS has limitations like being unsuitable for small files and only supporting batch processing.
Big data analytics can be applied in various domains like e-commerce, marketing, oil and gas, and research. HDFS provides a cost-effective platform for storing and analyzing large, unstructured datasets from different sources. HDFS allows files to be read and written in a distributed manner across clusters. When writing a file, data is split and stored across data nodes with replication for fault tolerance. When reading, the file locations are determined and data is streamed from nodes. However, HDFS has limitations like being unsuitable for small files and only supporting batch processing.
• EBay, Facebook, LinkedIn and Twitter are among the companies that used HDFS to underpin big data analytics to address requirements similar to Yahoo's. • Marketing. Targeted marketing campaigns depend on marketers knowing a lot about their target audiences. Marketers can get this information from several sources, including CRM systems, direct mail responses, point-of- sale systems, Facebook and Twitter. Because much of this data is unstructured, an HDFS cluster is the most cost-effective place to put data before analyzing it. • Oil and gas providers. Oil and gas companies deal with a variety of data formats with very large data sets, including videos, 3D earth models and machine sensor data. An HDFS cluster can provide a suitable platform for the big data analytics that's needed. • Research. Analyzing data is a key part of research, so, here again, HDFS clusters provide a cost-effective way to store, process and analyze large amounts of data. Reading file from a Hadoop URL
READING DATA USING THE FILESYSTEM API:
As the previous section explained, sometimes it is impossible to set a URLStreamHandlerFactory for your application. In this case, you will need to use the FileSystem API to openan input stream for a file. A file in a Hadoop filesystem is represented by a Hadoop Path object (and not ajava.io.File object, since its semantics are too closely tied to the local filesystem). You canthink of a Path as a Hadoop filesystem URI, such as hdfs://localhost/user/tom/ quangle.txt. FileSystem is a general filesystem API, so the first step is to retrieve an instance for the filesystem we want to use— HDFS in this case. There are several static factory methods forgetting a FileSystem instance: public static FileSystem get(Configuration conf) throws IOException public static FileSystem get(URI uri, Configuration conf) throws IOException public static FileSystem get(URI uri, Configuration conf, String user) throws IOException. Anatomy of File Write in HDFS: Step 1: The client creates the file by calling create() on Distributed File System(DFS). Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s namespace, with no blocks associated with it. The name node performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the name node prepares a record of the new file; otherwise, the file can’t be created and therefore the client is thrown an error i.e. IOException Step 3: Because the client writes data, the DFSOutputStream splits it into packets, which it writes to an indoor queue called the info queue. The data queue is consumed by the DataStreamer, which is liable for asking the name node to allocate new blocks by picking an inventory of suitable data nodes to store the replicas. The list of data nodes forms a pipeline.The DataStreamer streams the packets to the primary data node within the pipeline, which stores each packet and forwards it to the second data node within the pipeline. Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last) data node in the pipeline. Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the
data node pipeline and waits for acknowledgments before connecting to the name node to signal whether the file is complete or not. Anatomy of File Read in HDFS Step 1: The client opens the file it wishes to read by calling open() on the File System Object(which for HDFS is an instance of Distributed File System). Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls (RPCs), to determine the locations of the first few blocks in the file. For each block, the name node returns the addresses of the data nodes that have a copy of that block. The DFS returns an FSDataInputStream to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the data node and name node I/O. Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info node addresses for the primary few blocks within the file, then connects to the primary (closest) data node for the primary block in the file. Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the stream. Step 5: When the end of the block is reached, DFSInputStream will close the connection to the data node, then finds the best data node for the next block. This happens transparently to the client, which from its point of view is simply reading an endless stream. Step 6: When the client has finished reading the file, a function is called, close() on the FSDataInputStream.
Limitations of Hadoop distributed file system:
1. Issue with Small Files: Hadoop does not suit for small data. (HDFS) Hadoop distributed file system lacks the ability to efficiently support the random reading of small files because of its high-capacity design. 2. Slow Processing Speed: In Hadoop, with a parallel and distributed algorithm, the MapReduce process large data sets. There are tasks that we need to perform: Map and Reduce and, MapReduce requires a lot of time to perform these tasks thereby increasing latency. Data is distributed and processed over the cluster in MapReduce which increases the time and reduces processing speed. 3. Support for Batch Processing Only: Hadoop supports batch processing only, it does not process streamed data, and hence overall performance is slower. The MapReduce framework of Hadoop does not leverage the memory of the Hadoop cluster to the maximum. 4. No delta iteration: Hadoop is not so efficient for iterative processing, as Hadoop does not support cyclic data flow(i.e. a chain of stages in which each output of the previous stage is the input to the next stage). 5. Not easy to use: In Hadoop, MapReduce developers need to hand code for each and every operation which makes it very difficult to work. MapReduce has no interactive mode, but adding one such as hive and pig makes working with MapReduce a little easier for adopters.
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!