You are on page 1of 10

___~HDFS is about distributed file system that handles large data sets running on

commodity hardware.

It is used to scale a single Apache handoop cluster to hundred of nodes.HDFS is


one of the components of Apache handoop .

__~ The need of such technique is suitable for distributing storage and processing .
Handoop provides a command interface to interact with HDFS system.users can
check easily status of clusters ,and provides file permission.

__~HDFS implementation used with Java language,any machine that supports the
Java that can run the namenode on the datanode.

Handoop implements the open source software framework for storing data and
running application on clusters of commodity hardware.

__~Applications

*Finance sector

*Security and law enforcement

*Customer requirements

*Real time analysis of customer data

*Advertisement platform

*Financial trading

*Health care sectors.

__~ present many organisations are using Apache Hadoop as a robust data
analytics Google trends shows peak popularity about handoop since 2014.

_____ Kindly leave a comment for any queries.


Solution :

1)

The Hadoop Distributed File System (HDFS) is the primary data storage system
used by Hadoop applications. HDFS employs a NameNode and DataNode
architecture to implement a distributed file system that provides high-performance
access to data across highly scalable Hadoop clusters.

Hadoop itself is an open source distributed processing framework that manages data
processing and storage for big data applications. HDFS is a key part of the many
Hadoop ecosystem technologies. It provides a reliable means for managing pools of
big data and supporting related big data analytics applications.

Implementation :

HDFS enables the rapid transfer of data between compute nodes. At its outset, it
was closely coupled with MapReduce, a framework for data processing that filters
and divides up work among the nodes in a cluster, and it organizes and condenses
the results into a cohesive answer to a query. Similarly, when HDFS takes in data, it
breaks the information down into separate blocks and distributes them to different
nodes in a cluster.

With HDFS, data is written on the server once, and read and reused numerous times
after that. HDFS has a primary NameNode, which keeps track of where file data is
kept in the cluster.

HDFS also has multiple DataNodes on a commodity hardware cluster -- typically one
per node in a cluster. The DataNodes are generally organized within the same rack
in the data center. Data is broken down into separate blocks and distributed among
the various DataNodes for storage. Blocks are also replicated across nodes,
enabling highly efficient parallel processing.

The NameNode knows which DataNode contains which blocks and where the
DataNodes reside within the machine cluster. The NameNode also manages access
to the files, including reads, writes, creates, deletes and the data block replication
across the DataNodes.

The NameNode operates in conjunction with the DataNodes. As a result, the cluster
can dynamically adapt to server capacity demand in real time by adding or
subtracting nodes as necessary.

The DataNodes are in constant communication with the NameNode to determine if


the DataNodes need to complete specific tasks. Consequently, the NameNode is
always aware of the status of each DataNode. If the NameNode realizes that one
DataNode isn't working properly, it can immediately reassign that DataNode's task to
a different node containing the same data block. DataNodes also communicate with
each other, which enables them to cooperate during normal file operations.
Moreover, the HDFS is designed to be highly fault-tolerant. The file system replicates
-- or copies -- each piece of data multiple times and distributes the copies to
individual nodes, placing at least one copy on a different server rack than the other
copies.

HDFS schemaAPACHE SOFTWARE FOUNDATION

HDFS architecture centers on commanding NameNodes that hold metadata and


DataNodes that store information in blocks. Working at the heart of Hadoop, HDFS
can replicate data at great scale.

HDFS architecture, NameNodes and DataNodes :

HDFS uses a primary/secondary architecture. The HDFS cluster's NameNode is the


primary server that manages the file system namespace and controls client access
to files. As the central component of the Hadoop Distributed File System, the
NameNode maintains and manages the file system namespace and provides clients
with the right access permissions. The system's DataNodes manage the storage
that's attached to the nodes they run on.

HDFS exposes a file system namespace and enables user data to be stored in files.
A file is split into one or more of the blocks that are stored in a set of DataNodes.
The NameNode performs file system namespace operations, including opening,
closing and renaming files and directories. The NameNode also governs the
mapping of blocks to the DataNodes. The DataNodes serve read and write requests
from the clients of the file system. In addition, they perform block creation, deletion
and replication when the NameNode instructs them to do so.

HDFS supports a traditional hierarchical file organization. An application or user can


create directories and then store files inside these directories. The file system
namespace hierarchy is like most other file systems -- a user can create, remove,
rename or move files from one directory to another.
The NameNode records any change to the file system namespace or its properties.
An application can stipulate the number of replicas of a file that the HDFS should
maintain. The NameNode stores the number of copies of a file, called the replication
factor of that file.

Features of HDFS :

There are several features that make HDFS particularly useful, including:

Data replication. This is used to ensure that the data is always available and
prevents data loss. For example, when a node crashes or there is a hardware failure,
replicated data can be pulled from elsewhere within a cluster, so processing
continues while data is recovered.

Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them
across nodes in a large cluster ensures fault tolerance and reliability.

High availability. As mentioned earlier, because of replication across notes, data is


available even if the NameNode or a DataNode fails.

Scalability. Because HDFS stores data on various nodes in the cluster, as


requirements increase, a cluster can scale to hundreds of nodes.

High throughput. Because HDFS stores data in a distributed manner, the data can
be processed in parallel on a cluster of nodes. This, plus data locality (see next
bullet), cut the processing time and enable high throughput.

Data locality. With HDFS, computation happens on the DataNodes where the data
resides, rather than having the data move to where the computational unit is. By
minimizing the distance between the data and the computing process, this approach
decreases network congestion and boosts a system's overall throughput.

Benefits of using HDFS :

There are five main advantages to using HDFS, including:

Cost effectiveness. The DataNodes that store the data rely on inexpensive off-the-
shelf hardware, which cuts storage costs. Also, because HDFS is open source,
there's no licensing fee.

Large data set storage. HDFS stores a variety of data of any size -- from megabytes
to petabytes -- and in any format, including structured and unstructured data.

Fast recovery from hardware failure. HDFS is designed to detect faults and
automatically recover on its own.

Portability. HDFS is portable across all hardware platforms, and it is compatible with
several operating systems, including Windows, Linux and Mac OS/X.
Streaming data access. HDFS is built for high data throughput, which is best for
access to streaming data.

Cons :

1. Problem with Small files

Hadoop can efficiently perform over a small number of files of large size. Hadoop
stores the file in the form of file blocks which are from 128MB in size(by default) to
256MB. Hadoop fails when it needs to access the small size file in a large amount.
This so many small files surcharge the Namenode and make it difficult to work.

2. Vulnerability

Hadoop is a framework that is written in java, and java is one of the most commonly
used programming languages which makes it more insecure as it can be easily
exploited by any of the cyber-criminal.

3. Low Performance In Small Data Surrounding

Hadoop is mainly designed for dealing with large datasets, so it can be efficiently
utilized for the organizations that are generating a massive volume of data. It’s
efficiency decreases while performing in small data surroundings.

4. Lack of Security

Data is everything for an organization, by default the security feature in Hadoop is


made un-available. So the Data driver needs to be careful with this security face and
should take appropriate action on it. Hadoop uses Kerberos for security feature
which is not easy to manage. Storage and network encryption are missing in
Kerberos which makes us more concerned about it.

5. High Up Processing

Read/Write operation in Hadoop is immoderate since we are dealing with large size
data that is in TB or PB. In Hadoop, the data read or write done from the disk which
makes it difficult to perform in-memory calculation and lead to processing overhead
or High up processing.

6. Supports Only Batch Processing

The batch process is nothing but the processes that are running in the background
and does not have any kind of interaction with the user. The engines used for these
processes inside the Hadoop core is not that much efficient. Producing the output
with low latency is not possible with it.

Examples :

HDFS is a distributed file system that handles large data sets running on commodity
hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even
thousands) of nodes. HDFS is one of the major components of Apache Hadoop, the
others being MapReduce and YARN. HDFS should not be confused with or replaced
by Apache HBase, which is a column-oriented non-relational database management
system that sits on top of HDFS and can better support real-time data needs with its
in-memory processing engine.

2)

NTFS (NT File System)

NTFS, which stands for 'NT file system' and the 'New Technology File System,' is the
file system that the Windows NT operating system (OS) uses for storing and
retrieving files on hard disk drives (HDDs) and solid-state drives (SSDs).

NTFS, which stands for NT file system and the New Technology File System, is the
file system that the Windows NT operating system (OS) uses for storing and
retrieving files on hard disk drives (HDDs) and solid-state drives (SSDs). NTFS is the
Windows NT equivalent of the Windows 95 file allocation table (FAT) and the OS/2
High Performance File System (HPFS). However, NTFS offers several
improvements over FAT and HPFS in terms of performance, extendibility and
security.

A computer's OS creates and maintains the file system on a storage drive or device.
The file system essentially organizes the data into files. It controls how data files are
named, stored, retrieved and updated and what other information can be associated
with the files -- for example, data on file ownership and user permissions.

NTFS is one type of file system. File systems are generally differentiated by the OS
and the type of drive they are being used with. Today, there is also a distributed file
system (DFS) where files are stored across multiple servers but is accessed and
handled as if it were stored locally. A DFS enables multiple users to easily share
data and files on a network and provides redundancy.

How is NTFS used?

Microsoft Windows and some removable storage devices use NTFS to organize,
name and store files. NTFS is an option for formatting SSDs -- where its speed is
particularly useful -- HDDs, USBs and micro SD cards that are used with Windows.

Depending on the storage capacity of the device, the OS used and the type of drive,
a different file system may be preferable, such as FAT32 or Extended FAT (exFAT).
Each file system has benefits and drawbacks. For example, security and
permissions are more advanced with NTFS than exFAT and FAT32. On the other
hand, FAT32 and exFAT work better with non-Windows OSes, such as Mac and
Linux.

All Microsoft OSes from Windows XP on use NTFS version 3.1 as their main file
system. NTFS is also used on external drives because it has the capacity those
drives need, supporting large files and partition sizes. NTFS can support up to 8
petabyte volumes and files on Windows Server 2019 and Windows 10, according to
Microsoft. The theoretical limit for the individual file size NTFS can support is 16
exbibytes minus 1 kilobyte (KB).

How NTFS works

When installing an OS, the user chooses a file system. When formatting an SSD or
an HDD, users choose the file system they'll use. The process of formatting each
type of drive is slightly different, but both are compatible with NTFS.

When an HDD is formatted or initialized, it is divided into partitions. Partitions are the
major divisions of the hard drive's physical space. Within each partition, the OS
keeps track of all the files it stores. Each file is stored on the HDD in one or more
clusters or disk spaces of a predefined uniform size.

Using NTFS, the sizes of the clusters range from 512 bytes to 64 KB. Windows NT
provides a recommended default cluster size for each drive size. For example, a 4
gigabyte (GB) drive has a default cluster size of 4 KB. The clusters are indivisible, so
even the smallest file takes up one cluster, and a 4.1 KB file takes up two clusters, or
8 KB, on a 4 KB cluster system.

Cluster sizes are determined based on balancing a tradeoff between maximizing use
of disk space and minimizing the number of disk accesses required to get a file. With
NTFS, generally, the larger the drive, the larger the default cluster size, because it's
assumed that a system user will prefer to have fewer disk accesses and better
performance at the expense of less efficient use of space.

When a file is created using NTFS, a record about the file is created in the Master
File Table (MFT). The record is used to locate a file's possibly scattered clusters.
NTFS looks for a storage space that will hold all the clusters of the file, but it isn't
always able to find one space all together.
Along with its data content, each file contains its metadata, which is a description of
its attributes.

NTFS features

One distinguishing characteristic of NTFS, compared with FAT, is that it allows for
file permissions and encryption. Notable features of NTFS include the following:

Organizational efficiency. NTFS uses a b-tree directory scheme to keep track of file
clusters. This is significant because it allows for efficient sorting and organization of
files.

Accessible data. It stores data about a file's clusters and other data in the MFT, not
just in an overall governing table as with FAT.

File size. NTFS supports very large files.

User permissions. It has an access control list that lets a server administrator control
who can access specific files.

Compression. Integrated file compression shrinks file sizes and provides more
storage space.

Unicode file naming. Because it supports file names based on Unicode, NTFS has a
more natural file-naming convention and allows for longer file names with a wider
array of characters. Non-Unicode naming conventions sometimes require translation.

Secure. NTFS provides security for data on removable and nonremovable disks.

Requires less storage. It has support for sparse files that replaces empty information
-- long strings of zeros -- with metadata that takes up a smaller volume of storage
space.

Easy volume access. NTFS uses mounted volumes, meaning disk volumes can be
accessed as normal folders in the file system.

Advantages and disadvantages of NTFS

There are several advantages and disadvantages to using NTFS, which are included
below.

Advantages

Control. One of the primary features of NTFS is the use of disk quotas, which gives
organizations more control over storage space. Administrators can use disk quotas
to limit the amount of storage space a given user can access.

Performance. NTFS uses file compression, which shrinks file sizes, increasing file
transfer speeds and giving businesses more storage space to work with. It also
supports very large files.
Security. The access control features of NTFS let administrators place permissions
on sensitive data, restricting access to certain users. It also supports encryption.

Easy logging. The MFT logs and audits files on the drive, so administrators can track
files that have been deleted, added or changed in any way. NTFS is a journaling file
system, meaning it logs transactions in a file system journal.

Reliability. Data and files can be quickly restored in the event of a system failure or
error, because NTFS maintains the consistency of the file system. It is a fault tolerant
system and has an MFT mirror file that the system can reference if the first MFT gets
corrupted.

Disadvantages

Limited OS compatibility. The main disadvantage of NTFS is limited OS


compatibility; it is read-only with non-Windows OSes.

Limited device support. Many removable devices don't support NTFS, including
Android smartphones, DVD players and digital cameras. Some other devices don't
support it either, such as media players, smart TVs and printers.

Mac OS X support. OS X devices have limited compatibility with NTFS drives; they
can read them but not write to them.

How NTFS, FAT32 and exFAT differ

Microsoft developed FAT32 before NTFS, making it the oldest of the three file
systems. It is generally considered less efficient than NTFS. It has a smaller 4 GB
file size and 32 GB volumes in Windows.

FAT32 is easier to format than NTFS and simpler in other ways. Its file allocation
table is a less complex way to organize files than the MFT in NTFS. Because it's
simpler to use, FAT 32 is more compatible with non-Windows OSes and is used
where NTFS generally isn't, such as smart TVs, digital cameras and other digital
devices. FAT32 works with every version of Mac, Linux and Windows. As mentioned
earlier, NTFS is read-only with Mac and Linux.

ExFAT was designed as an evolution of FAT32 and is the newest of the three file
systems. It retains the positive characteristics of FAT32 -- a lightweight, more flexible
file allocation system -- while overcoming some of its limitations. For example,
FAT32 can only store files of up to 4 GB, while exFAT can handle file sizes of 16
exabytes.

ExFAT does require additional software to work with Mac and Linux systems, but it is
more compatible with them than NTFS. It is ideal for when users need a larger file
size than FAT32 but has more compatibility than NTFS.

The journaling file system in NTFS makes it possible to use the journal to repair data
corruption, something FAT cannot do. The MFT in NTFS holds more information
about the files being held than FAT's file allocation tables, making for better file
indexing and cluster organization.

The file system takeaway

NTFS, FAT32 and exFAT each have strengths and weaknesses. However, they are
also each used in a variety of computing contexts, from personal computing to the
enterprise. NTFS is prominent among the three because of its connection to
Windows.

Criteria HDFS NFTS

HDFS is better over NTFS...the reasons are as follows:

HDFS does support multiple replicas of files, which avoids the common bottleneck of
many clients accessing a single file. It has reading performance scales better than
NTFS because of having multiple replicas on different physical disks.

You might also like