Professional Documents
Culture Documents
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant
and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant
fashion to rescue the system from possible data losses in case of failure. HDFS also
makes applications available to parallel processing.
Features of HDFS
HDFS Architecture
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does the
following tasks −
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening
files and directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there
will be a datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that
HDFS can read or write is called a Block. The default block size is 64MB, but it can
be increased as per the need to change in HDFS configuration.
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have
mechanisms for quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation
takes place near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.
What is HBase?
Since the HBase data model is a NoSQL database, developers can easily read and
write data as and when required, making it faster than the HDFS architecture. It
consists of the following components:
2. RowKey: A RowKey is assigned to every set of data that is recorded. This makes it
easy to search for specific data in HBase tables.
3. Columns: Columns are the different attributes of a dataset. Each RowKey can have
unlimited columns.
5. Column Qualifiers: Column qualifiers are like column titles or attribute names in a
normal table.
7. Timestamp: Whenever a data is stored in the HBase data model, it is stored with a
timestamp.
The HBase architecture comprises three major components, HMaster, Region Server,
and ZooKeeper.
1. HMaster
HMaster operates similar to its name. It is the master that assigns regions to Region
Server (slave). HBase architecture uses an Auto Sharding process to maintain data.
In this process, whenever an HBase table becomes too long, it is distributed by the
system with the help of HMaster. Some of the typical responsibilities of HMaster
includes:
2. Region Server
Region Servers are the end nodes that handle all user requests. Several regions are
combined within a single Region Server. These regions contain all the rows between
specified keys. Handling user requests is a complex task to execute, and hence Region
Servers are further divided into four different components to make managing requests
seamless.
3. ZooKeeper
ZooKeeper acts as the bridge across the communication of the HBase architecture. It
is responsible for keeping track of all the Region Servers and the regions that are
within them. Monitoring which Region Servers and HMaster are active and which
have failed is also a part of ZooKeeper’s duties. When it finds that a Server Region
has failed, it triggers the HMaster to take necessary actions. On the other hand, if the
HMaster itself fails, it triggers the inactive HMaster that becomes active after the alert.
Every user and even the HMaster need to go through ZooKeeper to access Region
Servers and the data within. ZooKeeper stores a.Meta file, which contains a list of all
the Region Servers. ZooKeeper’s responsibilities include:
Compiler –
Queries are parses, semantic analysis on the different query blocks and
query expression is done by the compiler. Execution plan with the help of
the table in the database and partition metadata observed from the
metastore are generated by the compiler eventually.
Metastore –
All the structured data or information of the different tables and partition
in the warehouse containing attributes and attributes level information are
stored in the metastore. Sequences or de-sequences necessary to read and
write data and the corresponding HDFS files where the data is stored.
Hive selects corresponding database servers to stock the schema or
Metadata of databases, tables, attributes in a table, data types of databases,
and HDFS mapping.
Execution Engine –
Execution of the execution plan made by the compiler is performed in the
execution engine. The plan is a DAG of stages. The dependencies within
the various stages of the plan is managed by execution engine as well as
it executes these stages on the suitable system components.
What are the approaches to integrate the human in data exploration process to
realize different kind of approaches to visual data mining ?
Answer
Approaches to integrate the human in data exploration process to realize
different kind of approaches to visual data mining :
2. Two-dimensional data :
a. A two-dimensional data is geographical data, where the two distinct dimensions are
longitude and latitude.
b. A standard method for visualizing two-dimensional data are x-y plots and maps are
a special type of x-y plots for presenting two
dimensional geographical data.
c. Example of two-dimensional data is geographical maps.
3. Multi-dimensional data :
a. Many data sets consist of more than three dimensions and therefore do not allow a
simple visualization as 2-dimensional or 3-dimensional plots.
b. Examples of multi-dimensional (or multivariate) data are tables from relational
databases, which often have tens to hundreds of columns.
a. Dynamic projection :
1. Dynamic projection is an automated navigation operation.
2. The basic idea is to dynamically change the projections in order to explore a multi-
dimensional data set.
3. A well-known example is the GrandTour system which tries to show all interesting
two-dimensional projections of a multi
dimensional data set as a series of scatter
plots.
4. The sequence of projections shown can be random, manual, pre
computed, or data
driven.
5. Examples of dynamic projection techniques include XGobi , XLispStat, and
ExplorN.
b. Interactive filtering :
1. Interactive filtering is a combination of selection and view enhancement.
2. In exploring large data sets, it is important to interactively partition the data set into
segments and focus on interesting subsets.
3. This can be done by a direct selection of the desired subset (browsing) or by a
specification of properties of the desired subset (querying).
4. An example of a tool that can be used for interactive filtering is the Magic Lens.
5. The basic idea of Magic Lens is to use a tool similar to a magnifying glass to filter
the data directly in the visualization. The data under the magnifying glass is processed
by the filter and displayed in a
different way than the remaining data set.
6. Magic Lens show a modified view of the selected region, while the rest of the
visualization remains unaffected.
7. Examples of interactive filtering techniques includes InfoCrystal , Dynamic
Queries, and Polaris.
c. Zooming :
1. Zooming is a well known view modification technique that is widely used in a
number of applications.
2. In dealing with large amounts of data, it is important to present the data in a highly
compressed form to provide an overview of the data but at the same time allow a
variable display of the data at different resolutions.
3. Zooming does not only mean displaying the data objects larger, but also that the
data representation may automatically change to present more details on higher zoom
levels.
4. The objects may, for example, be represented as single pixels at a low zoom level,
as icons at an intermediate zoom level, and as labeled objects at a high resolution.
5. An interesting example applying the zooming idea to large tabular data sets is the
TableLens approach.
6.The basic idea of TableLens is to represent each numerical value by a small bar.
7. All bars have a one-pixel height and the lengths are determined by the attribute
values.
8. Examples of zooming techniques includes PAD++, IVEE/Spotfire, and DataSpace.
e. Distortion :
1. Distortion is a view modification technique that supports the data exploration
process by preserving an overview of the data during drill-down operations.
2. The basic idea is to show portions of the data with a high level of detail while
others are shown with a lower level of detail.
3. Popular distortion techniques are hyperbolic and spherical distortions.
4. These are often used on hierarchies or graphs but may also be applied to any other
visualization technique.
5. Examples of distortion techniques include Bifocal Displays, Perspective Wall,
Graphical Fisheye Views, Hyperbolic Visualization, and Hyperbox.