You are on page 1of 8

What is Big Data?

The process of storing and analysing data to make some sense for the organization is called Big data. In
simple terms, data which is very large in size and yet growing exponentially with time is called as Big
data.

Why we need Big Data?

For any application that contains limited amount of data we normally use Sql/Postgresql/Oracle/MySQL,
but what in case of large applications like Facebook,Google,Youtube? This data is so large and complex
that none of the traditional data management system is able to store and process it.

Facebook generates 500+ TB data per day as people upload various images, videos, posts etc. Similarly
sending text/multimedia messages, updating Facebook/WhatsApp status, comments etc. generates
huge data. If we use traditional data processing applications (SQL/Oracle/MySQL) to handle it, it will lead
to loss of efficiency. So in order to handle exponential growth of data, data analysis becomes a required
task. To overcome this problem, we use Big data. Big data includes both structured and unstructured
data.

Traditional data management systems and existing tools are facing difficulties to process such a big data.
R is one of the main computing tool used in statistical education and research. It is also widely used for
data analysis and numerical computing in scientific research.

Where does Big Data come from?

1.Social data : This could be data coming from social media services such as Facebook Likes, photos and
videos uploads, putting comments, Tweets and YouTube views.

2.Share Market: Stock exchange generates huge amount of data through its daily transaction.

3.E-commerce site: E-commerce Sites like Flipkart,Amazon,Snapdeal generates huge amount of data.

4.Airplane: Single airplane can generate 10+ TB of data in 30 minutes of a flight time.

What is the need for storing such huge amount of data?

The main reason behind storing data is analysis. Data analysis is a process used to clean, transform and
remodel data with a view to reach to a certain conclusion for a given situation. More accurate analyses
leads to better decision making and better decision making leads to increase in efficiency and risk
reduction.

Hadoop follows master-slave architecture means there is one master machine and multiple slave
machines. The data that you give to hadoop is stored across these machines in the cluster.

Two important components of hadoop are:

1.HDFS (Data storage)

2.Map-Reduce (Analyzing and Processing)

1. HDFS:

Hadoop Distributed File System is distributed file system used to store very huge amount of data. HDFS
follows master-slave architecture means there is one master machine(Name Node) and multiple slave
machines(Data Node).The data that you give to hadoop is stored across these machines in the cluster.

Various components of HDFS are:

a) Blocks:

HDFS is block structured file system in which individual file is split into several blocks of equal size and
stored across one or more machines in a cluster. HDFS blocks are 64 MB by default in Apache Hadoop
and 128 MB by default in Cloudera Hadoop but it can be increased as per the need. If the file in HDFS is
smaller than block size, then it does not occupy full block. Exm- If file size is 10 MB and HDFS block size is
128 MB then it takes only 10 MB of space.

b) Name Node:

Name node is the controller/master of the system. Name node spreads data to data node. It stores the
metadata of all the files in HDFS. This metadata includes name, location of each block, block size and file
permission.
c) Data Node-

Data node is the place where actual data is stored. The data sent by name node is stored into data node.
They store and retrieve blocks when they are requested by client or name node.They perform
operations such as block creation,deletion and replication as stated by the name node.

d) Secondary Name Node-

Many people think that Secondary Namenode is just a backup of primary Namenode in Hadoop but it is
not a back up node. Name Node is a primary node in which all the metadata is stored into fsimage and
editlog files periodically. But, when name node down, secondary node will be online but this node only
have the read access to the fsimage and editlog files and don't have the write access to them. All the
secondary node operations will be stored to temp folder. When name node back to online this temp
folder will be copied to name node and the namenode will update the fsimage and editlog files.

2. Map reduce:

MapReduce is a framework and processing technique using which we can write applications to process
huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. Map
reduce program is written by default in java but we can use other language also like pig,apache pig.

The MapReduce algorithm contains two important tasks.

a) Map-

In map stage,mapper takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).

b) Reduce-

In reduce phase, reducer takes the output from a map as input and combines those data tuples into a
smaller set of tuples. The reduce job is always performed after the map job.

Components of MapReduce:
a) JobTracker:

Job tracker is a daemon that runs on a namenode. There will be only one job tracker for name node but
many task trackers for data nodes. It assigns the tasks to the different task tracker. It is the single point
of failure. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task
tracker based on which Job tracker decides whether the assigned task is completed or not.

JobTracker process:

1. JobTracker receives the requests from the client.

2. JobTracker talks to the NameNode to determine the location of the data.

3. JobTracker finds the best TaskTracker nodes to execute tasks.

4. The JobTracker submits the work to the chosen TaskTracker nodes.

5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals then work is scheduled
on a different TaskTracker.

6. When the work is completed, the JobTracker updates its status and submit back the overall status of
job back to the client.

7. The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running
jobs are halted.

b) TaskTracker:

Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual
tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the
work amongst different task trackers to perform MapReduce tasks. While performing this action, the
task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job
tracker does not receive heartbeat from task tracker within specified time, then it will assume that task
tracker has crashed and assign that task to another task tracker in the cluster.

TaskTracker process:

1. The JobTracker submits the work to the TaskTracker nodes.

2. TaskTracker run the tasks and report the status of task to JobTracker.
3. It has function of following the orders of the job tracker and updating the job tracker with its progress
status periodically.

4. TaskTracker will be in constant communication with the JobTracker.

5. TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker
will assign the task to another node.

4V's of Big Data:

1.Volume:

The amount of data which we deal with is of very large size of Peta bytes.

2.Variety:

Data Comes in all type of format.(Text,audio,image,video).

3.Velocity:

The data is generating at a very fast rate. Velocity is the measure of how fast the data is coming in. For
time critical applications faster processing is very important. Exm- Share marketing, Video streaming

Open the Couchbase Web Console.

Click Setup to begin the initial configuration.

Step 1 of 5: Configure Server

cluster setup 01

Set up the disk storage and Data service.

The Configure Disk Storage option specifies the location of the persistent storage used by the Couchbase
Server Data service. The setting affects only this node and sets the directory where all the data is stored
on disk. It also sets the location where the indexes created by views are stored:
If you are not indexing data with views, you can accept the default setting. Otherwise for the best
performance we recommend specifying separate physical storage for the data and index paths.

Provide a node IP or hostname under Configure Server Hostname.

Choose which services to include on the first node. At a minimum, the Data service is required. In
production, only one service per node should be deployed.

The Configure Server Memory section sets the amount of physical RAM that will be allocated by the
Couchbase Server for storage, both for the Data and the Index service. The same amount of memory is
allocated to each node in the cluster that runs the particular service. Since the same setting applies to
the whole cluster, specify a value that all nodes of each service type can support.

Alternatively, if joining an existing cluster, select the radio button Join a cluster now and then provide
the IP address or hostname of an existing node and administrative credentials for that existing cluster.
Select the services to install on this node.

Click Next.

Step 2 of 5: Sample Buckets

setup 02

The Sample Buckets panel appears where you can select the sample data buckets you want to load.

Click the names of sample buckets to load to the Couchbase Server. These data sets demonstrate
Couchbase Server’s features and help you understand and develop views and N1QL Queries.
If you decide to install sample data, the installer creates one Couchbase bucket for each set of sample
data you choose.

After you create sample data buckets, the Create Default Bucket panel appears where you can create a
new default data bucket.

Step 3 of 5: Create Default Bucket (optional)

setup 03

At this step you can set up a default bucket for Couchbase Server. You can change most bucket settings
later, a major exception is the bucket name which in this case is fixed to default. See Bucket setup for
more information.

If you wish to set up the default bucket then accept all defaults in this panel.

Couchbase Server will create a new data bucket named default; you can use this test bucket to learn
more about Couchbase Server and in test environments. It is worth noting that the default bucket is
unauthenticated, which is not recommended for production purposes due to possible security risks. You
can instead choose to skip the creation of the default bucket and create your own buckets at a later
stage.

To create a default bucket as part of the setup click Next, otherwise to skip this step click Skip.

Step 4 of 5: Notifications

setup 04

In the Notifications screen, select Enable software update notifications.


Couchbase Web Console communicates with Couchbase Server nodes and confirms the version
numbers of each node.

As long as you have Internet access this information will be sent anonymously to Couchbase corporate,
which uses this information only to provide you with updates and information to help improve
Couchbase Server and related products. When you provide an email address, it is added to the
Couchbase community mailing list for news and update information about Couchbase and related
products. You can unsubscribe from the mailing list at any time using the Unsubscribe link provided in
each newsletter.

Couchbase Web Console communicates the following information:

The current version. When a new version of Couchbase Server exists, you get information about where
you can download the new version.

Information about the size and configuration of your Couchbase cluster to Couchbase corporate. This
information helps prioritize the development efforts.

Read the terms and conditions and then select I agree to the terms and conditions associated with this
product and click Next.

Step 5 of 5: Configure Server

setup 05

The screen Configure this Server is the last configuration step. Enter a cluster administrator’s username
and password. Your username can have up to 24 characters, and your password must have 6 to 24
characters. Use these credentials each time you add a new server into the cluster. These are the same
credentials you use for Couchbase Server REST API.

After you finish this setup, you see the Couchbase Web Console with the Cluster Overview page.
Couchbase Server is now running and ready to use.

You might also like