You are on page 1of 32

GMR Institute Of Technology, Rajam

Department of IT

An Approach to understand the


GMR Institute of Technology

working of Sqoop and Hive in Hadoop


Project Supervisor Y. Santosh (19341A12C8)
Ms.U.Archana P .V Manideep (19341A1286)

Asst. Professor, M. Niranjani (20345A1208)


Dept. of Information Technology S. Jithendra Varma (19341A12A8)

1
1
19-Jul-19 1
ABSTRACT

Analysis of consistent and structured data has seen huge success in past decades.
Where the analysis of unstructured data in the form of multimedia format remains a challenging
task. The buzz-word big-data refers to the large-scale distributed data processing applications
that operates on large amounts of data. Most big data solutions are built on top of the Hadoop
Eco-system or use its Hadoop Distributed File System (HDFS). Hadoop allows to manage the
distribution of data and its placement based on cluster analysis of the data itself. Where
GMR Institute of Technology

Sqoop (SQL-to-Hadoop) is one of the Hadoop component that is designed to efficiently imports
the enormous volumes of data from RDBMS to HDFS and also exports the data from HDFS to
RDMS. Instagram is one of the most used and popular social media tools. In this analyze the
data that is generated from Instagram that can be mined and utilized by using Sqoop and Hive.
Hive is an open source data warehouse software for reading, writing and managing large data set
files that are stored directly in either the Apache Hadoop Distributed File System (HDFS). This
paper gives the details of sqoop and hive working in hadoop.

Keywords: Big data, Hadoop, Sqoop, Hive, HDFS.


PROBLEM STATEMENT

Loading dataset with duplicated data may give some unwanted results while
working on the data set. Due to the flexibility of the system one can be able to avoid
many network and processing associated with loading raw data. Since data is always
changing ,the flexibility of the system makes it much easier to integrate any changes.
GMR Institute of Technology

Hadoop will allow you to process massive amounts of data very quickly using its
components. Here we are using sqoop tool for transferring the data from RDBMS to
HDFS servers.
EXISTING SYSTEM

• The current system just having a direct dataset downloaded from the Kaggle websites.

• Here the Existing system used the pig scripting language which converts the script into
map and reduce format which makes the system difficult to do the process the data from
the dataset and the errors are not helpful.
GMR Institute of Technology

• In today's world, there are various scripting languages, but the existing system concentrates
on one pig scripting language, making it difficult for abstraction of data at lower level and
also commands cannot be executed until the dataset is dumped using .pig extension.

• The lines of code is difficulty to understand and to retrieve a single line data requires a
large code to process.
PROPOSED SYSTEM
• The proposed system tries to implement the accuracy in the results while performing
query on the dataset using Hive commands.

• These Hive commands are simple to use and keeps the query to run fast and also it takes
less time to write the query.

• HiveQL is a declarative language like sql which facilitates the users to using simple join
GMR Institute of Technology

operations and also provides the structure of an array of data formats.

• Multiple users can query the data with the help of HiveQL and here the dataset is created
using sql queries and imported into HDFS using sqoop tool which helps the users to
modify the dataset according to their requirements.

• The proposed system is planned to design in such a way that it can analyse the data set
and can give accurate outputs to the query processed.
Literature survey
1) Sidhu, R., Chea, D., Dhakal, R., Hur, A., & Zhang, M. (2015).
Implementation of Hadoop and Sqoop for Big Data Organization.

• The author suggests the widespread usage of the Hadoop MapReduce


framework by businesses and others needing mediums to precisely organize sets
of data.
GMR Institute of Technology

• It utilizes two main functions: map and reduce.


• MapReduce can be utilized to process in parallel on a variety of servers in order
to smoothly function, thus enabling it to work efficiently with very large
datasets.
• Sqoop is a significant tool regarding the management of Big Data
• Sqoop can easily move data from the RDBMS to the Hadoop file system
Literature survey
2) Geng, D., Zhang, C., Xia, C., Xia, X., Liu, Q., & Fu, X. (2019). Big
data-based improved data acquisition and storage system for designing
industrial data platform. IEEE Access, 7, 44574-44582

• Big data-based acquisition and storage system (ASS) plays an important role
in the design of industrial data platform
• Generally, big data platform include three modules, which are data acquisition
GMR Institute of Technology

module, data storage module and data computation module


• Both Sqoop and Flume are the main framework for data acquisition modules
• In the paper, the high availability data stored system of HDFS and m H base
is built.
• The computing framework uses Spark SQL for batch processing, Spark
Streaming for micro-batch processing, Structured Streaming and link for
stream processing.
Literature survey
3)Lydia, E. L., & Swarup, M. B. (2015). Big data analysis using hadoop
components like flume, mapreduce, pig and hive. International Journal of
Science, Engineering and Computer Technology, 5(11), 390.

• Big data is mainly collection of data sets so large and complex that it is very difficult to
handle them using on hand database management tools.
• The main challenges with Big databases include creation, curation, storage, sharing,
search, analysis and visualization.
GMR Institute of Technology

• Flume can be used to acquire data from social media such as twitter.
• Then, this data can be organized using distributed file systems such as Google File
System or Hadoop File System.
• A Flume agent is a JVM process which has 3 components -Flume Source, Flume
Channel and Flume Sink.
• Pig latin which is a dataflow language is used by Pig .Pig latin is a type of language in
which you program by connecting things together
Literature survey
4) Pol, U. R. (2016). Big data analysis: comparison of hadoop mapreduce,
pig and hive. International Journal of Innovative Research in Science,
Engineering and Technology, 5(6), 9687-93.

• HADOOP is an open source software framework.


• Applications built using HADOOP are run on large data sets distributed across
clusters of commodity computers.
GMR Institute of Technology

• A MapReduce job usually splits the input dataset into independent chunks which
are processed by the map tasks in a completely parallel manner
• The MapReduce framework consists of a single master Resource Manager, one
slave Node Manager per cluster-node, and MRApp Master per application.
• Hadoop is typically used for storing (hdfs) and summarizing(map-reduce) large
batches of data. MapReduce permits you to filter and aggregate data from HDFS
so that you can gain insights from the big data.
Literature survey
5) Rida Qayyum." A Roadmap Towards Big Data Opportunities, Emerging
Issues and Hadoop as a Solution", International Journal of Education and
Management Engineering (IJEME), 10(4), 8-17.

• With this enhancement of existing technologies, data has been generated on a large scale.
When we notice how much data generated through smartphones.
• The main purpose of this study is to presents:
GMR Institute of Technology

(a) A comprehensive survey of Big Data characteristics


(b) A discussion on types of Big Data to understand the concept of this domain
(c) The development of a new
opportunities coming with Big Data. (d) Highlighting the
issues associated with Big Data
• Since, the data is not only huge but it is present in various formats as well like structured,
unstructured, multi-structured, and semi-structured.
• It is essential to ensure the interconnected system is present to store this variety of data
generated from various sources.
Literature survey
6)Rodrigues, A. P., & Chiplunkar, N. N. (2018). Real-time Twitter data analysis
using Hadoop ecosystem. Cogent Engineering, 5(1), 1534519.

• In today‘s highly developed world, every minute, people around the globe express themselves via
various platforms on the Web .A huge amount of unstructured data is generated. Such data is
termed as big data.
GMR Institute of Technology

• Twitter, one of the largest social media site receives millions of tweets every day on variety of
important issues.
• Pig Latin is a procedural programming language and fits very naturally in the pipeline paradigm
• Pig Latin allows developers to insert their own code almost anywhere in the data pipeline which is
useful for pipeline development. It can be used for dumping twitter data in Hadoop HDFS.
Literature survey
7)Ramsingh, J., & Bhuvaneswari, V. (2015). An insight on big data analytics
using pig script. International Journal of Emerging Trends & Technology in
Computer Science (IJETTCS), 4(6), 2278-6856.

• In this paper we discussed one of the Hadoop components "Apache pig" for analysing Big Data in
minimum time duration.
• YARN is a cluster management technology. It is one of the key features in second-generation of
Hadoop.
GMR Institute of Technology

• YARN provides resource management and a central platform to deliver consistent operations,
security and data governance tools across Hadoop clusters.
• The pig execution environment has two modes of execution they are:
1. Local mode: Pig scripts runs on a single machine were Hadoop MapReduce and HDFS
are not required.
2. Hadoop or MapReduce mode: pig scripts runs on a Hadoop cluster.
Literature survey
8) Mutasher, W. G., & Aljuboori, A. F. (2022). New and Existing Approaches
Reviewing of Big Data Analysis with Hadoop Tools. Baghdad Science Journal,
0887-0887.

• Hadoop is free programs platform which ensures parallel processing for big datasets using
programming models through commodity server clusters.
• The Hadoop distributed file system uses the Name and Name Master slave architecture
GMR Institute of Technology

Nodes with details.


• An Agent of Flume is a method for the Java Virtual Machine (JVM) Hosting the elements
from which major events stream to the next destination from an external source .
• Spark is an effective open-source processor. It is designed across top speed, due to the ease
use and advanced engine about analysis.
• Sorting 10-100 times quicker than Map Reduce, Spark is Providing speed time for
perspective into more information, resulting in faster information and better management
decisions and better outcomes.
Literature survey
9) Menon, A. (2012, September). Big data@ facebook. In Proceedings of the
2012 workshop on Management of big data systems (pp. 31-32).

• This journal focused in building an analysis informatic model, to analyse data collected from
face book pages.
• Big Data is an ability to manage a large volume of different type of data. Characteristics of big
data are listened below:
1. Volume : Quantity of data
GMR Institute of Technology

2. Velocity : How rapidly is the process of data


3. Diversity : Different type of data
4. Value : How valuable are those data
• Benefits of using big data are:
 Correct definition of what to measure.
 Use of analysis on process enhancement.
 Commitment of improvement and measure again.
 Use of an notification system for any failure process.
Literature survey
10) Ashwini, T., Sahana, L. M., Mahalakshmi, E., & Shweta . (2021). s.
p. youtube data analysis using hadoop framework. (IJEAST) Vol. 5,
Issue 11, ISSN No. 2455-2143.

• YouTube is one of the most used and popular social media tool.
• The main aim of this paper is to analyze the data that is generated from YouTube that
can be mined and utilized.
• The MapReduce algorithm’s summarization is as follows: The algorithm takes set of
GMR Institute of Technology

input key/value pairs and generates set of output key/value pairs.


• A key-value pair (KVP) is the set of two interconnected data items
• The class Video rating is fetched and it extends Mapper of the class which has
arguments.
• A variable ‘rating’ is initialized which stores all the ratings in input Youtubedata.txt
dataset. This task is done to convert an unstructured dataset into structured.
• Lastly, the key and value is written, where the key is ‘rating’ and the value is ‘one’. This
is the output of map method.
Methodology

• The complexity of Big Data lends itself to the development of technologies like
Hadoop, Hive, Pig, and Sqoop in order to flexibly analyze datasets and improve
the performance of the basic Hadoop framework.
• Sqoop, in particular, takes advantage of simple data structures to import/export
data from RDBMS to Hadoop and vice versa.
• By acting as a go­-between, Sqoop can easily move data from the RDBMS to
the Hadoop file system.
GMR Institute of Technology
• Other goals behind the establishment of Sqoop included allowing the transfer of data
using simple commands that could be executed from a CLI, allowing for client­ side
installation, use of Hive and HDFS for data processing.
• Sqoop has the ability to migrate data; however, it has three weakness. Sqoop becomes
inefficient as the user must manually enter the migration command in the terminal.
• In order to grasp a deeper understanding of the Hadoop framework and Sqoop
GMR Institute of Technology

technologies, we installed Hadoop using a virtual machine and Ubuntu, allowing us to


conduct experiments using the tool.
Implementation
Importing data from MySQL to HDFS
• In order to store data into HDFS, we make use of Apache Hive which
provides an SQL-like interface between the user and the Hadoop
distributed file system (HDFS) which integrates Hadoop. We perform the
following steps:
• Step-1: Login into MySQL
GMR Institute of Technology

18
Step-2: Create a database and table and insert data
create database instagram;
create table instagram.insta(profile_name varchar(65),followers int,
following int,posts int);
GMR Institute of Technology
Step-3: TABLE VIEW
select * from instagram.insta;
GMR Institute of Technology
Step-4: Create a database and table in the hive where data should be imported
GMR Institute of Technology
Step-5: Run the import command on Hadoop
sqoop import --connect \
jdbc:mysql://127.0.0.1:3306/database_name_in_mysql \ --username root
--password cloudera \ --table table_name_in_mysql \ --hive-import --
hive-table database_name_in_hive.table_name_in_hive \ --m 1

• 127.0.0.1 is localhost IP address.


• 3306 is the port number for MySQL.
• m is the number of mappers
GMR Institute of Technology
Step-6 : Check-in hive whether the data is imported successfully or not.
GMR Institute of Technology

23
NOW WORKING ON THE DATASET WITH HIVE COMANDS

hive>select * profile_name,followers from Instagram sort


by followers DESC LIMIT 5;
GMR Institute of Technology

24
hive>select * profile_name,followers from Instagram sort
by followers DESC LIMIT 1;

hive>select * profile_name,followers,posts from Instagram


GMR Institute of Technology

sort by posts ASC LIMIT 5;

25
hive>select profile_name from Instagram WHERE
posts > 100 ;
GMR Institute of Technology

hive>select profile_name from Instagram


WHERE followers < 200000 ;

26
hive>select profile_name from Instagram
WHERE following < 100 ;
GMR Institute of Technology

hive>select profile_name,posts from Instagram


WHERE followers = 59557;

27
hive>select profile_name from Instagram
WHERE following > 100 ;
GMR Institute of Technology

hive>select * from Instagram WHERE profile


name = “rahul”;

28
REFERENCES

1) Sidhu, R., Chea, D., Dhakal, R., Hur, A., & Zhang, M. (2015). Implementation of Hadoop and
Sqoop for Big Data Organization.
2) Geng, D., Zhang, C., Xia, C., Xia, X., Liu, Q., & Fu, X. (2019). Big data-based improved data
acquisition and storage system for designing industrial data platform. IEEE Access, 7, 44574-44582.
3) Lydia, E. L., & Swarup, M. B. (2015). Big data analysis using hadoop components like flume,
mapreduce, pig and hive. International Journal of Science, Engineering and Computer
GMR Institute of Technology

Technology, 5(11), 390.


4) Pol, U. R. (2016). Big data analysis: comparison of hadoop mapreduce, pig and hive. International
Journal of Innovative Research in Science, Engineering and Technology, 5(6), 9687-93.
5) Qayyum, R. (2020). A roadmap towards big data opportunities, emerging issues and hadoop as a
solution. Rida Qayyum." A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop
as a Solution", International Journal of Education and Management Engineering (IJEME), 10(4), 8-
17.

29
REFERENCES

6) Rodrigues, A. P., & Chiplunkar, N. N. (2018). Real-time Twitter data analysis


using Hadoop ecosystem. Cogent Engineering, 5(1), 1534519.
7) Ramsingh, J., & Bhuvaneswari, V. (2015). An insight on big data analytics using pig
script. International Journal of EmergingTrends & Technology in Computer Science
(IJETTCS), 4(6), 2278-6856.
8) Mutasher, W. G., & Aljuboori, A. F. (2022). New and Existing Approaches
GMR Institute of Technology

Reviewing of Big Data Analysis with Hadoop Tools. Baghdad Science Journal, 0887-
0887.
9) Menon, A. (2012, September). Big data@ facebook. In Proceedings of the 2012
workshop on Management of big data systems (pp. 31-32).
10) Ashwini, T., Sahana, L. M., Mahalakshmi, E., & Shweta . (2021). s. p. youtube
data analysis using hadoop framework. (IJEAST) Vol. 5, Issue 11, ISSN No. 2455-
2143.
30
GMR Institute of Technology

Any Queries?
GMR Institute of Technology

You might also like