Professional Documents
Culture Documents
BACHELOR OF TECHNOLOGY
By:
AAKASH JUNEJA (1319213001)
Furthermore I would also like to acknowledge with much appreciation the crucial role
of the staff of Department of Information Technology, who gave the permission to use
all required equipment and the necessary material to complete the task. A special
thanks goes to go to the guide of the project, Mr. Akhilesh Singh whose have invested
his full effort in guiding me in achieving the goal. I have to appreciate the guidance
given by other supervisor as well as the panels especially in our project presentation
that has improved our presentation skills thanks to their comment and advices.
Aakash Juneja
Abstract
Annual Health Survey (Combined Household Information).It contains data of all 3 rounds of
AHS Survey. The survey is conducted in all EAG States namely Bihar,
Chhattisgarh, Jharkhand, Madhya Pradesh, Odisha, Rajasthan, Uttarakhand & Uttar
Pradesh and Assam. Despite being restricted to 9 States, the AHS is the largest
demographic survey in the world and covers two and a half times that of the Sample
Registration System. This project is based on the analysis of data and report generation
according to data.
Table of content
1.0 Introduction
2.0 System Requirement
2.1 Use of Hadoop
2.2 Use of Pig
2.3 Use of Hive
2.3 Use of R
2.4 Use of R Studio
3.0 Procedure
4.0 References
5.0 Bibliography
1.0 Introduction
It contains data of all 3 rounds of AHS Survey i.e. Baseline, First Updating Round and
Second Updating Round. The survey is conducted in all EAG States namely Bihar,
Chhattisgarh, Jharkhand, Madhya Pradesh, Odisha, Rajasthan, Uttarakhand & Uttar
Pradesh and Assam. During the Base line Survey in 2010-11, a total of 20.1 million
population and 4.14 million households and during the first updating survey in 2011-12,
20.61 million population and 4.28 million households have actually been covered. The
second updating survey (third and final round) covered a total of 20.94 million population
and 4.32 million households in 2012-13. Despite being restricted to 9 States, the AHS is the
largest demographic survey in the world and covers two and a half times that of the Sample
Registration System.
Data includes various Indicators like Whether usual Resident, Date/Month/Year of Birth,
Age, Religion, Social Group, Marital Status, Date/Month/Year of first marriage, Attending
school, not-attending school, Highest educational qualification attained, Occupation /
Activity Status during last 365 days, Whether having any form of disability, Type of
treatment for injury, Type of illness, Source of Treatment, Symptoms Pertaining to illness
persisting for more than one month, Sought medical care, Various Diagnosis, Source of
Diagnosis, Getting Regular Treatment, Person Chews/Smoke/Consume Alcohol, Status of
House, Type of Structure of the House, Ownership status of the house, Source of Drinking
water, Does the household treat the water in any way to make it safer to drink, Toilet facility,
Household with electricity, Main source of lighting, Main source of fuel used for cooking,
Number of dwelling rooms, Availability of Kitchen, Possess
Radio/Transistor/Television/Computer/Laptop
/Telephone/Mobilephone/WashingMachine/Refrigerator/Sewingmachine/Bicycle/Motor/Sco
oter/Moped/Car/Jeep/Van/Tractor/Water Pump/Tube Well/Cart, Land Possessed, Residential
Status, Covered by any health scheme or health insurance, Status of Household.
Analysis and visualization needs the data extraction, cleaning and mining of the data and
finally present in form for report using by any reporting tool like R, Tableau etc.
To contribute something towards Digital India I would like to analysis AHS data in
the different aspect:
Apache Hadoop is an open-source software framework for distributed storage and distributed
processing of very large data sets on computer clusters built from commodity hardware. All
the modules in Hadoop are designed with a fundamental assumption that hardware failures
are common and should be automatically handled by the framework.
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File
System (HDFS), and a processing part called MapReduce. Hadoop splits files into large
blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers
packaged code for nodes to process in parallel based on the data that needs to be processed.
This approach takes advantage of data locality— nodes manipulating the data they have
access to— to allow the dataset to be processed faster and more efficiently than it would be in
a more conventional supercomputer architecture that relies on a parallel file system where
computation and data are distributed via high-speed networking.
Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
Hadoop YARN – a resource-management platform responsible for managing
computing resources in clusters and using them for scheduling of users' applications;
and
Hadoop MapReduce – an implementation of the MapReduce programming model for
large scale data processing.
The term Hadoop has come to refer not just to the base modules above, but also to the
ecosystem, or collection of additional software packages that can be installed on top of or
alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix,
Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache
Oozie, Apache Storm.
Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on
their MapReduce and Google File System.
The Hadoop framework itself is mostly written in the Java programming language, with some
native code in C and command line utilities written as shell scripts. Though MapReduce Java
code is common, any programming language can be used with "Hadoop Streaming" to
implement the "map" and "reduce" parts of the user's program. Other projects in the Hadoop
ecosystem expose richer user interfaces
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The
language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce,
Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java
MapReduce idiom into a notation which makes MapReduce programming high level, similar
to that of SQL for RDBMSs. Pig Latin can be extended using User Defined Functions
(UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call
directly from the language
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. While developed by Facebook, Apache Hive is now used
and developed by other companies such as Netflix and the Financial Industry Regulatory
Authority (FINRA). Amazon maintains a software fork of Apache Hive that is included in
Amazon Elastic MapReduce on Amazon Web Services.
Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file
systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL with
schema on read and transparently converts queries to MapReduce, Apache Tez and Spark
jobs. All three execution engines can run in Hadoop YARN. To accelerate queries, it
provides indexes, including bitmap indexes. Other features of Hive include:
Indexing to provide acceleration, index type including compaction and Bitmap index
as of 0.10, more index types are planned.
Different storage types such as plain text, RCFile, HBase, ORC, and others.
Metadata storage in an RDBMS, significantly reducing the time to perform semantic
checks during query execution.
Operating on compressed data stored into the Hadoop ecosystem using algorithms
including DEFLATE, BWT, snappy, etc.
Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-
mining tools. Hive supports extending the UDF set to handle use-cases not supported
by built-in functions.
SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez,
or Spark jobs.
By default, Hive stores metadata in an embedded Apache Derby database, and other
client/server databases like MySQL can optionally be used.
Four file formats are supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC and
RCFILE. Apache Parquet can be read via plugin in versions later than 0.10 and natively
starting at 0.13. Additional Hive plugins support querying of the Bitcoin Blockchain.
2.4 Use of R
R is a language and environment for statistical computing and graphics. It is a GNU project
which is similar to the S language and environment which was developed at Bell Laboratories
(formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be
considered as a different implementation of S. There are some important differences, but
much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical
tests, time-series analysis, classification, clustering, …) and graphical techniques, and is
highly extensible. The S language is often the vehicle of choice for research in statistical
methodology, and R provides an Open Source route to participation in that activity.
One of R’s strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has been
taken over the defaults for the minor design choices in graphics, but the user retains full
control.
R is available as Free Software under the terms of the Free Software Foundation’s GNU
General Public License in source code form. It compiles and runs on a wide variety of UNIX
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
The R environment
R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. It includes
The term “environment” is intended to characterize it as a fully planned and coherent system,
rather than an incremental accretion of very specific and inflexible tools, as is frequently the
case with other data analysis software.
R, like S, is designed around a true computer language, and it allows users to add additional
functionality by defining new functions. Much of the system is itself written in the R dialect
of S, which makes it easy for users to follow the algorithmic choices made. For
computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run
time. Advanced users can write C code to manipulate R objects directly.
R has its own LaTeX-like documentation format, which is used to supply comprehensive
documentation, both on-line in a number of formats and in hardcopy.
RStudio is available in two editions: RStudio Desktop, where the program is run locally as a
regular desktop application; and RStudio Server, which allows accessing RStudio using a
web browser while it is running on a remote Linux server. Prepackaged distributions of
RStudio Desktop are available for Windows, OS X, and Linux.
RStudio is available in open source and commercial editions and runs on the desktop
(Windows, OS X, and Linux) or in a browser connected to RStudio Server or RStudio Server
Pro (Debian, Ubuntu, Red Hat Linux, CentOS).
RStudio is written in the C++ programming language and uses the Qt framework for its
graphical user interface.
Work on RStudio started at around December 2010, and the first public beta version (v0.92)
was officially announced in February 2011.
3.0 Procedure and Result
x=load'hdfs://localhost:9000/aakash/ahs/firozabad/ahscombuttar_pradesh-
firozabad.csv' using PigStorage(',');
y=foreach x GENERATE
$7,$5,$58,$60,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,$72;
3. After extracting required columns into another variable store it into another new
file.
$hive
6. Create a table in which you have to load that extracted file for analysisCommand:-
8. Now execute a query to perform required analysis and store that data into a file in
hive Command:-
R Command
x=load'hdfs://localhost:9000/aakash/ahs/firozabad/ahscombuttar_pradesh-
firozabad.csv' using PigStorage(',');
y=foreach x GENERATE
$7,$5,$58,$60,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,$72;
3. After extracting required columns into another variable store it into another new
file.
$hive
6. Create a table in which you have to load that extracted file for analysis
8. Now execute a query to perform required analysis and store that data into a file in
hive:
hive>select * from
(SELECT rural,household_have_electricity HH,COUNT(psu_id),'ELECTRICITY'
Flag FROM project.allcities group by rural,household_have_electricity
UNION ALL
SELECT rural,is_radio HH,COUNT(psu_id),'RADIO' Flag FROM project.allcities
group by rural,is_radio
UNION ALL
SELECT rural,is_television HH,COUNT(psu_id),'TELEVISION' Flag FROM
project.allcities group by rural,is_television
UNION ALL
SELECT rural,
CASE WHEN is_computer = 'With Internet connection' THEN 'Yes'
WHEN is_computer = 'Without Internet connection' THEN 'Yes'
ELSE is_computer END AS HH,COUNT(psu_id),'COMPUTER' Flag FROM
project.allcities group by rural,is_computer
UNION ALL
SELECT rural,
CASE WHEN is_telephone = 'Both' THEN 'Yes' WHEN is_telephone = 'Mobile
Phone only' THEN 'Yes' WHEN is_telephone = 'Telephone only' THEN 'Yes'
ELSE is_telephone END AS HH,COUNT(psu_id),'TELEPHONE' Flag FROM
project.allcities group by rural,is_telephone
UNION ALL
SELECT rural,is_washing_machine HH,COUNT(psu_id),'WASHING_M' Flag
FROM project.allcities group by rural,is_washing_machine
UNION ALL
SELECT rural,is_refrigerator HH,COUNT(psu_id),'REFRIGERATOR' Flag FROM
project.allcities group by rural,is_refrigerator
UNION ALL
SELECT rural,is_sewing_machine HH,COUNT(psu_id),'SEWING_M' Flag FROM
project.allcities group by rural,is_sewing_machine
UNION ALL
SELECT rural,is_bicycle HH,COUNT(psu_id),'BICYCLE' Flag FROM
project.allcities group by rural,is_bicycle
UNION ALL
SELECT rural,is_scooter HH,COUNT(psu_id),'SCOOTER' Flag FROM
project.allcities group by rural,is_scooter
UNION ALL
SELECT rural,is_car HH,COUNT(psu_id),'CAR' Flag FROM project.allcities group
by rural,is_car
UNION ALL
SELECT rural,is_tractor HH,COUNT(psu_id),'TRACTOR' Flag FROM
project.allcities group by rural,is_tractor)tmp;
Urban Rural
TV 100.00% 50.00%
CAR 70.00% 10.00%
BIKE 40.00% 90.00%
Electricity 100.00% 100.00%
Cooking Fuel 100.00% 90.00%
Final Result of Analysis 2
Plotting Bar
barplot(m,names.arg = temp$TYPE,beside = T,col=c('red','blue'),legend("topright",
c("Rural","Urban"), cex=0.75, fill= c('red','blue')))
C. Steps to perform first analysis that is the symptom of illness on source of
drinking water
x=load'hdfs://localhost:9000/aakash/ahs/firozabad/ahscombuttar_pradesh-
firozabad.csv' using PigStorage(',');
y=foreach x GENERATE
$7,$5,$58,$60,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,$72;
3. After extracting required columns into another variable store it into another new
file.
6. Create a table in which you have to load that extracted file for analysis
8. Now execute a query to perform required analysis and store that data into a file in
hive
hive>select maxcount.source,A.symptoms_pertaining_illness
from maxcount
LEFT OUTER JOIN a3_grouped A on (maxcount.maxx = A.total and
maxcount.source = A.drinking_water_source);
R Command
analysis2<-read.csv("/home/aakash/Desktop/final1.csv")
5.0 Bibliography
http://www.data.gov.in/