Case Study Hadoop

A Smart Citizen Information System using Hadoop: a Case Study
Parkavi.A1, Dr.N.Vetrivelan2
1
Assistant Professor, CSE Department, M.S.Ramaiah Institute of Technology, Bangalore-560054,India
(parkavi.a@msrit.edu)
2
Professor, Department of Computer Applications, Periyar Maniammai Univeristy, Tanjore, Tamilnadu, India
(nvetri@yahoo.com)
Abstract – In this paper a proposal is made for analysis will really get rid of the poverty from our nation
maintaining citizens’ information using Geo distributed data and all the families can be taken care by the government.
centers in different regions of the country for performing This was tedious and impossible before. But now by using
any analytics over the citizen details to find out the statistics latest technologies it is possible to get rid of poverty from
of citizens with respect to specific criteria. If big data our country .
analytics needed to be performed over the citizen data which
are stored across the country, it can be optimized using Data
transformation graph technique. As the citizen information By using hadoop based system, big data analysis over
has sensitive data, the personal information can be hidden the citizens data[2] can be done and collect the required
using the anonymization technique. statistical data and based on it government can form
committee to benefit the people by issuing proper
schemes. [1][4]
Keywords – Map Reduce, Datacenters, Anonymization,
Citizen system II. MAP REDUCE FRAMEWORK ACROSS REGION
WISE DATA CENTERS IN GEO DISTRIBUTED WAY
I. INTRODUCTION
Big data analytics needs applications which will create
In India, the information about citizens is known and and manage bulk of information to improve the
recorded in files to administrators of villages, towns or performance, monitoring and verification. The cloud
zones of cities. That information can be collected and resources will divide the citizen files into chunks; thus
stored in the data centers which can be maintained in they can be processed in parallelized way using hadoop
zonal offices of administrators[1]. The head of the family, framework. [1]
family members’ details along with their name, age,
qualification, job, salary income, medical and health Usually in hadoop framework for map reduce
details, additional earning details and participation in operations “key,value pair” will be used as intermediate
events (which yields popularity and good name for the data. Hadoop framework provides the facility of data
country, with proofs of the information) have to be storage nearer to sources of analysis. It will ensure that ,
maintained in the files of data centers. The citizen even though the data is collected in different regions of
information have to be updated at least every year after country , they can be analyzed for statistical information
verifying in person individually with proofs about the by central government as globally.[4]
families.
In necessary situation, the data can be replicated for
A. Expected benefits to Government using the geo increasing the availability. For example, replicating of
distributed data centers and hadoop framework revenue department staffs details can be done. This may
be done for back up as well as for specific analysis.
Nowadays Governments are issuing many benefit Analysis may be done to know that how many employees
schemes for the people based on the statistics collected are still having 2 more years to get retired. Based on that
from the authorities of regions about the peoples’ need. result, the government can offer scheme like volunteer
And the essential needs of the people may be left with out retirement, thus new youngsters can be recruited for
the notice of government some times. revenue departments. This is one example scenario where
the replication will fasten up the analysis as well as it will
For example if there are small children in a family help to overcome the failures.
and father met with accident and died or may not be in
situation to work after the recovery from the accident. If The analyzed results of each region can be collected
the father works for private job then getting benefit from to produce a single data set. So individual data centers can
government scheme is not possible. In such cases the be maintained at the regional centers to maintain the
analysis over such criteria’s has to be done by citizen files.
government to meet the needs of such families by
providing educational facilities, training and job to wife or
to children finished their education, etc., This kind of
978-1-4799-1597-2/13/$31.00 ©2013 IEEE

2013 IEEE International Conference on Computational Intelligence and Computing Research
A. Optimizing the path of mapping and reducing jobs of tracker running in the global mapper. In this case
big data analytics distributed data is copied to mapper running center. [1]
The mappers and reducers of analysis jobs can be

installed in geo distributed manner over the data centers Data Centre
of different regions in selective manner. So there may be
many possible paths for executions of those analysis jobs
Anonymization
in map reduce manner. The sequences of jobs can be
Data Analytics
arranged in hierarchical tree to represent their reducing
Data
Citizen info
and mapping. The data transformation graph can be used
in files
to find out the optimized path of executing the map
reduce jobs. The input files which are divided in to
chunks called as splits will be assigned to a mappers of
the analysis job. [1]
Job tracker will perform the mapping or dividing big

data analytics jobs into tasks and assigning them to task
Fig.2. Big data analytics over citizen info in data centers
trackers. The task trackers will assign the tasks to worker
nodes of the data centers which are placed in regional Case 2: For analyzing the national level players
offices. [6] information which is stored in the regional data centers
can be replicated. Then the analysis can be done in the
B. Geo distributed citizen info system data center where the mappers are running. In this case,
data sets can be moved between data centers of mappers
and reducers. [1][5]
Case 3: For analyzing the incomes of family in

Anonymization
Data centre1
country, the tasks of analysis can be carried over all the

families details stored in all data centers. And final local
Data
results will be sent to the mapper and it will perform

gathering the results. In this case results of reducers are
aggregated. [1][4]
Data Analytics
…
… For maintaining the jobs, group manager can be used
… in the system. It can be used to start the analysis jobs over
… the data centers. And the data transformation graph
algorithm (DTG) usage for finding the optimal execution
path will be carried over by group manager.DTG can be
Anonymization
calculated based on time of execution path as well as the

Data centre-k
cost of execution path. Implicitly DTG will be found out

Data
using dijikstra’s shortest path algorithm. The optimal path

should always follow the optimality in maintain the data
sets movement among data centers. The derivatives or the
partitions of same data center should not be copied back
to same center. This will reduce replication of data in
Fig.1. Clustering the data centres for data analytics same data center in regional office of citizen info. [1][5]
The Geo distributed citizen info system will have n Group manager of the analysis jobs can check the job
regions of states in the country. So for each region one managers by using heart beat checking mechanism. Thus
data center can be maintained. According to the analytics the livelihood of job managers can be verified. In the
needed the cluster of k -data centers can be logically same way task managers can also be monitored by the job
formed based on their locality of interest. There are managers.
various possibilities of execution paths for map reduce
job. [1][4] II. GENERALIZATION OF DATA TO PRESERVE
THE PRIVACY OF CITIZENS USING MAP REDUCE
Case1 : For example to identify the number of cancer FRAMEWORK
patients in particular state , the analytics task can be
executed near or within the data center and the local result Privacy preservation is very important criteria in
which is the sub set of global result can be sent to the job cloud. Because the applications like e-health records, e-
2013 IEEE International Conference on Computational Intelligence and Computing Research
finance records, e-transactions records are having high CONCLUSION

sensitivity. So whenever the big data analytics is done
over such data sets, the data anonymization has to be done Our country is presently not having any framework to
first then the analysis has to be allowed over the data sets. maintain the citizen information for performing statistical
Nowadays data privacy is very vital issue for the analysis. So the authors have proposed a system using
individuals. Nobody prefers that their individuality hadoop map reduce framework to maintain the citizen
getting leaked over the cloud of information. For this information. Thus the system can provide the data sets for
purpose only data anonymization techniques have come statistical analysis which can be used by the government.
into picture for preserving the privacy of the people.
Using this mechanism identify of the people will be REFERENCES
hidden as well as their sensitive information. [7]
[1] Chamikara Jayalath, Julian Stephen, Patrick Eugster, "From
For hadoop kind of frameworks, there are top down the Cloud to the Atmosphere: Running MapReduce across
anonymization techniques existing. For this process, the Datacenters," IEEE Transactions on Computers, 27 May
original data set will be divided into partitions using 2013
[2] Hewlett-Packard development company,L.P,Master Big
mapper. And then the intermediate data sets can be data to optimize the oil and gas lifecycle,Oct,2012
generated by reducers and then the integration of data sets [3] Hongyong Yu; Deshuai Wang, "Research and
will be yielded without the private data details using the Implementation of Massive Health Care Data Management
mapper. Spatial indexing can be used to improve the and Analysis Based on Hadoop," Computational and
efficiency of the data sets retrieval. [7] Information Sciences (ICCIS), 2012 Fourth International
Conference on , vol., no., pp.514,517, 17-19 Aug. 2012
For example consider analysis about cancer; if many [4] Patel, A.B.; Birla, M.; Nair, U., "Addressing big data
people from same family if they have the medical history problem using Hadoop and Map Reduce," Engineering
of cancer and early death in the age of 50’s. If (NUiCONE), 2012 Nirma University International
Conference on , vol., no., pp.1,5, 6-8 Dec. 2012
government want to perform the statistical analysis over [5] Pengju Shang; Qiangju Xiao; Jun Wang, "DRAW: A new
that to improve the research in producing medicines for Data-gRouping-AWare data placement scheme for data
such cancers, then the private info about the citizens intensive applications with interest locality," APMRC, 2012
should not be revealed. So in such cases the Digest , vol., no., pp.1,8, Oct. 31 2012-Nov. 2 2012
anonymization of citizen name, city info has to be hidden. [6] Vilajosana, I.; Llosa, J.; Martinez, B.; Domingo-Prieto, M.;
Using the anonymization method of data, data sets can be Angles, A.; Vilajosana, X., "Bootstrapping smart cities
partitioned in such a way that the sensitive data (personal) through a self-sustainable model based on big data flows,"
data of the citizens can be hidden. Only the required data Communications Magazine, IEEE , vol.51, no.6,
for analytics will be sent after the anonymization process pp.128,134, June 2013
[7] Xuyun Zhang, Laurence T. Yang, Chang Liu, Jinjun Chen,
to the analytics machines. [7] "A Scalable Two-Phase Top-Down Specialization
Approach for Data Anonymization Using MapReduce on
DISCUSSION AND RESULTS Cloud," IEEE Transactions on Parallel and Distributed
Systems, 29 May 2013.
The data analysis needed to be done over the citizen
information can make use of DTG mechanism.
Chamikara Jayalath et. al evaluated the performance of
geo distributed map reduce framework. For that Data
transformation graph (DTG) was used to optimize the
execution path. The optimized execution path’s cost and
time predicted using DTG was nearer to the real time
optimized execution path used by big data analytics tasks.
[1][3][4][5]
Anonymization technique can be used for hiding the

sensitive data of citizens before performing any statistical
analysis . Xuyun Zhang et. al evaluated the partitioning
the sensitive data from the data sets of citizens. Then the
non sensitive data can be supplied for performing the data
analysis using map reduce framework across the data
centers. [7][6]

Case Study Hadoop

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Case Study Hadoop

Uploaded by

Copyright:

Available Formats

A Smart Citizen Information System using Hadoop: a Case Study

978-1-4799-1597-2/13/$31.00 ©2013 IEEE

The mappers and reducers of analysis jobs can be

Job tracker will perform the mapping or dividing big

Case 3: For analyzing the incomes of family in

country, the tasks of analysis can be carried over all the

results will be sent to the mapper and it will perform

calculated based on time of execution path as well as the

cost of execution path. Implicitly DTG will be found out

using dijikstra’s shortest path algorithm. The optimal path

finance records, e-transactions records are having high CONCLUSION

Anonymization technique can be used for hiding the

You might also like