You are on page 1of 4

A Big Data implementation based on Grid Computing

Virginia Sandulescu Ionela Halcu Giorgian Neculoiu Oana Grigoriu Mariana Marinescu Viorel Marinescu
Faculty of Hydrotechnics Technical University of Civil Engineering of Bucharest Bucharest, Romania,,,,, management and storage of output data of measurements, observations and/or knowledge in different fields. A few of the results are mentioned above: 1) The Knowledge database attached to the Information System for the Environment BD-SIM [1] The database may contain structured data, but also unstructured data mixed types: photographs, descriptions and so on) 2) The database for the Information System of UCCH Urmarirea Comportarii Constructilor Hidrotehncie: Monitoring the Behaviour of Hydrotechnic Buildings. [2] 3) A Computer Aided Learning (CAL) Software [3] Also in the education field we had to deal with high amounts of data. A CAL that contains all the disciplines of a study program and all disciplines contain a textbook, handbook, different exercises and supplementary files. 4) Other fields can be mentioned, as the medical field [4], chemistry [5]. In all these projects, our team had to deal we the problem of organizing and storing data. II. WHAT BIG DATA ACTUALLY IS? Nowadays, we live in a more and more interconnected world that generates a great volume of information every day, starting from the logging files of the users of social networks, search engines, e-mail clients to machine generated data as from the real-time monitoring of sensor networks for dams or bridges, and various vehicles such as airplanes, cars or ships. According to an infographic made by Intel, 90% of the data today was created in the last two years, and the growth continues. It is estimated that all the global data generated from the beginning of time until 2003 represented about 5 exaBytes (1 exaByte equals 1 million gigaBytes), the amount of data generated until 2012 is 2.7 zettaBytes (1 zettaBytes equals 1000 exaBytes) and it is expected to grow 3 times larger than that until 2015 [6]. For example, the number of RFID tags sold globally is projected to rise from 12 million in 2012 to 209 billion in 2021[6]. All this volume represents a great amount of data that rise challenges when talking about acquiring,

Dan Garlasu
Core Technology Oracle Romania Bucharest, Romania

AbstractBig Data is a term defining data that has three main characteristics. First, it involves a great volume of data. Second, the data cannot be structured into regular database tables and third, the data is produced with great velocity and must be captured and processed rapidly. Oracle adds a fourth characteristic for this kind of data and that is low value density, meaning that sometimes there is a very big volume of data to process before finding valuable needed information. Big Data is a relatively new term that came from the need of big companies like Yahoo, Google, Facebook to analyze big amounts of unstructured data, but this need could be identified in a number of other big enterprises as well in the research and development field. The framework for processing Big Data consists of a number of software tools that will be presented in the paper, and briefly listed here. There is Hadoop, an open source platform that consists of the Hadoop kernel, Hadoop Distributed File System (HDFS), MapReduce and several related instruments. Two of the main problems that occur when studying Big Data are the storage capacity and the processing power. That is the area where using Grid Technologies can provide help. Grid Computing refers to a special kind of distributed computing. A Grid computing system must contain a Computing Element (CE), and a number of Storage Elements (SE) and Worker Nodes (WN). The CE provides the connection with other GRID networks and uses a Workload Management System to dispatch jobs on the Worker Nodes. The Storage Element is in charge with the storage of the input and the output of the data needed for the job execution. The main purpose of this article is to present a way of processing Big Data using Grid Technologies. For that, the framework for managing Big Data will be presented along with the way to implement it around a grid architecture. KeywordsBig Data; Grid Technology; Storage Element; Hadoop; HDFS



During the last few years, our team at Technical University of Civil Engineering (TUCEB), Department of Hydrotechnics Engineering, the Collective of Automation and Applied Informatics, developed research projects concerning the

organizing and analyzing it. Big Data is an umbrella term describing all these types of information mentioned above. As the name suggests, Big Data refers to a great volume of data, but this is not enough to describe the meaning of the concept. The data presents a great variety, it is usually unsuitable for typical relational databases treatment, being raw, semistructured or unstructured. Also, the data will be processed in different ways, depending on the analysis that needs to be done or on the information that must be found in the initial data. Usually, this big amount of data is produced with great velocity and must be captured and processed quickly (as in the case of real time monitoring). Often, the meaningful and useful information comprised represents a small percent of the initial big volume of data this means that the data has a low value density. III.

factor are configurable per file. The replica scheme is managed by the NameNode and executed by the DataNodes upon the instructions from the NameNode. Unfortunately, Hadoop doesnt support automatic recovery in case of a NameNode failure, but a SecondaryNameNode can be configured (preferably on a separate machine). It doesnt take the place of the NameNode in case of a failure, meaning the DataNodes cannot connect to it, but it performs periodic checkpoints: it downloads current NameNode image and edit log files, it creates a new image that can be uploaded back on the primary NameNode. In order to prevent synchronization bugs that can lead to differences between the information on the NameNode and what really is on the DataNodes, HDFS requests for block reports from the DataNodes. Also, the integrity of the data on the DataNodes is verified using a checksum on every block. In case of failing the checksum, the failing blocks are deleted and replaced by the NameNode. HDFS is based on the principle that Moving Computation is Cheaper than Moving Data, meaning that it is easier to move the computation where that data to be processed is, rather than moving the data to where the computation is running, this being true especially when the I/O files have a big size [7]. IV. GRID COMPUTING Grid computing is a special type of distributed computing. The concept was born in the early 1990s along with the SETI (Search for Extraterrestrial Intelligence) initiative [15] but it is best known through the Worldwide LHC (Large Hadron Collider) Computing Grid WLCG, created in order to save, distribute and analyze the data generated in the LHC experiments [16]. The WLCG comprises more than 170 Grid centers in 36 countries and is used by more than 8000 researchers from all over the world. Romania is participating in the WLCG project threw the RO-LCG Tier-2 Federation and contributes with 4.400 cores and 1.8 PBytes storing capacity. Our team made experiments using Grid technology [9] on the Grid platform, HUTCB, installed in one of the laboratories in the Faculty of Hydrotechnics, TUCEB. A grid center is represented by a number of servers that are interconnected by a high speed network, each of the server plays one or many roles. The two main benefits of computing in a Grid Center are the high storage capability and the processing power. A grid center is composed of computing elements (CE), storage elements (SE) and worker nodes (WN). Basically, the CE manages the resources of the Grid node and manages the jobs launched. The SE offers the storage and data transfer services and the WNs are the servers that offer the processing power. There is also a User Interface (UI). At a higher level, in order to access the Grid resources, a potential user must have a certificate issued by a Virtual Organization (VO) a set of institutions or people that define rules of accessing and using the resources in a grid center. There is the administrative part of a Virtual Organization which comprises a Workload Management System that keeps track of the available CEs for the users jobs, a Virtual Organization Membership System (VOMS) and the Logical File Catalog (LFC). These parts can be shared by more Grid centers.

This type of data is impossible to handle using traditional relational database management systems. New innovative technologies were needed and Google found the solution by using a processing model called MapReduce. There are more solutions to handle Big Data, but the most widely-used one is Hadoop, an open source project based on Googles MapReduce and Google File System. Hadoop was founded by the Apache Software Foundation. The main contributors of the project are Yahoo, Facebook, Citrix, Google, Microsoft, IBM, HP, Cloudera and many others. Hadoop is a distributed batch processing infrastructure which consists of the Hadoop kernel, Hadoop Distributed File System (HDFS), MapReduce and several related projects. A. HDFS At the foundation of Hadoop lies HDFS. The need for it comes from the fact that Big Data is, of course, stored on many machines. HDFS is a block-structured distributed file system designed to hold big amounts of data, in a reliable, scalable and easy to operate way. The blocks are called chunks and the default size is 64 MB (except for the last block of each file), much bigger than the usual 4 or 8kB block size of most of the block structured file systems. HDFS presents a client-server architecture comprised of a NameNode and many DataNodes. The NameNode stores the metadata for the DataNodes. The metadata comprises the file names, the permissions the replication factor and the location of the chunks of files on the DataNodes. It stores all metadata in memory, so it offers a good speed in terms of operations per second, but this way, the amount of data is limited to the machines RAM. The NameNode is also responsible with file system operation like opening, closing, moving and renaming files and folders. Also, the NameNode keeps track of the state of the DataNodes by receiving signals called heartbeats from them. B. The main advantages offered by Hadoop HDFS offers great portability, being written in Java and designed to run on commodity hardware, usually on machines that run a GNU/Linux operating system. The HDFS offers great reliability through the fact that all files are replicated on two or more DataNodes. The default number of replicas is three. The block size and replication

A user accesses the UI via SSH (Secure Shell) and he receives a Proxy Certificate (PyC). The user than sends the job written in Job Description Language (JDL) and the PyC to the WMS . The WMS checks the PyC and if the needed resources for the job are available. It, then, sends the job and the PyC to the CE. The CE checks the authenticity of the user again and then sends the job to be processed by a WN. The WN computes the job and then sends the results to the WMS and the state of the job to the CE. The users gather the results using the UI and he can store them on the SE. Following the scientific grid described above, Oracle corporation brought on the market a commercial implementation targeting the Enterprise businesses with the following benefits: scalability, the possibility to allocate computing capacity within an organization following the needs of various departments, as well as business continuity/disaster recovery (DR) features. Although in the early 2000s this concept looked quite peculiar, eventually the Oracle Grid infrastructure for Enterprise comprising products such as Real Application Clusters, Automatic Storage Management, Active Data Guard, etc. became de facto options for any company aiming business continuity and DR strategies [17]. The concept is also supported by Oracle with a commercial distribution of Linux for enterprise class applications [18]. V. IMPLEMENT A BIG DATA PLATFORM IN A GRID CENTER Hadoop is written mainly for data transfer within the same datacenter whilst grid computing is mainly developed for distributing the data and the computational power between different sites possibly in different geographical areas [10]. Working on the principle mentioned above, that it is easier to move the computation where the data is rather than moving the data where the processing is being done, it seems normal to share some of the data among the WNs in the Grid center. Hadoops scheduler tries to collocate the jobs and the data and it has the ability to schedule 70% or greater of its jobs so that the input data will be found on local disks [HDFS for grid]. How can we use Hadoop in a Grid center? The first implementations are described in [10] and [11] and took place in two Grid centers, members of the WLGC. The tested and efficient proven architecture presents a SE that implements the Hadoop framework. It is composed of dedicated DataNodes which run the Hadoop Client but also of DataNodes that use the storage capabilities of the WNs of the Grid Center. Of course, there is a need for another server that does the job of the NameNode and keeps track of the location of the data. HDFS doesnt implement all the POSIX interfaces and this is supplemented by FUSE. Fuse (Filesystem in Userspace) is a Unix like operating system kernel module that allows filesystems to be mounted in userspace. FUSE mount allows the whole file system of the SE to be seen by the WN as local and also offers a POSIX-like interface. In the early 2012, Oracle has partnered Cloudera to bring Apache Hadoop to its Oracle Big Data Appliance. The appliance comes with Cloudera distribution including Apache Hadoop (CDH), along with the Cloudera manager software [13].The rack comes also with a copy of the Oracle NoSQL Database build upon the proven Oracle Berkeley DB Java Edition high-availability storage engine [19]. The initial

architecture of Oracle Big Data appliance supported the big data requirements with 18 SUN X4270 M2 Servers/Rack totalizing 216 Intel cores (lately increased to 288, due to the adoption of 8 core Intel processors), 864 GB of memory and 648 TB of storage. The 40 GB Infiniband fabric provides best inter-node and inter-rack connectivity. Data Center connectivity is covered by 10 GBs Ethernet. Oracle provides two other members of its Engineered Systems family to complete its Big Data solution: Exadata aimed to organize information in datawarehouse and run in-database analytics and Exalytics for high performance analytics applications [14]. VI. CONCLUSIONS The main advantages offered by Grid computing are the storage capabilities and the processing power and the main advantages of using Hadoop, especially HDFS, are reliability (offered by replicating all data on multiple DataNodes and other mechanism to protect from failure), the schedulers ability to collocate the jobs and the data offering high throughput for data for the jobs processed on the grid. Adding the ease of use, ease of maintenance and scalability combining these two technologies seems like a good choice. By implementing a Hadoop based SE, we take advantage of the WNs storage capabilities, the Hadoop schedulers abilities to send jobs where the needed data is located (when possible). The Oracle/Cloudera approach is an interesting and effective combination of Clouderas enterprise-ready software tools and the Oracle engineered systems designed to provide high performance and scalable data processing for Big Data. REFERENCES
M. Marinescu, V. Marinescu, C. Barbu, BD-SIM Informationl System for Environmental Monitoring and Protection, ICI, Viena, 1998 [2] V. Marinescu, B. Ciocanel, M. Marinescu, F. Ionescu, D. Stematiu, A. Popovici, Banca de modele atasata sistemului informational al activitatii de urmarire a comportarii constructiilor hidrotehncie, Buletinul Stiintific al Universitatii Tehnice de Constructii din Bucuresti, 1996 [3] M. Ciubancan, M. Marinescu, O. Grigoriu, G. Neculoiu, V. Sandulescu, I. Halcu, Computer aided learning with GRID technologies, 10th RoEduNet IEEE International Conference, STEF, pp 222-225, 2011 [4] C. Alecu, G. Martin, R. Deac, V. Marinescu, C. Suciu, F. Ortan, M. Marinescu, Computer assisted pre- and post-surgery activities in a cardiovascular surgery clinic, 9th Mediterranean Conference on Medical and Biological Engineering and Computing, Pula, University of Zagreb Publishing, 2001 [5] M. Marinescu, I. Fagarasanu, A. Constatinescu, S. Iliescu, V. Marinescu, Use of databases technology for self-monitoring and monitoring of the activities for monitoring the state of the anthropic environment having impact on natural environment, CEEX Conference, Ed. Tehnica, Brasov, 2008 [6] Big Data Infographic: Solve your Big Data Problems?, [7] Apache Hadoop Documentation, [8] Yahoo! Hadoop Tutorial, [9] Mihai Ciubancan, HUTCB Grid Site, unpublished [10] Garhan Attebury, Andrew Baranovski, Ken Bloom, Brian Bockelman, Doria Kcira, James Letts, Tanya Levshina, Carl Lundestedt, Terrence Martin, Will Maier, Haifeng Pi, Abhishek Rana, Igor Sfiligoi, Alexander Sim, Michael Thomas, Frank Wuerthwein, Hadoop Distributed File System for the Grid, IEEE Nuclear Science Symposium Conference Record, Orlando, FL, 2009 [1]

[11] Brian Bockelman, Using Hadoop as a Grid Storage Element, IOP Publishing, CSE Conference and Workshop Papers, 2009 [12] Yaodong Cheng, HEP Grid and Hadoop, available at, 2009 [13] [14] [15] [16]

[17] [18] [19] [20] Dan Garlasu, Big Data - Noua paradigma pentru inovare, competitie si productivitate Papers presented at the Workshop no. 3 of the POSDRu/86/1.2/S/63806