University “Politehnica” of Bucharest Automatic Control and Computers Faculty, Computer Science and Engineering Department

National University of Singapore School of Computing

MASTER THESIS Distributed Code Analysis over Computer Clusters

Scientific Advisers: Prof. Khoo Siau Cheng (NUS) Prof. Nicolae Tăpus (UPB) , ,

Author: Călin-Andrei Burloiu

Bucharest, 2012

Universitatea “Politehnica” Bucuresti , Facultatea de Automatică si Calculatoare, , Catedra de Calculatoare

National University of Singapore School of Computing

LUCRARE DE DISERTATIE , Analiza de cod în mod distribuit peste clustere de calculatoare

Conducători Stiintifici: , , Prof. Khoo Siau Cheng (NUS) Prof. Nicolae Tăpus (UPB) , ,

Autor: Călin-Andrei Burloiu

Bucuresti, 2012 ,

I would like to thank my parents and my brother for their care and support. I would also like to thank Professor Nicolae Tăpus for offering me the opportunity to have an internship at , , National University of Singapore where I had the chance to contribute to this interesting and promising project. Many thanks to Professor Khoo Siau Cheng for his involvement into the project and for guiding my work to the right direction.

This master thesis marks the first steps towards building an Internet-scale source code search engine, forked from Sourcerer infrastructure [4]. The first part of the work is a deep analysis of the appropriateness of using a Hadoop stack for scaling up Sourcerer. The second describes the design and implementation of the storage layer for the code analysis engine of the system, by using HBase, a distributed database for Hadoop. The third part is an implementation over Hadoop MapReduce of an algorithm named Generalized CodeRank for scoring code entities by their popularity, as an extended application of Google’s PageRank. As far we know this approach is unique because it considers all entities during calculation, not only subsets of particular types. The results show that Generalized CodeRank gives relevant results although all entity types are used for computation. Aceată lucrare de disertatie face primii pasi către construirea unui motor de căutare pentru cod , , sursă la scara Internetului, pornind de la infrastructura Sourcerer. Prima parte reprezintă o analiză profundă a posibilitătii de a utiliza stiva de aplicatii Hadoop pentru a scala Sourcerer. A , , doua parte descrie proiectarea si implementarea nivelului de stocare pentru motorul de analiză , de cod al sistemului, folosing HBase, o bază de date distribuită pentru Hadoop. În a treia parte este descrisă proiectarea si implementarea algoritmul de CodeRank generalizat pentru , calcularea scorului de popularitate a entitătilor de cod, ca o aplicatie extinsă a algoritmului , , PageRank de la Google. După constintele mele, această abordare este unică prin faptul că , , include în calcul toate entitătile de code, nu doar cele de un anumit tip. Rezultatele arată că , algoritmul de CodeRank generalizat oferă rezultate relevante, în conditiile în care entităti de , , toate tipurile sunt folosite pentru calcul.


Acknowledgements Abstract 1 Introduction 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Sourcerer: A Code Search and Analysis Infrastructure . . . . . . . . . . . . . . . 2 The Choice for Cluster Computing Technologies 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 2.2 MapReduce and Hadoop . . . . . . . . . . . . . . . . 2.2.1 MapReduce Algorithm . . . . . . . . . . . . . 2.2.2 Storage . . . . . . . . . . . . . . . . . . . . . 2.3 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Fault Tolerance . . . . . . . . . . . . . . . . . 2.3.2 Data Sharding . . . . . . . . . . . . . . . . . 2.3.3 High Throughput for Sequential Data Access 2.4 NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Context . . . . . . . . . . . . . . . . . . . . . 2.4.2 Reasons . . . . . . . . . . . . . . . . . . . . . 2.5 HBase . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Data Structuring . . . . . . . . . . . . . . . . 2.5.2 Node Roles and Data Distribution . . . . . . 2.5.3 Data Mutations . . . . . . . . . . . . . . . . . 2.5.4 Data Retrieval . . . . . . . . . . . . . . . . . 2.5.5 House Keeping . . . . . . . . . . . . . . . . . 2.6 The Reasons for the Chosen Technologies . . . . . . 2.6.1 Why Hadoop? . . . . . . . . . . . . . . . . . 2.6.2 ACID . . . . . . . . . . . . . . . . . . . . . . 2.6.3 CAP Theorem and PACELC . . . . . . . . . 2.6.4 SQL vs. NoSQL . . . . . . . . . . . . . . . . 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . 3 Database Schema Design and Querying 3.1 Database Purpose . . . . . . . . . . . . 3.2 Design Principles . . . . . . . . . . . . . 3.3 Projects Data . . . . . . . . . . . . . . . 3.3.1 Former SQL Database . . . . . . 3.3.2 Functional Requirements . . . . 3.3.3 Schema Design . . . . . . . . . . 3.3.4 Querying . . . . . . . . . . . . . 3.4 Files Data . . . . . . . . . . . . . . . . . i ii 1 2 3 5 5 6 6 6 6 7 7 7 7 8 8 8 9 9 9 10 10 10 10 11 12 13 13 14 14 15 15 15 16 16 17 17

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .


CONTENTS 3.4.1 Former SQL Database . . 3.4.2 Functional Requirements 3.4.3 Schema Design . . . . . . 3.4.4 Querying . . . . . . . . . Entities Data . . . . . . . . . . . 3.5.1 Former SQL Database . . 3.5.2 Functional Requirements 3.5.3 Schema Design . . . . . . 3.5.4 Querying . . . . . . . . . Relations Data . . . . . . . . . . 3.6.1 Former SQL Database . . 3.6.2 Functional Requirements 3.6.3 Schema Design . . . . . . 3.6.4 Querying . . . . . . . . . Dangling Entities Cache . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv 18 18 18 19 19 19 19 20 22 22 22 22 23 24 25 25 26 26 26 26 27 28 28 28 29 30 30 31 32 32 32 33 34 35 36 36 36 38 39 40 41 41 42 42 43 43 44



3.7 3.8

4 Generalized CodeRank 4.1 Reputation, PageRank and CodeRank . . . . . . . . 4.1.1 PageRank . . . . . . . . . . . . . . . . . . . . 4.1.2 The Random Web Surfer Behavior . . . . . . 4.1.3 CodeRank . . . . . . . . . . . . . . . . . . . . 4.1.4 The Random Code Surfer Behavior . . . . . . 4.2 Mathematical Model . . . . . . . . . . . . . . . . . . 4.2.1 CodeRank Basic Formula . . . . . . . . . . . 4.2.2 CodeRank Matrix Representation . . . . . . 4.3 Computing Generalized CodeRank with MapReduce 4.3.1 Storing Data in HBase . . . . . . . . . . . . . 4.3.2 Hadoop Jobs . . . . . . . . . . . . . . . . . . 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Convergence . . . . . . . . . . . . . . . . . . 4.4.3 Probability Distribution . . . . . . . . . . . . 4.4.4 Entities CodeRank Top . . . . . . . . . . . . 4.4.5 Performance Results . . . . . . . . . . . . . . 5 Implementation 5.1 Database Implementation . . . . . . . . . 5.1.1 Data Modeling . . . . . . . . . . . 5.1.2 Database Retrieval Queries API . 5.1.3 Database Insertion Queries API . . 5.1.4 Indexing Data from Database . . . 5.2 CodeRank Implementation . . . . . . . . 5.2.1 CodeRank and Metrics Calculation 5.2.2 Utility Jobs . . . . . . . . . . . . . 5.3 Database Querying Tools . . . . . . . . . 5.4 Database Utility Tools . . . . . . . . . . . 5.5 Database Indexing Tools . . . . . . . . . . 5.6 CodeRank Tools . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . Jobs . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

6 Conclusions 46 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

CONTENTS A Model Types B Top 100 Entities CodeRank

v 48 51

List of Figures
1.1 1.2 4.1 4.2 4.3 4.4 4.5 A Java code graph example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sourcerer system architecture (as it appears in [4]) . . . . . . . . . . . . . . . . . CodeRank Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variation of Euclidean distance with the growth of iteration illustrating the convergence of Generalized CodeRank algorithm . . . . . . . . . . . . . . . . . . . Probability distribution represented by CodeRanks vector . . . . . . . . . . . . log-log plot for CodeRanks distribution and a power law distribution . . . . . Left: Top 10 Entities CodeRank chart; Right: Distribution of Top 10 Entities CodeRanks within the whole set of entities . . . . . . . . . . . . . . . . . . . . 2 3

. 29 . 33 . 33 . 34 . 35


List of Tables
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 4.1 4.2 5.1 5.2 5.3 A.1 A.2 A.3 A.4 A.5 Columns for projects MySQL table . projects HBase Table . . . . . . . . . Columns for files MySQL table . . . files HBase Table . . . . . . . . . . . Columns for entities MySQL table . entities_hash HBase Table . . . . . entities HBase Table . . . . . . . . . Columns for relations MySQL table relations_hash HBase Table . . . . relations_direct HBase Table . . . relations_inverse HBase Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 16 18 18 20 21 21 23 23 24 24

Top 10 Entities CodeRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Experiments and jobs running time . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Common CLI arguments for Hadoop tools (CodeRank and tools) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Common CLI arguments for CodeRankCalculator tool . . . Common CLI arguments for CodeRankUtil tool . . . . . . . Project Types . . File Types . . . . Entity Types . . Relation Types . Relation Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Database indexing . . . . . . . . . . . . 44 . . . . . . . . . . . . 45 . . . . . . . . . . . . 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 48 49 50 50

B.1 Top 100 Entities CodeRank (No. 1-33) . . . . . . . . . . . . . . . . . . . . . . . . 51 B.2 Top 100 Entities CodeRank (No. 34-67) . . . . . . . . . . . . . . . . . . . . . . . 52 B.3 Top 100 Entities CodeRank (No. 68-100) . . . . . . . . . . . . . . . . . . . . . . 53


Chapter 1

This work marks the first steps into the project “Semantic-based Code Search” initiated at University of Singapore, School of Computing by Professor Khoo Siau Cheng. Its main objective is building a Internet-scale search engine for source code. The implementation is a fork of Sourcerer [11], a code search platform developed at University California of Irvine. My work concentrates on putting the foundation of a code search and analysis platform by scaling up Sourcerer to Internet-scale. My contribution can be divided into three parts. The first part is a deep analysis of the appropriateness of using a Hadoop stack for scaling up Sourcerer. The second describes the design and implementation of the storage layer for the code analysis engine, by using HBase, a distributed database for Hadoop. And the third part is an implementation over Hadoop MapReduce of an algorithm for ranking code entities by their popularity, as an extended application of Google’s PageRank. In the last years we assist to an increase of popularity of open-source and free software. It may be possible that the Web had a big impact on this by allowing everyone to share content with others. To make this more accessible to people, web hosting services offered access to free technologies like LAMP1 stack. By having a lot of users, the open-source communities became motivated to developed better and better software. The expansion of open-source software also had an impact on IT business. A lot of companies like Google, Yahoo!, Cloudera, DataStax and MapR are financing open-source projects at the exchange of payed support and premium services. On the other side of the business field, more and more companies are starting to adopt open-source projects, technologies and libraries into their products not only because are free, but also because of the big communities around them which are able to offer good support. The open-source movement also changed the way software architects and developers work. They need to reserve a lot of time to find libraries or technologies capable of accomplishing a particular task or to figure out how to use them. In this context a search engine provides a good way to start by finding pieces of code, documentation, code examples or to download the source code. During development with open-source technologies, programmers often encounter issues and searching the Web for similar problems with the purpose of finding code snippets, code examples and solutions is often part of the development phase. Currently Google dominates the market of mainstream search engines [53]. Developers often use this kind of search engines to download code, to find code snippets, examples and to solve their issues, although they are not adapted for this purpose. For better results a dedicated source code search engines would be more appropriate. But searching for code that is capable to accomplish a particular task is not easy. Others tried to implement code search engines, like
1 Linux,

Apache, MySQL, PHP




the commercial solutions Koders [48], Krugle [49], Codase [14] and Google Code Search. The fact that none of them is very popular and that users prefer to use mainstream search engines proves the fact that state of the art code search does not satisfy users’ needs. Google Code Search has been shut down in 2012 most likely because of the unsolved challenges in this field.



When we started working at the “Semantic-based Code Search” project we wanted to incorporate basic state of the art code search techniques without the need to reinvent the wheel. So we searched for an open-source code search platform that we could extend by building our own algorithms on top of it. After analyzing multiple alternatives we stopped at Sourcerer [4]. One of the first reasons to choose it was that fact that as our system it aimed at an Internet-scale search engine. Secondly, it had the basic information retrieval techniques already implemented. The important thing was that it included a MySQL [61] database with information extracted from code which can be used as a basis for a lot of code analysis tools and algorithms. Sourcerer only handles Java code. We named our Sourcerer fork [11] Distributed Sourcerer because it runs on a cluster of computers, a set of tightly coupled commodity hardware computers which work together to a common task. After getting deep into Sourcerer by studying its architecture and source code, as well as by testing it, we soon realized that it would not scale to Internet size as its authors aimed, as it will be discussed in the next section. The basic idea was that each of its components could only run on a single machine. So I started to redesign it to run on a computer cluster. My area of investigation is the code analysis part of Sourcerer infrastructure, which provides algorithms vital to a code search engine, such as code indexing techniques, ranking of code entities, code clone detection and code clustering. The code analysis field investigates the structure and the semantics of program code and it is part of program analysis field which focuses on behavior. Code analysis algorithm can be classified as compile based and non-compile based, depending on their need to compile the source code before performing analysis or not. Non-compile based algorithms are generally faster because they perform static code analysis and can cope with code that contains errors, which is not compilable because of this reason. The main disadvantage of this algorithms is that they cannot analyze dynamically loaded code entities, like Java classes loaded during runtime. Compile-based code analysis algorithms do not have this disadvantage, but are generally slower.

Figure 1.1: A Java code graph example Code analysis usually deals with code graphs, like the one presented in Figure 1.1. Their nodes are code entities like classes, methods, primitives and variables and their edges are relations



like “returns” in “method returns primitive” (see Figure 1.1). When a code graph only contains method nodes and “calls” relations it is called call graph. In a similar way graphs that only catch the class inheritance hierarchy or class connections can be constructed. More details about entities and relations can be found in Section 3.1. A full list of all entity types and relation types as well as an explanation of them can be found in Appendix A. When talking about big scale systems, the data takes an important role because of its size and difficulties to access it. A recent solution for large scale data processing is Hadoop platform [24], an open-source implementation of Google’s MapReduce programming paradigm [15]. Chapter 2 presents the investigations of using this platform, as well as the reasons we decided to port the MySQL [61] database to a Hadoop-compliant database named HBase [25]. The design schema of the new database, the reasons behind it and the techniques used to implement queries to it are explained in Chapter 3. After porting the storage layer of the system to HBase, I implemented a ranking algorithm for the search engine (see Chapter 4), named Generalized CodeRank, which is a PageRank [63] adaption for code analysis used for calculating entities popularity. Other state of the art works implemented CodeRank before [64][51][55], but our approach differs from theirs by the fact that we applied the algorithm to all entities and relations from the database. Portfolio [55] only applies it to C functions and their call relations. Puppin et al. [64] apply it only to classes. As far as we know, an older version of Sourcerer [51] only applied it to several types of entities, but not simultaneously.


Sourcerer: A Code Search and Analysis Infrastructure

Developed at University California of Irvine mostly by S. Bajracharya, J. Ossher and C. Lopes, Sourcerer is a code search infrastructure which aims to grow to Internet-scale.

Figure 1.2: Sourcerer system architecture (as it appears in [4]) A crawler downloads source code found on the Internet in various repositories and stores the data in three different forms:



1. In the Managed Repository (referred from now as repository) which keeps the original code and additional libraries. 2. In the Code Database named SourcererDB [62] which stores data as a metamodel obtained by parsing the code from the repository (details in Chapter 3). 3. In the Code Index which is an inverted index for keywords extracted from the code. The system architecture is illustrated in Figure 1.2. At its core, Sourcerer applies basic information retrieval techniques to index tokenized source code into the Code Index implemented with Apache Lucene [26], an open-source search engine library. To hide the complexity of Lucene, a higher level technology is used, that is Apache Solr [29], a search server. In 2010, Sourcerer team published a paper [5] which proposed an innovative way to efficiently retrieve code examples. Their technique associates keywords to source code entities that have similar API usage. This similarity is obtained from the Code Database and the keywords associated are stored in the Code Index. The Code Database is a relational MySQL [61] database, which stores metamodels of projects, files, entities and relations into tables. More about this in Chapter 3. The data obtained by the crawler from the web is first stored in the Managed Repository. In order to populate the Code Database and the Code Index the extractor is used, which is implemented as a headless Eclipse [37] plugin. This component is able to parse the source code and obtain code entities and code relations data. An older version of Sourcerer [51] implemented CodeRank, the PageRank-like algorithm for ranking code entities, but the current version does not implement this any more. Chapter 2 will talk more about the limitations of Sourcerer with respect to scalability and will propose changing the database with a distributed one called SourcererDDB (Sourcerer Distributed Database). Chapter 3 will describe in detail the schema design of the new database and Chapter 4 will present Generalized CodeRank algorithm which runs on SourcererDDB’s data.

Chapter 2

The Choice for Cluster Computing Technologies
This chapter presents the cluster computing technologies used for scaling up Sourcerer and the reason they were chosen instead of considering other alternatives.



Chapter 1.2 described Sourcerer an open-source project from which our implementation started. Although its goals of building a large-scale system match our goals we soon realized the limitations regarding Sourcerer scalability: 1. SourcererDB (Sourcerer database), which uses MySQL, showed poor performance for repositories of hundreds of gigabytes and for difficult queries required by some applications. 2. The extractor, implemented as an Eclipse [37] headless plugin, can only run on a single machine and parsing hundreds of gigabytes takes days. 3. The real time code search application uses Apache Solr [29], a search server based on Apache Lucene [26] search library. Solr runs on a single machine mode so it’s not capable to scale. There is a multi-machine version, called Distributed Solr, but currently lacks some of the Solr features. Other Lucene-based search servers like ElasticSearch [18] should be investigated in the future. Our basic idea is to port Sourcerer for technologies capable of running in a distributed manner on a computer cluster. This master thesis deals only with the first point above, by investigating solutions to scale the database and by designing and implementing a new distributed database called SourcererDDB (Sourcerer Distributed Database), capable of scaling to thousands of machines and to deal with petabytes of data. The database design schema for storing code data is presented in Chapter 3 and an algorithm, Generalized CodeRank, implemented on top of it is presented in Chapter 4. The next three sections present the technologies we chose for running our data on a computer cluster. We are using HBase [25], a large-scale distributed database, and Hadoop [24], a platform for running distributed computing computation based on MapReduce programming model [15]. These technologies are capable of scaling linearly and horizontally on commodity hardware.





MapReduce and Hadoop

Hadoop [24] is an open-source framework for running distributed applications developed by Apache Software Foundation [28]. Its implementation is based on two Google papers, one about MapReduce [15] and the other about Google File System [39]. For the later the Hadoop homologous implementation is named HDFS and is described in Section 2.3, while the former will be detailed in this section. By the time of writing this thesis, Hadoop is very popular and widely used in the industry by important companies like Facebook, Yahoo!, Twitter, IBM and Amazon.[35] For example, a news published on states that Facebook has a 100 GiB Hadoop cluster [41]. MapReduce [15] is a programming model and framework which can be used to solve embarrassingly parallel problems consisting of very large amounts of data distributed across a cluster of computers.


MapReduce Algorithm

MapReduce problem solving model involves two main functions, map and reduce, inspired from functional programming. The execution of map functions is supervised by Map tasks and the execution of reduce functions by Reduce tasks. From a distributed system point of view, there is a master node which coordinates jobs, and multiple slave nodes which execute tasks. The master is responsible with job scheduling by assigning tasks to slave nodes, coordinating slaves activity and monitoring. The domain and range of the map and reduce functions are values structured as (key, value) pairs.[15][72] In Hadoop, the input data is passed to an InputFormat class implementation which splits the data into multiple parts. The master assigns a split to each slave which uses a RecordReader to parse the input data split into input (key, value) pairs. During the map stage, each map function will receive a pair and by processing it will output a set of intermediate (key, value) pairs. After the completion of all Map tasks from the whole cluster the map stage is finished. During sort and shuffling stage the master schedules the allocation of intermediate (key, values) to Reduce tasks. During reduce stage, each reduce function will receive as input a set of intermediate (key, value) pairs having the same key and through processing will output a set of output (key, value) pairs.[15][72]



Hadoop stores input and output data into a distributed file system or in a distributed database [72]. It comes with its own distributed file system implementation, which is HDFS, but other implementation ca be used. The most common distributed database used with Hadoop is Apache HBase [25], but other solutions like Apache Cassandra [22] can be used as well. The resources shared by the cluster nodes are stored in the distributed file system. Hadoop achieves data locality by trying to allocate tasks on the same nodes where the data is located or if it’s not possible in the same rack.



Hadoop framework comes with HDFS, a distributed file system which offers the following features: 1. Fault tolerance in case of node failures

CHAPTER 2. THE CHOICE FOR CLUSTER COMPUTING TECHNOLOGIES 2. Data sharding across the cluster 3. High throughput for streaming data access


HDFS is based on the Google File System paper [39] and exposes to the client an API that provides access to the file system with similar UNIX semantics.


Fault Tolerance

Fault tolerance guarantees that in case of a node failure all files continue to be available and the system will continue to work. This is provided thorough block replication. Each file is made of a set of blocks and each block is replicated by default on two other nodes, one in the same rack and the other in another rack. This ensures that in case of a node failure the in-rack replica can be used without losing data locality. In case of a rack failure the replica from another rack is used to serve data. Data replication is done automatically transparent for the client.[34]


Data Sharding

Files are automatically sharded on cluster nodes without user’s intervention. DataNodes store the blocks and a master node, named NameNode, keeps track where each block is located. The client only talks with the master to find which DataNodes store the desired blocks. The NameNode is not susceptible to overloading because transferring blocks between clients and DataNodes does not involve the master and blocks’ location is cached to the client. New blocks are written in a daisy chain, i.e., while a DataNode receives data (from a client or another DataNode) and writes to disk, it also sends the data in pipeline to the next replica.[34][39][72]


High Throughput for Sequential Data Access

When HDFS was designed, besides the need to create a distributed system that offers a consistent view of the files for all cluster nodes, it was desired to transfer data from commodity hard-disks with a superior speed then from traditional Linux files systems. HDFS provides high throughput for sequential data access. It is known that the biggest bottleneck in a hard-disk are disk seeks and not the transfer rate. Using larger data blocks diminishes the chance of disk seeks and improves throughput, but grows the access latency. For an average user this is not acceptable because small files are frequently accessed and having big latency to each file affects user experience. But in the case of MapReduce which aims at processing large amounts of data having high throughput it’s a must and high latencies are not a concern if only big files are used. HDFS data blocks typically have 64 or 128 MiB. For the best performance, files should be larger than the block size. By using large block sizes data is read from disk at transfer rate, not at seek rate.[34][39]



NoSQL (No SQL or Not Only SQL) is an emerging database category which proposes different approaches to store data then the ubiquitous relational database management systems (RDBMS) based on SQL. It has been stated that there are so many differences between NoSQL databases that they were grouped together based on what they don’t have in common [54]. Usually the main differences against RDBMS are the following [68][6]: 1. They do not have a relational model, so there is no SQL support.

CHAPTER 2. THE CHOICE FOR CLUSTER COMPUTING TECHNOLOGIES 2. Are usually column based or offer a different way of structuring data.


3. They sacrifice some constraints such as ACID (Atomicity, Consistency, Isolation, Durability). For details about what ACID means see Section 2.6.



NoSQL databases appeared in the context of the Web expansion which required much more data to store and many more users to access it. Web applications that need to store large data sets, also known as big data [73], started to appear. The limits in terms of scalability of SQL-based databases started to show. Big Web sites like Facebook, Twitter, Reddit and Digg started to experience problems with their SQL databases and as a consequence they begin to look for alternatives. Databases like MongoDB [1], CouchDB[23], HBase and Cassandra [22] come with different approaches to structure data and with different guarantees.[68][6]



There are three reasons one would chose a NoSQL solution instead of SQL. Usually the reason is the need for more scalability which is typically obtained through distributed computing. RDBMS offer a lot of guarantees like strong consistency and durability (see Section 2.6) which have a big overhead in a distributed environment or just make scaling difficult and expensive. Running computation in a distributed system also creates availability problems. Web applications have availability requirements, because if a site goes down its owner may lose a lot of money and clients. Ensuring scalability, high availability and low latency comes with the expense of consistency which is usually weaken. Users don’t typically care if they don’t see the latest version of a post as long as they still can access the site to some extent. For applications where consistency is important like banking, SQL databases are still the best choice.[68][6] The second reason for choosing NoSQL and giving up SQL is because of the availability requirements. This reason is strongly linked with the first one. On large scale the database runs on a cluster of computers. For commodity hardware failures are usually the norm, not the exception and as Section 2.6 will expose SQL databases cannot meet theses availability guarantees at large scale. The third reason why one would choose NoSQL is the situation when RDBMS way to structure data and the relational model does not feet the needs [68][6]. NoSQL databases are usually column-based and scale horizontally, as opposed to SQL which scales vertically. In SQL the schema is fixed, but databases like HBase, Cassandra and MongoDB support an arbitrary number of columns to be stored on a row with any name. Others, like Amazon Dynamo [16], Amazon S3 [52] and memcached [56] are based on key-values and are optimized for fast retrieval by key. Neo4j [57] is well suited for graph structured data like maps.



Storing data in files is not always advantageous and there are many scenarios when a database, which offers a more structured data access, is more profitable. With these thoughts in mind HBase was developed, a NoSQL database based on Google’s BigTable paper [13]. Nicknamed “The Hadoop Database”, it offers a good integration with MapReduce and is able to scale to millions of rows and billions of column. It is used with success in the industry by a lot of important companies such as Facebook, Twitter, Yahoo! and Abobe [33][9][3]. Facebook uses it to store all user private messages and chats [9][3] and has a 100 petabytes Hadoop cluster [41].




Data Structuring

As in RDBMS, data is organized in tables, rows and columns. The difference is that any number of columns with any names can be stored on each row. Columns on a row are independent of the columns from another row, for instance if row a has columns x, y and z, row b may have columns m, n and o (without having any column of a). The column names are called column qualifiers. A table can have one or more column families, which group together columns. In order to identify a column, both the column family and the column qualifier need to be given, pair which is called column key. HBase stores data internally as key-value pairs having three-dimensional keys with coordinates row key, column key and timestamp. The former coordinate is a way to store multiple versions for a table value based on the number of milliseconds from the Epoch. By default when a new value is inserted the current time is used as a timestamp, but a custom value can be used as well. The key-value pairs are sorted first by row key, then by column key and then in decreasing order by timestamp, such that the newest versions are retrieved first.[38][25] When a new table is created only the column families and configuration parameters for each one need to be given, because new rows and columns can be freely created when performing inserts.


Node Roles and Data Distribution

The data range for each table, consisting of key-value pairs, is split into regions which can be stored on different nodes in the cluster, known as region servers. Clients talk directly with this servers which are responsible to serve data reads and writes to clients, for the regions they are assigned to. If a region grows beyond a threshold because of new data, the region is split and a new region is assigned to a different region server, so HBase scales automatically.[38][25] HBase relies on HDFS for data persistence, replication and availability. Because region servers serve data reads and writes, data locality is achieved. This is because HDFS first writes data locally and then updates other replicas from other nodes in a daisy chain. In case of a region server failure, its regions are assigned to other region servers and data locality is temporarily lost. However, compactions, described in Subsection 2.5.5 reestablish data locality after a while. A master node is responsible with region assignment. To do this, it uses ZooKeeper [44], a distributed, highly available, reliable and persistent coordination and configuration service. For client bootstrap, ZooKeeper is also necessary, because it stores contact information to reach catalog regions, which are able to tell on which region server a row-key is stored. Clients cache regions to region servers mappings for efficient future requests.[38][25]


Data Mutations

Data mutations are insertions or updates to the database. HBase keeps some key-value pairs in memory in MemStore. In order to guarantee durability, mutations received from the client are first persisted to a log in HDFS, named write-ahead log (WAL) and then the information is also updated in MemStore. By doing this no data loss occurs in case of a failure like a power outage, when memory data is lost. When MemStore data grows beyond a threshold, it is persisted to disk to HDFS in a HFile. This files offer an efficient way to store data to disk because they contain an index which is used for fast location of key-values within the file. When the HFile is completely written, the WAL can be discarded.[38][25] Mutations in HBase are atomic on a row basis.[38]




Data Retrieval

When region servers need to retrieve a key-value for a client, it is searched both in MemStore and in HFiles stored in HDFS. By requesting only particular timestamps searching in some HFiles can be avoided, achieving some performance improvements. MemStore is also used as a cache for recently retrieved key-values.[38][25] Searching in memory is very efficient because keys are looked-up in B+ Trees with O(log(n)) complexity. Searching in HFiles is accomplished as stated before by using the index from the file which avoids the need to load the entire file in memory, which is usually impossible.[38][25] Because each column is stored internally as a key-value, having all information that identities it along with the value, it does not matter where the actual data is stored from the space requirements point of view. This allows users to move some data from the value to the row key or to the column qualifier, if it requires to index the column by that information. However, if possible, keys (row keys and column keys) should be kept as small as possible in order to keep the HFile index small.


House Keeping

After many mutations, multiple flushes from memory to disk occur, so a lot of HFiles are going to be created. The retrieval performance decreases when the number of files grows. To eliminate this issue HBase executes compactions regularly, in order to reduce the number of HFiles.


The Reasons for the Chosen Technologies

NoSQL technologies became very popular these days, but many startups seem to choose them just because they constitute a trend. Before we took a decision we made a deep analysis of our data needs. This section presents our rational reasons for using Hadoop and choosing HBase as our database of choice.


Why Hadoop?

We require a solution capable of scaling linearly by just adding commodity hardware without additional overhead. This is exactly the reason why Hadoop was created. Scaling does not require the change of the code or restructuring the data. Code search and analysis require processing of both unstructured and structured data. We aim at building a distributed extractor which will parse source code and extract facts about it. The input source code constitutes unstructured plain text data which can be embarrassingly parallelized with Hadoop by assigning groups of files to each Map task. The facts extracted from code are usually structured data which may have a graph structure. Hadoop is not usually recommended for this kind of data unless the data is structured and optimized for that particular usage scenario. The nature of our project will require a fixed set of algorithms that are going to be run on a long term basis. Querying data in unexpected ways is not required in our case. Our information retrieval processing require a fixed set of steps which are rarely changed: crawling, filtering, indexing, ranking, retrieval etc.. Only crawling and indexing require massive updates of the database. In our ranking algorithms reads prevail (see Chapter 4). The input data is written once during indexing phase, but is read multiple times iteratively during ranking, having only some small updates of some fields at the end of each iteration. If the data written



during indexing is structured properly Hadoop performs well when it reads repeatedly during ranking phase. The last reason for choosing Hadoop is its good support and well written documentation. Being used by major actors from IT, boosts its support and stability making it a reliable solution.



RDBMSs generally offer ACID guarantees, which is an abbreviation from atomicity, consistency, isolation and durability. This subsection will explain the concepts and analyze if they are required for our system. If not, we can drop some of them in order to gain other advantages. All these guarantees are most of the time linked with the concept of transaction which is a unit of work in a database which may involve more steps.[6] Atomicity guarantees that a transaction can either be successful or can fail [6]. In case some step failed in the middle of the transaction, the system must return to the original state where it was before starting the transaction and declare a failure. HBase guarantees atomic row mutations [32][38], which meets our requirements for the Generalized CodeRank algorithm. We do have updates that expand to more rows at a time and even to more tables during indexing phase, but a failure which will let a data field inconsistent with the other will statistically have an insignificant impact on our system. Besides that such indexing errors are easy to recover without data loss. Consistency in ACID sense differs from the same concept found in distributed systems which will be discussed in the next subsection. Here consistency guarantees that a transaction will bring the system from one valid state to another [6]. This means that if after the transaction some constraints, triggers or cascades are not valid the transaction must be rolled back and the system must return to its original state. So, consistency copes with the logical conditions of the system, as opposed to atomicity which copes with failures and errors. HBase consistency guarantees are linked to its atomic row mutations feature. Retrieval of a row will return a complete image of that row that existed at some point in history [32]. Additionally “time travel” or updates from the past are not possible. HBase does not come with any other consistency guarantees in ACID sense, but developers are free to implement this logic in their application either on the client side or on the server side thorough an HBase feature called coprocessors. This could come with some performance penalties especially when it is implemented on client side. However for our application ACID-consistency is not required. Logical constraints can be invalidated only through programming errors and there is no reason to sacrifice performance for constraints checking if those are not very likely to occur. Isolation ensures that a transaction will not influence other concurrent transactions, i.e., transactions are independent of each other [6]. HBase offers atomic row mutations [38] and as a consequence isolation is guaranteed at the same granularity [32]. It is not very likely that for our applications a higher isolation guarantee is required. We are not planning to run concurrent algorithms that require atomic operations on multiple rows or tables. We plan to run read queries or distributed, non-concurrent MapReduce algorithms in batch jobs. Most of our usage patterns will consist of reads. Isolation violations can only occur due to human error or programming errors which are expected anyway in a system. Durability guarantees that when a transaction is reported as successful data mutations will already be persisted, such that in cause of a system failure (like a power outage) there are no data losses [6]. HBase aims at offering complete durability through the WAL by ensuring that any mutation is not reported as successful until writing to the log has not finished. However there are still issues on this feature and at the time of this writing the only guarantee is that the data has been flushed to the operating system buffer [42]. If an outage occurs before the buffer



is flushed to disk the data is lost. This is not an HBase issue, but an HDFS one, which has been recently solved [36], but its integration to HBase is pending [21]. It is very likely that the next HBase version will support full durability. However, small data losses from an unflushed OS buffer are not critical for our applications. Usually our data is obtained from crawling or from other data through processing, thus data can be easily recovered.


CAP Theorem and PACELC

Eric Brewer conceived in 2000 the CAP principles, CAP being an abbreviation from consistency, availability and partition-tolerance. In 2002, Nancy Lynch formalized them into a theorem [40] and since then it became a fundamental model for describing distributed databases. CAP Theorem: It is impossible to have in a distributed system all three qualities of consistency, availability and partition-tolerance in the same time.[20][40] As stated in the previous section, consistency has a different meaning in distributed systems context then in ACID. Actually, this semantics is the one which is usually considered when referring to the term. Consistency in distributed systems sense subsumes the atomic and consistent meaning from ACID concepts [40], so it may be defined as atomic consistency. Consistency in a distributed system (or atomic consistency) guarantees that any observer will always see the latest version of the data no matter what replica is read.[6][40][20] Availability ensures that the system will continue to work as a whole even when a node fails, i.e. a response is always received for a request.[40][20] Partition-tolerance requires a distributed system to continue to operate even if arbitrarily many messages between two nodes are lost, when a partition in the network appears. A model for better describing CAP Theorem was proposed by Daniel Abadi, named PACELC [2]. Each letter from this abbreviation is marked with bold and capital letters in the following scheme: if Partition: trade between Availability and Consistency Else: trade between low Latency and Consistency PACELC model explains the fact that in case of a network partitioning (the P from PACELC) a system needs to make trades between either availability (the A), either consistency (the C). E lse (the E), the system must decide if either providing a low l atency (the L) is more important or a stronger consistency (the C). As stated atomic consistency covers both the atomicity and consistency terms from ACID. Thus, HBase guarantees the consistency condition in distributed systems terms. Availability is weakened in the sense that in case a region server fails it takes some time until the master reassigns its region and the new assigned region server replays the failed server log (WAL). By default it takes up to three minutes for ZooKeeper to figure out that a region server failed. This can also have implications on latency and data locality in case the data from one of the remaining replicas is not located on the same machine as the new allocated region server. But this problem is solved when compactions are performed. The tradeoff made by HBase at the expense of availability are not a big concern for Distributed Sourcerer because the database is not designed to be used by critical realtime applications like code search. Algorithms are usually using the database, which can be programmed to cope with this kind of situations by waiting for the region to be recovered.



In PACELC semantics, HBase can be characterized as a PC/EC system, because in case of a network partition will prefer keeping consistency and weakening availability to some extent and in case of normal operation writes have a larger latency because of the consistency requirements of the underlying HDFS implementation. Also, latency has to suffer in the case of reassigned regions, but this is just temporarily. However, the consistency is weaker than in RDBMS, such that latency is kept within a controllable range.



By applying the CAP Theorem [40] it is know obvious why SQL and RDBMS do not scale for big data. By ensuring ACID constraints the atomic consistency is guaranteed, which is a strong consistency requirement, which in PACELC semantics translates to a PC/EC distributed system. This sacrifices availability and latency for the benefit of consistency. As the system grows, the latency also gets larger and parts of the data become unavailable due to failures which are normal in a commodity hardware cluster. As described in the previous section, HBase is also a PC/EC system, but with more relaxed consistency requirements. Only row mutations are atomic, there are no transactions and no constraints between columns. By giving up joins, data denormalization and duplication is encouraged, such that only one big table is queried, reducing the overhead. However, this gives some limitations in some scenarios when a relational model is more appropriate. Another problem with SQL databases are the algorithms they use. Most of them use B+ Trees to store indexes [38], which offer a good performance in O(log(n)) complexity for reads, updates and inserts. But as the database size grows, more updates are performed and the B+ Trees get imbalanced. Rebalancing is a very expensive operation which can significantly slow down the database. On the other hand, HBase uses a more appropriate design for big data by storing the B+ Trees in the MemStore for recently accessed key-values and by using an index for HFiles, which are stored on disk in HDFS [38]. An overhead occurs during compactions, but those are performed in two different stages which lowers the impact on performance. SQL databases are able run on a cluster in a distributed way, but scaling them involves a big operational overhead [38]. HBase scales automatically without human intervention. When a region grows beyond a limit it is automatically split into two regions as described in Section 2.5.



We saw in this chapter that HBase by offering atomic row mutations guarantees enough consistency for our usage requirements. Reading latency is kept low as the system grows ensuring good performance for MapReduce. The availability at scale is way more better than what SQL can offer and the partial outages are controllable and predictable (they are not longer than 3 minutes by default). No data losses can occur in case of hardware failures, because mutations are always persisted to the log first. Since all this HBase advantages fit our needs and all disadvantages are not a concern for our applications we decided to use HBase to reimplement SourcererDB into what we call SourcererDDB.

Chapter 3

Database Schema Design and Querying
This chapter presents the motivations behind schema design decisions for the HBase database used in Distributed Sourcerer. The former database based on MySQL is also presented by comparison highlighting differences.


Database Purpose

As described in Chapter 1.2, Sourcerer uses a database to store information about code entities and relations between them, as well as information about projects and files. The extractor parses Java source files, JARs and class files from the repository in order to extract this information which is described by using the following models [62]: • Projects: The biggest division in a repository is a project which consists of a set of files that comprise the same system, are typically developed by the same team and in the same company or organization. For each project there is a database entry which stores metadata fields like project name, description, type, version and path within the repository. • Files: The repository stores Java source files (with .java extension), JAR (Java archive) files (with .jar extension) and Java class files (with .class extension) which are byte code compiled files contained within the JAR files. Class files not packed into JARs are ignored by Sourcerer. For each file metadata fields are stored into database like path, file type, file hash and the project ID that contains it. • Entities: The smallest metamodel divisions extracted from code are represented by entities such as methods, variables, classes, interfaces, packages etc. • Relations: The relationship between entities are modeled by relations such as a calling relationship between two methods, an inheritance relationship between two classes or a containment relationship between a class and a method. Various algorithms can be built on top of the infrastructure to use as input the file structure of the projects and the relations between code entities. Chapter 4 describes such an algorithm for computing CodeRank, a metric used to rank code entities based on their popularity in a similar way PageRank from Google is used to rank web pages popularity. Code entities relations are used as input to compute CodeRank for each entity from the database.




The database API can be used to search projects, files, entities and relations matching several criteria. For example we can assume that an application needs to retrieve all methods called from a particular class instance, in a particular project. This chapter describes how HBase database was designed in order to facilitate searching based on several matching criteria and how querying is performed on this schema design.


Design Principles

All schema design decisions were made such that processing time for database operations is minimized. The most important factor that was considered was reading time because algorithms that work with the database perform faster for low latency and high throughput when reading their input. Usually only a small number of small size fields are updated in the database. The most complex writing process takes place at the beginning when the database is populated, but after this stage most operations are reads accompanied by some updates. Some algorithms require reading of large amounts of data in a repeated manner. For instance CodeRank runs iteratively until convergence is reached, so it must read repeatedly all relations from database. In these kind of situations loading large batches of relations into memory with high throughput and low latency are vital. On the other hand writing into the database requires a smaller amount of information to be written in the case of CodeRank. After each iteration the current CodeRank (a double floating point value) must be written for each entity. The number of entities is much more smaller than the number of relations. There is no best design that can perform well in all situations so compromises need to be done to optimize performance for some particular scenarios. This scenarios were chosen by studying all MySQL SELECT queries used in Sourcerer as well as studying the data requirements to compute CodeRank for the entities. As it was described in Chapter 2, No-SQL schema design principles for databases such as HBase differ substantially from their relational counterparts. Because join operations are not natively supported and an arbitrary number of columns with arbitrary names can be used for a row, normalization is not required. On the contrary, according to DDI (Denormalization, Duplication, Intelligent keys) principle [19], denormalization should be used instead. By using this principle fewer reads are needed to retrieve the data because all the columns required can be stored on the same row, not on different rows from different tables as in relational normalized data. Denormalization is often used with duplication if the required data must be retrieved by different matching criteria. In this way no secondary indexes must be created as in SQL databases. In HBase data is sorted by keys, so an intelligent key design must be chosen such that the most common search criteria are optimized. Additionally, as discussed in Chapter 2, because data is stored in HFiles as KeyValues, it makes no difference for storage requirements if data is stored in the key part or in the value part.


Projects Data

Projects metadata is stored in HBase in a similar way to MySQL. The main difference, detailed in the following sections, lies in the way the row key was designed. The project types defined in Sourcerer are described in Table A.1 [60][62][59].


Former SQL Database

The original MySQL database used in Sourcerer has the columns described in Table 3.1 [60][62][59]. Most of the columns can have a null value, thus are optional and important columns like

CHAPTER 3. DATABASE SCHEMA DESIGN AND QUERYING project_id and project_type are indexed for fast retrieval in O(log n). Table 3.1: Columns for projects MySQL table Is Indexed Null Description yes no Numerical unique ID of the project. yes no Type of the project. yes no Name of the project from the original repository. no no An optional human readable project description. no yes Version number for MAVEN projects. yes yes Group for MAVEN projects. no yes Project path within the repository. yes yes Project MD5 hash for JAR and MAVEN projects. yes no Whether the project has or does not have source files.


Column project_id project_type name description version groop path hash has_source

An additional SQL table named project_metrics exists which stores metrics for projects like the number of lines of code and the number of lines of code with non-whitespace lines. Each row contains the project ID, the metric type and the metric value. Thus, a join by project_id is required in order to obtain the metric values for a project.


Functional Requirements

The distributed database should be able to retrieve fast a project by its ID. As it can be seen in the next sections, files, entities and relations are attached to a project by referring to its ID. In case more information about a project is required it can be searched by its ID.
1 2

SELECT project_id, project_type, path, hash FROM projects WHERE project_type = ? Listing 3.1: SQL Query used to retrieve projects by their type There are a few methods implemented in Sourcerer that retrieve information about projects by their ID by using SQL queries like the one from Listing 3.1. The new database needs to provide an efficient way to retrieve project entries by their type.


Schema Design

Each project must be uniquely identified by an ID. An MD5 hash can be used to generate such an ID. Some of the metadata fields used to describe a project can be hashed to generate the unique MD5. For JAVA_LIBRARY and CRAWLED projects the path from the repository is used as a hash seed since any project has a unique path. But other types of projects do not have this field so different fields are used to generate an unique ID. For JAR and MAVEN projects the hash field is used. For the two SYSTEM projects the ID is a 16 byte array containing the ASCII string primitives or unknowns respectively, right padded with null bytes. Table 3.2: projects HBase Table <projectType><projectID> name, description, version, groop, path, hasSource linesOfCode, nonWhitespaceLinesOfCode

Row Key Default Column Family Metrics Column Family



Project metadata is stored into projects HBase table described in Table 3.2. Each project entry can be assigned to one row creating a tall-narrow table with a lot of rows and just a few columns having the same meaning as the SQL columns in Table 3.1. Part of these fields can be stored in the key part to achieve efficient retrieval. All the other columns, which are homologous to the ones from the SQL schema, can be grouped together as default column family. Because an arbitrary number of columns can be stored on each row there is no need to store null values, so only those metadata fields that are available can be set as HBase columns. Another column family, named metrics is used to store any metric defined for the project from that row. Currently Sourcerer only uses two metrics, but more metrics can be added with no cost in the future. The main question that arises is how to design the row key for efficient retrieval by both project ID and project type? If the project ID is used as a row key, any project can be efficiently retrieved with a get operation by using its ID. Using a hash function for all projects IDs, except for the two SYSTEM projects, causes project row entries to be randomly distributed across regions, no matter what type they have. So row scans cannot be used to efficiently retrieve projects by type if project ID is used as row key. Filtering only project rows that have a particular type is very inefficient because it requires scanning the whole table of projects. The project type can be encoded as a single byte and placed in the row key before the 16 byte project ID hash as described in Table 3.2. In this way data locality is achieved and by using row scans all project entries with a particular type can be retrieved. There is no project type that seems to appear more often than all other types in the dataset so region hotspotting [38] shouldn’t be a problem. The issue with this approach is that project entries can no longer be retrieved by their ID without knowing the type in advance. If this is not known a get operation can be tried for each project type and a particular ID. All this requests can be served in parallel and for a big dataset requests will be served by different regions exploiting the distributed nature of HBase. Additionally the number of types to be tried is very small. There are very few projects of JAVA_LIBRARY type and only two projects of type SYSTEM, so the number of projects of these types can be neglected. Most of the projects have type CRAWLED and JAR and some of them have type MAVEN. So basically there are only three project types to be tried making this approach very efficient.



As described in Subsection 3.3.3 efficient retrieval of project entries is done when project type is known. All projects of a particular type can be retrieved by doing row scans. The start row is set as the 1 byte project type and the stop row is the same byte incremented by 1. By using a get operation for a row key which includes the project type as the first byte and the project ID as the rest of the bytes a particular project entry can be retrieved. If project type is not known the techniques described previously of trying all types can be applied. As described, this does not endure serious performance penalties. Querying by any other other criteria, like path, is not efficient when using this schema design. It is possible to do it by using value filters, but it requires scanning the whole table which can take a long time.


Files Data

For files the database only stores metadata as for projects. That is why the schema design for HBase in this case is also similar to the SQL one. The file types defined in Sourcerer are described in Table A.2[60][62][59].




Former SQL Database

There are two MySQL tables with file information. One of them, files table, has its columns described in Table 3.3 and stores metadata [60][62][59]. The other one, named file_metrics, stores metrics related to files in a similar manner with project_metrics. Currently the same two metrics are used: the number of lines of code and the number of lines of code with no whitespace lines. Table 3.3: Columns for files MySQL table Is Indexed Null Description yes no Numerical unique ID of the file. yes no Type of the file. yes no Name of the file. no yes File path within the repository. yes yes File MD5 hash for JAR files. yes no ID of the project that contains this file.

Column file_id file_type name path hash project_id


Functional Requirements

As reflected by the next sections, entities and relations can refer to an ID of a file they belong to. It should be possible to retrieve file entries, which contain metadata about files, by their unique ID as well as by their type or ID of the project they belong to. Different combinations of those three criteria should be considered.


Schema Design

Each file from the repository must be uniquely identified by an ID, which is obtained by using an MD5 hash. For JAR files the name field is hashed and for other file types, the path field is hashed, resulting an unique ID for each file entry. For more information about file metadata fields see the SQL columns of the former database in Table 3.3. As in the case of projects it is necessary to store in the database metadata and metrics. A similar HBase schema can be used by storing a file entry on each row of files HBase table, described in Table 3.4. A default column family contains the same information as the SQL columns described in Table 3.3 except for some metadata fields which are moved in the key part for efficient retrieval. Metrics column family stores file metrics in the same manner as in projects HBase table. Table 3.4: files HBase Table <projectID><fileType><fileID> name, path, hash linesOfCode, nonWhitespaceLinesOfCode <entityType><fqn> <relationKind><targetEntityID><sourceEntityID>

Row Key Default Column Family Metrics Column Family Entities Column Family Relations Column Family

After defining column families for file data and the column keys used, the remaining challenge that remained was to design the row key for efficient retrieval by file ID, file type and project ID. All this three fields are placed in the row key and encoded as 33 bytes. The first 16 bytes



represent the project ID, the next byte encodes the file type and the last 16 bytes represent the file ID, as illustrated in Table 3.4. For efficient retrieval of a file entry both the file ID and the ID of the project it belongs to need to be known in advance. It is not very important to know the file type since all the three types can be tried without sacrificing performance so much. A similar approach was described in Subsection 3.3.3 for trying all project types to retrieve a project entry. In the case of files, there are even less types to try – only three.



As discussed in the previous section efficient retrieval of a file entry is achieved when querying HBase by both file ID and project ID. As discussed, knowing also the file type would not bring substantial performance improvements. Knowing the project ID is not a problem for the current design of the database, because as it can be seen in the next sections, when file ID is stored for an entity or relation also the project ID is kept. However, in case project ID is not known, it is possible to retrieve a file entry by using a filter. A custom row filter has been implemented which passes all rows that contain into their row key suffix (the last 16 bytes) the file ID. Using this retrieval approach is not optimal since it requires scanning of the whole table, but at least it makes the scenario possible. Retrieving all files from a project is possible by doing a row scan of all rows that begin with the 16 bytes of the project ID. If an additional byte representing the file type is added only files of a particular type are retrieved from that project. If it is required to retrieve all files from the repository of a particular type the whole table must be scanned and a custom row filter can be used which passes only rows that have the 17th byte set to the correspondent value of the file type. Querying by other matching fields can be achieved in a non-efficient way by using column value filters and scanning the whole table, which can take some time for big datasets.


Entities Data

Code entities have a lot of information fields that describe them. In order to achieve efficient retrieval by matching several fields duplication design principle [19], described in Section 3.2, will be applied. Thus, entities data will be stored redundantly into multiple HBase tables. The entity types available in Sourcerer are described in Table A.3 [60][62][59].


Former SQL Database

As in the case of projects and files two MySQL tables are used to store entities information. General information used to describe them is placed in entities table [60] [60][62][59]. Metric information is stored in entity_metrics in the same way as for projects and files. Table 3.5 describes the columns used in entities SQL table.


Functional Requirements

Schema design for relations HBase tables should provide efficient retrieval by the following data fields: • FQN (Fully-Qualified Name)



Column entity_id entity_type fqn modifiers multi project_id file_id offset length

Table Is Indexed yes yes yes no no yes yes no no

3.5: Columns for entities MySQL table Null Description no Numerical unique ID of the entity. no Type of the entity. yes FQN (Fully-Qualified Name) of the entity. yes Java modifiers for entity types that are allowed to have them. yes Multipurpose column for additional information. no ID of the project that contains the entity. yes ID of the file that contains the entity. yes Byte offset of the entity in the source file. yes Byte length of the entity in the source file.

• entity type • project ID • file ID This requirements are found in Sourcerer API to the SQL database, where a lot of SQL queries select rows by these criteria. For example, Listing 3.2 shows three queries extracted from Sourcerer’s code. All of them filter results by entity type, marking this field as being very important. One of the queries searches entity entries that have a particular FQN prefix, so partial FQNs should be a searching criteria, not only exact FQNs. The other two queries search by project ID and file ID respectively.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

-- Retrieval by FQN prefix and filtering by entity type: SELECT entity_id, entity_type, fqn, project_id FROM entities WHERE fqn LIKE ’${PREFIX}%’ AND entity_type NOT IN (’PARAMETER’, ’LOCAL_VARIABLE’) -- Retrival by project ID and filtering by entity type: SELECT entity_id, entity_type, fqn, project_id FROM entities WHERE project_id = ? AND entity_type IN (’ARRAY’, ’WILD_CARD’, ’TYPE_VARIABLE’, ’PARAMETRIZED_TYPE’, ’DUPLICATE’) -- Retrieval by file ID and filtering by entity type: SELECT entity_id, entity_type, fqn, project_id FROM entities WHERE file_id = ? AND entity_type IN (’CLASS’, ’INTERFACE’, ’ANNOTATION’, ’ENUM’) Listing 3.2: SQL Queries used to retrieve entities data The former SQL database uses secondary indexes for all these four fields, confirming their importance (see Table 3.5).


Schema Design

Entities data is stored redundantly in three HBase tables by applying duplication design principle [19], thus ensuring efficient retrieval by several criteria. Each entity is uniquely identified by an MD5 hash ID, calculated by using the following fields described in Table 3.5: entity type, FQN, modifiers, multi, project ID, file ID, offset and length. entities_hash HBase table, described in Table 3.6, stores entity data by entity ID. It does



this by storing the unique ID as a row key and the other fields as columns in default column family. Entity IDs are used in relations data, so this table can be useful when it is required to retrieve more information about an entity. Table 3.6: entities_hash HBase Table <entityID> entityType, fqn, modifiers, multi, projectID, fileID, fileType, offset, length Metrics Column Family linesOfCode, nonWhitespaceLinesOfCode Relations Column Family sourceEntityType, codeRank, targetEntitiesCount, targetEntities, relationIDs Row Key Default Column Family

To achieve efficient retrieval by the four fields mentioned in the previous section, i.e. FQN, entity type, file ID and project ID, they need to be stored in the key part of two of the tables which store entities data, whether this key part is the row key or the column qualifier. The other remaining fields, which are not stored in the key part, are serialized in the value part. For scenarios when searching by project ID or file ID is required, entities data is stored into files HBase table, previously described in Table 3.4 and Subsection 3.4.3. When searching entities by FQN or FQN prefix a special table is used, named entities table (see Table 3.7). Table 3.7: entities HBase Table <fqn>0x00<projectID><fileID> <entityType>

Row Key Default Column Family

By using the row key design from files table entities can be efficiently searched by project ID, file type and file ID. Entities column family is used to separate entities data from files metadata, which is stored in the default and metrics column families as described in Table 3.4. Entity type and FQN fields are placed in this order in the column qualifiers of entities column family. The one byte entity type and the 16 bytes MD5 hashes for file ID and project ID require exact matching when performing a search. But for FQN it must be possible to search all entities that have a particular FQN prefix. The most efficient way to do this is by putting this field at the beginning of row keys of entities table and performing a scan by the required FQN prefix. After the FQN field the row key includes a null byte which is useful for exact FQN matches. For example let’s assume we need to search an entity which has the exact FQN java.lang. If we perform a scan only by this string other entities with different FQNs but the same prefix will be returned, such java.lang.Object, java.lang.String etc. But by adding the additional null byte to the scanning start row string, i.e. "java.lang\0", the exact FQN will be matched. The next fields found in the row key are the MD5 hashes of project ID and file ID (see Table 3.7), which can help narrowing results by these two other criteria. In all three queries from Listing 3.3 it is required to narrow the results by including or excluding entities with a particular type. By placing the entity type byte in column qualifiers makes this filtering possible when searching by FQN or FQN prefix. All this columns associated with entity types are placed into the default column family.





As mentioned in the previous section retrieving en entity entry by its ID is performed in entities_hash table. If entities need to be searched by several criteria the other two tables can be used, i.e. files table and entities table. When FQN or an FQN prefix is known, entities table should be used. The following scenarios cover the use cases for this table: • FQN or an FQN prefix is known • both the exact FQN and project ID are known • exact FQN, project ID and file ID are known Since these three fields presented above are placed in the row key, the scenarios are implemented by doing row scans or using get operations. In the first case the FQN is used as the start row and in the second the null byte and the project ID are added. In the third case a get operation can be performed, because by adding the file ID the whole row key is known. Column qualifiers hold entity types, so by requesting only some columns to be returned the results are narrowed by the corresponding entity type. When FQN or an FQN prefix is not known, operations in entities column family of files table should be performed. Usually, the use cases of this table are covered when the project ID or file ID is known. Similar scenarios where presented in Subsection 3.4.4 when searching for files. Here the same matching requirements are desired but for entities instead of files, so entities column family is used. Narrowing results to some specific entity type can be implemented with a column qualifier prefix filter which passes only those columns that have the required value as the first byte.


Relations Data

Storing relations data into HBase requires the same design principle of duplication in order to achieve efficient retrieval for the desired scenarios. Data is stored redundantly in multiple HBase tables, each one being used for a particular scenario. The relation types defined in Sourcerer are described in Table A.4 [60][62][59]. Besides its type, a relation also has a class which defines the location of the target entity as described in Table A.5 [60][62][59].


Former SQL Database

There is only one MySQL table which stores relations data, which is named relations. Table 3.8 describes its columns [60][60][62][59].


Functional Requirements

Retrieval of relation entries should be optimized for the following fields, explained in Table 3.8: • source entity ID • target entity ID • relation type and relation class • project ID



Column relation_id relation_type relation_class lhs_eid rhs_eid project_id file_id offset length • file ID

Table 3.8: Columns for relations MySQL table Is Indexed Null Description yes no Numerical unique ID of the relation. yes no Type of the relation. no no Class of the relation. yes no ID for the source entity of the relation. yes no ID for the target entity of the relation. yes no ID of the project that contains the relation. yes yes ID of the file that contains the relation. no yes Byte offset of the relation in the source file. no yes Byte length of the relation in the source file.

All this requirements can be found in SQL queries from Sourcerer source code, except for source entity ID. Two of those SQL queries are available in Listing 3.3. Optimizations for source entity ID field were considered for practical reasons and because its column in relations MySQL table is indexed.
1 2 3 4 5 6 7

-- Retrieve relations by target entity ID and type: SELECT project_id FROM relations WHERE rhs_eid = ? AND relation_type IN ? -- Retrieve relations by project and type: SELECT * FROM relations WHERE project_id = ? AND relation_type IN ? Listing 3.3: SQL Queries used to retrieve relations data


Schema Design

Duplication design principle [19] has been applied in order to achieve efficient retrieval by several criteria. Thus, relations data has been stored redundantly in multiple HBase tables. Depending on the application some tables may not be implemented. For example, there is a table named relations_hash, described in Table 3.9, which stores relations data by their ID, similar to entities_hash. Currently, there is no feature or algorithm that uses it, so in future it may be dropped if it’s not required. Table 3.9: relations_hash HBase Table <relationID> relationKind, sourceEntityID, sourceEntityType, targetEntityID, targetEntityType, projectID, fileID, fileType, offset, length

Row Key Default Column Family

The fields by which the retrieval should be optimized are stored in the key part of the tables relations_direct (see Table 3.10), relations_inverse (see Table 3.11) and files (see Table 3.4). The other remaining fields, i.e. offset and length are serialized in the value part of those tables.



Table 3.10: relations_direct HBase Table Row Key <sourceEntityID><relationKind><targetEntityID> Default Column Family <projectID><fileID>

Table 3.11: relations_inverse HBase Table Row Key <targetEntityID><relationKind><sourceEntityID> Default Column Family <projectID><fileID>

Relation type and relation class are combined together in a single byte in HBase tables resulting a field named relation kind. The three most significant bits are used for relation class and the next 5 bits for relation type. Relation IDs are calculated by using an MD5 hash on the following fields: relation kind, source entity ID, target entity ID, project ID, file ID, offset and length. Efficient retrieval by source entity ID is achieved by querying relations_direct HBase table, described in Table 3.10, which contains in its row key the following fields in this order: source entity ID (16 bytes), relation kind (1 byte) and target entity ID (16 bytes). Retrieval of all relations with a particular source entity ID and optionally with a particular relation kind is possible through row scanning. The same principles are applied for relations_inverse table, described in Table 3.11, optimized for retrieval by target entity ID which has the following fields in its row key in this order: target entity ID, relation kind, source entity ID. Here scanning can be performed by target entity ID or by both target entity ID and relation kind. Both relations_direct and relations_inverse tables have the same column keys design. They have a default column family and column qualifiers contain the project ID and the file ID in this order. Selecting only specific columns or by using column qualifier filters relations results can be narrowed to only those that are part of a particular project or source file. Relations data is also stored in relations column family of files HBase table, described in Table 3.4, similar to entities data in entities column family or to files data in default column family. By using the row key design of this table efficient retrieval by project ID and file ID can be performed. The other important relation fields are stored in column qualifiers in the following order: relation kind, target entity ID and source entity ID. Narrowing results by matching this fields is achieved by selecting particular columns or by using column qualifier filters. The Generalized CodeRank algorithm described in Chapter 4 gets relations information from entities_hash table (see Table 3.6), from relations column family. The row key represents the source entity ID. Target entity IDs of the relations that have this source entity ID, as well as their relation kinds are serialized in the targetEntities column. The current code rank of the source entity ID defined on the row key is stored in codeRank column.



A relation entry can be retrieved by its ID by using relations_hash table, which stores the ID on the row key. If efficient retrieval by source entity ID and relation kind or just by source entity ID is desired, relations_direct table should be used. If instead of source entity ID, target entity ID needs to be matched relations_inverse should be used. Both these two tables have the same column qualifier design. Requesting specific columns will only retrieve those relations that are included in the files and projects identified by the columns. If only a particular project is



required and all files from the project need to be included a column qualifier prefix filter can be used which only matches the first 16 bytes of the project ID MD5 hash. A custom filter has been implemented which can match the last 16 bytes of the qualifier if filtering by file ID is required. Relations can be efficiently retrieved by project ID, file type and file ID from relations column family of files table. If only those or part of those fields need to be matched the same approach as the one to retrieve files or entities from this table must be applied. Further narrowing of the results can be performed by selecting specific columns or by using column qualifier filters. For instance, the first byte can be matched to select a specific relation kind. The bytes from the 2nd to the 17th can be matched to narrow results to a particular target entity ID and finally the last 16 bytes can be matched for a specific source entity ID (see Table 3.4). The CodeRank algorithm described in Chapter 4 queries relations data into entities_hash table, column family relations. The current CodeRank of an entity is also stored here. Efficient retrieval by source entity ID is provided here.


Dangling Entities Cache

Generalized CodeRank algorithm described in Chapter 4 needs a fast way to retrieve all dangling entities without scanning the whole entities_hash table and checking each entity if it is dangling. Obtaining dangling entities and their CodeRanks can be done by querying dangling entities cache. The cache needs to be rewritten each time a dangling entity’s rank is updated. Dangling entities cache is implemented by redundantly storing all dangling entities at the end of entities_hash table. This is accomplished through row key design. Each cache entry is a row that has as row keys 16 bytes with the maximum value (FF hex value) followed by a dangling entity ID. Because keys are byte ordered the 16 bytes prefix ensures that there is no other entity placed after the cache rows. Performing a row scan having the start row the 16 maximum value bytes makes possible the retrieval of all dangling entities.



As it can be observed from this chapter, a schema design needs to be engineered for some particular querying requirements or some data access patterns. A one-size-fits-all design which works for every problem is not possible. SQL is more flexible from this point of view, but unfortunately is does not scale to our needs as shown in Chapter 2. This chapter presented a demonstrative schema design which tries to match as close as possible the original Sourcerer data access patterns. All lot of changes can occur to this schema in future development if needed.

Chapter 4

Generalized CodeRank
This chapter presents the design of a code analysis algorithm, named Generalized CodeRank, which ranks code entities by their popularity within the repository. Inspired by Google PageRank [63], instead of considering the links between web pages, it uses relations between code entities to compute ranks.


Reputation, PageRank and CodeRank

This section introduces the concept of PageRank, illustrates how it can be applied to code by introducing the concept of CodeRank and describes how the general concept of PageRank models the reputation of nodes within a graph.



The PageRank algorithm was first published in the article “The anatomy of a large-scale hypertextual Web search engine” by the Google Inc. founders Brin and Page [10]. Since then, it sparkled a lot of research around it and a lot of variants, improvements and alternative uses appeared besides its original use for the web. PageRank uses the graph structure of the web created by the links between web pages to rank those pages’ popularity. The algorithm is inspired by academic citation literature [10] where an article is considered important if a lot of other articles cite it. Taking the idea beyond, PageRank ranks higher pages that have more inbound links and more importantly pages that are linked by other important pages. The algorithm’s philosophy is that when a page links to another page it trusts its content and guarantees for its quality. The same thing happens in the academia, when a paper cites another one it takes its content as granted. It is a way to measure a page’ s reputation within the web context. If a popular and thus important page, such as one from CNN 1 or Yahoo! 2 , links to a web page, it might be ranked higher then another page which has a lot of inbound links from non-popular pages.


The Random Web Surfer Behavior

PageRank models the random surfer behavior [10] which starts from a random web page and keeps following random links without using browser’s back button. After a while he gets bored
1 2




and jumps to a random page from the web. If he reaches a dangling page, also known as a sink page, which has no outbound links, he goes to a random page. The set of all PageRanks can be viewed like a vector of probability distribution. So the PageRank for a page has a value between 0 and 1 and represents the probability that a user reaches that page by following the random surfer behavior. Because all PageRanks make up a probability distribution their sum should be 1. Web surfing can be modeled with a Markov chain where each page can be viewed as a state and the links between them are the probabilities of passing to other states. In the case of a random surfer there is an equal probability to move from one page to another linked page. More complex PageRank models can assign different probabilities in order to better model the real user behavior which can choose pages by different criteria, such as their topic or the language are written in or the location of the link on the page.



The general concept of PageRank can be generalized such that the algorithm can be applied in other fields than the web. To do this, the Web link structure can be viewed as a graph where web pages are nodes and links are edges. Thus, for any directed graph PageRank can be applied. This has been already used in several other fields. One example is a proposal to replace ISI IF (Institute for Scientific Information, Impact Factor) ranking of science and social science journals, which only counts the number of citations for two years, with a PageRank-like prestige rank.[8] Another usage example is for ecosystem modeling in order to determine which species are important for the health of the environment.[12] Following this idea, state of the art research in code search and code analysis proposed applications of PageRank to measure the importance of methods, classes, packages and other code entities [64][51][55]. The name CodeRank for this approach was proposed in an older implementation of Sourcerer [51]. The concepts of code entities and relations are explained in Chapter 3. Calculating CodeRank follows the principles of PageRank algorithm, but considers entities instead of nodes and relations between them instead of graph edges. The result is a hierarchy of the most popular entities from the source code used as input. For instance the classes java.lang.Object or java.lang.String should be very popular for Java source code. This master thesis proposes a PageRank-like approach to rank all code entities from a repository following all relations between them, not just for some code entities like methods, classes or packages. From this point, we will refer to the algorithm that follows this approach as Generalized CodeRank. As far as we know calculating PageRank on code entities by taking into account all entities and all relations from a repository has not been done until now. There are a lot of useful applications for CodeRank: • Improving results ranking in source code search engines. • Creating a top of the most important projects from a repository. • Listing the most important packages, classes and methods from a project to help developers get started with a new project. If they must read the code in order to understand how it works, they might start with the important packages and read the code of the most important methods of the most important classes. • In the context of a Web-scale source code repository a top of the most important libraries for a specific purpose can be computed. This might be useful for project managers that need to chose the best library or technology that accomplishes a desired task in their project.



As proved by state of the art, CodeRank was successfully used to improve source code search engines results ranking providing better results to the user.[51][55]


The Random Code Surfer Behavior

The original PageRank algorithm was used in the Web context and models the web surfer behavior of following links from page to page. Generalized CodeRank could be imagined modeling a programmer’s behavior of surfing source code. For better understanding we can assume the following scenario. A new developer is hired in a company to work in a software system implemented in Java which already has a big code base of multiple tightly coupled projects. Additionally other third-party libraries are used, such as JUnit and Apache Commons. Before the new employee stars coding, it needs to understand how the already written code works and how it is organized. So he will start from a main method to surf the code to facilitate understanding. While doing this, he will read code entities like methods, fields, local variables, classes, packages and follow the relations between them. For example while he reads a method he may follow the call relation to another method and so on. While doing this he may jump to a random point in the source code, i.e. a random entity, from time to time or when he reaches a dangling entity (sink entity), which has no outbound relations. By following this model we can interpret CodeRank as the probability that a programmer will encounter an entity while surfing source code. Entities encountered more often are more popular and thus the chance that someone might search them in a source code search engine is bigger.


Mathematical Model

This section will describe the mathematics behind PageRank concept as it can be applied on any directed graph, no matter if the nodes are web pages, code entities or any other concept and no matter if the edges are links or code entity relations. However, throughout this section we will use the term CodeRank instead of PageRank, for consistency with the topic described by this chapter. The concept CodeRank of an entity can be used interchangeably with rank of an entity.


CodeRank Basic Formula

The CodeRank of an entity is a probability, so it has a value between 0 and 1. When an entity has outbound relations to other entities it transfers its rank to each of them as illustrated in Figure 4.1. According to the most simple form of CodeRank algorithm an entity rank sums up the rank amount propagated from all inbound relations as illustrated by the following formula [63]: R(u) =

R(v) Nv


R(u) is the CodeRank of an entity u, Bu is the set of entities that have outbound relations to entity u and Nv is the number of outbound relations of entity v. It can be noticed from the formula that by dividing the rank of an entity by the number of outbound relations, an equal amount of its rank is transfered to each target entity of the outbound relations as it happens in Figure 4.1.



Figure 4.1: CodeRank Example According to the code surfer model, a programmer might get bored of following relations through entities and suddenly can jump to a random entity, action known in the literature as teleportation [55]. A damping factor d, which represents the probability to follow relations without teleporting, is introduced to the previous formula to model this behavior: R(u) = d

1 R(v) + (1 − d) Nv n


In this formula the value n represents the total number of the entities.


CodeRank Matrix Representation

The set of all CodeRanks for each entity can be grouped together in a vector r. If M is a transition matrix that models code surfer’s behavior of moving from one entity to another, the following formula holds: r =M ·r (4.3)

PageRanks vector r is the dominant eigenvector of the transition matrix M . Computing r from the equation is not possible because of the size of matrix M , but the ranks vector r can be approximated with the formula r = M j ·r0 , where r0 is an initial CodeRanks vector. The values of the initial ranks are not important because for a large enough j an approximative value of r is obtained. r0 is typically set to an uniform distribution, where each rank is 1/n, n being the size of the vector. An ideal r would be obtained if j tends to infinity: r = lim M j · r0


Transition matrix can be decomposed like this: M = dP + (1 − d)Q = d(A + D) + (1 − d)Q (4.5)

d is the damping factor, A is the adjacency matrix which models relations graph, D models transitions from dangling entities and Q models teleportation, i.e. random transition to any entity. Elements ai,j of matrix A are 0 if there is no relation between entity j and entity i. Each



element ai,j represents the probability that the random surfer will go from entity j to entity i. Each column of A sums to 1, making A a stochastic matrix. From a dangling entity there are no outbound relations, so in order to model random code surfer behavior, we state that there is equal probability to have an outbound relation to any other entity. This behavior is modeled by transition matrix D. If an entity j is dangling (sink), then all elements of column j from matrix D are 1/n, because there is equal probability to have a transition to any other entity. Otherwise (j is not dangling), all elements of the column are 0. The first equation below describes a way to decompose D, where e is a vector having all its elements 1 and sT is the transposed vector of sink entities, i.e. element j of s is 1 if entity j is dangling and 0 otherwise (4.9).[58] D = e · sT /n ⇐⇒ D · r = e · (sT · r)/n (4.6)

Computing D · r for CodeRank equation (4.3) is basically reduced to calculating the inner product sT · r, as it can be seen in the equations above. Calculating this inner product, referred from now on as dangling entities inner product (DEIP), is equivalent with summing the CodeRanks of all dangling entities. By replacing D from (4.6) in (4.5) and M from (4.5) in (4.3) the basic CodeRank formula (4.2) can be rewritten:

    R(0) a0,0 a0,1 ··· a0,n−1 R(0)  R(1)   a a1,1 ··· a1,n−1   R(1)     1,0     = d  + . . . .    .   .. . . . . . .    .   . . . . R(n − 1) an−1,0 an−1,1 · · · an−1,n−1 R(n − 1)     1 1 R(0) n 1  R(1)  1     n  + (1 − d)   +d  .  s0 s1 · · · sn−1  . .   . . .   . . .  1 0
1 Nj


R(n − 1)

1 n

ai,j =

if there is no relation from entity j to entity i if there is a relation from entity j to entity i 0 1 if j is not a dangling entity if j is a dangling entity


sj =



Computing Generalized CodeRank with MapReduce

As shown in the previous section, computing CodeRanks vector is performed by repeatedly multiplying the transition matrix with the current CodeRanks vector. This section will show how to accomplish this by using Hadoop MapReduce.


Storing Data in HBase

In Chapter 3, Subsection 3.5.3 described how entities_hash HBase table stores data required by Generalized CodeRank algorithm in relations column family. The row key is an entity



ID and codeRank column stores the current rank of the entity. Relations having the source entity with this ID can be retrieved. Besides the current CodeRank of an entity, Generalized CodeRank algorithm also needs the number of outbound relations of an entity, stored in targetEntitiesCount column, and the target entities of those relations, stored in targetEntities column along with the kinds of each relation.


Hadoop Jobs

There are two mandatory MapReduce jobs that need to be performed for one algorithm iteration: • DEIP Job: calculates DEIP scalar value • CodeRank Job: calculates CodeRank for each entity; takes DEIP scalar value as input parameter Those jobs are going to be repeated until a maximum number of iterations is reached or an error tolerance ε is achieved. The Map tasks of CodeRank Job take as input data all rows from relations column family of entities_hash table. The input key is the table row key, which represents the source entity ID of a relation, and the input value are columns codeRank, targetEntitiesCount and targetEntities. Each Map task will output a key for each target entity ID received as input value [58]. All output values will be the CodeRank divided by the number of outbound relations, corresponding to the sum terms R(v)/Nv from (4.2). Basically, each Map task computes a source entity contribution to each target entity rank of the outbound relations. Each Reduce task sums up the contributions to an entity received from the source entities of the inbound relations. The input key is the entity ID and the value is the contribution received. After calculating the contributions sum, referred here as a, the Reduce task calculates CodeRank with the formula below and outputs this as value along with the entity ID as key.[58] R(u) = d(a + 1 b ) + (1 − d) n n (4.10)

The above formula is obtained from (4.7) by replacing the above mentioned contributions sum (which corresponds with the dot product between the adjacency matrix and CodeRanks vector) with a and DEIP scalar with b. Each Reduce task will write in codeRank column of entities_hash table the CodeRank calculated for the entity received as input. Calculating DEIP is equivalent with summing CodeRanks of all dangling entities. To do this with MapReduce, each Map task needs to read a dangling entity from relations column family of entities_hash table and output the input key with its CodeRank as value. Each Reduce task sums the CodeRanks received as input values for an entity received as input key. The output key is the same with the input key received and the output value is the sum calculated. There are two different Hadoop jobs capable of calculating DEIP. One of them, named Metrics Job, scans the whole entities_hash table and each Map task needs to verify if the entity received as input is dangling and only if so will output its rank. This is highly inefficient because the Map task needs to read all entities and from our statistics only about 1% of the entities are dangling. DEIP Job has an efficient Map implementation which only reads dangling entities. To do this, they are stored redundantly along with their CodeRanks at the end of the entities_hash table as described in Section 3.7. This table range is called Dangling Entities Cache. Prefixing dangling entity IDs with 16 bytes having the maximum hex value FF, ensures that



they are placed at the end of the table. By scanning all rows that start with 16 FF -valued bytes all dangling entities are retrieved. Metrics Job can be used to calculate useful metrics: • Euclidean distance between current CodeRank vector and the one from the previous iteration • sum of all CodeRanks (should be approximatively 1 if computation was correct) • DEIP Generalized CodeRank algorithm continues to run until a maximum number of iterations is reached. An optional stopping condition can be set such that the algorithm stops when a tolerance is reached, i.e., Euclidean distance metric falls below a threshold ε. When this condition is set DEIP Job is replaced by Metrics Job which calculates both DEIP scalar and Euclidean distance metric. If desired, also sum can be calculated. Metrics Job is inefficient for DEIP calculation, but if calculating metrics is desired, this compromise must be done, because Euclidean distance and sum require all CodeRanks of all entities.



This section presents our experiments with the Generalized CodeRank algorithm by describing our data input, the infrastructural setup and the results accompanied by statistical remarks.



A sample repository of about 333 MiB was used to populate the database. We did not tested with a bigger repository because currently the extractor is not yet ported to run on the cluster. Populating our HBase database involves a lot of overhead because at the moment we are importing data from the old MySQL database, which takes a lot of time. After running the extractor the HBase database contained 111 projects, 29, 236 files, 611, 262 entities and 1, 936, 757 relations. Hadoop and HBase were deployed on an 9 node cluster reserved from Tembusu Cluster of National University of Singapore, School of Computing. Each is a Dell PC with 2 x QuadCore Xeon E5620 2.4GHz CPU, 24 GiB RAM. It runs CensOS GNU/Linux operating system installed on a 500 GiB hard-disk. Two additional hard-disks are used to store HDFS data, each with 1 TiB, connected in a RAID matrix. On one node we run the HBase master, Hadoop JobTracker and HDFS NameNode. On each of the others we placed HBase RegionServers, Hadoop TaskTrackers and HDFS DataNodes. Three of these former nodes hosted a ZooKeeper cluster.



In a first experiment 29 iterations of Generalized CodeRank algorithm have been run and for each one the Euclidean distance between the current CodeRank vector and the one from the previous iteration has been calculated. Figure 4.2 shows how this Euclidean distance varies with each iteration. A desired tolerance of 10−5 is reached after 15 iterations, which is enough for good CodeRank results.



Figure 4.2: Variation of Euclidean distance with the growth of iteration illustrating the convergence of Generalized CodeRank algorithm When metrics like Euclidean distance are calculated, Metrics Job instead of DEIP Job is executed. In order to test with DEIP Job also, another experiment was run with 15 iteration. The results shown in the next subsectios are obtained from this second experiment.


Probability Distribution

The mathematical model of CodeRank states that CodeRanks vector is a probability distribution. This distribution is plotted in Figure 4.3 for the first 40 largest CodeRank values. The ranks are ordered in decreasing order and indexed from 0 to 38. x axis has index values and y axis has rank values.

Figure 4.3: Probability distribution represented by CodeRanks vector It is known from previous studies that PageRank follows a power law distribution with the power value of approximatively 1 [71]. Figure 4.4 plots the CodeRanks distribution obtained from the second experiment and a power law distribution f (x) modeled by the following equation: f (x) = max(r) x+1 (4.11)



Figure 4.4: log-log plot for CodeRanks distribution and a power law distribution r is the CodeRanks vector and max(r) is the biggest CodeRank, i.e., the one that has index x = 0 in Figure 4.3. Both axes of Figure 4.4 are on logarithmic scale, so PageRank points have coordinate (logx, logR(x)) and power law’s points have (logx, logf (x)). max r ⇐⇒ log f (x) = log max r − log (x + 1) ⇐⇒ x+1 ⇐⇒ g(log (x + 1)) = k − log (x + 1) ⇐⇒ g(x) = k − x

log f (x) = log


In the above equation log f (x) = g(log (x + 1)) and k = log max r is a constant. Equation g(x) = k − x which describes the power law from Figure 4.4 is linear, so its graphic is a line. The CodeRank points from the figure are very close to the power law line, proving that CodeRanks follow a power law distribution like PageRanks from the web.


Entities CodeRank Top

Table 4.1 describes the 10 entities with the largest CodeRank expressed as percents. An extended version of this table which shows 100 entities instead of 10 can be found in Appendix B Their importance in the repository is very big because those 10 entities from a total of 611, 262 sum 21.42% from the total amount of CodeRank as illustrated in the right side of Figure 4.5. Table 4.1: Top 10 Entities CodeRank FQN java.lang void java.lang.Object int java.lang.String boolean java.lang.CharSequence java.lang.Comparable<java.lang.String>

# 1 2 3 4 5 6 7 8 9 10

Entity Type package primitive class primitive class primitive package interface interface parametrized type

CodeRank 4.72% 4.24% 4.06% 2.29% 2.20% 1.12% 1.03% 1.00% 0.39% 0.37%



Figure 4.5: Left: Top 10 Entities CodeRank chart; Right: Distribution of Top 10 Entities CodeRanks within the whole set of entities It can be observed that the 10 most popular entities are all either from the Java Standard Library or are primitives ubiquitous in any Java application. The reason for this is that any Java program need them in order to work. java.lang, which takes the first place, is the default package, so it’s imported by default. Any program should have at least a main method, which has type void, primitive that occupies the second place. The base of all classes is java.lang.Object – third place. A Java program must have at least one class which by default inherits Object. We can conclude that the correctness of the results is justified by both the relevance of CodeRanks Top and also by having the same statistical model (i.e. CodeRank and PageRank follow the same power law distribution).


Performance Results

Table 4.2 contains the time required by Generalized CodeRank to run the experiments, as well as the time for each job involved in the process. Table 4.2: Experiments and jobs running time Experiment / Job Time Experiment 1 (with metrics), 29 iterations 4485 s (74 min 45 s) Experiment 2 (without metrics), 15 iterations 1882 s (31 min 22 s) Experiment 3 (with metrics), 15 iterations 2255 s (37 min 35 s) DEIP Job (for Experiment 2) 41 s Metrics Job (for Experiment 1 and 3) 69 s (1 min 9 s) CodeRank Job 84 s (1 min 24 s)

As explained in this chapter DEIP, can be calculated either with a DEIP Job or a Metrics Job. The results from the table show that by using dangling entities cache, DEIP Job achieves a performance boost of 68.29%. The computation of Generalized CodeRank as a whole benefits from a performance boost of 19.82%.

Chapter 5

The original Sourcerer code search infrastructure has been developed in Java at University California of Irvine. This master thesis describes the work around a fork of Sourcerer, named Distributed Sourcerer, which aims at scaling up Sourcerer to Internet-scale. The original database implementation, which relied on MySQL, has been rewritten from scratch in order to work with HBase. The implementation details and the new API to the database is described in Section 5.1. A higher level interface to the new database, as a set of command-line interface (CLI) tools, is described in Section 5.3. The Generalized CodeRank algorithm, described in Chapter 4, has been implemented over Hadoop MapReduce as described in Section 5.2. A CLI user interface which facilitates CodeRank calculation is described in Section 5.6.


Database Implementation

The new distributed database implementation is called SourcererDDB and is located in distributed-database Eclipse project, path “infrastructure/tools/java/distributed-database” from Distributed Sourcerer repository [11]. Its implementation can be divided into three parts described in the next subsections: • Subsection 5.1.1: classes used to model data, HBase tables and model types • Subsection 5.1.2: classes that provide the programming interface to retrieve data from HBase tables • Subsection 5.1.3: classes that provide the programming interface to insert data to HBase tables • Subsection 5.1.4: Hadoop MapReduce jobs which duplicates data into additional tables for efficient retrieval


Data Modeling

The data modeling part of the implementation is compounded from three Java packages described as follows: 1. edu.nus.soc.sourcerer.ddb.tables • Eclipse project: distributed-database 36



• Path: infrastructure/tools/java/distributed-database/src/edu/nus/soc/sourcerer/ddb/tables/ • Description: classes from this package contain information about HBase tables and provide access to them. 2. edu.uci.ics.sourcerer.model • Eclipse project: model • Path: infrastructure/tools/java/model/src/edu/uci/ics/sourcerer/model/ • Description: this package contains classes that abstract the model types described in Appendix A for projects, files, entities and relations. 3. edu.nus.soc.sourcerer.model.ddb • Eclipse project: distributed-database • Path: infrastructure/tools/java/distributed-database/src/edu/nus/soc/sourcerer/model/ddb/ • Description: classes used to abstract data exchanged with HBase in an objectoriented way. Each HBase table has an associated class from package edu.nus.soc.sourcerer.ddb.tables, named <name>HBTable. <name> identifies the table, so the camel case of the table name was chosen. So, for example, relations_inverse HBase table is associated with class RelationsInverseHBTable. All classes associated with tables follow singleton design pattern and extend HBTable abstract class. A unique instance of a class is obtained with getInstance() method. The only abstract method that needs to be overridden is getName(), which returns the associated table’s name. The base class HBTable provides the implementation of getHTable() method which returns an HTable instance for the associated table, which is used to access the table as described in HBase documentation [38]. When the instance is created for the first time setupHTable() methods is called, where special configuration code for an HTable instance can be added. Besides the table name, classes associated with tables also contain static final fields which store column family names and column qualifier names. An HTableDescriptor (see HBase documentation [38]) can be obtained by calling the static method getTableDescriptor(). The obtained instance can be used to create new tables, modify tables schema, delete tables etc.. These administrative operations are implemented in class DatabaseInitializer from package Updating tables schema is not currently implemented and a NotImplementedException is thrown. Package edu.uci.ics.sourcerer.model, which contains enums that abstract model types, was included in the original Sourcerer implementation. Project types, file types, entity types, relation types and relation classes are abstracted in classes Project, File, Entity, Relation and RelationClass, respectively. I modified those classes in order to encode a byte value for each type, which can be returned by using getValue() method. Classes from package edu.nus.soc.sourcerer.model.ddb are used to model data exchanged with HBase. All of them implement Model interface and have their name ended in Model. Some of them implement the interface indirectly by extending class ModelWithID. Method computeId from this class returns an MD5 hash of the class fields passed as parameters. The data from those fields is obtained through Java reflection and the returned value can be used to set an id field for a model. For most models that extend ModelWithID this is done in the constructor.




Database Retrieval Queries API

Package edu.nus.soc.sourcerer.ddb.queries from Eclipse project distributed-database provides specialized classes to retrieve and add data about projects, files, entities and relations data. Listing 5.1 presents an example of retrieving relations by several criteria.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

try { /* Instatiate the object used to retrieve relations from HBase.*/ RelationsRetriever rr = new RelationsRetriever(); /* Results are going to be printed as they are retrieved from HBase.*/ ModelAppender<Model> appender = new PrintModelAppender<Model>(); /* Retrieve relations call.*/ rr.retrieveRelations(appender, sourceID, kind, targetID, projectID, fileID, fileType); } catch (HBaseConnectionException e) { LOG.fatal("Could not connect to HBase database: " + e.getMessage()); } catch (HBaseException e) { LOG.fatal("An HBase error occured: " + e.getMessage()); } Listing 5.1: Retrieving relations example In order to retrieve logical entries from HBase, referred from now as models, the following API steps must be followed. Each step is exemplified in Listing 5.1. The models are implemented in package edu.nus.soc.sourcerer.model.ddb. 1. A retrieval class is used to search HBase tables. • In the example from Listing 5.1, RelationsRetriever class is used. 2. A retrieval method of that class is called. The first parameter is always a ModelAppender object. Subsequent parameters represent several searching criteria. If one of these parameters is null, searching will not be performed by that criteria. Depending on what searching criteria parameters are not null, the method will figure out in which HBase table to look up for the entries in order to optimize the query. • In the example, method retrieveRelations from line 11 is called. After the appender,
the following searching criteria are passed respectively: source entity ID, relation kind (byte representing relation type and relation class), target entity ID, project ID, file ID and file type. Depending on which of these criteria parameters are null, HBase will look in relations_direct, relations_inverse or files table.

3. Inside the retrieval method, HBase client API is used which retrieves table rows as results. In some tables each row is mapped to exactly one model, but in others a row is mapped to multiple models. A result-to-model(s) method is used to convert a table row result to a model or a set of models. • In the example, HBase client API will retrieve table rows as results from relations_direct, relations_inverse or files table, depending on the criteria parameters which are not null. In relations_hash table each row is mapped to exactly one relation entry, but in relations_direct or relations_inverse tables, a row may contain more relation entries. relationsInverseResultToRelationsGroupedModels



method converts a relations_inverse table row result to a set of RelationsGroupedModels.

4. For each model retrieved, method add of a ModelAppender object is called. ModelAppender is an interface which facilitates a visitor design pattern. Depending on the processing desired for each model retrieved, a special implementation for this interface can be written. • In the provided example, PrintModelAppender will print the model passed to add
method. ListModelAppender, another ModelAppender implementation, creates a list of the models passed to add which can be returned afterwards by calling getList().

Retrieval classes, retrieval methods, result-to-model(s) methods and ModelAppender implementations follow some naming conventions: • Retrieval classes: <Entries>Retriever, where <Entries> can be Projects, Files, Entities or Relations. So, the following retrieval classes exists: ProjectsRetriever, FilesRetriever, EntitiesRetriever and RelationsRetriever • Retrieval methods: retrieve<Models>[From<Table>][WithGet]. The parts in square brackets are optional. <Models> is the model class name which is passed to add method of the ModelAppender object. <Table> is the camel case name of the table (for example: RelationsDirect for relations_direct table). if [From<Table>] is included in the name of the method, the retrieval is performed from that particular table. If [WithGet] is included in the name, the retrieval is performed by using a get HBase operation instead of a scan operation. Examples: retrieveProjects, retrieveFilesWithGet, retrieveRelationsFromFilesTableWithGet. • Result-to-model(s) methods: <table>ResultTo<Model>[s]. <table> is the camel case name of the table with the first letter lower case. <Model> is the model class name. The optional [s] represents a plural for the model. If it appears a List of model types will be retrieved, rather than a model type. Examples: resultToFileModel, entitiesHashResultToEntityModel, filesResultToEntitiesGroupedModels. • ModelAppender: <Name>ModelAppender.


Database Insertion Queries API

The same package edu.nus.soc.sourcerer.ddb.queries contains the classes for inserting code data into HBase tables. Insertion classes, which implement ModelInserter interface, are used to add data contained in a collection of models into the database, as shown in Listing 5.2. Interface ModelInserter is parametrized by the model class. Currently, there are four implementations of this interface, each one for inserting projects, files, entities and relations, respectively. The example from Listing 5.2 shows how two relations, having their data stored in models relationModelA and relationModelB, are inserted into the database by using RelationModelInserter class, which implements ModelInserter<RelationModel> interface.
1 2 3 4 5 6 7 8 9

try { /* Create a list of relation models. */ Collection<RelationModel> relationModels = new Vector<RelationModel>(2); relationModels.add(relationModelA); relationModels.add(relationModelB); /* Insert the models from the list into HBase tables. */ ModelInserter<RelationModel> modelInserter =

10 11 12 13 14 15


new RelationModelInserter(2); modelInserter.insertModels(relationModels); } catch (HBaseException e) { LOG.fatal("An HBase error occured: " + e.getMessage()); } Listing 5.2: Inserting relations example All ModelInserter implementations should have their name following the format <ModelClass>Inserter. ProjectModelInserter adds data to projects table, FileModelInserter to files table, EntityModelInserter inserts data to entities_hash table and RelationModelInserter to relations_hash table. Currently class MySQLImporter uses all these insertion classes mentioned to import code data from the old MySQL database (SourcererDB) to the new database (SourcererDDB), based on HBase. The next section shows how the new imported data is indexed into more tables for efficient retrieval.


Indexing Data from Database

The insertion classes described in the previous subsection only populate one table for each of the metamodels projects, files, entities and relations. This tables are basically hash tables for efficient retrieval by MD5 hash ID. In order to have an optimized retrieval by several other searching criteria, other tables need to be populated for redundancy, as explained in Chapter 3. To achieve this some Hadoop MapReduce jobs need to be run. By running those jobs duplication and denormalization principles are satisfied for the data [38]. For projects and files, two HBase tables are enough. But for entities and relations storing data redundantly is required. All Hadoop MapReduce classes are currently organized in distributed-database Eclipse project. Classes which implement MapReduce Job-s are placed in package, Map classes in, and Reduce classes in edu.nus.soc.sourcerer.ddb.mapreduce.reduce. In order to index entities a MapReduce job needs to be run. Its class name, as well as their corresponding Map and Reduce class implementations are as follows: • EntitiesIndexerJob: indexes entities by duplicating their data from entities_hash table to entities table and entities column family of files table. – Map task class: EntitiesMapper – Reduce task class: EntitiesReducer In order to index relations two MapReduce jobs need to be run. Their class names, as well as their corresponding Map and Reduce class implementations are as follows: • RelationsIndexerJob: indexes relations by duplicating their data from relations_hash table to relations_direct table, relations_inverse table and relations column family of files table. – Map task class: RelationsMapper – Reduce task class: RelationsReducer • CRRelationsIndexerJob: indexes relations for efficient retrieval during CodeRank calculation. Data from relations_hash table is redundantly stored in relations column family of entities_hash table, as explained in Chapter 4.

CHAPTER 5. IMPLEMENTATION – Map task class: RelationsSourceMapper – Reduce task class: RelationsSourceReducer



CodeRank Implementation

Chapter 4 explained in details what MapReduce jobs are required to calculate CodeRank and how these jobs need to be combined and repeated in order to achieve a final result. As a consequence, most of Subsection 5.2.1 will specify which classes are used for each job and which map and reduce task classes are used for them. Subsection 5.2.2 will describe some additional utility jobs that have been implemented. Generalized CodeRank source code is located in distributed-database Eclipse project. The same package structure for MapReduce jobs, Map tasks and Reduce tasks was used as specified in Subsection 5.1.4.


CodeRank and Metrics Calculation Jobs

If the database has just been populated and indexed, the values from relations column family of entities_hash table need to be initialized. An initialization job does the following operations: • The initial CodeRank for all entities is set to 1/n, where n is the total number of entities. • Dangling Entities Cache is created with its initial values of CodeRank (1/n, as previously stated). • For entities with no outbound relations, target entities count column must store a value of 0, such that they can be identified as dangling entities in a MetricsJob. CodeRank and metrics calculation job class names, as well as their corresponding Map and Reduce class implementations are as follows: • CRInitJob: the initialization job described above. – Map task class: CRInitMapper – Reduce task class: not available • CRJob: the CodeRank Job described in Subsection 4.3.2. – Map task class: CRMapper – Combine task class: CRCombiner – Reduce task class: CRReducer • DEIPJob: the DEIP Job described in Subsection 4.3.2. – Map task class: DEIPMapper – Combine task class: DEIPReducer – Reduce task class: DEIPReducer • CRMetricsJob: the Metrics Job described in Subsection 4.3.2. – Map task class: CRMetricsMapper – Combine task class: CRMetricsCombiner

CHAPTER 5. IMPLEMENTATION – Reduce task class: CRMetricsReducer



Utility Jobs

Currently there is only one utility job which is used to output in an HDFS text file the entities top by their CodeRank. Entities formatted in descendant order are each stored on a line. There are four tab separated columns: • CodeRank (as a subunit value, not percent) • entity ID (hex representation of the MD5 hash) • entity types • FQN (Fully-Qualified Name) The job class name, as well as their corresponding Map and Reduce class implementations are as follows: • CRTop – Map task class: CRTopMapper – Reduce task class: not available


Database Querying Tools

Database querying tools have their source code in Eclipse project distributed-database, main class When running the tools at the command line their name and their arguments are prefixed by double-minus --. The following list presents the tools. Parameter --help for any tool will print usage information, including arguments explanation. All tools that work with the HBase database support --hbase-table-prefix, which appends the specified prefix to all table names that are going to be accessed. Argument --properties-file can be used to pass a Java properties file where predefined arguments are stored as key-value pairs, separated by the equal = sign. • --retrieve-projects: search projects from the database – --pt: project type as upper case string – --pid: project ID as a hex of the MD5 hash • --retrieve-files: search files from the database – --pid: project ID as a hex of the MD5 hash – --ft: file type as upper case string – --fid: file ID as a hex of the MD5 hash • --retrieve-entities: search entities from the database – --eid: entity ID as a hex of the MD5 hash – --et: entity type as upper case string – --fqn: fully-qualified name – --fqn-prefix: fully-qualified name prefix

CHAPTER 5. IMPLEMENTATION – --pid: project ID as a hex of the MD5 hash – --fid: file ID as a hex of the MD5 hash – --ft: file type as upper case string • --retrieve-relations: search relations from the database – --rid: relation ID as a hex of the MD5 hash – --seid: source entity ID as a hex of the MD5 hash – --teid: target entity ID as a hex of the MD5 hash


– --rk: relation kind as upper case string compound from relation type and relation class separated by a double colon – --fqn: fully-qualified name – --fqn-prefix: fully-qualified name prefix – --pid: project ID as a hex of the MD5 hash – --fid: file ID as a hex of the MD5 hash – --ft: file type as upper case string • --retrieve-relations-by-source: retrieve relations by source entity ID from relations column family, entities_hash table – --eid: entity ID as a hex of the MD5 hash • --retrieve-code-rank: prints the CodeRank of an entity by its ID – --eid: entity ID as a hex of the MD5 hash


Database Utility Tools

Utility tools share the same main class as querying tools. The following list presents the tools: • --initialize-db: tool used to initialize HBase database by creating the tables – --empty-existing: if a table already exists, it is empty (by deleting it and creating it again) – --update-existing: if a table already exists, its configuration and column families definition is updated if necessary. This feature is not currently implemented • --import-mysql: imports data from an old SourcererDB database, based on MySQL – --database-url – --database-user – --database-password


Database Indexing Tools

Database indexing tools are used to duplicate data in multiple tables for efficient retrieval as discussed in Section 5.1.4 and have their main classes located in distributed-database project, in package edu.nus.soc.sourcerer.ddb.mapreduce. Tools that use Hadoop have a different library for parsing command-line arguments. Other database tools rely on Sourcerer



library, but database indexing and CodeRank tools rely on Apache Commons library. For each Hadoop tool there is a different main class and arguments have both a short one letter form prefixed by one hyphen - and a long form prefixed by two hyphens --. There is a set of common arguments for all tools described in Table 5.1. Table 5.1: Common CLI arguments for Hadoop tools (CodeRank and Database indexing tools) Long arg. Short arg. Description --hbase-table-prefix -p Prefix appended to HBase table names --debug -d Turn on debug

In order to index entities, the tool with the main class EntitiesIndexer must be used. To index relations, the tool with the main class RelationsIndexer must be used.


CodeRank Tools

CodeRank tools are basically used for CodeRank calculation and have their main classes located in distributed-database project, in package edu.nus.soc.sourcerer.ddb.mapreduce. Being Hadoop applications they also use Apache Commons CLI arguments parsing library, as explained in the previous section. The most important tool is the one used to calculate CodeRanks for all entities. Its main class is CodeRankCalculator and the command line arguments are described in Table 5.2.

Arguments --num-iter and --entities-count are mandatory. If the initialization job needs to be run as explained in Subsection 5.2.1, then --init argument must be set. If it is desired to iterate the algorithm until a tolerance is reached --tolerance argument must be provided with a small floating point value. Setting this requires setting --metric-euclidian-distance also. By setting any argument which starts with --metric-, Metric Job will be used instead of DEIP Job, as discussed in Subsection 4.3.2. The performance is affected, but it is the only way to do it if metric calculation or iterating until a tolerance is reached is required. Another tool which has its main class CodeRankUtil is used to exclusively calculate metrics or to output in HDFS a text file with entities top by CodeRank. Its CLI arguments are described in Table 5.3.



Table 5.2: Common CLI arguments for CodeRankCalculator tool Long arg. Short arg. Description --num-iter -n The number of CodeRank iterations to run --init -i Initialize database before CodeRank calculation. Required if tables has just been populated --entities-count -c The number of entities --teleportation-probab -r Probability of jumping from one entity to another random one. Defaults to 0.15 --metric-euclidian-distance -e Calculate euclidian distance between current CodeRank vector and the previous one. Use --tolerance / -t argument to set a distance when computation should stop --tolerance -t Euclidian distance between currenct iteration and the previous one which stops computation if it is reached. Requires setting --metric-euclidian-dist / -e argument --metric-coderanks-sum -s Calculate the sum of CodeRanks for all entities. Should be close to 1 if computation was correct --metrics-output -o Output directory in HDFS where metrics should be saved (one file for each iteration). This argument is ignored if calculation of no metric is requested. One of the arguments -metric-euclidian-dist / -e or --metriccoderanks-sum / -s should be set. Output file(s) will contain by default an additional metric which is “deip” (dangling entities inner product)

Table 5.3: Common CLI arguments for CodeRankUtil tool Long arg. Short arg. Description --coderank-top -T Generate a file with the top of all entities by CodeRank --metric-euclidian-dist -e Calculate euclidian distance between current CodeRank vector and the previous one --metric-coderanks-sum -s Calculate the sum of CodeRanks for all entities. Should be close to 1 if computation was correct --metric-deip -D Calculate DEIP (Dangling Entities Inner Product)

Chapter 6

This chapter exposes a summary of the contributions of this work, presents an outlook of future research to the “Semantic-based Code Search” project and takes a look to the past by comparing this work with state of the art contributions.



I have chosen a cluster computing technology stack, based on Hadoop and HBase, as a basis for an Internet-scale code search and code analysis platform. By performing a rigorous analysis I proved that an SQL database would not scale for our needs because of the big latencies involved. We showed that the tradeoff made by moving to HBase, such as giving up some consistency guarantees affects our applications in a negligible way. The system can now scale linearly by just adding new commodity hardware machines and benefits from using the popular Hadoop MapReduce platform, which is highly used in the industry and has a big community around it, both of volunteers and of companies with commercial interest. I have engineered an HBase database schema design for the storage layer of the system. It allows basic code queries to be performed and stores the data needed to calculate Generalized CodeRank. I have showed that there is no schema that meets any application need and exemplified why the chosen schema wouldn’t be appropriate for other data access patterns. I implemented [11] a PageRank variant for ranking code entities, which as far we know it is unique by considering all entities during calculation and not only subsets of particular types. I proved the validity of the results by coming with both statistical proofs and intuitive facts. Those results show that CodeRank gives relevant results even when all entity types are considered during computation. The algorithm was implemented over Hadoop MapReduce and terminates in reasonable time – about 30 minutes for a Java repository of about 300 MiB. The ranked entities improve code search as state of the art shows [51][55].


Future Work

The next step for building our code search engine is the parallelize the extractor such that it can run on a cluster. Our idea is to use Hadoop for this purpose by running on each Map an extractor instance. The files need to be accessible to each Map. Putting files in HDFS is not a good idea, because this file system performs good for sequential access to big files, but source and jar files are small. To address this issue we could write source files in HBase, because are




small enough to fit as values. Additionally random access to particular files is possible with good performance. Jar files can embed a lot of files, hence they can grow larger and storing them in HBase can create problems [31]. This files can be stored in HDFS as SequenceFiles, by concatenating multiple jars in a big HDFS file. Thus, sequential access is achieved for optimum MapReduce performance. The only drawback is having a bigger overhead for random access to a particular jar file. The second task that we want to accomplish in the future is scaling up the search server. As discussed in Chapter 2, Sourcerer uses Solr as a search server. Its distributed version, named Distributed Solr [30], is currently limited in comparison with single machine Solr. We are considering to use ElasticSearch [18] instead of Solr, which also uses Lucene [26] and performs better than Solr for realtime access to a large-scale index [67]. Our third plan is linked to a contribution we want to make to code search field. We are currently investigating a way to improve the results by using code clone detection techniques and clustering.


Related Work

Besides Sourcerer [4] from which our system has been forked there are several other infrastructures for code search or code analysis. Portfolio [55] is a code search system for C programming language which focuses on retrieval and visualization of relevant functions and their usages. Similar to my work it implements PageRank for code, but it only uses functions and their call relation. Besides this, it also proposes a technique called SAN (Spreading Activation Network) to improve ranking. For the purpose of indexing, it uses Lucene like Sourcerer. An older version of Sourcerer use to implement CodeRank [51], but currently this component is not available any more. Another search tool that relies on Sourcerer is CodeGenie [50], which uses test-cases to search and reuse source code. Another code search engine which uses test-cases is a prototype of Reiss et al. presented in [66]. Besides test-cases and standard keyword based retrieval techniques it also uses contracts and security constraints. The distinctive characteristic of this engine is its ability to apply program transformations to adapt to user requirements. Keinvaloo et al. proposed another Internet-scale code search infrastructure called SE-CodeSearch [45], based on semantic web. Instead of relying on a relational model like Sourcerer, this infrastructure uses an ontology to represent facts about source code and inference to acquire new knowledge about missing code entities. For the purpose of querying source code based on its entities and relations, different other mathematical models have been proposed besides relational algebra (used in relational databases) and description logics (used in semantic-web ontologies). It is important to notice that the storage solution proposed in this work, although it uses HBase which is not a relational database, it uses a relational model. Query languages using relational algebra have been implemented, like SemmleCode [70] and JGraLab [17]. Similar to Codd’s relational algebra is Tarski’s Binary Relational Calculus [69]. Grok [43], Rscript [46] and JRelCal [65] use this formalism. Other approaches use predicate logic like CrocoPat [7] and JTransformer [47].

Appendix A

Model Types
Table A.1: Project Types Description Project type used for only two core projects. One of them groups primitive types provided by the Java language and the other one unknown entities with unresolved references. Projects associated Java Standard Library JARs, like rt.jar. Projects downloaded by the crawler from online repositories. All unique JARs aggregated from the CRAWLED projects are also considered a project on their own. Used for MAVEN projects.[27]

Project Type SYSTEM




Table A.2: File Types Description Files containing Java source code from any project except SYSTEM. Files containing Java byte code from any project except SYSTEM and CRAWLED. Class files are extracted from jar files. The extractor ignores crawled class files which are not packed into a jar. Jar files from CRAWLED projects.





Table A.3: Entity Types Description used for an undefined type package declaration class declaration interface declaration enum declaration annotation declaration for instance or static initializer declaration field declaration enum constant declaration constructor declaration method declaration annotation type element declaration formal parameter declaration local variable declaration only used in primitives SYSTEM project array declaration type variable declaration wildcard declaration parametrized type declaration an entity created when it is unclear exactly which type was referenced by a relation






Table A.4: Relation Types Description used for an undefined type. physical containment. Example: METHOD INSIDE CLASS. class inheritance. Example: CLASS EXTENDS CLASS. interface implementation or inheritance. Example: CLASS IMPLEMENTS INTERFACE or INTERFACE IMPLEMENTS INTERFACE. defines the type of a field. Example: FIELD HOLDS CLASS. defines the return type of a method. Example: METHOD RETURNS CLASS. a field being read. Example: METHOD READS FIELD. a field being written. Example: METHOD WRITES FIELD. method invocation. Example: METHOD CALLS METHOD. type reference. Example: METHOD USES CLASS. constructor invocation for object instantiation. Example: METHOD INSTANTIATES CONSTRUCTOR. defines a throws clause. Example: METHOD THROWS CLASS. defines a cast expression. Example: METHOD CASTS CLASS. defines an instanceof expression: METHOD CHECKS CLASS. an entity is annotated. Example: METHOD ANNOTATED_BY CLASS. defines the item type from an array. Example: ARRAY HAS_ELEMENTS_OF CLASS. defines the type parameters for an entity. Example: METHOD PARAMETRIZED_BY TYPE_VARIABLE. defines the base type of a parametrized type. Example: PARAMETRIZED_TYPE HAS_BASE_TYPE CLASS. defines the bounding of a type parameter to a specific type. Example: PARAMETRIZED_TYPE HAS_TYPE_ARGUMENT CLASS. defines the upper bound of a wildcard. Example: WILDCARD HAS_UPPER_BOUND CLASS. defines the lower bound of a wildcard. Example: WILDCARD HAS_LOWER_BOUND CLASS. defines when a method overrides a parent class/interface method. Example: METHOD OVERRIDES METHOD. defines when a DUPLICATE type matches a number of types. Example: DUPLICATE MATCHES CLASS.


Table A.5: Relation Classes Description It is unknown where the target entity is. The target entity is located in the JAVA_LIBRARY project. The target entity is in the same project. The target entity is in an external project. It makes no sense to classify the target entity as internal or external.

Appendix B

Top 100 Entities CodeRank

# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

Table Entity Type package primitive class primitive class primitive package interface interface parameterized type package primitive interface package interface primitive package type variable primitive class package package unknown package primitive type variable primitive package class package package class class

B.1: Top 100 Entities CodeRank (No. 1-33) FQN java.lang void java.lang.Object int java.lang.String boolean java.lang.CharSequence java.lang.Comparable<java.lang.String> java.util long java.lang.Comparable java.awt java.lang.Cloneable byte javax.swing <T+java.lang.Object> short java.lang.Exception java.sql sun.awt.X11 java.lang.Object org.w3c.dom float <E+java.lang.Object> double javax.accessibility java.lang.Throwable org.omg.CORBA com.lowagie.text.pdf java.util.Vector java.util.ListResourceBundle

CodeRank 4.715620% 4.242388% 4.059280% 2.293895% 2.199717% 1.116965% 1.034111% 1.000685% 0.388675% 0.374038% 0.304947% 0.251881% 0.232919% 0.222480% 0.221790% 0.200660% 0.195239% 0.163133% 0.155148% 0.139547% 0.138024% 0.136395% 0.136028% 0.130180% 0.108922% 0.099308% 0.096172% 0.094041% 0.089935% 0.089068% 0.088002% 0.087758% 0.087504%




# 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67

Entity Type primitive interface interface class package package interface interface package unknown class package package class type variable package package array package package package class class package class package package package package package interface class class interface

Table B.2: Top 100 Entities CodeRank (No. 34-67) FQN char javax.accessibility.Accessible org.w3c.dom.Node javax.swing.JComponent org.xml.sax java.util.List java.util.EventListener org.biomage.Interface java.lang.String java.lang.Class net.sf.pizzacompiler.compiler java.lang.reflect <E> java.awt.event byte[] org.hsqldb org.hsqldb java.awt.Color java.awt.Component javax.swing.text javax.swing.plaf.basic org.w3c.dom.Element java.util.Hashtable java.util.ResourceBundle java.lang.Runnable

CodeRank 0.083407% 0.080270% 0.079507% 0.075495% 0.074926% 0.073773% 0.072944% 0.071246% 0.070408% 0.067242% 0.066851% 0.065243% 0.063781% 0.062156% 0.060435% 0.059796% 0.059524% 0.058651% 0.058159% 0.057687% 0.057557% 0.057352% 0.056617% 0.056141% 0.055589% 0.054086% 0.052590% 0.051423% 0.050802% 0.050419% 0.049696% 0.049590% 0.049359% 0.048984%



# 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Entity Type interface constructor package package interface class interface package interface package package package class package package class class class class class class class package package interface array type variable package interface interface class class package

Table B.3: Top 100 Entities CodeRank (No. 68-100) FQN java.util.Collection java.lang.Object.<init>() java.awt.image antlr java.util.Map java.sql.SQLException java.sql.Wrapper javax.swing.plaf xjavadoc javax.swing.JPanel org.w3c.dom.svg ca.gcf.util java.lang.RuntimeException java.lang.Integer java.awt.Container com.lowagie.text java.nio java.util.Iterator java.lang.String[] <V+java.lang.Object> java.sql.ResultSet sun.awt.X11.XKeySymConstants java.util.ArrayList

CodeRank 0.048234% 0.047192% 0.046850% 0.046490% 0.045774% 0.045667% 0.045381% 0.045180% 0.045055% 0.044337% 0.043616% 0.043496% 0.042380% 0.042105% 0.041985% 0.041813% 0.041065% 0.040916% 0.040798% 0.039915% 0.038651% 0.038380% 0.038072% 0.037661% 0.037292% 0.037119% 0.036814% 0.036752% 0.036514% 0.036440% 0.036350% 0.035903% 0.035815%

[1] Inc. 10gen. MongoDB., August 2012. [2] Daniel Abadi. Problems with CAP, and Yahoo’s little known NoSQL system. http://, April 2010. [3] Amitanand S. Aiyer, Mikhail Bautin, Guoqiang Jerry Chen, Pritam Damania, Prakash Khemani, Kannan Muthukkaruppan, Karthik Ranganathan, Nicolas Spiegelberg, Liyin Tang, and Madhuwanti Vaidya. Storage Infrastructure Behind Facebook Messages: Using HBase at Scale. IEEE Data Eng. Bull., 35(2):4–13, 2012. [4] Sushil Bajracharya, Joel Ossher, and Cristina Lopes. Sourcerer: An internet-scale software repository. In Proceedings of the 2009 ICSE Workshop on Search-Driven DevelopmentUsers, Infrastructure, Tools and Evaluation, SUITE ’09, pages 1–4, Washington, DC, USA, 2009. IEEE Computer Society. [5] Sushil Krishna Bajracharya, Joel Ossher, and Cristina Videira Lopes. Leveraging usage similarity for effective retrieval of examples in code repositories. In Gruia-Catalin Roman and Kevin J. Sullivan, editors, SIGSOFT FSE, pages 157–166. ACM, 2010. [6] Daniel Bartholomew. SQL vs. NoSQL. Linux Journal, 2010(195), Jully 2010. [7] D. Beyer, A. Noack, and C. Lewerentz. Efficient relational calculation for software analysis. 31:137– 149, 2005. [8] Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, volume 69, number 3, pp. 669-687, 2006, December 2006. [9] Dhruba Borthakur, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkaruppan, Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan, Dmytro Molkov, Aravind Menon, Samuel Rash, Rodrigo Schmidt, and Amitanand Aiyer. Apache hadoop goes realtime at facebook. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, SIGMOD ’11, pages 1071–1080, New York, NY, USA, 2011. ACM. [10] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(1-7):107–117, April 1998. [11] Călin-Andrei Burloiu. Distributed sourcerer code on github. calinburloiu/Sourcerer, September 2012.

[12] Judith Burns. Google trick tracks extinctions. 8238462.stm, September 2009. [13] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1–4:26, June 2008. [14] Codase. Codase., September 2012.




[15] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI ’ 04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, 2004. [16] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev., 41(6):205–220, October 2007. [17] Jürgen Ebert, Daniel Bildhauer, Hannes Schwarz, and Volker Riediger. Using Difference Information to Reuse Software Cases. Softwaretechnik-Trends, 27(2), 2007. [18] Elasticsearch. Elasticsearch., September 2012. [19] D. Salmen et al. Cloud data structure diagramming techniques and design patterns. 22-white-papers/68-cloud-data-structure-diagramming, November 2009. [20] Dietrich Featherston. Cassandra: Principles and application. cassandra-cs591-su10-fthrstn2.pdf. [21] Apache Software Foundation. Allow proper fsync support for HBase. https://issues.apache. org/jira/browse/HBASE-5954, August 2012. [22] Apache Software Foundation. Apache cassandra., September 2012. [23] Apache Software Foundation. Apache CouchDB., August 2012. [24] Apache Software Foundation. Apache hadoop., September 2012. [25] Apache Software Foundation. Apache HBase., September 2012. [26] Apache Software Foundation. Apache lucene., September 2012. [27] Apache Software Foundation. Apache Maven Project., August 2012. [28] Apache Software Foundation. September 2012. Apache software foundation.,

[29] Apache Software Foundation. Apache solr., September 2012. [30] Apache Software Foundation. DistributedSearch, September 2012. Distributed solr.

[31] Apache Software Foundation. HBase – FAQ Design. FAQ_Design#A3, September 2012. [32] Apache Software Foundation. HBase ACID Properties. acid-semantics.html, September 2012.

[33] Apache Software Foundation. HBase/PoweredBy - Hadoop Wiki. hadoop/Hbase/PoweredBy, September 2012. [34] Apache Software Foundation. HDFS architecture guide. r1.0.3/hdfs_design.html, August 2012. [35] Apache Software Foundation. Powered by – hadoop wiki. PoweredBy, September 2012. [36] Apache Software Foundation. Support hsync in HDFS. browse/HDFS-744, August 2012.

BIBLIOGRAPHY [37] Eclipse Foundation. Eclipse., September 2012. [38] Lars George. HBase: The definitive guide. O’Reilly, September 2011.


[39] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. SIGOPS Oper. Syst. Rev., 37(5):29–43, October 2003. [40] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News, 33(2):51–59, June 2002. [41] Derrick Harris. How Facebook keeps 100 petabytes of Hadoop data online. http://gigaom. com/cloud/how-facebook-keeps-100-petabytes-of-hadoop-data-online/, September 2012. [42] Lars Hofhansl. HBase, HDFS and durable sync. 05/hbase-hdfs-and-durable-sync.html, May 2012. [43] Richard C. Holt. Structural manipulations of software architecture using tarski relational algebra. In Proceedings of the Working Conference on Reverse Engineering (WCRE’98), WCRE ’98, pages 210–, Washington, DC, USA, 1998. IEEE Computer Society. [44] Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. Zookeeper: waitfree coordination for internet-scale systems. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference, USENIXATC’10, pages 11–11, Berkeley, CA, USA, 2010. USENIX Association. [45] Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, and Juergen Rilling. SE-CodeSearch: A scalable Semantic Web-based source code search infrastructure. In Proceedings of the 2010 IEEE International Conference on Software Maintenance, ICSM ’10, pages 1–5, Washington, DC, USA, 2010. IEEE Computer Society. [46] Paul Klint. How understanding and restructuring differ from compiling – a rewriting perspective. In Proceedings of the 11th IEEE International Workshop on Program Comprehension, IWPC ’03, pages 2–, Washington, DC, USA, 2003. IEEE Computer Society. [47] Gunter Kniesel and Uwe Bardey. An analysis of the correctness and completeness of aspect weaving. In Proceedings of the 13th Working Conference on Reverse Engineering, WCRE ’06, pages 324–333, Washington, DC, USA, 2006. IEEE Computer Society. [48] Koders. Koders., September 2012. [49] Krugle. Krugle., September 2012. [50] Otávio Augusto Lazzarini Lemos, Sushil Krishna Bajracharya, and Joel Ossher. Codegenie: : a tool for test-driven source code search. In Richard P. Gabriel, David F. Bacon, Cristina Videira Lopes, and Guy L. Steele Jr., editors, OOPSLA Companion, pages 917– 918. ACM, 2007. [51] Erik Linstead, Sushil Bajracharya, Trung Ngo, Paul Rigor, Cristina Lopes, and Pierre Baldi. Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery, 18:300–336, 2009. [52] Amazon Web Services LLC. Amazon s3., August 2012. [53] Karma Snack LLC. Search engine market share. search-engine-market-share/, September 2012. [54] M. Loukides. What is data science? html, August 2012. [55] Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. Portfolio: finding relevant functions and their usage. In Proceedings of the 33rd International Con-



ference on Software Engineering, ICSE ’11, pages 111–120, New York, NY, USA, 2011. ACM. [56] memcached. memcached., August 2012. [57] Neo4j graph database., August 2012. [58] Michael Nielsen. Using MapReduce to compute PageRank. using-mapreduce-to-compute-pagerank/, January 2009. [59] University California of Irvine. Sourcerer code on github. Sourcerer, September 2012. [60] University California of Irvine. SourcererDB web page. sourcerer-db.html, September 2012. [61] Oracle. MySQL., September 2012. [62] Joel Ossher, Sushil Bajracharya, Erik Linstead, Pierre Baldi, and Cristina Lopes. SourcererDB: An aggregated repository of statically analyzed and cross-linked open source java projects. In Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, MSR ’09, pages 183–186, Washington, DC, USA, 2009. IEEE Computer Society. [63] Lawrence Page, Sergey Brin, Motwani Rajeev, and Winograd Terry. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998. [64] Diego Puppin and Fabrizio Silvestri. The social network of java classes. In Proceedings of the 2006 ACM symposium on Applied computing, SAC ’06, pages 1409–1413, New York, NY, USA, 2006. ACM. [65] Peter Rademaker. Binary relational querying for structural source code analysis. 2008. [66] Steven P. Reiss. Semantics-based code search. In Proceedings of the 31st International Conference on Software Engineering, ICSE ’09, pages 243–253, Washington, DC, USA, 2009. IEEE Computer Society. [67] Ryan Sonnek. Realtime Search: Solr vs Elasticsearch. realtime-search-solr-vs-elasticsearch/, May 2011.

[68] Michael Stonebraker. SQL databases v. NoSQL databases. Commun. ACM, 53(4):10–11, April 2010. [69] A. Tarski. On the calculus of relations. Journal of Symbolic Logic, 6(3):73–89, September 1941. [70] Mathieu Verbaere, Elnar Hajiyev, and Oege de Moor. Improve software quality with SemmleCode: an eclipse plugin for semantic code search. In Richard P. Gabriel, David F. Bacon, Cristina Videira Lopes, and Guy L. Steele Jr., editors, OOPSLA Companion, pages 880–881. ACM, 2007. [71] Yana Volkovich, Nelly Litvak, and Debora Donato. Determining factors behind the PageRank log-log plot. In Proceedings of the 5th international conference on Algorithms and models for the web-graph, WAW’07, pages 108–123, Berlin, Heidelberg, 2007. Springer-Verlag. [72] Tom White. Hadoop: The definitive guide (third edition). O’Reilly, Yahoo! Press, January 2012. [73] Wikipedia. Big data., August 2012.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.