Professional Documents
Culture Documents
2015
Follow this and additional works at the FSU Digital Library. For more information, please contact lib-ir@fsu.edu
FLORIDA STATE UNIVERSITY
PYQUERY:
By
A Thesis
Department of submitted
Computer to the in
Science
partial fulfillment of the
requirements for the degree of
Master of Science
2015
Copyright Ⓧc 2015 Shiva Krishna Imminni. All Rights Reserved.
Shiva Krishna
members of theImminni defended
supervisory this thesis
committee were:on November 13, 2015. The
Piyush Kumar
Sonia Haiduc
Committee Member
Margareta
CommitteeAckerman
Member
The Graduate
the thesis School
has been has verified
approved and approved
in accordance with the above-named
university committee members, and certifies that
requirements.
ii
Iwho
dedicate
madethis
me thesis to myI am
the person family. I am
today. grateful
I am to my
thankful loving
to my parents, sister,
affectionate Nageswara Rao
Ramya and Subba
Krishna, who Laxmi,
is very
special to me and always stood by my side.
i
ACKNOWLEDGMENTS
Idirecting
owe thanks
my to manyWithout
thesis. people. his
Firstly, I would support,
continuous like to express myguidance
patience, gratitude and
to Dr. Piyush knowledge,
immense Kumar for
PyQuery wouldn’t be so successful. He truly made a difference in my life by introducing me to Python
programming language and helped me learn how to contribute to Python community. He trusted me and
remained patient during the difficult times. I would also like to thank Dr. Sonia Haiduc and Dr. Margareta
Ackerman for participating on the committee, monitoring my progress and providing insightful
comments. They helped me learn multiple perspectives that widened my research. I would like to thank
my team members Mir Anamul Hasan, Michael Duckett, Puneet Sachdeva and Sudipta Karmakar for their
time, support, commitment and contributions to PyQuery.
i
TABLE OF CONTENTS
List of Tables.............................................................................................................................................vii
List of Figures..........................................................................................................................................viii
Abstract......................................................................................................................................................ix
1 Introduction 1
1.1 Objective.......................................................................................................................................2
1.2 Approach......................................................................................................................................2
23 Related Work
Data Collection 4
7
3.1 Package Level Search....................................................................................................................7
3.1.1 Metadata - Packages.........................................................................................................7
3.1.2 Code Quality....................................................................................................................8
3.2 Module Level Search..................................................................................................................10
3.2.1 Mirror Python Packages..................................................................................................10
3.2.2 Metadata - Modules........................................................................................................11
5.4.1 Preprocessing.................................................................................................................26
6 System Level Flow Diagram 31
7 Results 36
8 Conclusions and Future Work 42
8.1 Thesis Summary.........................................................................................................................42
v
LIST OF TABLES
v
LIST OF FIGURES
v
ABSTRACT
Python Package
community. Index
It hosts (PyPI) is
thousands of apackages
repository thatdifferent
from hosts alldevelopers
the packages
and ever developed
for the for the Python
Python community, it is
the primary source for downloading and installing packages. It also provides a simple web interface to
search for these packages. A direct search on PyPI returns hundreds of packages that are not intuitively
ordered, thus making it harder to find the right package. Developers consequently resort to mature search
engines like Google, Bing or Yahoo which redirect them to the appropriate package homepage at PyPI.
Hence, the first task of this thesis is to improve search results for python packages.
Secondly,
perform a codethis search
thesis also attempts
targeting to develop
python a new
modules. search engine
Currently, that allows
the existing Python
search developers
engines to
classify
programming languages such that a developer must select a programming language from a list. As a
result every time a developer performs a search operation, he or she has to choose Python out of a
plethora of programming languages. This thesis seeks to offer a more reliable and dedicated search engine
that caters specifically to the Python community and ensures a more efficient way to search for Python
packages and modules.
i
CHAPTER 1
INTRODUCTION
Python is and
simplicity a high-level programming
can perform language
more functions basedlines
in fewer on of
simplicity
code. Inand efficiency.
order It emphasizes
to streamline code
code and speed
up development many programmers use application packages, which reduce the need to copy definitions
into each program. These packages consist of written application software in the form of different
modules, which contain the actual code. Python’s main power exists within these packages and the wide
range of functionality that they bring to the software development field. Providing ways and means to
deliver information about these reusable components is of utmost importance.
PyPI,the
meeting a software repository
user needs. for Python
It implements packages,
a trivial rankingoffers a search
algorithm featurematching
to detect to look for available
packages forpackages
a user’s
keyword, resulting in a poorly sorted, huge list of closely and similarly scored packages. From this
immense list of results, it’s hard to find a package efficiently in a reasonable amount of time that meets
user needs. Due to lack of an efficient native search engine for the Python community, often developers
rely on mature and multipurpose search engines like Google, Yahoo and Bing. In order to express his or
her interests in python packages, a developer taking this route has to articulate his query and on top of that
provide additional input. A dedicated search engine for the Python community would bypass the need to
specify once interest in Python. One may argue that such a search engine wouldn’t alter the experience of
developers who is searching for python packages. However, considering the number of times a developer
search for packages, code and related information on an average is five search sessions with 12 total
queries each workday [1], it is desirable to propose a search engine that saves a lot of time and effort.
An additional method of software development that saves time is the practice of Code reuse
[2]. There of
availability arethe
many
rightfactors
tools tothat
findinfluence practice of code
reusable components. reuse for
Searching [3]code
[4]. is
One of thechallenge
a serious such factors is
1
for developers.
ameliorate Code search
the hardship engines
to search suchbyastargeting
for code Krugle 1,a Searchcode
wide range ofand
2
BlackDuck
languages.
3
attempted
Currently, to
Python
developers who conduct code searches have to learn how to configure these search engines so that the
search engines display python specific results. As a result all though these search engines exemplify the
ideal that one product may solve all kinds of problems, such an ideal fails to overcome the problems faced
by Python developers. Python developers would rather rely on a code search engine that is designed for
searching python packages exclusively.
1.1 Objective
This(PyQuery)
engine thesis seeks
4 to enables
that contribute to thedevelopers
Python Python community by Python
to search for developing a dedicated
Packages Python
and Mod- ules search
(code)
and encourage them to take benefit of an important software development practice, code reuse. With
PyQuery we want to facilitate the search process, improve the query results, and collect packages and
modules into one user-friendly interface that provides the functionality of a myriad of code search tools.
PyQuery will also be able to synchronize with the Python Package Index to provide users with code
documentation and downloads, thereby providing all steps in the code search process.
1.2 Approach
PyQuery is
Presentation. Theorganized into threecomponent
Data Collection separate components: Data
is responsible for Collection,
collecting Data
all theIndexing and Data
data required to
facilitate the search operation. This data includes source code for packages, metadata about packages
from PyPI, preprocessed metadata about modules, code quality analysis for packages and other relevant
information that helps us deliver meaningful search results. In order to provide the most recent updates to
packages at PyPI, we ensure that the data we collect and maintain is always in sync with changes made at
PyPI. For this reason, we use the Bandersnatch5 mirror client of PyPI which keeps track of changes
utilizing state files.
1
http://www.krugle.com/ 2https://searchcode.com/
3
https://code.openhub.net/ 4https://pypi.compgeom.com/
5
https://pypi.python.org/pypi/bandersnatch
2
TheModule
Collection Data Indexing component
in a structured stores
schema thatallfacilitates
the data we have
faster collected
search andfor
queries processed
matchinginpackages
the Data and
modules. We used Elasticsearch (ES)6, a flexible and powerful, open source, distributed, real- time search
and analytics engine, built on top of Apache Lucene to index our data. We rely on FileSystem River
(FSRiver)7, an ES plugin, to index documents from the local file system using SSH. In ES, we used
separate indexes for files related to module level search to those of package level search. By using this
approach, we can map each query to its specific type and related files.
The Data
appealing andPresentation component
easy to follow. We havedelivers
used matched
Flask8 forsearch
serverresults
side to the user When
scripting. in a fashion
a userthat is both
query for
matching packages, we send a query to the index responsible for packages and retrieve required details
that allow the user to see the most significant packages, their scores, statistics and other relevant
information. We implemented a ranking algorithm that works on fine tuning ES results by sorting them
based on various metric. Additionally, when a user query for matching modules, a request is sent to the
ES index for modules that contain metadata (Ex: class name, method name, etc.) to get a list of matches
along side their line number and path to module on the server. For every match, a code snippet containing
matching line is rendered using Pygments9. To reduce time for processing matched results, all the
modules are preprocessed with pygments and each line number is matched to their starting byte address in
the file, so that the server can quickly open the pygment file, seek to the calculated byte location, and pull
the required piece of HTML code snippet.
6
https://www.elastic.co/products/elasticsearch
7
https://github.com/dadoonet/fsriver 8http://flask.pocoo.org/
9
http://pygments.org/
3
CHAPTER 2
RELATED WORK
Search engines
differentiate the employ metrics
significant for scoring
packages. and ranking,
Additionally, but these
these metrics dometrics are often
not exhibit all thelimited andthat
qualities domay
not
be relevant to what a user wants out of a specific module or package.
The aPyPI
follows very[5] website
simple signifies
search the exemplar
algorithm which givesfor athis project.
score When
for each one searches
package based onfor thepackages, PyPI
query. Certain
fields such as name, summary, and keywords are matched against the query and a binary score for each
field is computed (basically a “yes; it matched” or “no; it didn’t”). A weight is given for each field, and
the composite scores from each field are added to create a total score for each package. Packages are first
sorted by score and then in lexicographical order of package names. We found this information at
stackoverflow1 and followed the steps given to confirm the working of the PyPI searching algorithm.
The above method employed by PyPI works, but it doesn’t distinguish the packages very well.
For example,the
(fortunately, searching for framework,
Flask web “flask” will which
yield 1750
shouldresults
be at with a top
the top score
when of 9 given
searched, to 162
is listed at 4packages
th
due to
sorting based on alphabetical order). This also makes it very easy to influence the outcome of popular
queries if you are the package developer. An algorithm which resist the influence of a package owner
would be a better fit for reliable package searches.
PyPI only
searches Ranking [6] Packages,
Python is another and
website created
no other by Taichino
languages. that
It has is similar
a search to PyPI
function thatand PyQuery
takes as it
in a user’s
search query and finds relevant packages. It also syncs with PyPI so that the user can access the
information contained on PyPI such as documentation and downloads. The main difference, however, is
that PyPI Ranking ranks packages based only on the number of downloads, so packages with more
downloads will emerge higher up on the list. This means that packages get more value based on their
popularity, which is a valuable metric, but not the only valuable
1
http://stackoverflow.com/questions/28685680/what-does-weight-on-search-results-in-pypi-help-
in-choosing-a-package
4
metric.
packageFurthermore, the website
level search function and only allows
module levela search
packagefunction,
level search, whereas
providing morePyQuery
resourcescontains both
to the user.
Additionally, the website makes use of Django to facilitate the web development whereas PyQuery uses
the microframework Flask.
There are multiple code search engines that allow users to look for written code that relates to
their search3,query.
BlackDuck These code4.search
and SearchCode Theseengines
websites include
allowwebsites
users tosuch
enterasaKrugle
search, query,
2
Open Huband Code Search
then they list-
sample lines of code based on the results from their search6 query. These websites are limited; however,
because they can only search code at GitHub , Sourceforge and other open source repositories. They are
5
not contained within one context in which a user might want to find a specific package or module.
Additionally, the websites do a search based purely on term occurrence, by identifying the user’s search
term within the lines of code and returning the code samples with numerous hits. The results a user
receives on their search key may not address what they want, but rather just contain the term itself.
Consequently, the results are not scored due to the lack of relevant information to incorporate as metrics.
PyQuery accesses data directly from PyPI, preprocess the data to extract useful information from code,
indexes the data within itself, searches the data, and reorders it based on ranking function. PyQuery is also
constructed within the Python community so that Python packages and modules are only ranked against
other Python packages and modules. These results are more valuable due to the metrics they are based on
and the nature of the searching algorithm.
In the past, people have attempted to do code search in languages like Java7 based on semantics
that
Thisare test that
means driven [7]need
they and to
required
provideusers
both to
thespecify
syntaxwhat theysemantics
and the are searching fortarget.
of their as precisely as possible.
Furthermore pass a
set of test cases that include sample input and expected output to filter potential set of matches. This is a
great technique to search for code that can be reused; however, it has its limitations. This tool requires the
kind of detail regarding the input that the user will not know in the first place. This tool is more helpful
for testing the reusable coding entity whose path from the
2
http://www.krugle.com/
3
https://code.openhub.net/
4
https://searchcode.com/
5
https://github.com/
6
http://sourceforge.net/
7
http://www.oracle.com/technetwork/java/index.html
5
package root ex:
first character of str.capitalize()
string, he mayisguess
knownthetofunction
the username
precisely.
to beIfcapitalize()
a user is looking fornot
but may code that capitalizes
precisely know it
can be found in str package with the signature str.capitalize(). If a user usually knows this information, he
or she may directly look inside usage documents to see if it meets his or her requirements (though he or
she may have to execute test cases on their own).
pathNullege is a dedicated
like structure searchbyengine
(“/” replaced for Python
“.”) used source
in python codestatements
import [8]. As athat
keyword,
al- waysit start
requires
at the alevel
UNIXof
the package root folder. Some of the sample queries for Nullege in- clude “flask.app.Environment”,
“requests.adapters.BaseAdapter” and “scipy.add”. Results from the search operation on Nullege point to
the source code where the programming entity is im- ported. This is a useful tool for users who are
familiar with folder structure of the package and are generally curious in exploring its source code or to
learn packages that import them. A user can’t directly pass a generic keyword(s) that infers the purpose of
programming entity he or she is interested in. For users who want to learn if there exist a reusable
component for a specific task at hand and are not aware of precise location to look at, Nullege is not the
right tool. Because of limitations imposed on the input and type of results returned, Nullege can be
classified as an exploration tool for source code tree of Python packages rather than a search engine for
source code. PyQuery allows users to perform a generic keyword search without limitations in input like
those of Nullege. PyQuery results are usually code snippets that point to definitions of programming
entities rather than import statements.
We have used Abstract Syntax Tree (AST) to collect various programming entity names and
their
AST.line
Somenumbers in modules for
of the applications code include
of AST search. Many research topics
Semantics-Based Codethat analyze
Search [7],software code often
Under- standing use
source
code evolution using abstract syntax tree matching [9] and Employing Source Code Information to
Improve Question-Answering in Stack Overflow [10]. These implementations con- struct an ast for code
at consideration and extract needed information by walking through the tree or directly visiting the
required node. For this purpose, we have used ast module [11] in Python. Chapter 3 elaborates on how we
extract metadata about modules for code search.
6
CHAPTER 3
DATA COLLECTION
For anybesearch
could of anyengine to work,
form and it requires
any type. data
For the to perform
problem search
we plan operations.
to solve, Data
we have to could be the
address anything. It
question
“What kind of data are we interested in?”. We are engrossed in data related to Python packages that can
help us return meaningful results for a user query. We intend to provide two flavors to the search engine:
Package Level Search and Module Level Search. Let us examine tools and configurations that help us
collect required data to achieve this goal.
3.1 Package Level Search
A package
example: is a collection
“requests” package isofdeveloped
modules, towhich arehttp
handle meant to solve According
capabilities. the problem(s)
to itsof some type.
homepage 1 For
, it has
various features, including International Domains and URLs, Keep-Alive and Connection Pooling,
Sessions with Cookie Persistence, Browser-style SSL Verification, etc. A user interested in these features
would like to use this library to solve his or her problem. A developer may produce a library and assign a
name to it that may or may not directly have any relation with the purpose of the library. A user would get
to know whether the library helps solve his problem not just by looking at its name alone but the
description, sample usage and other useful metadata about the package mentioned at its homepage.
Sometimes when a user has to pick between multiple packages that are trying to solve the same problem,
criteria like popularity of author, number of downloads, frequency of releases, code quality and efficiency
starts to factor. A search engine that returns Python package as matches to a user query would require
similar information.
3.1.1 Metadata - Packages
7
aforgiven libraryquery
the user’s solves hisiformultiple
and her problem. PyQuery
packages qualify,needs this information
to prioritize one over to
thesearch
other.for
Onematching packages
direct way to get
description of a package is to crawl its homepage at PyPI2. Though this sounds pretty straight forward and
easy, gathering URL information for the latest stable release for each package and maintaining this
information could be tricky and searching for required information in crawled data could be time
consuming.
Wemetadata
access found aninformation
elegant andabout
mucha simpler
package way
via atohttp
gather metadata
request 3
. This of a package.
would return aPyPI
JSONallows users
file with to
keys
such as description, author, package url, downloads.last month, downloads.last week, downloads.last day,
releases.x.x.x,4 etc. For example: one can query the PyPI website for metadata about “requests” package
through URL . Refer to Figure 3.1 for a sample response from PyPI.
3.1.2 Code Quality
PEPdevelopers
Python 0008 – Style Guide
should for Python
incorporate Code
into their, describes
5
a setstandards
code. These of semantic rules and
are highly guidelinesby that
encouraged the
Python community. Standard libraries that are shipped with installation are written using these
conventions. One main reason to emphasize the standardizing style guide is to increase code readability.
The Python code base is pretty huge, and it is important to maintain consistency across it. Conventions set
in Python style guide makes Python language so beautiful and easy to follow as you read.
Code Quality
consideration of a package
is following can guide
the style be measured in multiple
for Python ways.
code. The First,
Python we can check
community if thetopackage
has tools at
check the
package compliance with the style guide. PEP86 is a simple Python module7 that uses only standard
libraries and validates any Python code against the PEP 8 style guide. Pylint is another such tool that
checks for line length, variable names, unused imports, duplicate code and other coding standards against
PEP 8.
2
https://pypi.python.org/pypi
3
http://pypi.python.org/pypi/<package_name>/json
4
http://pypi.python.org/pypi/requests/json
5
https://www.python.org/dev/peps/pep-0008/
6
https://pypi.python.org/pypi/pep8 7http://www.pylint.org/
8
‘‘info’’:{
...
‘‘ ‘‘ package ’ url
: ’‘ ’‘’ ’Kenneth
: ‘ ‘ http : / / pypi
’ ’ , . python. com
. org / pypi / r e q u e s t s ’ ’ ,
‘ ‘ author
author ’email : Reitz
‘ ‘ me@ kennethreitz ’’,
‘ ‘ d e s c r i p t i o n ’ ’ : ‘ ‘ Requests : HTTP f o r Humans . . . ’ ’
...
...
‘ ‘ r e l e a s e u r l ’ ’ : ‘ ‘ http : / / pypi . python . org / pypi / r e q u e s t s
/2.7.0’’,
‘ ‘ downloads ’ ’ : {
‘ ‘ last month ’ ’ : 4002673 ,
‘ ‘ l ast week ’ ’ : 1307529 ,
‘ ‘ l a s t d a y ’ ’ : 198964
},
...
...
‘‘releases’’:{
‘‘1.0.4’’: [
{
‘‘has sig ’’: false ,
‘ ‘ upload t ime ’ ’ : ‘ ‘2012 −12 −23T07 : 4 5 : 1 0 ’ ’ ,
‘ ‘ comment text ’ ’ : ‘‘’’,
‘ ‘ python ver sio ’ ’ : ‘ ‘ source ’ ’ ,
‘ ‘ url ’ ’ : ‘ ‘ https : / / pypi . python . org / packages / s our c e / r / r e q u e s t s / r
‘ ‘ mde q5udigest
e s t s −’ ’1:. 0 .‘4‘ .0t ba 7448
r . gz ’f ’9,e 1 a 077 a 7218720575003 a 1 b 6 ’ ’ ,
‘ ‘ downloads ’ ’ : 111768 ,
‘ ‘ f i l ename ’ ’ : ‘ ‘ r eq u e s t s − 1 . 0 . 4 . t a r . gz ’ ’ ,
‘ ‘ packagetype ’ ’ : ‘ ‘ s d i s t ’ ’
}, ‘ ‘ s i z e ’ ’ : 336280
],
...
...
}
}
9
PEP8 and
Prospector 8 Pylint
that aretogether
brings great tools
boththat
the we can use to of
functionality check
Pep8for code
and quality,
Pylint. butadds
It also we have decided to use
the functionality of
the code complexity analysis tool called McCabe 9. When a package is processed with Prospector, it will
give a count of errors, warnings, messages and their detail description. This information gives an
inference of Code Quality.
wellThere is another
the code set of information
is commented. The ratioweofcan
theuse for analyzing
number codelines
of comment quality. Developers
to the care about
total number how
of lines,
number of code lines to the total number of lines 10 and number of warnings to the number of lines, offers
some metrics to do code quality analysis. CLOC helps us acquire this information. As CLOC stands for
Count Lines of Code, when we run CLOC on Python package at consideration, it returns the total number
of files, number of comments, number of blank lines and number of code lines. We collect this
information to check for Code Quality.
3.2 Module Level Search
A module
classes, is a Python
methods file withSome
and variables. extension “.py”. Itareis ainterested
developers collectioninofsearching
multiple programing units such as
for these programming
entities in a module, so we wanted to build a search engine for them. There are various steps involved in
achieving this goal.
3.2.1 Mirror Python Packages
1
wouldn’t like their
development. developers
Instead, to hit the
they maintain world
a local wide of
mirror web
thetoPyPI
download software
repository frompackages they need can
which developers for
download necessary packages without connecting to the Internet.
Currently,
order to avoid PyPI
such isdisaster,
a homePyPI
to 50,000+
has comepackages. It PEP
up with would381be11a, asingle point infrastructure
mirroring of failure if it that
goescan
down. In
clone
an entire PyPI repository in a desired machine. People started making public and private repositories
using this infrastructure. For our purposes, we use Bandersnatch , a client side implementation of PEP
12
381 to sync Python packages. When bandersnatch is executed for the first time it will mirror the entire
PyPI i.e., download all the Python packages. It will also maintain state files that help maintain the current
state of the repository, which is later used to sync with PyPI to get any updates made to the packages. A
recurring cron job to execute command “bandersnatch mirror” will keep the local repository always
updated.
3.2.2 Metadata - Modules
We have
entities. previouslythe
We mirrored discussed that repository
entire PyPI developers into
showour
interest in using
servers doing bandersnatch.
a code search for program-
In order ming
to enable
code search, we have to find useful information from modules of each package, i.e., get a list of
programming entities for each module. There are many programming entities in a Python module, but we
are mainly interested in classes, functions under classes, variables under classes, global functions,
recurring inner functions inside global functions, variables inside global functions and global variables.
We maintain each of them in a separate key so that we can give more weight to certain entities than others.
To collect required information, we iterate through all packages; with in each package we iterate
through
Python and all perform
modules;a for each
walk module,
(visit we construct
all) operation on thisantree.
Abstract Syntax
As walk Tree visits
operation using each
ast13 programming
module from
entity, it invokes various function calls inside ast.NodeVisitor such as visit Name, visit FunctionDef, visit
ClassDef and so on as per the current element. We override ast.NodeVisitor class and functions inside it
and perform visit all operation on top of it so that we have control over the operation performed inside
them. For example, during visit all, if a class is being visited, a
11
https://www.python.org/dev/peps/pep-0381/
12
https://pypi.python.org/pypi/bandersnatch
13
https://docs.python.org/2/library/ast.html
1
# Sample Code for collecting metadata class
PyPINodeVisitor(ast.NodeVisitor):
def visit_Name(self, node):
# collect variable name and line number def
visit_FunctionDef(self, node):
# collect function name and line number def
visit_ClassDef(self, node):
# collect class name and line number def
visit_all(self, node):
# call super class visit function
function callpassed
information to visittoClassDef is invoked.
it and decide what to Since weit.have
do with overridden
We can this function,that
collect information weisare in control
of interest of
to us
such as names of the various classes and line number at which they occur. Figure
3.2 is the
collect all pseudocode
the metadatafor forusing ast toand
a module generate
save itthe
in arequired metadata
JSON format for of a module.
making This way,
it available we can
for Module
Level Search. Figure 3.3 is an example of one such JSON file we have generated using this process. Each
identifier is concatenated with its line number and additional underscores to make a length of minimum
18. The reason behind this format is discussed in Chapter 4. As part of data collection, it is important that
information being collected should be stored in an agreed format that enables better indexing and
searching techniques.
3.2.3 Code Quality
Similar
CLOC; we to thealso
have method we have
collected this applied to collect
information at thecode quality
module for We
level. packages using
couldn’t useProspector
Prospectorand
to
process a single module like we did for package, so we used Pylint instead of Prospector. CLOC helped
towards obtaining the number of comment lines, number of blank lines and number of code lines at the
module level.
1
{
‘ ‘ c el cua s rs e’ C ’ :oo‘ ‘k iSeeSses si so inonM i10 x i9n N2 u8 l Tagged
l S e s s i JSONSe
o n 1 1 9 ri aliz er 55 S
2 7 2o ’p’e, n‘ ‘ scel sa ss iso fnu 3n 0c 1t isoanv ’e ’ s: e s s‘ i‘ ogne t3s1i 5g ’ne’ i,Sn egsssei roi na Il ni zt ee rr f2a9c0e
S e s s i o n I n t e r f a c e 1 3 4 S e c u r e C o o k i
1
CHAPTER 4
We used engine,
analytics Elasticsearch (ES)
built on top, of
1
a Apache
flexible Lucene
and powerful,
to indexopen source,
our data anddistributed, real-timedata
query the indexed search and
for both
package level search and module level search. FileSystem River (FSRiver) 2, an ES plugin is used to index
documents from a local file system or remote file system (using SSH).
4.1 Data Indexing
4.1.1 Package Level Search
Extracted
package data for each
is considered Python package
a document. Althoughis all
indexed
fieldsin
inES using FSRiver,
a document whereindata
are indexed the for
ES each Python
server, only
the following fields: name, summary, keywords and description are analyzed (Refer to Figure 4.1 for ES
mapping) before they are indexed using ES Snowball analyzer3. Snowball analyzer generates tokens using
the standard tokenizer, removes English stop words and uses standard filter, lowercase filter and snowball
filter. The other fields are not analyzed before indexing either because they are numbers (eg:
info.downloads.last month) or are of no interest with respect to search query. Figure 4.2 depicts the river
definition which actually indexes the package level data (in .json format) located in the server, looks for
updates every 12 hour and reindex data if there is any
update.
4.1.2 Module Level Search
Extracted
Python data
package is for each module
considered in a Python All
as a document. package
fieldsisexcept
indexed in ES,path
module where
in data for eachare
a document module in a
analyzed
using a custom analyzer (Refer to Figure 4.3 for the definition of the custom analyzer) before they are
indexed. The custom analyzer generates tokens using a custom
1
https://www.elastic.co/products/elasticsearch 2https://github.com/dadoonet/fsriver
3
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-analyzer.html
1
PUT packagedata / packageindex / mapping
{ ‘ ‘ packageindex ’ ’ : {
‘‘properties’’:{
‘‘info’’:{
‘‘properties’’:{
‘ ‘ name ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ analyzed ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ snowball ’ ’
},
‘ ‘ summary ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ analyzed ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ snowball ’ ’
},
‘ ‘ keywords ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ analyzed ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ snowball ’ ’
},
‘‘description’’:{
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ analyzed ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ snowball ’ ’
}
}
}
}
}
}
1
PUT r i v e r / p a c k a g e i n d e x r i v e r / meta
{
‘, ‘‘ ‘type
f s ’ ’’ :’ :{ ‘ ‘ f s ’ ’
‘‘ ‘‘ url ’ ’ : rate
update ‘ ‘ / ’s ’e:r v e‘r‘/12
package
h ’ ’ , / data / d i r e c t o r y / path ’ ’ ,
} , ‘ ‘ j son supp or t ’ ’ : t rue
‘ ‘ index ’ ’ : {
‘ ‘ index ’ ’ : ‘ ‘ packagedata ’ ’ ,
‘ ‘ type ’ ’ : ‘ ‘ packageindex ’ ’
}
}
We define
queries. ourthe
ES uses search querymodel
Boolean according to ES
to find Query documents,
matching DSL to lookand
for amatches
formulaincalled
the index for user
the practical
scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse
document frequency and the vector space model but adds modern features like a coordination factor, field
length normalization, and term or query clause boosting6.
4.2.1 Package Level Search
Figure
query in the4.5 depicts the
following query
fields: used
name, for package
author, level
summary, search. ESand
description looks for matches
keywords. Basedforonthe
theuser search
matches it
ranks the results and returns name, author, summary, description, version, keywords, number of
downloads in the last month for each top n ranked Python package, where n is the number of matching
packages requested. Summary and description of a matched package are
6
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
1
PUT moduledata
{ ‘ ‘ s e t t i n g s ’ ’: {
‘‘analysis’’ :{
‘‘filter ’’ : {
‘ ‘ code1 ’ ’ : {
‘, ‘‘ ‘type ’s’ e r: v e‘ ‘oprai tgt ier
p r eerns n c ap t u r e ’ ’
‘ ‘ patt ’’ : [ nal ’’ : 1,
‘‘ ‘‘ (( \\
\\ p{
d+)Ll ’ ’}+|\\p{Lu}\\p{ Ll }+|\\p{Lu}+) ’ ’ ,
} ]
},
‘ ‘ analyzer ’ ’ : {
‘ ‘ code ’ ’ : {
‘ ‘ t o k e n i z e r ’ ’ : ‘ ‘ pattern ’ ’ ,
‘ ‘ f i l t e r ’ ’ : [ ‘ ‘ code1 ’ ’ , ‘ ‘ l owe r ca s e ’ ’ , ‘ ‘ snowball ’ ’
]
}
}
}
}
}
Figure
in the 4.6 depicts
following fields:the query
class, usedfunction,
class for module level
class search.function,
variable, ES detects matches
function to the user
function, searchvar
function query
and
variable of a module document. Different weights are assigned for matches in a different field based on
their importance. For example, a match in the class field will weigh more than a match in the function
field in a module. Weights are assigned using a caret (ˆ) sign followed by a number. Based on the
matches, it ranks the results and returns the path to the module (module path) where match occurred.
Using this information, we will retrieve the
1
PUT moduledata / moduleindex / mapping
{ ‘ ‘ pypimtype ’ ’ : {
‘‘properties’’:{
‘ ‘ module ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ code ’ ’
},
‘ ‘ module path ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ not ana lyz ed ’ ’
},
‘‘class’’:{
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ code ’ ’ ,
‘ ‘ term vector ’ ’ : ‘‘withpositionsoffsets ’’
},
‘‘class function’’:{
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ code ’ ’ ,
‘ ‘ term vector ’ ’ : ‘‘withpositionsoffsets ’’
},
‘‘classvariable’’:{
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ code ’ ’ ,
‘ ‘ term vector ’ ’ : ‘‘withpositionsoffsets ’’
},
‘ ‘ function ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ code ’ ’ ,
‘ ‘ term vector ’ ’ : ‘‘withpositionsoffsets ’’
},
...
...
}
}
}
1
GET packagedata / packageindex / s e a r c h
{ ‘ ‘ query ’ ’ : {
‘ ‘ multi match ’ ’ : {
‘ ‘ query ’ ’ : ‘ ‘ sear ch query ’ ’ ,
‘ ‘ operator ’ ’ : ‘ ‘ or ’ ’ ,
‘ ‘ f i e l d s ’ ’ : [ ‘ ‘ i n f o . name ˆ 3 0 ’ ’ , ‘ ‘ i n f o . author ’ ’ , ‘ ‘ i n f o .
summary ’ ’ , ‘ ‘ i n f o . version ’ ’ , ‘ ‘ i n f o . keywords ’ ’ ]
}
},
‘‘fields ’’: [
‘ ‘ i n f o . name ’ ’ , ‘ ‘ i n f o . author ’ ’ , ‘ ‘ i n f o . summary ’ ’ , ‘ ‘ i n f o . version
’ ’ , ‘ ‘ i n f o . keywords ’ ’ , ‘ ‘ i n f o . downloads . last month ’ ’
],
‘‘highlight ’’: {
‘‘fields ’’: {
‘ ‘ summary ’ ’ : { } ,
‘‘description’’:{}
}
}
}
1
GET moduledata / moduleindex / s e a r c h
{ ‘ ‘ query ’ ’ : {
‘ ‘ multi match ’ ’ : {
‘ ‘ query ’ ’ : ‘ ‘ user query ’ ’ , ‘ ‘
fields’’:[
‘ ‘ c ,l ‘a‘sf suˆn5c’ t’i,o‘ n‘ cˆ 4l a’ s’ s, ‘ f‘ uf un nc tc it oi on n’ ’f,u‘ n‘ cc lt ai os ns ’v’a, r‘ i‘ af bu lnec ’t ’i
} on var’’,‘‘variableˆ3’’]
}‘ ‘, f i e l d s ’ ’ : [
‘ ‘ module path ’ ’
],
‘ ‘ source ’ ’ : f a l s e ,
‘‘highlight’’: {
‘ ‘ order ’ ’ : ‘ ‘ score ’ ’ ,
‘, ‘‘ ‘r fe iqeul idrse ’ f’ i :e{l d m a t c h ’ ’ : true
‘‘class’’: {
‘ ‘ number of fragments ’ ’ : 5 , ‘
} , ‘ f r a g m e n t s i z e ’ ’ : 18
‘ ‘ c l a s s f u n c t i o n ’ ’: {
‘ ‘ number of fragments ’ ’ : 5 , ‘
} , ‘ f r a g m e n t s i z e ’ ’ : 18
‘‘class variable ’’ : {
‘ ‘ number of fragments ’ ’ : 5 , ‘
} , ‘ f r a g m e n t s i z e ’ ’ : 18
‘ ‘ fu n ction ’ ’ : {
‘ ‘ number of fragments ’ ’ : 5 , ‘
} , ‘ f r a g m e n t s i z e ’ ’ : 18
...
...,
‘‘variable’’ : {
‘‘ ‘f rnumber
a g m e noft fragments
s i z e ’ ’ : 18’ ’ : 5 , ‘
}
}
}
}
2
CHAPTER 5
DATA PRESENTATION
Once
searchthe data type
engine is indexed, it needs
of interface. to goal
Our be presented in alevel
for package fast search
and presentable manner.
was to provide We chose
a ranked to relevant
list of create a
packages and their details to any given query. For module level search, we wanted to provide actual source
code snippets related to the query. In order to display some of the source code to the user, preprocessing
was necessary.
5.1 Server Setup
Our
serve ourserver
data, has
andauWSGI
simple4stack
as ansetup. We between
interface use Nginxthe
1
toNginx
handleand
requests, Flask2/Python(ES)
Flask. Elasticsearch
3
to process
5
is usedand
to
hold our data for package and module level search. To make sure we are using the latest packages from
PyPI, we use sqlite databases to track the modified times of each6 package. Much of the rendering and
manipulation of the browser interface is done using JavaScript . Our JavaScript library of choice is
jQuery7.
5.2 Browser Interface
The interface
or module level features a home
search. The pagepage
results with displays
a simple the
text query
box forinformation
queries andata choice
the top,forand
either
thepackage
results
themselves below. The user can modify his or her query or change the search type from package level
search to module level search and vice versa. In Chapter 6 we have added some sample screen shots of
the browser interface.
1
https://www.nginx.com/resources/wiki/
2
http://flask.pocoo.org/ 3https://www.python.org/
4
https://uwsgi-docs.readthedocs.org/en/latest/
5
https://www.elastic.co/products/elasticsearch
6
https://developer.mozilla.org/en-US/docs/Web/JavaScript
7
https://jqueryui.com/
2
Figure 5.1: Package modal.
5.3 Package Level Search
When
list of a userissends
packages a query
sent back. to the
Each serverisfor
package packageonlevel
depicted search, the
the browser as aquery is processed,
tile (Refer and6.3).
to Figure a ranked
Tile
provides minimal information about the package that includes package name, author of the package, brief
description, number of downloads and score assigned by the ranking algorithm. The user has the option to
click the tile to view more information about the package and visit PyPI and other sites related to the
package. When the user clicks on the tile, a modal is opened that contains detail description, version,
source code homepage, PyPI homepage, score from ranking algorithm (Refer to Figure 5.1), statistics of
the package as bar graph, pie chart and numbers (Refer to Figure 5.2), other packages from author (Refer
to Figure at 5.3). This process relies on ranking of packages.
2
Figure 5.2: Package statistics.
5.3.1 Ranking of Packages
manyOne of the
types mostfor
of data important
ranking aspects which In
the packages. distinguishes thisChapter
Chapter 3 and search engine fromdiscussed
4 we have others ishow
the use of
all the
preprocessed information about packages and modules, including basic details from PyPI are stored in
our ES server. We felt that ES relevance algorithm was not thorough8 enough to return meaningful results.
So we are using a few other metrics, namely, Bing search results , number of downloads, the ratio of
warnings to the total number of lines, the ratio of comments to the total number of lines and the ratio of
code lines to the total number of lines (gathered by prospector and CLOC, also visible in package modal
referenced in Figure 5.2). All of these metrics are passed as columns to the ranking algorithm. These
columns represent nothing but a matrix. After the ranking algorithm is executed, it returns a sorted list of
packages in reverse order based on newly generated scores and a dictionary mapping of package names to
their scores. Note that
8
http://datamarket.azure.com/dataset/bing/search
2
Figure 5.3: Other packages from author.
ES column
these columnand Bing Column
weights could beare primary
tricky. columns
One way while
to fine tuneothers are secondary
this algorithm derived
is to try columns.
different Tuning
combinations
of weights and learn which one works better. We can also fine tune this algorithm by adding more
primary columns or secondary derived columns. For example, at the time of writing this thesis, we were
experimenting with PageRank as one of the primary columns. We sought to calculate PageRank based on
import statements in each module. Just like in web pages where a page “A” links to some other page “B”,
in Python modules we have import statements where a module “C” imports module “D”. This can be
considered as a vote cast by “C” to module “D” and this information can be used to generate PageRank
for modules and packages.
Ranking Algorithm.
2
4. Pass list of all columns generated in step 1, 2 and 3 to rerank function along with weights for each
column.
5. Inside rerank function
(a) Find max length of columns, i.e. number of rows in the matrix.
(b) Construct a results
the package dictionary
and value for set union
(score) initiated to 0.of all cells in the matrix with key as the name of
(c) For each row in matrix
i. For each cell in row
A. Find the package score
row.rownumber) for its position using the formula, (maxlen −
× weightvector(cell.columnnumber).
Here
position of the current row number
“maxlen” is the total of rows
in the matrix. in the matrix. “row.rownumber” isgives
“weightvector(cell.columnnumber)” the
the weight assigned to the particular column the cell belongs to.
B. Add this score to existing score of that package in the dictionary created in step 5(b).
At the end
standing of this step,
in different each package in results dictionary will have cumulative score for
columns.
(d) From
resultsthe
listresults dictionary,
with reranked generate a reverse sorted list with key as score. This will give
order.
totalNote that of
number there
rowscould bematrix
in the duplicates between
passed top 20
to ranking resultsisofnot
function ESalways
server40.
andFigure
top 20 results of Bing, so
5.4 is the sample pseudocode for the ranking algorithm. Table 5.1 is the
keyword “music”. In this table by looking at the cells highlighted with yellow background, sample matrix
youformed for
can notice
duplicate name “vls-framework” between primary columns Elasticsearch and Bing. This only means that
both primary columns agree to the fact that this is a relevant result for given keyword. By looking at the
cells highlighted in red, you can notice duplicates within the same primary column Bing. This scenario
occurs because multiple versions of the same package emerge famous. From the table we can also notice
that these duplicates are carry forwarded to secondary columns, thus influencing the ranks. Allowing
duplicates to present in primary columns and allowing them to carry forward to secondary columns are
decisions yet to be made. For now, this practice in ranking algorithm has looked promising, with positive
effects. As we investigate with more use cases, if it turns out duplicates are influencing negatively, we can
always just eliminate them by not disturbing
2
the nature
scores of ranking
calculated function.
by the rankingFor keyword
function “music”,
ordered Table 5.2 shows
in descending order. the matching packages and their
5.4 Module Level Search
Refer
search, thetouser
Figure 6.4a to
sends viewto the
query the template used for this
server. However, Module
time Level
aroundSearch. Similar
the server to package
returns a list of level
code
excerpts. One of our concerns was to keep the wait time of queries low. So a preprocessing step was
added to retrieve the code faster.
5.4.1 Preprocessing
To reduce
previously the waitBandernsatch
mentioned, time of searches,
is usedthe sourcea code
to create needs of
local mirror to the
be PyPI
adapted for theonbrowser.
repository As
our server,
however, only compressed packages are maintained with Bandersnatch. An uncompression step is
required to examine each Python module in plain text.
Ourlines
twenty aim from
was tothe
show nicely
source formatted,
code to the user stylized lines of
and render thecode.
codeInitially, we were
on the user’s sidegoing
usingtoa send
third about
party
JavaScript library such as SyntaxHighlighter 9. This worked well except for multi-line constructs such as
doc strings. Since there is a possibility of missing lines of a doc string during client-side rendering, the
renderer has no way of knowing how to stylize the doc string. Instead, we fixed this by rendering the code
snippets before sending it to the client (server-side rendering). For this, we used pygments10, a Python
library, for creating stylized code in Python and other languages for numerous formats such as HTML.
This inevitably increases the amount of data sent from the server, but this ensures that the code is
correctly stylized.
Linesmodule.
for each of CodeThe
(LOC) is another
mapping file intrick
thisused in expediting
instance stores thethe code
exact display.
byte whereItacreates a “mapping”
new line file
starts for each
line in a module. When it is time to grab a snippet of code, the server can open the file and immediately
seek to the correct spot rather than search or linearly pass through the rest of lines. There could be
thousand or even more lines of code that we are avoiding by using this technique. This cuts down on
processing time and only requires another step in the preprocessing.
9
http://alexgorbatchev.com/SyntaxHighlighter/ 10http://pygments.org/
2
generateTop20ESResults():
{
#query ES server and get top 20 results
}
generateTop20BingResults():
{
#query bing and get top 20
#filter out real packages from pages
}
generateOtherMetrics(listOfPackagesFromESandBing):
{
for eachpackage in listOfPackagesFromESandBing:
column_downloads = eachpackage.downloads
column_warnings = eachpackage.warnings/eachpackage.lines
column_comments = eachpackage.comments/eachpackage.lines
column_code = eachpackage.code/eachpackage.lines
column_downloads.sortReverse() # more downloads, better the package
column_warning.sort() # less warnings, better the package
column_comments.sortReverse() # more comments, better the package
column_code.sortReverse() # more code, better the package
return column_downloads,column_warning,column_comments,column_code
}
rerank(weightvector, matrix):
{
maxlen = max(len(column) for column in matrix)
# Create a dict as keys:0.0 from union of cells in matrix resultsDictionary =
getcDictionaryFromCells(matrix)
# go through the matrix one row at a time for row in
matrix:
for cell in row:
if packagename: # avoiding empty cells resultsDictionary[cell.packaganame] +=
(maxlen - row.rownumber) * weightvector[cell.column_number] # higher the
score, better the package
resultsList = sortReverse(resultsDictionary, key=score) # resultsList
gives you the order of packages
# resultsDictionary gives score to each package return
resultsList,resultsDictionary
}
2
Table 5.1: Ranking matrix for keyword music.
Elasticsearch Bing Downloads Warnings/Lines Comments/Lines Code/Lines
2
Table 5.2: Matching packages and their scores for keyword music.
Package Name Score Package Name Score
We limited
module the number
level search of code
vis-framework
will generate snippets
a huge that33 are
response returned
increasetothe20.
andkurzfile Sending
12.5 more than 20 matches for
response time. Before sending the top
20 results, we apply the ranking algorithm discussed for package level search with changes in input for
the ranking function. As of now, there is only one primary column, i.e. results from the ES server and a
pyspotify 22.8 spilleliste 12.1
list of secondary columns similar to package level, but each one of them points to module level statistics
rather than package level statistics
mopidy-gmusic (except for downloads column). For example, consider column
“Warnings/Lines”, for package level search, it is20.8
the ratio of the number of10.8
gmusic-rating-sync
total warnings for package to
the number of total lines of code for the package and for the module level search it is the ratio of the total
number of warnings for module to the total number
gmusicapi 19 of vkmusic
lines of code for the module.
10.5
2
Figure 5.5: Module modal.
have given an option for user to click on avisibility
from the module. This will give users more code snippet that enables
for modules. Figurea 5.5
modal that displays
represents entire code
this modal.
3
CHAPTER 6
Figure
overnight6.1toisgenerate
a System Level
all the Flow details
required Diagram of PyQuery.
mentioned in the A set Collection
Data of Pythonchapter.
scripts This
are run in batch
preprocessed
information is in JSON file format. Before executing these batch scripts, the bandersnatch 1
mirror client is
executed so that packages are in sync with PyPI2 and we deliver the most up to date information. All the
files generated are either part of package search or module search.
We maintain separate Elasticsearch (ES)3 indexes for package search and module search. These
indexes
form theare configured
core to update at regular intervals if there are any changes for the files they point. They
of the PyQuery.
A web interface customized
deployed using NGINX5 server. Figure for easy
6.2 isflow of information
PyQuery’s homepagetowhoseusersdesign
is developed
is mainlyininspired
Flask 4from
and
Google’s homepage. It serves separate edge nodes for package search and module search. When a user
hits the package level search page, an AJAX call is made to the edge node responsible for retrieving
matching packages from ES index. Based up on matching packages retrieved from ES, a set of metrics are
formulated and passed to the ranking algorithm as discussed in chapter 5. This returns a list of packages to
requested front end page that is reverse sorted with key as the score calculated by ranking algorithm. The
highest score is positioned on top of others. Figure 6.3 shows result of Package Level Search on PyQuery
for keyword “flask”.
When a user hits the module level search page, an AJAX6 call is made to edge node responsible
for retrieving
matching matching
modules modules
and line or lines
numbers of code
at which based happened,
the match on the metadata index
the Lines of in ES.(LOC)
Code After collecting
technique
discussed in Chapter 5 is executed to quickly capture code snippets from matching
1
https://pypi.python.org/pypi/bandersnatch 2https://pypi.python.org/pypi
3
https://www.elastic.co/products/elasticsearch 4http://flask.pocoo.org/
5
https://www.nginx.com/resources/wiki/ 6http://api.jquery.com/jquery.ajax/
3
Figure 6.1: System Level Flow Diagram of PyQuery.
modules
Search onand displayfor
PyQuery them
the on the requested
keyword front end page. Figure 6.4 shows result of Module Level
“quicksort”.
3
Figure 6.2: PyQuery homepage.
3
Figure 6.3: PyQuery package level search template.
3
Figure 6.4: PyQuery module level search template.
3
CHAPTER 7
RESULTS
3
Based onsearch
meaningful aboveand
grounds
avoid of comparison,
closely it is clear
and similarly thatpackages.
scored we have met our goals to improve PyPI, offer a
3
Keyword Table 7.3: Results comparison for keyword - pygments.
pygments
# of results from PyPI 250
# of results from PyQuery 29
Pygments, django mce pygments, pygments-asl,
Top 5 results from PyPI pygments-gchangelog, pygments-rspec
Pygments, pygments-style-github,
Xslfo-Formatter, Pygments-
Bibtex-Pygments-Lexer, Mis-
Table 7.4: Results comparison for keyword - Django.
tune
Top 5 results from PyQuery
First 5 scores - PyPI 11, 9, 9, 9, 9
First 5 scores - PyQuery 88.10, 31.50, 28.50, 24.90, 19.30
# of packages with highest score -
Keyword Django
#PyPI
of results from PyPI 11292
1
# of packages with
results from highest score -
PyQuery 38
Django, django-hstore, django-modelsatts,
PyQuery 1
Top
Rank5(score)
results of
from PyPI match -
expected django-notifications-hq,django-notifications-hq
Django, Django-Appconf, Django-Celery,
PyPI 1 (11)
Top
Rank5(score)
results of
from PyQuery
expected match - Django-Nose, Django-Inplaceedit
First 5 scores - PyPI
PyQuery 110,(88.10)
10, 10, 10, 10
First 5 scores - PyQuery 75.30, 25.50, 25.20, 25.20, 23.50
# of packages with highest score -
PyPI 6
# of packages with highest score -
PyQuery 1
Rank (score) of expected match -
PyPI 1 (10)
Rank (score) of expected match -
PyQuery 1 (73.30)
3
Keyword Table 7.5: Results comparison
pylint for keyword - pylint.
# of results from PyPI 361
# of results from PyQuery 25
gt-pylint-commit-hook, plpylint, pylint, pylint-
Top 5 results from PyPI
Table 7.6: Results comparisonpatcher,
Pylint,
pylint-web2py
for keyword - biological computing.
Pylint2tusar, Django-Jenkins, Py-
Top 5 results from PyQuery lama pylint, Logilab-Astng
First 5 scores - PyPI 9, 9, 9, 9, 9
First 5 scores - PyQuery 91.90, 21.80, 20.60, 17.30, 16.00
Keyword
# of packages with highest score - biological computing
# of results from PyPI 66
PyPI
# of results
packagesfrom PyQuery
with highest score - 23
blacktie, appdynamics, appdynamics, appdy-
PyQuery 1
Top
Rank5(score)
results of
from PyPI match -
expected namics, inspyred
BiologicalProcessNetworks, Blacktie, PyD-
PyPI 3 (9)
Top
Rank5(score)
results of
from PyQuery
expected match - STool, PySCeS, Csb
First 5 scores - PyPI 3, 2, 2, 2, 2
1 (91.90)
PyQuery
First 5 scores - PyQuery 15.10, 14.10, 12.90, 12.50, 11.80
# of packages with highest score -
PyPI 1
# of packages with highest score -
PyQuery 1
Relevant packages among top 5 -
PyPI 1
Relevant packages among top 5 -
PyQuery 1, 2, 3, 4, 5
3
Keyword Table 7.7: Results comparison for keyword - 3D printing.
3D printing
# of results from PyPI 26
# of results from PyQuery 31
fabby, tangible, blockmodel, citygml2stl, de-
Top 5 results
Table from PyPI comparison formakein
7.8: Results keyword - web development framework.
Top 5 results from PyQuery Pymeshio, Demakein, C3d, Bqclient, Pyautocad
First 5 scores - PyPI 7, 7, 6, 5, 4
First 5 scores - PyQuery 44.00, 21.50, 18.90, 18.90, 18.60
# of packages with highest score -
Keyword web development framework
PyPI 2
## of results from PyPI
of packages with highest score - 801
# of results from PyQuery 32
PyQuery 1HalWeb, WebPages, robotframework-
Relevant packages among top 5 - extendedselenium2library, robotframework-
extendedselenium2library, robotframework-
PyPI extendedselenium2library
1, 2, 3, 4, 5
Top 5 results
Relevant from PyPI
packages among top 5 -
PyQuery 1, 2, 3, 4, 5
Top 5 results from PyQuery Django, Pyramid, Pylons, Moya, Circuits
First 5 scores - PyPI 16, 16, 15, 15, 15
First 5 scores - PyQuery 65.80, 48.20, 37.60, 32.60, 23.80
# of packages with highest score -
PyPI 2
# of packages with highest score -
PyQuery 1
Relevant packages among top 5 -
PyPI 1, 2
Relevant packages among top 5 -
PyQuery 1, 2, 3, 4, 5
4
Keyword Table 7.9: Results comparison for keyword
material science - material science.
# of results from PyPI 52
# of results from PyQuery 29
py bonemat abaqus, MatMethods, MatMiner,
Top 5 results from PyPI pymatgen, pymatgen
FiPy, Pymatgen, Pymatgen-Db, Custodian,
Table 7.10: Results comparison for keyword - google maps.
Top 5 results from PyQuery Mpmath
First 5 scores - PyPI 7, 6, 6, 6, 6
First 5 scores - PyQuery 57.50, 55.20, 20.30, 19.70, 17.50
# of packages with highest score -
Keyword google maps
PyPI 1
## of
of results
packagesfrom PyPI
with highest score - 290
# of results from PyQuery 31
PyQuery 1Product.ATGoogleMaps, trytond google maps,
Relevant packages among top 5 - django-google-maps,
1, 2, 3, 4, 5 djangocms-gmaps, Flask-
GoogleMaps
Top
PyPI5 results from PyPI Note: 4 and 5 are duplicate results.
Relevant packages among top 5 - Googlemaps, Django-Google-Maps, Flask-
PyQuery
Top 5 results from PyQuery 1, 2, 3, 4
GoogleMaps, Gmaps, Geolocation-Python
First 5 scores - PyPI 18, 18, 14, 14, 14
First 5 scores - PyQuery 50.20, 39.40, 39.20, 39.00, 38.60
# of packages with highest score -
PyPI 2
# of packages with highest score -
PyQuery 1
Relevant packages among top 5 - PyPI 1, 2, 3, 4, 5
Note: Resultspurpose
missing general are relevant to query
packages like but they are
Googlemaps among top 5.
4
CHAPTER 8
We believe
modules. Wewe have the
expect succeeded in developing
Python community a dedicated
to widely adopt search engine
PyQuery. for Python
PyQuery packages
would allow and
Python
developers to explore well written, widely adopted, famous and highly apt Python packages and modules
for their programming needs. It will offer itself as an encouraging tool in Python community to follow
software engineering practice code reuse.
8.1 Thesis Summary
In this
for thesis
Python we have
packages andproposed
modules.some concrete
We have ideas
sought on how
to build to develop
an improved a dedicated
version search
of the state of engine
the art
Python search engine PyPI. Although PyPI is the first and only tool to address this problem, results
from PyPI are found to serve very little use to user needs and requirements. We have discussed various
tools and techniques which are brought together as one single tool called PyQuery, for facilitating better
search, better rank and better package visibility. With PyQuery we want to bridge the gap between the
high demand of means and ways to deliver reusable components in Python for code reuse and the lack of
efficient tools at users disposal to achieve it. In Chapter 1 we discussed the relevance of this problem, our
objective and approach towards solving the problem. In Chapter 2 we highlighted the related work in this
area. For package level search, PyPI being the only search engine that does Python module search, we
have elaborated on how PyPI search algorithm works, offered reasons as to why we think it needs
improvement. For module level search, there isn’t any dedicated code search engine for Python so we
have explored code search engines that work across multiple languages and reasoned the need for a
dedicated search engine for Python. PyQuery is divided into three different components: Data Collection,
Data Indexing and Data Presentation. Since we intend to provide two modes of search operations i.e.
Package Level Search and Module Level Search, at each component we employ a list of tools and
techniques to achieve specific goals related to these modes. In Chapter 3 we discussed Data Collection
Module, use of
4
Bandersnatch 1
PEP 381
using code analysis toolsmirror client to 2clone
like Prospector Python3. packages
and CLOC locally
We explored howand later process
to make these packages
use of Abstract Syntax
Trees (ast) to filter out useful information or metadata from Python modules. We also addressed JSON
file format for saving all this information with an example for each type of data.
RiverIn5 Chapter 4 we 6demonstrated
and Analyzer how the
plugins to digest to feed
feedstructured data
data. ES is to Elasticsearch
built (ES)4Lucene
on top of Apache and make
7
anduse of FSa
offers
wide variety of methods to configure data indexing and data retrieval. We explained the purpose behind
agreeing to a specific format for the JSON file to collect the data so that we can make use of
configurations ES offers. One such configuration is minimum fragment size. By minimum fragment size
set to 18 and an identifier collected along with the line number as one word separated by underscores and
right filled with underscores until minimum length of 18, we were able to get a matching identifier and its
line number as one single match. This reduced the size of JSON file indexed in ES drastically and also
helps save time to fetch line number from another key. In this chapter, we have also outlined some sample
queries to index and retrieved meaningful information out of the indexed structured data.
In Chapter 5, we covered the data presentation concepts like browser interface and server setup.
We discussed
columns our implementation
involved in ranking metricsof aand
server side ranking
an example viewalgorithm for package
of these columns for aand module
sample levelAlso,
query. search,
we
presented our preprocessing implementation of faster code search that involves generating the starting
address byte of each line in a module and transforming code into pygments. In Chapter 6 we gave an
overview of how all three components of PyQuery will work together with a system level flow diagram.
Finally, in Chapter 7 we compared results of PyQuery with that of PyPI to prove that we have achieved
our goal to improve PyPI, offer a meaningful search and avoid closely and similarly scored packages.
1
https://pypi.python.org/pypi/bandersnatch] 2https://github.com/landscapeio/prospector
3
http://cloc.sourceforge.net/ 4https://www.elastic.co/products/elasticsearch
5
https://github.com/dadoonet/fsriver
6
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-analyzer.html
7
https://lucene.apache.org/core/
4
8.2 Recommendation for Future Work
Although In
improvement. PyQuery accomplished
this section, we would likethe toinitially
list waysestablished
to improve goals,
further.there is definitely scope for
set of keywords for which we know matching packages. We observed thatwePyQuery
We want to perform a large scale comparison on PyQuery. Currently, have tested PyQuery
is doing betterwith
thana
state of the art, PyPI. Python is an extensive language and people from many different fields use the
Python programming language to solve problems in their respective disciplines. In this process, there is
always a continuous production of packages that are useful. There are thousands of packages that are
pretty famous for various reasons. Knowing in advance all the possible keywords that map to these
packages is nearly impossible. A tool gains popularity and importance only when it is widely accepted by
user base. By reaching out to developers of Python community from various disciplines, we can gauge
how well PyQuery is mapping keywords to right packages. We want to plan for large scale user surveys by
asking professional developers to search for packages that they use by direct package name and by
keywords that infer package purpose. We want to collect their feedback and learn if PyQuery is meeting
their requirements and if it is doing a better job than PyPI. We would like to list out use cases where
PyQuery needs to do better.
We can extend PyQuery to a recommendation system. We can apply collaborative filtering
technique
later from i.e.,
this capture
data makeuserpredictions
actions to of
know
a listtheir likes anda dislikes
of packages of Python
user would packages we
find interesting. Thissuggest and
will allow
further improvements on PyQuery. If a user trusts a specific author and tends to explore packages
developed by him/her more often, we could make his or her search results more appealing to him by
filtering packages from this author among the set of initially matched packages. If a user tends to explore
packages specific to a field or category, it is most likely that he/she is working in that field and in future if
there is a user management component added to PyQuery and every time a user login to the website, we
can suggest famous packages from his or her field on dashboard or we can suggest latest news related to
updates made on packages related to his or her field. These are the few of the many possibilities we can
do with Collaborative filtering technique, to facilitate better search operations. This will allow developers
to receive the latest information on packages and help them to make the best of Python packages and
modules. Many successful giants in the field of entertainment like Netflix and Comcast make use of
Collaborative filtering technique to
4
always
Python keep theiritusers
packages couldengaged with
find great their website.
purpose Since PyQuery
for collaborative seeks to help developers explore
filtering.
4
BIBLIOGRAPHY
[1] Caitlin
study. InSadowski, Kathryn
Proceedings of the T2015
Stolee,
10thand Sebastian
Joint MeetingElbaum. How developers
on Foundations search
of Software for code: apages
Engineering, case
191–201. ACM, 2015.
[2] Asimina Zaimi, Noni
giannis, Panagiotis Triantafyllidou,
Sfetsos, and Ioannis Androklis Mavridis,
Stamelos. An Theodore
empirical study onChaikalis,
the reuse Ignatios Deli-
of third- party
libraries in open-source software development. In Proceedings of the 7th Balkan Confer- ence on
Informatics Conference, page 4. ACM, 2015.
[3] Andy
SoftwareLynex and Paul5(1):105–124,
Engineering, J Layzell. Organisational
1998. considerations for software reuse. Annals of
[4] David C Rine
investment and Robert
success factors.MJournal
Sonnemann. Investments
of systems in reusable
and software, software.1998.
41(1):17–32, a study of software reuse
4
BIOGRAPHICAL SKETCH
My name isRao
Nageswara Shiva
is Krishna Imminni
a government and I was
employee andborn in metropolitan
my mother city Laxmi
Mrs. Subba Hyderabad, India. My father
is a homemaker. They Mr.
are
my biggest inspiration and support. I am the elder of two children of my parents. My sister Ramya
Krishna Imminni is very close to my heart and is very special to me. My family is the guiding force
behind the success I have in my career.
I have received Bachelor’s degree from Jawaharlal Nehru Technology University in May 2011
and joined
scripts FactSet
in various Researchlike
languages Systems as QAand
Java, Ruby Automation Analyst.onAt
Jscript. I worked FactSet,
various I wrote QA
automation Au- tomation
frameworks like
TestComplete and Selenium. I was one among the first three employees hired for QA Automation
process, so I had a lot of opportunities to try various job roles and experiment with new technologies.
Out of all the job roles I had done, I liked training new hires the most. I was promoted to QAAutomation
Analyst 2 in a short span of 1 year and awarded Star Performer for the year 2013. It is at FactSet I
developed Testlogger, a ruby library to generate xml log files; custom built for QA terminology like
<testcase> and <teststep>. I worked at FactSet for 2 years from November 2011 to December 2013 and
gained a diverse experience performing various roles. I have joined Department of Computer Science at
Florida State University as a Master of Sci- ence student in Spring 2013. At FSU I have continued to
gain professional experience working part time as Software Developer, Graduate Research Assistant at
iDigInfo, Institute of Digital Information and Scientific Communication. At iDigInfo, I worked on
various projects related to re- search in specimen digitization. Some of these projects include Morphbank,
a continuously growing database of images that scientists use for international collaboration, research and
education and iDigInfo-OCR, an optical character recognition software for digitizing label information of
specimen collections. I also worked as a Graduate Teaching Assistant for Bachelor level Software
Engineering course. As a part of my course work, I have taken a Python course under Dr. Piyush Kumar
that led to my interest in working on PyQuery. Experience I have gained while working on PyQuery
helped me get an intern opportunity. During the Summer of 2015, I interned with Bank of America.
As an intern, I worked on various technologies related to BigData including Hadoop HDFS, Hive
and Impala.