sEARCH eNGINE

Florida State University Libraries
2015
Pyquery: A Search Engine for Python

Packages and Modules
Shiva Krishna Imminni
Follow this and additional works at the FSU Digital Library. For more information, please contact lib-ir@fsu.edu
FLORIDA STATE UNIVERSITY
COLLEGE OF ARTS AND SCIENCES
PYQUERY:
A SEARCH ENGINE FOR PYTHON PACKAGES AND MODULES
By
SHIVA KRISHNA IMMINNI
A Thesis
Department of submitted
Computer to the in
Science
partial fulfillment of the
requirements for the degree of
Master of Science
2015
Copyright Ⓧc 2015 Shiva Krishna Imminni. All Rights Reserved.
Shiva Krishna
members of theImminni defended
supervisory this thesis
committee were:on November 13, 2015. The
Piyush Kumar
Professor Directing Thesis
Sonia Haiduc
Committee Member
Margareta
CommitteeAckerman
Member
The Graduate
the thesis School
has been has verified
approved and approved
in accordance with the above-named
university committee members, and certifies that
requirements.
ii
Iwho
dedicate
madethis
me thesis to myI am
the person family. I am
today. grateful
I am to my
thankful loving
to my parents, sister,
affectionate Nageswara Rao
Ramya and Subba
Krishna, who Laxmi,
is very
special to me and always stood by my side.
i
ACKNOWLEDGMENTS
Idirecting
owe thanks
my to manyWithout
thesis. people. his
Firstly, I would support,
continuous like to express myguidance
patience, gratitude and
to Dr. Piyush knowledge,
immense Kumar for
PyQuery wouldn’t be so successful. He truly made a difference in my life by introducing me to Python
programming language and helped me learn how to contribute to Python community. He trusted me and
remained patient during the difficult times. I would also like to thank Dr. Sonia Haiduc and Dr. Margareta
Ackerman for participating on the committee, monitoring my progress and providing insightful
comments. They helped me learn multiple perspectives that widened my research. I would like to thank
my team members Mir Anamul Hasan, Michael Duckett, Puneet Sachdeva and Sudipta Karmakar for their
time, support, commitment and contributions to PyQuery.
i
TABLE OF CONTENTS
List of Tables.............................................................................................................................................vii
List of Figures..........................................................................................................................................viii
Abstract......................................................................................................................................................ix
1 Introduction 1
1.1 Objective.......................................................................................................................................2
1.2 Approach......................................................................................................................................2
23 Related Work
Data Collection 4
7
3.1 Package Level Search....................................................................................................................7
3.1.1 Metadata - Packages.........................................................................................................7
3.1.2 Code Quality....................................................................................................................8
3.2 Module Level Search..................................................................................................................10
3.2.1 Mirror Python Packages..................................................................................................10
3.2.2 Metadata - Modules........................................................................................................11
3.2.3 Code Quality..................................................................................................................12

4 Data Indexing and Searching 14
4.1 Data Indexing..............................................................................................................................14
4.1.1 Package Level Search.....................................................................................................14
4.1.2 Module Level Search......................................................................................................14
4.2 Data Searching............................................................................................................................16
4.2.1 Package Level Search.....................................................................................................16
4.2.2 Module Level Search......................................................................................................17

5 Data Presentation 21
5.1 Server Setup................................................................................................................................21
5.2 Browser Interface........................................................................................................................21
5.3 Package Level Search..................................................................................................................22
5.3.1 Ranking of Packages.......................................................................................................23
5.4 Module Level Search..................................................................................................................26
5.4.1 Preprocessing.................................................................................................................26
6 System Level Flow Diagram 31
7 Results 36
8 Conclusions and Future Work 42
8.1 Thesis Summary.........................................................................................................................42
8.2 Recommendation for Future Work..............................................................................................44

Bibliography..............................................................................................................................................46
v
Biographical Sketch...................................................................................................................................47
v
LIST OF TABLES
5.1 Ranking matrix for keyword music...............................................................................................28
5.2 Matching packages and their scores for keyword music................................................................29
7.1 Results comparison for keyword - requests...................................................................................37
7.2 Results comparison for keyword - flask........................................................................................37
7.3 Results comparison for keyword - pygments.................................................................................38
7.4 Results comparison for keyword - Django....................................................................................38
7.5 Results comparison for keyword - pylint.......................................................................................39
7.6 Results comparison for keyword - biological computing..............................................................39
7.7 Results comparison for keyword - 3D printing..............................................................................40
7.8 Results comparison for keyword - web development framework..................................................40
7.9 Results comparison for keyword - material science......................................................................41
7.10 Results comparison for keyword - google maps............................................................................41
v
LIST OF FIGURES
3.1 Metadata from PyPI for package requests.......................................................................................9
3.2 Pseudocode for collecting metadata of a module...........................................................................12
3.3 Metadata for a module in Flask package........................................................................................13
4.1 Package level data index mapping.................................................................................................15
4.2 River definition.............................................................................................................................16
4.3 Custom Analyzer with custom pattern filter..................................................................................17
4.4 Module level data indexing mapping............................................................................................18
4.5 Package level search query............................................................................................................19
4.6 Module level search query............................................................................................................20
5.1 Package modal...............................................................................................................................22
5.2 Package statistics...........................................................................................................................23
5.3 Other packages from author...........................................................................................................24
5.4 Pseudocode for ranking algorithm.................................................................................................27
5.5 Module modal...............................................................................................................................30
6.1 System Level Flow Diagram of PyQuery......................................................................................32
6.2 PyQuery homepage.......................................................................................................................33
6.3 PyQuery package level search template.........................................................................................34
6.4 PyQuery module level search template.........................................................................................35
v
ABSTRACT
Python Package
community. Index
It hosts (PyPI) is
thousands of apackages
repository thatdifferent
from hosts alldevelopers
the packages
and ever developed
for the for the Python
Python community, it is
the primary source for downloading and installing packages. It also provides a simple web interface to
search for these packages. A direct search on PyPI returns hundreds of packages that are not intuitively
ordered, thus making it harder to find the right package. Developers consequently resort to mature search
engines like Google, Bing or Yahoo which redirect them to the appropriate package homepage at PyPI.
Hence, the first task of this thesis is to improve search results for python packages.
Secondly,
perform a codethis search
thesis also attempts
targeting to develop
python a new
modules. search engine
Currently, that allows
the existing Python
search developers
engines to
classify
programming languages such that a developer must select a programming language from a list. As a
result every time a developer performs a search operation, he or she has to choose Python out of a
plethora of programming languages. This thesis seeks to offer a more reliable and dedicated search engine
that caters specifically to the Python community and ensures a more efficient way to search for Python
packages and modules.
i
CHAPTER 1
INTRODUCTION
Python is and
simplicity a high-level programming
can perform language
more functions basedlines
in fewer on of
simplicity
code. Inand efficiency.
order It emphasizes
to streamline code
code and speed
up development many programmers use application packages, which reduce the need to copy definitions
into each program. These packages consist of written application software in the form of different
modules, which contain the actual code. Python’s main power exists within these packages and the wide
range of functionality that they bring to the software development field. Providing ways and means to
deliver information about these reusable components is of utmost importance.
PyPI,the
meeting a software repository
user needs. for Python
It implements packages,
a trivial rankingoffers a search
algorithm featurematching
to detect to look for available
packages forpackages
a user’s
keyword, resulting in a poorly sorted, huge list of closely and similarly scored packages. From this
immense list of results, it’s hard to find a package efficiently in a reasonable amount of time that meets
user needs. Due to lack of an efficient native search engine for the Python community, often developers
rely on mature and multipurpose search engines like Google, Yahoo and Bing. In order to express his or
her interests in python packages, a developer taking this route has to articulate his query and on top of that
provide additional input. A dedicated search engine for the Python community would bypass the need to
specify once interest in Python. One may argue that such a search engine wouldn’t alter the experience of
developers who is searching for python packages. However, considering the number of times a developer
search for packages, code and related information on an average is five search sessions with 12 total
queries each workday [1], it is desirable to propose a search engine that saves a lot of time and effort.
An additional method of software development that saves time is the practice of Code reuse
[2]. There of
availability arethe
many
rightfactors
tools tothat
findinfluence practice of code
reusable components. reuse for
Searching [3]code
[4]. is
One of thechallenge
a serious such factors is
1
for developers.
ameliorate Code search
the hardship engines
to search suchbyastargeting
for code Krugle 1,a Searchcode
wide range ofand
2
BlackDuck
languages.
3
attempted
Currently, to
Python
developers who conduct code searches have to learn how to configure these search engines so that the
search engines display python specific results. As a result all though these search engines exemplify the
ideal that one product may solve all kinds of problems, such an ideal fails to overcome the problems faced
by Python developers. Python developers would rather rely on a code search engine that is designed for
searching python packages exclusively.
1.1 Objective
This(PyQuery)
engine thesis seeks
4 to enables
that contribute to thedevelopers
Python Python community by Python
to search for developing a dedicated
Packages Python
and Mod- ules search
(code)
and encourage them to take benefit of an important software development practice, code reuse. With
PyQuery we want to facilitate the search process, improve the query results, and collect packages and
modules into one user-friendly interface that provides the functionality of a myriad of code search tools.
PyQuery will also be able to synchronize with the Python Package Index to provide users with code
documentation and downloads, thereby providing all steps in the code search process.
1.2 Approach
PyQuery is
Presentation. Theorganized into threecomponent
Data Collection separate components: Data
is responsible for Collection,
collecting Data
all theIndexing and Data
data required to
facilitate the search operation. This data includes source code for packages, metadata about packages
from PyPI, preprocessed metadata about modules, code quality analysis for packages and other relevant
information that helps us deliver meaningful search results. In order to provide the most recent updates to
packages at PyPI, we ensure that the data we collect and maintain is always in sync with changes made at
PyPI. For this reason, we use the Bandersnatch5 mirror client of PyPI which keeps track of changes
utilizing state files.
1
http://www.krugle.com/ 2https://searchcode.com/
3
https://code.openhub.net/ 4https://pypi.compgeom.com/
5
https://pypi.python.org/pypi/bandersnatch
2
TheModule
Collection Data Indexing component
in a structured stores
schema thatallfacilitates
the data we have
faster collected
search andfor
queries processed
matchinginpackages
the Data and
modules. We used Elasticsearch (ES)6, a flexible and powerful, open source, distributed, real- time search
and analytics engine, built on top of Apache Lucene to index our data. We rely on FileSystem River
(FSRiver)7, an ES plugin, to index documents from the local file system using SSH. In ES, we used
separate indexes for files related to module level search to those of package level search. By using this
approach, we can map each query to its specific type and related files.
The Data
appealing andPresentation component
easy to follow. We havedelivers
used matched
Flask8 forsearch
serverresults
side to the user When
scripting. in a fashion
a userthat is both
query for
matching packages, we send a query to the index responsible for packages and retrieve required details
that allow the user to see the most significant packages, their scores, statistics and other relevant
information. We implemented a ranking algorithm that works on fine tuning ES results by sorting them
based on various metric. Additionally, when a user query for matching modules, a request is sent to the
ES index for modules that contain metadata (Ex: class name, method name, etc.) to get a list of matches
along side their line number and path to module on the server. For every match, a code snippet containing
matching line is rendered using Pygments9. To reduce time for processing matched results, all the
modules are preprocessed with pygments and each line number is matched to their starting byte address in
the file, so that the server can quickly open the pygment file, seek to the calculated byte location, and pull
the required piece of HTML code snippet.
6
https://www.elastic.co/products/elasticsearch
7
https://github.com/dadoonet/fsriver 8http://flask.pocoo.org/
9
http://pygments.org/
3
CHAPTER 2
RELATED WORK
Search engines
differentiate the employ metrics
significant for scoring
packages. and ranking,
Additionally, but these
these metrics dometrics are often
not exhibit all thelimited andthat
qualities domay
not
be relevant to what a user wants out of a specific module or package.
The aPyPI
follows very[5] website
simple signifies
search the exemplar
algorithm which givesfor athis project.
score When
for each one searches
package based onfor thepackages, PyPI
query. Certain
fields such as name, summary, and keywords are matched against the query and a binary score for each
field is computed (basically a “yes; it matched” or “no; it didn’t”). A weight is given for each field, and
the composite scores from each field are added to create a total score for each package. Packages are first
sorted by score and then in lexicographical order of package names. We found this information at
stackoverflow1 and followed the steps given to confirm the working of the PyPI searching algorithm.
The above method employed by PyPI works, but it doesn’t distinguish the packages very well.
For example,the
(fortunately, searching for framework,
Flask web “flask” will which
yield 1750
shouldresults
be at with a top
the top score
when of 9 given
searched, to 162
is listed at 4packages
th
due to
sorting based on alphabetical order). This also makes it very easy to influence the outcome of popular
queries if you are the package developer. An algorithm which resist the influence of a package owner
would be a better fit for reliable package searches.
PyPI only
searches Ranking [6] Packages,
Python is another and
website created
no other by Taichino
languages. that
It has is similar
a search to PyPI
function thatand PyQuery
takes as it
in a user’s
search query and finds relevant packages. It also syncs with PyPI so that the user can access the
information contained on PyPI such as documentation and downloads. The main difference, however, is
that PyPI Ranking ranks packages based only on the number of downloads, so packages with more
downloads will emerge higher up on the list. This means that packages get more value based on their
popularity, which is a valuable metric, but not the only valuable
1
http://stackoverflow.com/questions/28685680/what-does-weight-on-search-results-in-pypi-help-
in-choosing-a-package
4
metric.
packageFurthermore, the website
level search function and only allows
module levela search
packagefunction,
level search, whereas
providing morePyQuery
resourcescontains both
to the user.
Additionally, the website makes use of Django to facilitate the web development whereas PyQuery uses
the microframework Flask.
There are multiple code search engines that allow users to look for written code that relates to
their search3,query.
BlackDuck These code4.search
and SearchCode Theseengines
websites include
allowwebsites
users tosuch
enterasaKrugle
search, query,
2
Open Huband Code Search
then they list-
sample lines of code based on the results from their search6 query. These websites are limited; however,
because they can only search code at GitHub , Sourceforge and other open source repositories. They are
5
not contained within one context in which a user might want to find a specific package or module.
Additionally, the websites do a search based purely on term occurrence, by identifying the user’s search
term within the lines of code and returning the code samples with numerous hits. The results a user
receives on their search key may not address what they want, but rather just contain the term itself.
Consequently, the results are not scored due to the lack of relevant information to incorporate as metrics.
PyQuery accesses data directly from PyPI, preprocess the data to extract useful information from code,
indexes the data within itself, searches the data, and reorders it based on ranking function. PyQuery is also
constructed within the Python community so that Python packages and modules are only ranked against
other Python packages and modules. These results are more valuable due to the metrics they are based on
and the nature of the searching algorithm.
In the past, people have attempted to do code search in languages like Java7 based on semantics
that
Thisare test that
means driven [7]need
they and to
required
provideusers
both to
thespecify
syntaxwhat theysemantics
and the are searching fortarget.
of their as precisely as possible.
Furthermore pass a
set of test cases that include sample input and expected output to filter potential set of matches. This is a
great technique to search for code that can be reused; however, it has its limitations. This tool requires the
kind of detail regarding the input that the user will not know in the first place. This tool is more helpful
for testing the reusable coding entity whose path from the
2
http://www.krugle.com/
3
https://code.openhub.net/
4
https://searchcode.com/
5
https://github.com/
6
http://sourceforge.net/
7
http://www.oracle.com/technetwork/java/index.html
5
package root ex:
first character of str.capitalize()
string, he mayisguess
knownthetofunction
the username
precisely.
to beIfcapitalize()
a user is looking fornot
but may code that capitalizes
precisely know it
can be found in str package with the signature str.capitalize(). If a user usually knows this information, he
or she may directly look inside usage documents to see if it meets his or her requirements (though he or
she may have to execute test cases on their own).
pathNullege is a dedicated
like structure searchbyengine
(“/” replaced for Python
“.”) used source
in python codestatements
import [8]. As athat
keyword,
al- waysit start
requires
at the alevel
UNIXof
the package root folder. Some of the sample queries for Nullege include “flask.app.Environment”,
“requests.adapters.BaseAdapter” and “scipy.add”. Results from the search operation on Nullege point to
the source code where the programming entity is im- ported. This is a useful tool for users who are
familiar with folder structure of the package and are generally curious in exploring its source code or to
learn packages that import them. A user can’t directly pass a generic keyword(s) that infers the purpose of
programming entity he or she is interested in. For users who want to learn if there exist a reusable
component for a specific task at hand and are not aware of precise location to look at, Nullege is not the
right tool. Because of limitations imposed on the input and type of results returned, Nullege can be
classified as an exploration tool for source code tree of Python packages rather than a search engine for
source code. PyQuery allows users to perform a generic keyword search without limitations in input like
those of Nullege. PyQuery results are usually code snippets that point to definitions of programming
entities rather than import statements.
We have used Abstract Syntax Tree (AST) to collect various programming entity names and
their
AST.line
Somenumbers in modules for
of the applications code include
of AST search. Many research topics
Semantics-Based Codethat analyze
Search [7],software code often
Under- standing use
source
code evolution using abstract syntax tree matching [9] and Employing Source Code Information to
Improve Question-Answering in Stack Overflow [10]. These implementations construct an ast for code
at consideration and extract needed information by walking through the tree or directly visiting the
required node. For this purpose, we have used ast module [11] in Python. Chapter 3 elaborates on how we
extract metadata about modules for code search.
6
CHAPTER 3
DATA COLLECTION
For anybesearch
could of anyengine to work,
form and it requires
any type. data
For the to perform
problem search
we plan operations.
to solve, Data
we have to could be the
address anything. It
question
“What kind of data are we interested in?”. We are engrossed in data related to Python packages that can
help us return meaningful results for a user query. We intend to provide two flavors to the search engine:
Package Level Search and Module Level Search. Let us examine tools and configurations that help us
collect required data to achieve this goal.
3.1 Package Level Search
A package
example: is a collection
“requests” package isofdeveloped
modules, towhich arehttp
handle meant to solve According
capabilities. the problem(s)
to itsof some type.
homepage 1 For
, it has
various features, including International Domains and URLs, Keep-Alive and Connection Pooling,
Sessions with Cookie Persistence, Browser-style SSL Verification, etc. A user interested in these features
would like to use this library to solve his or her problem. A developer may produce a library and assign a
name to it that may or may not directly have any relation with the purpose of the library. A user would get
to know whether the library helps solve his problem not just by looking at its name alone but the
description, sample usage and other useful metadata about the package mentioned at its homepage.
Sometimes when a user has to pick between multiple packages that are trying to solve the same problem,
criteria like popularity of author, number of downloads, frequency of releases, code quality and efficiency
starts to factor. A search engine that returns Python package as matches to a user query would require
similar information.
3.1.1 Metadata - Packages
codeWe have discussed

quality, popularityhow a developer’s
of author description
and number of a package
of downloads helps aonuser
its to
homepage, frequency of release,
decide whether
1
http://docs.python-requests.org/en/latest/
7
aforgiven libraryquery
the user’s solves hisiformultiple
and her problem. PyQuery
packages qualify,needs this information
to prioritize one over to
thesearch
other.for
Onematching packages
direct way to get
description of a package is to crawl its homepage at PyPI2. Though this sounds pretty straight forward and
easy, gathering URL information for the latest stable release for each package and maintaining this
information could be tricky and searching for required information in crawled data could be time
consuming.
Wemetadata
access found aninformation
elegant andabout
mucha simpler
package way
via atohttp
gather metadata
request 3
. This of a package.
would return aPyPI
JSONallows users
file with to
keys
such as description, author, package url, downloads.last month, downloads.last week, downloads.last day,
releases.x.x.x,4 etc. For example: one can query the PyPI website for metadata about “requests” package
through URL . Refer to Figure 3.1 for a sample response from PyPI.
3.1.2 Code Quality
PEPdevelopers
Python 0008 – Style Guide
should for Python
incorporate Code
into their, describes
5
a setstandards
code. These of semantic rules and
are highly guidelinesby that
encouraged the
Python community. Standard libraries that are shipped with installation are written using these
conventions. One main reason to emphasize the standardizing style guide is to increase code readability.
The Python code base is pretty huge, and it is important to maintain consistency across it. Conventions set
in Python style guide makes Python language so beautiful and easy to follow as you read.
Code Quality
consideration of a package
is following can guide
the style be measured in multiple
for Python ways.
code. The First,
Python we can check
community if thetopackage
has tools at
check the
package compliance with the style guide. PEP86 is a simple Python module7 that uses only standard
libraries and validates any Python code against the PEP 8 style guide. Pylint is another such tool that
checks for line length, variable names, unused imports, duplicate code and other coding standards against
PEP 8.
2
https://pypi.python.org/pypi
3
http://pypi.python.org/pypi/<package_name>/json
4
http://pypi.python.org/pypi/requests/json
5
https://www.python.org/dev/peps/pep-0008/
6
https://pypi.python.org/pypi/pep8 7http://www.pylint.org/
8
‘‘info’’:{
...
‘‘ ‘‘ package ’ url
: ’‘ ’‘’ ’Kenneth
: ‘ ‘ http : / / pypi
’ ’ , . python. com
. org / pypi / r e q u e s t s ’ ’ ,
‘ ‘ author
author ’email : Reitz
‘ ‘ me@ kennethreitz ’’,
‘ ‘ d e s c r i p t i o n ’ ’ : ‘ ‘ Requests : HTTP f o r Humans . . . ’ ’
...
...
‘ ‘ r e l e a s e u r l ’ ’ : ‘ ‘ http : / / pypi . python . org / pypi / r e q u e s t s
/2.7.0’’,
‘ ‘ downloads ’ ’ : {
‘ ‘ last month ’ ’ : 4002673 ,
‘ ‘ l ast week ’ ’ : 1307529 ,
‘ ‘ l a s t d a y ’ ’ : 198964
},
...
...
‘‘releases’’:{
‘‘1.0.4’’: [
{
‘‘has sig ’’: false ,
‘ ‘ upload t ime ’ ’ : ‘ ‘2012 −12 −23T07 : 4 5 : 1 0 ’ ’ ,
‘ ‘ comment text ’ ’ : ‘‘’’,
‘ ‘ python ver sio ’ ’ : ‘ ‘ source ’ ’ ,
‘ ‘ url ’ ’ : ‘ ‘ https : / / pypi . python . org / packages / s our c e / r / r e q u e s t s / r
‘ ‘ mde q5udigest
e s t s −’ ’1:. 0 .‘4‘ .0t ba 7448
r . gz ’f ’9,e 1 a 077 a 7218720575003 a 1 b 6 ’ ’ ,
‘ ‘ downloads ’ ’ : 111768 ,
‘ ‘ f i l ename ’ ’ : ‘ ‘ r eq u e s t s − 1 . 0 . 4 . t a r . gz ’ ’ ,
‘ ‘ packagetype ’ ’ : ‘ ‘ s d i s t ’ ’
}, ‘ ‘ s i z e ’ ’ : 336280
],
...
...
}
}
Figure 3.1: Metadata from PyPI for package requests.
9
PEP8 and
Prospector 8 Pylint
that aretogether
brings great tools
boththat
the we can use to of
functionality check
Pep8for code
and quality,
Pylint. butadds
It also we have decided to use
the functionality of
the code complexity analysis tool called McCabe 9. When a package is processed with Prospector, it will
give a count of errors, warnings, messages and their detail description. This information gives an
inference of Code Quality.
wellThere is another
the code set of information
is commented. The ratioweofcan
theuse for analyzing
number codelines
of comment quality. Developers
to the care about
total number how
of lines,
number of code lines to the total number of lines 10 and number of warnings to the number of lines, offers
some metrics to do code quality analysis. CLOC helps us acquire this information. As CLOC stands for
Count Lines of Code, when we run CLOC on Python package at consideration, it returns the total number
of files, number of comments, number of blank lines and number of code lines. We collect this
information to check for Code Quality.
3.2 Module Level Search
A module
classes, is a Python
methods file withSome
and variables. extension “.py”. Itareis ainterested
developers collectioninofsearching
multiple programing units such as
for these programming
entities in a module, so we wanted to build a search engine for them. There are various steps involved in
achieving this goal.
3.2.1 Mirror Python Packages
and In order to allow

variables, users
we need to to perform
extract thismodule level search,
information i.e., allow
from modules andusers to search
packages thatfor classes,
hold them.methods
We are
interested in the source code of all Python packages available. If there is a new release for any package,
we already have, we want to update the information we have on this package. All of these operations
could be complex or cumbersome if we were to do it by automating the process of downloading source
code from their respective homepage (assuming we somehow managed to collect source code download
URL for all packages). We found a better alternative. We came across a practice followed by software
development organizations. Some of these organizations
8
https://github.com/landscapeio/prospector
9
https://pypi.python.org/pypi/mccabe 10http://cloc.sourceforge.net/
1
wouldn’t like their
development. developers
Instead, to hit the
they maintain world
a local wide of
mirror web
thetoPyPI
download software
repository frompackages they need can
which developers for
download necessary packages without connecting to the Internet.
Currently,
order to avoid PyPI
such isdisaster,
a homePyPI
to 50,000+
has comepackages. It PEP
up with would381be11a, asingle point infrastructure
mirroring of failure if it that
goescan
down. In
clone
an entire PyPI repository in a desired machine. People started making public and private repositories
using this infrastructure. For our purposes, we use Bandersnatch , a client side implementation of PEP
12
381 to sync Python packages. When bandersnatch is executed for the first time it will mirror the entire
PyPI i.e., download all the Python packages. It will also maintain state files that help maintain the current
state of the repository, which is later used to sync with PyPI to get any updates made to the packages. A
recurring cron job to execute command “bandersnatch mirror” will keep the local repository always
updated.
3.2.2 Metadata - Modules
We have
entities. previouslythe
We mirrored discussed that repository
entire PyPI developers into
showour
interest in using
servers doing bandersnatch.
a code search for program-
In order ming
to enable
code search, we have to find useful information from modules of each package, i.e., get a list of
programming entities for each module. There are many programming entities in a Python module, but we
are mainly interested in classes, functions under classes, variables under classes, global functions,
recurring inner functions inside global functions, variables inside global functions and global variables.
We maintain each of them in a separate key so that we can give more weight to certain entities than others.
To collect required information, we iterate through all packages; with in each package we iterate
through
Python and all perform
modules;a for each
walk module,
(visit we construct
all) operation on thisantree.
Abstract Syntax
As walk Tree visits
operation using each
ast13 programming
module from
entity, it invokes various function calls inside ast.NodeVisitor such as visit Name, visit FunctionDef, visit
ClassDef and so on as per the current element. We override ast.NodeVisitor class and functions inside it
and perform visit all operation on top of it so that we have control over the operation performed inside
them. For example, during visit all, if a class is being visited, a
11
https://www.python.org/dev/peps/pep-0381/
12
https://pypi.python.org/pypi/bandersnatch
13
https://docs.python.org/2/library/ast.html
1
# Sample Code for collecting metadata class
PyPINodeVisitor(ast.NodeVisitor):
def visit_Name(self, node):
# collect variable name and line number def
visit_FunctionDef(self, node):
# collect function name and line number def
visit_ClassDef(self, node):
# collect class name and line number def
visit_all(self, node):
# call super class visit function
if name == " main ": for

modules in packages:
for module in modules: tree =
ast.parse(module)
JSONfile = PyPINodeVisitor().visit_all(tree)
Figure 3.2: Pseudocode for collecting metadata of a module.
function callpassed
information to visittoClassDef is invoked.
it and decide what to Since weit.have
do with overridden
We can this function,that
collect information weisare in control
of interest of
to us
such as names of the various classes and line number at which they occur. Figure
3.2 is the
collect all pseudocode
the metadatafor forusing ast toand
a module generate
save itthe
in arequired metadata
JSON format for of a module.
making This way,
it available we can
for Module
Level Search. Figure 3.3 is an example of one such JSON file we have generated using this process. Each
identifier is concatenated with its line number and additional underscores to make a length of minimum
18. The reason behind this format is discussed in Chapter 4. As part of data collection, it is important that
information being collected should be stored in an agreed format that enables better indexing and
searching techniques.
3.2.3 Code Quality
Similar
CLOC; we to thealso
have method we have
collected this applied to collect
information at thecode quality
module for We
level. packages using
couldn’t useProspector
Prospectorand
to
process a single module like we did for package, so we used Pylint instead of Prospector. CLOC helped
towards obtaining the number of comment lines, number of blank lines and number of code lines at the
module level.
1
{
‘ ‘ c el cua s rs e’ C ’ :oo‘ ‘k iSeeSses si so inonM i10 x i9n N2 u8 l Tagged
l S e s s i JSONSe
o n 1 1 9 ri aliz er 55 S
2 7 2o ’p’e, n‘ ‘ scel sa ss iso fnu 3n 0c 1t isoanv ’e ’ s: e s s‘ i‘ ogne t3s1i 5g ’ne’ i,Sn egsssei roi na Il ni zt ee rr f2a9c0e
S e s s i o n I n t e r f a c e 1 3 4 S e c u r e C o o k i
‘‘class variable ’’: ‘‘ salt 278 d i ges t me t h o d 28

0key derivation 283 serializer 287 session class 2
8s e8 ls fe3l 0f 21 9 0 aa pp pp 23 90 01 sr ie gq nu ee rs tk 3w0a 1r g s 2 9 3
val305 max age 308 self 315
app315 session 315 response 315
domain 316 path 317 httponly 323
secure 324 expires 325 val326
d a t a 3 1 0 ’ ’ ,
‘ ‘ f unct i on ’ ’ : ‘ ‘ t o t a l s e c o n d s 2 4 ’ ’ ,
‘‘function function ’’: ‘‘’’,
‘‘function var ’’: ‘‘’’,
‘ ‘ module ’ ’ : ‘ ‘ s e s s i o n s ’ ’ ,
‘’ ‘, ‘module
‘ v a r i apath b l e’ ’’ ’: : ‘‘ ‘‘ Flask
se ss− i o0n. j10 s o. 1n /s felrai sakl /i sz ee sr s1 i0o6n ’s’. py ’
}
Figure 3.3: Metadata for a module in Flask package.
1
CHAPTER 4
DATA INDEXING AND SEARCHING
We used engine,
analytics Elasticsearch (ES)
built on top, of
1
a Apache
flexible Lucene
and powerful,
to indexopen source,
our data anddistributed, real-timedata
query the indexed search and
for both
package level search and module level search. FileSystem River (FSRiver) 2, an ES plugin is used to index
documents from a local file system or remote file system (using SSH).
4.1 Data Indexing
4.1.1 Package Level Search
Extracted
package data for each
is considered Python package
a document. Althoughis all
indexed
fieldsin
inES using FSRiver,
a document whereindata
are indexed the for
ES each Python
server, only
the following fields: name, summary, keywords and description are analyzed (Refer to Figure 4.1 for ES
mapping) before they are indexed using ES Snowball analyzer3. Snowball analyzer generates tokens using
the standard tokenizer, removes English stop words and uses standard filter, lowercase filter and snowball
filter. The other fields are not analyzed before indexing either because they are numbers (eg:
info.downloads.last month) or are of no interest with respect to search query. Figure 4.2 depicts the river
definition which actually indexes the package level data (in .json format) located in the server, looks for
updates every 12 hour and reindex data if there is any
update.
4.1.2 Module Level Search
Extracted
Python data
package is for each module
considered in a Python All
as a document. package
fieldsisexcept
indexed in ES,path
module where
in data for eachare
a document module in a
analyzed
using a custom analyzer (Refer to Figure 4.3 for the definition of the custom analyzer) before they are
indexed. The custom analyzer generates tokens using a custom
1
https://www.elastic.co/products/elasticsearch 2https://github.com/dadoonet/fsriver
3
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-analyzer.html
1
PUT packagedata / packageindex / mapping
{ ‘ ‘ packageindex ’ ’ : {
‘‘properties’’:{
‘‘info’’:{
‘ ‘ name ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ analyzed ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ snowball ’ ’
},
‘ ‘ summary ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ analyzed ’ ’ ,
},
‘ ‘ keywords ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ analyzed ’ ’ ,
},
‘‘description’’:{
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ analyzed ’ ’ ,
}
}
}
}
}
}
Figure 4.1: Package level data index mapping.

tokenizer (splittingfilter
uses ES snowball by underscore,
4 camel
. The module case
path andisnumbers,
field at theassame
not analyzed preserves
we will the original
not look token)
for matches in and
this
field for a search query. We used fast vector highlighter5 (by setting term vector to with positions offsets
in the mapping) instead of the plain highlighter in module level search.
4
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/analysis-snowball-tokenfilter.html
5
https://www.elastic.co/guide/en/elasticsearch/reference/1.3/search-request-highlighting.html
1
PUT r i v e r / p a c k a g e i n d e x r i v e r / meta
{
‘, ‘‘ ‘type
f s ’ ’’ :’ :{ ‘ ‘ f s ’ ’
‘‘ ‘‘ url ’ ’ : rate
update ‘ ‘ / ’s ’e:r v e‘r‘/12
package
h ’ ’ , / data / d i r e c t o r y / path ’ ’ ,
} , ‘ ‘ j son supp or t ’ ’ : t rue
‘ ‘ index ’ ’ : {
‘ ‘ index ’ ’ : ‘ ‘ packagedata ’ ’ ,
‘ ‘ type ’ ’ : ‘ ‘ packageindex ’ ’
}
}
Figure 4.2: River definition.

The
Figure 4.4 depicts the mapping defined forfragment
fast vector highlighter lets you define size and
module level datanumber of matching fragments to be returned.
indexing.
4.2 Data Searching
We define
queries. ourthe
ES uses search querymodel
Boolean according to ES
to find Query documents,
matching DSL to lookand
for amatches
formulaincalled
the index for user
the practical
scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse
document frequency and the vector space model but adds modern features like a coordination factor, field
length normalization, and term or query clause boosting6.
4.2.1 Package Level Search
Figure
query in the4.5 depicts the
following query
fields: used
name, for package
author, level
summary, search. ESand
description looks for matches
keywords. Basedforonthe
theuser search
matches it
ranks the results and returns name, author, summary, description, version, keywords, number of
downloads in the last month for each top n ranked Python package, where n is the number of matching
packages requested. Summary and description of a matched package are
6
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
1
PUT moduledata
{ ‘ ‘ s e t t i n g s ’ ’: {
‘‘analysis’’ :{
‘‘filter ’’ : {
‘ ‘ code1 ’ ’ : {
‘, ‘‘ ‘type ’s’ e r: v e‘ ‘oprai tgt ier
p r eerns n c ap t u r e ’ ’
‘ ‘ patt ’’ : [ nal ’’ : 1,
‘‘ ‘‘ (( \\
\\ p{
d+)Ll ’ ’}+|\\p{Lu}\\p{ Ll }+|\\p{Lu}+) ’ ’ ,
} ]
},
‘ ‘ analyzer ’ ’ : {
‘ ‘ code ’ ’ : {
‘ ‘ t o k e n i z e r ’ ’ : ‘ ‘ pattern ’ ’ ,
‘ ‘ f i l t e r ’ ’ : [ ‘ ‘ code1 ’ ’ , ‘ ‘ l owe r ca s e ’ ’ , ‘ ‘ snowball ’ ’
]
}
}
}
}
}
Figure 4.3: Custom Analyzer with custom pattern filter.

returned in the section where matches
user interface while displaying are highlighted using the <em> tag, which we will utilize in the
the results.
4.2.2 Module Level Search
Figure
in the 4.6 depicts
following fields:the query
class, usedfunction,
class for module level
class search.function,
variable, ES detects matches
function to the user
function, searchvar
function query
and
variable of a module document. Different weights are assigned for matches in a different field based on
their importance. For example, a match in the class field will weigh more than a match in the function
field in a module. Weights are assigned using a caret (ˆ) sign followed by a number. Based on the
matches, it ranks the results and returns the path to the module (module path) where match occurred.
Using this information, we will retrieve the
1
PUT moduledata / moduleindex / mapping
{ ‘ ‘ pypimtype ’ ’ : {
‘ ‘ module ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ code ’ ’
},
‘ ‘ module path ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ not ana lyz ed ’ ’
},
‘‘class’’:{
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ code ’ ’ ,
‘ ‘ term vector ’ ’ : ‘‘withpositionsoffsets ’’
},
‘‘class function’’:{
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ code ’ ’ ,
},
‘‘classvariable’’:{
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ code ’ ’ ,
},
‘ ‘ function ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a nal yzer ’ ’ : ‘ ‘ code ’ ’ ,
},
...
...
}
}
}
Figure 4.4: Module level data indexing mapping.
1
GET packagedata / packageindex / s e a r c h
{ ‘ ‘ query ’ ’ : {
‘ ‘ multi match ’ ’ : {
‘ ‘ query ’ ’ : ‘ ‘ sear ch query ’ ’ ,
‘ ‘ operator ’ ’ : ‘ ‘ or ’ ’ ,
‘ ‘ f i e l d s ’ ’ : [ ‘ ‘ i n f o . name ˆ 3 0 ’ ’ , ‘ ‘ i n f o . author ’ ’ , ‘ ‘ i n f o .
summary ’ ’ , ‘ ‘ i n f o . version ’ ’ , ‘ ‘ i n f o . keywords ’ ’ ]
}
},
‘‘fields ’’: [
‘ ‘ i n f o . name ’ ’ , ‘ ‘ i n f o . author ’ ’ , ‘ ‘ i n f o . summary ’ ’ , ‘ ‘ i n f o . version
’ ’ , ‘ ‘ i n f o . keywords ’ ’ , ‘ ‘ i n f o . downloads . last month ’ ’
],
‘‘highlight ’’: {
‘‘fields ’’: {
‘ ‘ summary ’ ’ : { } ,
‘‘description’’:{}
}
}
}
Figure 4.5: Package level search query.

module .py file and extract the
Highlighter (FVH) for module level relevant code
search, segment
which fortheeach
returns top matching module.
n fragments We field
from each used (class,
Fast Vector
class
function, class variable, function, function function, function var and variable) in a module document. The
fragment size has been strategically defined to be 18, which is the minimum fragment size ES allows.
While forming the module level search documents we appended the line number to each variable where it
appears so that we can use the module path information and line number information to retrieve the
relevant code segment. If a variable appended with the line number is less than 18 characters long, we
appended underscores to make it 18. This trick is useful because ES FVH creates fragment based on
fragment size and word boundary. This way, ES creates fragments for each variable in a field, looks for
matches in each field to the search query and returns n top fragments, where n is the number of fragments
set in the query.
1
GET moduledata / moduleindex / s e a r c h
{ ‘ ‘ query ’ ’ : {
‘ ‘ multi match ’ ’ : {
‘ ‘ query ’ ’ : ‘ ‘ user query ’ ’ , ‘ ‘
fields’’:[
‘ ‘ c ,l ‘a‘sf suˆn5c’ t’i,o‘ n‘ cˆ 4l a’ s’ s, ‘ f‘ uf un nc tc it oi on n’ ’f,u‘ n‘ cc lt ai os ns ’v’a, r‘ i‘ af bu lnec ’t ’i
} on var’’,‘‘variableˆ3’’]
}‘ ‘, f i e l d s ’ ’ : [
‘ ‘ module path ’ ’
],
‘ ‘ source ’ ’ : f a l s e ,
‘‘highlight’’: {
‘ ‘ order ’ ’ : ‘ ‘ score ’ ’ ,
‘, ‘‘ ‘r fe iqeul idrse ’ f’ i :e{l d m a t c h ’ ’ : true
‘‘class’’: {
‘ ‘ number of fragments ’ ’ : 5 , ‘
} , ‘ f r a g m e n t s i z e ’ ’ : 18
‘ ‘ c l a s s f u n c t i o n ’ ’: {
} , ‘ f r a g m e n t s i z e ’ ’ : 18
‘‘class variable ’’ : {
} , ‘ f r a g m e n t s i z e ’ ’ : 18
‘ ‘ fu n ction ’ ’ : {
} , ‘ f r a g m e n t s i z e ’ ’ : 18
...
...,
‘‘variable’’ : {
‘‘ ‘f rnumber
a g m e noft fragments
s i z e ’ ’ : 18’ ’ : 5 , ‘
}
}
}
}
Figure 4.6: Module level search query.
2
CHAPTER 5
DATA PRESENTATION
Once
searchthe data type
engine is indexed, it needs
of interface. to goal
Our be presented in alevel
for package fast search
and presentable manner.
was to provide We chose
a ranked to relevant
list of create a
packages and their details to any given query. For module level search, we wanted to provide actual source
code snippets related to the query. In order to display some of the source code to the user, preprocessing
was necessary.
5.1 Server Setup
Our
serve ourserver
data, has
andauWSGI
simple4stack
as ansetup. We between
interface use Nginxthe
1
toNginx
handleand
requests, Flask2/Python(ES)
Flask. Elasticsearch
3
to process
5
is usedand
to
hold our data for package and module level search. To make sure we are using the latest packages from
PyPI, we use sqlite databases to track the modified times of each6 package. Much of the rendering and
manipulation of the browser interface is done using JavaScript . Our JavaScript library of choice is
jQuery7.
5.2 Browser Interface
The interface
or module level features a home
search. The pagepage
results with displays
a simple the
text query
box forinformation
queries andata choice
the top,forand
either
thepackage
results
themselves below. The user can modify his or her query or change the search type from package level
search to module level search and vice versa. In Chapter 6 we have added some sample screen shots of
the browser interface.
1
https://www.nginx.com/resources/wiki/
2
http://flask.pocoo.org/ 3https://www.python.org/
4
https://uwsgi-docs.readthedocs.org/en/latest/
5
https://www.elastic.co/products/elasticsearch
6
https://developer.mozilla.org/en-US/docs/Web/JavaScript
7
https://jqueryui.com/
2
Figure 5.1: Package modal.
5.3 Package Level Search
When
list of a userissends
packages a query
sent back. to the
Each serverisfor
package packageonlevel
depicted search, the
the browser as aquery is processed,
tile (Refer and6.3).
to Figure a ranked
Tile
provides minimal information about the package that includes package name, author of the package, brief
description, number of downloads and score assigned by the ranking algorithm. The user has the option to
click the tile to view more information about the package and visit PyPI and other sites related to the
package. When the user clicks on the tile, a modal is opened that contains detail description, version,
source code homepage, PyPI homepage, score from ranking algorithm (Refer to Figure 5.1), statistics of
the package as bar graph, pie chart and numbers (Refer to Figure 5.2), other packages from author (Refer
to Figure at 5.3). This process relies on ranking of packages.
2
Figure 5.2: Package statistics.
5.3.1 Ranking of Packages
manyOne of the
types mostfor
of data important
ranking aspects which In
the packages. distinguishes thisChapter
Chapter 3 and search engine fromdiscussed
4 we have others ishow
the use of
all the
preprocessed information about packages and modules, including basic details from PyPI are stored in
our ES server. We felt that ES relevance algorithm was not thorough8 enough to return meaningful results.
So we are using a few other metrics, namely, Bing search results , number of downloads, the ratio of
warnings to the total number of lines, the ratio of comments to the total number of lines and the ratio of
code lines to the total number of lines (gathered by prospector and CLOC, also visible in package modal
referenced in Figure 5.2). All of these metrics are passed as columns to the ranking algorithm. These
columns represent nothing but a matrix. After the ranking algorithm is executed, it returns a sorted list of
packages in reverse order based on newly generated scores and a dictionary mapping of package names to
their scores. Note that
8
http://datamarket.azure.com/dataset/bing/search
2
Figure 5.3: Other packages from author.
ES column
these columnand Bing Column
weights could beare primary
tricky. columns
One way while
to fine tuneothers are secondary
this algorithm derived
is to try columns.
different Tuning
combinations
of weights and learn which one works better. We can also fine tune this algorithm by adding more
primary columns or secondary derived columns. For example, at the time of writing this thesis, we were
experimenting with PageRank as one of the primary columns. We sought to calculate PageRank based on
import statements in each module. Just like in web pages where a page “A” links to some other page “B”,
in Python modules we have import statements where a module “C” imports module “D”. This can be
considered as a vote cast by “C” to module “D” and this information can be used to generate PageRank
for modules and packages.
Ranking Algorithm.
1. Request ES server for top 20 results for user query.
2. Request Bing for top 20 results for user query.
3. Generate other relevant columns or metrics.
2
4. Pass list of all columns generated in step 1, 2 and 3 to rerank function along with weights for each
column.
5. Inside rerank function
(a) Find max length of columns, i.e. number of rows in the matrix.
(b) Construct a results
the package dictionary
and value for set union
(score) initiated to 0.of all cells in the matrix with key as the name of
(c) For each row in matrix
i. For each cell in row
A. Find the package score
row.rownumber) for its position using the formula, (maxlen −
× weightvector(cell.columnnumber).
Here
position of the current row number
“maxlen” is the total of rows
in the matrix. in the matrix. “row.rownumber” isgives
“weightvector(cell.columnnumber)” the
the weight assigned to the particular column the cell belongs to.
B. Add this score to existing score of that package in the dictionary created in step 5(b).
At the end
standing of this step,
in different each package in results dictionary will have cumulative score for
columns.
(d) From
resultsthe
listresults dictionary,
with reranked generate a reverse sorted list with key as score. This will give
order.
totalNote that of
number there
rowscould bematrix
in the duplicates between
passed top 20
to ranking resultsisofnot
function ESalways
server40.
andFigure
top 20 results of Bing, so
5.4 is the sample pseudocode for the ranking algorithm. Table 5.1 is the
keyword “music”. In this table by looking at the cells highlighted with yellow background, sample matrix
youformed for
can notice
duplicate name “vls-framework” between primary columns Elasticsearch and Bing. This only means that
both primary columns agree to the fact that this is a relevant result for given keyword. By looking at the
cells highlighted in red, you can notice duplicates within the same primary column Bing. This scenario
occurs because multiple versions of the same package emerge famous. From the table we can also notice
that these duplicates are carry forwarded to secondary columns, thus influencing the ranks. Allowing
duplicates to present in primary columns and allowing them to carry forward to secondary columns are
decisions yet to be made. For now, this practice in ranking algorithm has looked promising, with positive
effects. As we investigate with more use cases, if it turns out duplicates are influencing negatively, we can
always just eliminate them by not disturbing
2
the nature
scores of ranking
calculated function.
by the rankingFor keyword
function “music”,
ordered Table 5.2 shows
in descending order. the matching packages and their
5.4 Module Level Search
Refer
search, thetouser
Figure 6.4a to
sends viewto the
query the template used for this
server. However, Module
time Level
aroundSearch. Similar
the server to package
returns a list of level
code
excerpts. One of our concerns was to keep the wait time of queries low. So a preprocessing step was
added to retrieve the code faster.
5.4.1 Preprocessing
To reduce
previously the waitBandernsatch
mentioned, time of searches,
is usedthe sourcea code
to create needs of
local mirror to the
be PyPI
adapted for theonbrowser.
repository As
our server,
however, only compressed packages are maintained with Bandersnatch. An uncompression step is
required to examine each Python module in plain text.
Ourlines
twenty aim from
was tothe
show nicely
source formatted,
code to the user stylized lines of
and render thecode.
codeInitially, we were
on the user’s sidegoing
usingtoa send
third about
party
JavaScript library such as SyntaxHighlighter 9. This worked well except for multi-line constructs such as
doc strings. Since there is a possibility of missing lines of a doc string during client-side rendering, the
renderer has no way of knowing how to stylize the doc string. Instead, we fixed this by rendering the code
snippets before sending it to the client (server-side rendering). For this, we used pygments10, a Python
library, for creating stylized code in Python and other languages for numerous formats such as HTML.
This inevitably increases the amount of data sent from the server, but this ensures that the code is
correctly stylized.
Linesmodule.
for each of CodeThe
(LOC) is another
mapping file intrick
thisused in expediting
instance stores thethe code
exact display.
byte whereItacreates a “mapping”
new line file
starts for each
line in a module. When it is time to grab a snippet of code, the server can open the file and immediately
seek to the correct spot rather than search or linearly pass through the rest of lines. There could be
thousand or even more lines of code that we are avoiding by using this technique. This cuts down on
processing time and only requires another step in the preprocessing.
9
http://alexgorbatchev.com/SyntaxHighlighter/ 10http://pygments.org/
2
generateTop20ESResults():
{
#query ES server and get top 20 results
}
generateTop20BingResults():
{
#query bing and get top 20
#filter out real packages from pages
}
generateOtherMetrics(listOfPackagesFromESandBing):
{
for eachpackage in listOfPackagesFromESandBing:
column_downloads = eachpackage.downloads
column_warnings = eachpackage.warnings/eachpackage.lines
column_comments = eachpackage.comments/eachpackage.lines
column_code = eachpackage.code/eachpackage.lines
column_downloads.sortReverse() # more downloads, better the package
column_warning.sort() # less warnings, better the package
column_comments.sortReverse() # more comments, better the package
column_code.sortReverse() # more code, better the package
return column_downloads,column_warning,column_comments,column_code
}
rerank(weightvector, matrix):
{
maxlen = max(len(column) for column in matrix)
# Create a dict as keys:0.0 from union of cells in matrix resultsDictionary =
getcDictionaryFromCells(matrix)
# go through the matrix one row at a time for row in
matrix:
for cell in row:
if packagename: # avoiding empty cells resultsDictionary[cell.packaganame] +=
(maxlen - row.rownumber) * weightvector[cell.column_number] # higher the
score, better the package
resultsList = sortReverse(resultsDictionary, key=score) # resultsList
gives you the order of packages
# resultsDictionary gives score to each package return
resultsList,resultsDictionary
}
# execution starts here columns =

[]
columns.append(generateTop20ESResults) # primary column, more weight
columns.append(generateTop20BingResults) # primary column, more weight
columns.extend(generateOtherMetrics(ES+Bing)) # secondary columns wegithvector = [0.4, 0.2,
0.1, 0.1, 0.1, 0.1] # can vary rerank(weightvector,columns)
Figure 5.4: Pseudocode for ranking algorithm.
2
Table 5.1: Ranking matrix for keyword music.
Elasticsearch Bing Downloads Warnings/Lines Comments/Lines Code/Lines
vk-music musicplayer pylast vk-music mps-youtube mopidy-musicbox-webclient
jmbo-music music mps-youtube jmbo-music mps-youtube musicplayer
panya-music mopidy-gmusic mps-youtube panya-music vis-framework musicplayer
tweet-music pyspotify jmbo-music tweet-music pyacoustid mopidy-gmusic
google-music-playlist- musicplayer hachoir-metadata google-music-playlist- cherrymusic hachoir-metadata

importer importer
music-score-creator mps-youtube pyspotify music-score-creator music21 bdmusic
spilleliste music21 mp3play spilleliste music21 music21
raspberry jam mps-youtube bdmusic raspberry jam pyspotify music21
kurzfile gmusicapi pyacoustid kurzfile mp3play pyspotify
vis-framework vis-framework gmusicapi vis-framework gmusicapi mp3play
gmusic-rating-sync music22 vis-framework gmusic-rating-sync bdmusic pyacoustid
vkmusic pygame-music-grid vis-framework vkmusic hachoir-metadata cherrymusic
melody-dl bdmusic music21 melody-dl musicplayer vis-framework
leftasrain mopidy-musicbox-webclient music21 leftasrain musicplayer gmusicapi
youtubegen netease-musicbox raspberry jam youtubegen mopidy-gmusic mps-youtube
marlib hachoir-metadata mopidy-musicbox-webclient marlib vk-music mps-youtube
chordgenerator mp3play musicplayer chordgenerator jmbo-music vk-music
tempi music21 musicplayer tempi panya-music jmbo-music
raindrop.py pyacoustid panya-music raindrop.py tweet-music panya-music
pylast cherrymusic cherrymusic pylast google-music-playlist- tweet-music

importer
- - kurzfile musicplayer music-score-creator google-music-playlist-
importer
- - mopidy-gmusic musicplayer spilleliste music-score-creator
- - music-score-creator music21 raspberry jam spilleliste
- - chordgenerator music21 kurzfile raspberry jam
- - tweet-music mps-youtube vis-framework kurzfile
- - vkmusic mps-youtube gmusic-rating-sync vis-framework
- - leftasrain pyacoustid vkmusic gmusic-rating-sync
- - gmusic-rating-sync vis-framework melody-dl vkmusic
- - vk-music pyspotify leftasrain melody-dl
- - tempi mopidy-musicbox-webclient youtubegen leftasrain
- - google-music-playlist- mp3play marlib youtubegen

importer
- - marlib hachoir-metadata chordgenerator marlib
- - melody-dl bdmusic tempi chordgenerator
- - raindrop.py cherrymusic raindrop.py tempi
- - spilleliste mopidy-gmusic pylast raindrop.py
- - youtubegen gmusicapi mopidy-musicbox-webclient pylast
2
Table 5.2: Matching packages and their scores for keyword music.
Package Name Score Package Name Score
musicplayer 45.8 music-score-creator 13.8
mps-youtube 44.6 raspberry jam 13.6
music21 39 google-music-playlist-importer 13.5
We limited
module the number
level search of code
vis-framework
will generate snippets
a huge that33 are
response returned
increasetothe20.
andkurzfile Sending
12.5 more than 20 matches for
response time. Before sending the top
20 results, we apply the ranking algorithm discussed for package level search with changes in input for
the ranking function. As of now, there is only one primary column, i.e. results from the ES server and a
pyspotify 22.8 spilleliste 12.1
list of secondary columns similar to package level, but each one of them points to module level statistics
rather than package level statistics
mopidy-gmusic (except for downloads column). For example, consider column
“Warnings/Lines”, for package level search, it is20.8
the ratio of the number of10.8
gmusic-rating-sync
total warnings for package to
the number of total lines of code for the package and for the module level search it is the ratio of the total
number of warnings for module to the total number
gmusicapi 19 of vkmusic
lines of code for the module.
10.5
bdmusic 18.6 music22 10.4
After rendering matchinghachoir-metadata 17.8

code snippets on the pygame-music-grid
front 10
end, for user convenience and interest, we
jmbo-music 17.7 leftasrain 9.4
mp3play 17.1 melody-dl 9.3
pyacoustid 16.9 pylast 9
vk-music 15.7 netease-musicbox 8.8
panya-music 15.7 chordgenerator 8.2
mopidy-musicbox-webclient 15.7 youtubegen 8
tweet-music 14.6 marlib 7.9
cherrymusic 14.5 tempi 7.1
music 14 raindrop.py 6.2
2
Figure 5.5: Module modal.
have given an option for user to click on avisibility
from the module. This will give users more code snippet that enables
for modules. Figurea 5.5
modal that displays
represents entire code
this modal.
3
CHAPTER 6
SYSTEM LEVEL FLOW DIAGRAM
Figure
overnight6.1toisgenerate
a System Level
all the Flow details
required Diagram of PyQuery.
mentioned in the A set Collection
Data of Pythonchapter.
scripts This
are run in batch
preprocessed
information is in JSON file format. Before executing these batch scripts, the bandersnatch 1
mirror client is
executed so that packages are in sync with PyPI2 and we deliver the most up to date information. All the
files generated are either part of package search or module search.
We maintain separate Elasticsearch (ES)3 indexes for package search and module search. These
indexes
form theare configured
core to update at regular intervals if there are any changes for the files they point. They
of the PyQuery.
A web interface customized
deployed using NGINX5 server. Figure for easy
6.2 isflow of information
PyQuery’s homepagetowhoseusersdesign
is developed
is mainlyininspired
Flask 4from
and
Google’s homepage. It serves separate edge nodes for package search and module search. When a user
hits the package level search page, an AJAX call is made to the edge node responsible for retrieving
matching packages from ES index. Based up on matching packages retrieved from ES, a set of metrics are
formulated and passed to the ranking algorithm as discussed in chapter 5. This returns a list of packages to
requested front end page that is reverse sorted with key as the score calculated by ranking algorithm. The
highest score is positioned on top of others. Figure 6.3 shows result of Package Level Search on PyQuery
for keyword “flask”.
When a user hits the module level search page, an AJAX6 call is made to edge node responsible
for retrieving
matching matching
modules modules
and line or lines
numbers of code
at which based happened,
the match on the metadata index
the Lines of in ES.(LOC)
Code After collecting
technique
discussed in Chapter 5 is executed to quickly capture code snippets from matching
1
https://pypi.python.org/pypi/bandersnatch 2https://pypi.python.org/pypi
3
https://www.elastic.co/products/elasticsearch 4http://flask.pocoo.org/
5
https://www.nginx.com/resources/wiki/ 6http://api.jquery.com/jquery.ajax/
3
Figure 6.1: System Level Flow Diagram of PyQuery.
modules
Search onand displayfor
PyQuery them
the on the requested
keyword front end page. Figure 6.4 shows result of Module Level
“quicksort”.
3
Figure 6.2: PyQuery homepage.
3
Figure 6.3: PyQuery package level search template.
3
Figure 6.4: PyQuery module level search template.
3
CHAPTER 7
RESULTS
Our goal was

meaningful. Weprimarily
wanted totoa avoid
build huge
a better PyPI
list of search
closely and engine.
similarlyWe wanted
scored to make
packages. WiththePyPI
search more
being the
state of the art, we will be comparing results of PyQuery with PyPI . For comparison, we will be
searching 5 keywords that directly match with a package name at PyPI (from Table 7.1 to Table 7.5) and 5
generic keywords that would infer the purpose of packages (from Table 7.6 to Table 7.10).
1. Total results returned
approximately 1 by PyPI
of total numberwasofhighest for on
packages keyword
PyPI. “Django”
On the otherand hand,
that amounts to 11,292.
for PyQuery with Ittwo
is
primary columns in ranking function is always set to a maximum of 40. As a fact, on Google 81%
of users view only one results page [12], which is close to 10 results.
4
2. Maximum number
is for package of packages,
“flask”. PyQuery that PyPI
always assigned
assigned the highest
highest score score forone
for only a keyword
package.was 162 and that
3. For the first matching
of keyword 5 keywords,
thewhere
packagewename,
are expecting a package
PyPI showed to be in first
this behavior onlyposition
2 out ofdue to theOn
5 times. nature
the
other hand, PyQuery exhibited this behavior 5 out of 5 times.
4. For PyPI, itdistribution
occasions has assignedofsame
scores is very
score, on 3 less spread.
out of Amongitthe
10 occasions has top 5 scores,
assigned onof4 same
4 to be out of 10
score
and on 2 out of 10 it has assigned 3 to be of the same score. PyQuery scores are more spread.
5. For the last we
developer, 5 keywords where
observe that they don’t
PyQuery directly
results pointappealing,
are more to any package namescored
diversely but infer
andthe need of
unique. Fora
example, for keyword ’Web Development Framework’, PyQuery returned all unique results with
packages like Django and Pyramid (widely used web development frameworks) in top 5. On the
other hand, among top 5 for PyPI it had only 3 unique results with some of them related to web
testing framework.
6. Among the versions
to multiple top 5 results,
of theonsame
4 outpackage.
of 10 occasions, PyPI returned
PyQuery always returns duplicate package
only one result pernames
package referring
that is
referring to the latest version.
3
Based onsearch
meaningful aboveand
grounds
avoid of comparison,
closely it is clear
and similarly thatpackages.
scored we have met our goals to improve PyPI, offer a
Keyword Table 7.1: Results comparison

requestsfor keyword - requests.
# of results from PyPI 5117
# of results from PyQuery 28
cache requests, curl to requests, drequests,
Top 5 results from PyPI helga-pull-requests, jsonrpc-requests
Requests, Requests-OAuthlib, Requests-Mock,
Table 7.2: Results comparison for keyword - flask.
Top 5 results from PyQuery Requests-Futures, Requests-Oauth
First 5 scores - PyPI 9, 9, 9, 9, 9
First 5 scores - PyQuery 131.60, 68.50, 51.00, 45.60, 39.90
# of packages with highest score -
Keyword flask
PyPI
## of results from PyPI 21
1750
of packages with highest score -
PyQuery airbrake
1 flask, draftin a flask, fireflask, Flask,
Rank (score) of expected match -
Top 5 results from PyPI Flask-AlchemyDumps
PyPI Flask,
7 (9) Flask-Admin, Flask-JSONRPC, Flask-
Top 5 results from PyQuery Restless, Flask Debug-toolbar
PyQuery
First 5 scores - PyPI 19,(131.60)
9, 9, 9, 9
PyPI 162
PyQuery 1
PyPI 4 (9)
PyQuery 1 (46.70)
3
Keyword Table 7.3: Results comparison for keyword - pygments.
pygments
Pygments, django mce pygments, pygments-asl,
Top 5 results from PyPI pygments-gchangelog, pygments-rspec
Pygments, pygments-style-github,
Xslfo-Formatter, Pygments-
Bibtex-Pygments-Lexer, Mis-
Table 7.4: Results comparison for keyword - Django.
tune
Top 5 results from PyQuery
Keyword Django
#PyPI
of results from PyPI 11292
1
# of packages with
results from highest score -
PyQuery 38
Django, django-hstore, django-modelsatts,
PyQuery 1
Top
Rank5(score)
results of
from PyPI match -
expected django-notifications-hq,django-notifications-hq
Django, Django-Appconf, Django-Celery,
PyPI 1 (11)
Top
Rank5(score)
results of
from PyQuery
expected match - Django-Nose, Django-Inplaceedit
First 5 scores - PyPI
PyQuery 110,(88.10)
10, 10, 10, 10
PyPI 6
PyQuery 1
PyPI 1 (10)
PyQuery 1 (73.30)
3
Keyword Table 7.5: Results comparison
pylint for keyword - pylint.
gt-pylint-commit-hook, plpylint, pylint, pylint-
Top 5 results from PyPI
Table 7.6: Results comparisonpatcher,
Pylint,
pylint-web2py
for keyword - biological computing.
Pylint2tusar, Django-Jenkins, Py-
Top 5 results from PyQuery lama pylint, Logilab-Astng
Keyword
# of packages with highest score - biological computing
PyPI
# of results
packagesfrom PyQuery
with highest score - 23
blacktie, appdynamics, appdynamics, appdy-
PyQuery 1
Top
Rank5(score)
results of
from PyPI match -
expected namics, inspyred
BiologicalProcessNetworks, Blacktie, PyD-
PyPI 3 (9)
Top
Rank5(score)
results of
from PyQuery
expected match - STool, PySCeS, Csb
1 (91.90)
PyQuery
PyPI 1
PyQuery 1
Relevant packages among top 5 -
PyPI 1
PyQuery 1, 2, 3, 4, 5
3
Keyword Table 7.7: Results comparison for keyword - 3D printing.
3D printing
fabby, tangible, blockmodel, citygml2stl, de-
Top 5 results
Table from PyPI comparison formakein
7.8: Results keyword - web development framework.
Top 5 results from PyQuery Pymeshio, Demakein, C3d, Bqclient, Pyautocad
Keyword web development framework
PyPI 2
## of results from PyPI
of packages with highest score - 801
PyQuery 1HalWeb, WebPages, robotframework-
Relevant packages among top 5 - extendedselenium2library, robotframework-
extendedselenium2library, robotframework-
PyPI extendedselenium2library
1, 2, 3, 4, 5
Top 5 results
Relevant from PyPI
packages among top 5 -
PyQuery 1, 2, 3, 4, 5
Top 5 results from PyQuery Django, Pyramid, Pylons, Moya, Circuits
PyPI 2
PyQuery 1
PyPI 1, 2
PyQuery 1, 2, 3, 4, 5
4
Keyword Table 7.9: Results comparison for keyword
material science - material science.
py bonemat abaqus, MatMethods, MatMiner,
Top 5 results from PyPI pymatgen, pymatgen
FiPy, Pymatgen, Pymatgen-Db, Custodian,
Table 7.10: Results comparison for keyword - google maps.
Top 5 results from PyQuery Mpmath
Keyword google maps
PyPI 1
## of
of results
packagesfrom PyPI
with highest score - 290
PyQuery 1Product.ATGoogleMaps, trytond google maps,
Relevant packages among top 5 - django-google-maps,
1, 2, 3, 4, 5 djangocms-gmaps, Flask-
GoogleMaps
Top
PyPI5 results from PyPI Note: 4 and 5 are duplicate results.
Relevant packages among top 5 - Googlemaps, Django-Google-Maps, Flask-
PyQuery
Top 5 results from PyQuery 1, 2, 3, 4
GoogleMaps, Gmaps, Geolocation-Python
PyPI 2
PyQuery 1
Relevant packages among top 5 - PyPI 1, 2, 3, 4, 5
Note: Resultspurpose
missing general are relevant to query
packages like but they are
Googlemaps among top 5.

PyQuery 1, 2, 3, 4, 5
4
CHAPTER 8
CONCLUSIONS AND FUTURE WORK
We believe
modules. Wewe have the
expect succeeded in developing
Python community a dedicated
to widely adopt search engine
PyQuery. for Python
PyQuery packages
would allow and
Python
developers to explore well written, widely adopted, famous and highly apt Python packages and modules
for their programming needs. It will offer itself as an encouraging tool in Python community to follow
software engineering practice code reuse.
8.1 Thesis Summary
In this
for thesis
Python we have
packages andproposed
modules.some concrete
We have ideas
sought on how
to build to develop
an improved a dedicated
version search
of the state of engine
the art
Python search engine PyPI. Although PyPI is the first and only tool to address this problem, results
from PyPI are found to serve very little use to user needs and requirements. We have discussed various
tools and techniques which are brought together as one single tool called PyQuery, for facilitating better
search, better rank and better package visibility. With PyQuery we want to bridge the gap between the
high demand of means and ways to deliver reusable components in Python for code reuse and the lack of
efficient tools at users disposal to achieve it. In Chapter 1 we discussed the relevance of this problem, our
objective and approach towards solving the problem. In Chapter 2 we highlighted the related work in this
area. For package level search, PyPI being the only search engine that does Python module search, we
have elaborated on how PyPI search algorithm works, offered reasons as to why we think it needs
improvement. For module level search, there isn’t any dedicated code search engine for Python so we
have explored code search engines that work across multiple languages and reasoned the need for a
dedicated search engine for Python. PyQuery is divided into three different components: Data Collection,
Data Indexing and Data Presentation. Since we intend to provide two modes of search operations i.e.
Package Level Search and Module Level Search, at each component we employ a list of tools and
techniques to achieve specific goals related to these modes. In Chapter 3 we discussed Data Collection
Module, use of
4
Bandersnatch 1
PEP 381
using code analysis toolsmirror client to 2clone
like Prospector Python3. packages
and CLOC locally
We explored howand later process
to make these packages
use of Abstract Syntax
Trees (ast) to filter out useful information or metadata from Python modules. We also addressed JSON
file format for saving all this information with an example for each type of data.
RiverIn5 Chapter 4 we 6demonstrated
and Analyzer how the
plugins to digest to feed
feedstructured data
data. ES is to Elasticsearch
built (ES)4Lucene
on top of Apache and make
7
anduse of FSa
offers
wide variety of methods to configure data indexing and data retrieval. We explained the purpose behind
agreeing to a specific format for the JSON file to collect the data so that we can make use of
configurations ES offers. One such configuration is minimum fragment size. By minimum fragment size
set to 18 and an identifier collected along with the line number as one word separated by underscores and
right filled with underscores until minimum length of 18, we were able to get a matching identifier and its
line number as one single match. This reduced the size of JSON file indexed in ES drastically and also
helps save time to fetch line number from another key. In this chapter, we have also outlined some sample
queries to index and retrieved meaningful information out of the indexed structured data.
In Chapter 5, we covered the data presentation concepts like browser interface and server setup.
We discussed
columns our implementation
involved in ranking metricsof aand
server side ranking
an example viewalgorithm for package
of these columns for aand module
sample levelAlso,
query. search,
we
presented our preprocessing implementation of faster code search that involves generating the starting
address byte of each line in a module and transforming code into pygments. In Chapter 6 we gave an
overview of how all three components of PyQuery will work together with a system level flow diagram.
Finally, in Chapter 7 we compared results of PyQuery with that of PyPI to prove that we have achieved
our goal to improve PyPI, offer a meaningful search and avoid closely and similarly scored packages.
1
https://pypi.python.org/pypi/bandersnatch] 2https://github.com/landscapeio/prospector
3
http://cloc.sourceforge.net/ 4https://www.elastic.co/products/elasticsearch
5
https://github.com/dadoonet/fsriver
6
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-analyzer.html
7
https://lucene.apache.org/core/
4
8.2 Recommendation for Future Work
Although In
improvement. PyQuery accomplished
this section, we would likethe toinitially
list waysestablished
to improve goals,
further.there is definitely scope for
set of keywords for which we know matching packages. We observed thatwePyQuery
We want to perform a large scale comparison on PyQuery. Currently, have tested PyQuery
is doing betterwith
thana
state of the art, PyPI. Python is an extensive language and people from many different fields use the
Python programming language to solve problems in their respective disciplines. In this process, there is
always a continuous production of packages that are useful. There are thousands of packages that are
pretty famous for various reasons. Knowing in advance all the possible keywords that map to these
packages is nearly impossible. A tool gains popularity and importance only when it is widely accepted by
user base. By reaching out to developers of Python community from various disciplines, we can gauge
how well PyQuery is mapping keywords to right packages. We want to plan for large scale user surveys by
asking professional developers to search for packages that they use by direct package name and by
keywords that infer package purpose. We want to collect their feedback and learn if PyQuery is meeting
their requirements and if it is doing a better job than PyPI. We would like to list out use cases where
PyQuery needs to do better.
We can extend PyQuery to a recommendation system. We can apply collaborative filtering
technique
later from i.e.,
this capture
data makeuserpredictions
actions to of
know
a listtheir likes anda dislikes
of packages of Python
user would packages we
find interesting. Thissuggest and
will allow
further improvements on PyQuery. If a user trusts a specific author and tends to explore packages
developed by him/her more often, we could make his or her search results more appealing to him by
filtering packages from this author among the set of initially matched packages. If a user tends to explore
packages specific to a field or category, it is most likely that he/she is working in that field and in future if
there is a user management component added to PyQuery and every time a user login to the website, we
can suggest famous packages from his or her field on dashboard or we can suggest latest news related to
updates made on packages related to his or her field. These are the few of the many possibilities we can
do with Collaborative filtering technique, to facilitate better search operations. This will allow developers
to receive the latest information on packages and help them to make the best of Python packages and
modules. Many successful giants in the field of entertainment like Netflix and Comcast make use of
Collaborative filtering technique to
4
always
Python keep theiritusers
packages couldengaged with
find great their website.
purpose Since PyQuery
for collaborative seeks to help developers explore
filtering.
4
BIBLIOGRAPHY
[1] Caitlin
study. InSadowski, Kathryn
Proceedings of the T2015
Stolee,
10thand Sebastian
Joint MeetingElbaum. How developers
on Foundations search
of Software for code: apages
Engineering, case
191–201. ACM, 2015.
[2] Asimina Zaimi, Noni
giannis, Panagiotis Triantafyllidou,
Sfetsos, and Ioannis Androklis Mavridis,
Stamelos. An Theodore
empirical study onChaikalis,
the reuse Ignatios Deli-
of third- party
libraries in open-source software development. In Proceedings of the 7th Balkan Confer- ence on
Informatics Conference, page 4. ACM, 2015.
[3] Andy
SoftwareLynex and Paul5(1):105–124,
Engineering, J Layzell. Organisational
1998. considerations for software reuse. Annals of
[4] David C Rine
investment and Robert
success factors.MJournal
Sonnemann. Investments
of systems in reusable
and software, software.1998.
41(1):17–32, a study of software reuse
[5] Python Software Foundation. Pypi. https://pypi.python.org/pypi.
[6] Taichino. Pypi ranking. http://pypi-ranking.info/alltime, 2012.

[7] Steven P Reiss.
International Semantics-based
Conference code
on, pages search.IEEE,
243–253. In Software
2009. Engineering, 2009. ICSE 2009. IEEE 31st
[8] Nullege a search engine for python source code. http://nullege.com/.

[9] Iulian
abstractNeamtiu, Jeffrey
syntax tree S Foster,
matching. ACMand MichaelSoftware
SIGSOFT Hicks. Understanding source
Engineering Notes, code evolution
30(4):1–5, 2005. using
[10] Themistoklis Diamantopoulos
improve question-answering in and
stackAndreas L Symeonidis. Employing source code information to
overflow.
[11] Ast abstract syntax trees. https://docs.python.org/2/library/ast.html.

[12] Bernard J Jansen
nine search engineand Amandalogs.
transaction Spink. How are we
Information searching
Processing & the world wide42(1):248–263,
Management, web? a comparison
2006. of
4
BIOGRAPHICAL SKETCH
My name isRao
Nageswara Shiva
is Krishna Imminni
a government and I was
employee andborn in metropolitan
my mother city Laxmi
Mrs. Subba Hyderabad, India. My father
is a homemaker. They Mr.
are
my biggest inspiration and support. I am the elder of two children of my parents. My sister Ramya
Krishna Imminni is very close to my heart and is very special to me. My family is the guiding force
behind the success I have in my career.
I have received Bachelor’s degree from Jawaharlal Nehru Technology University in May 2011
and joined
scripts FactSet
in various Researchlike
languages Systems as QAand
Java, Ruby Automation Analyst.onAt
Jscript. I worked FactSet,
various I wrote QA
automation Au- tomation
frameworks like
TestComplete and Selenium. I was one among the first three employees hired for QA Automation
process, so I had a lot of opportunities to try various job roles and experiment with new technologies.
Out of all the job roles I had done, I liked training new hires the most. I was promoted to QAAutomation
Analyst 2 in a short span of 1 year and awarded Star Performer for the year 2013. It is at FactSet I
developed Testlogger, a ruby library to generate xml log files; custom built for QA terminology like
<testcase> and <teststep>. I worked at FactSet for 2 years from November 2011 to December 2013 and
gained a diverse experience performing various roles. I have joined Department of Computer Science at
Florida State University as a Master of Sci- ence student in Spring 2013. At FSU I have continued to
gain professional experience working part time as Software Developer, Graduate Research Assistant at
iDigInfo, Institute of Digital Information and Scientific Communication. At iDigInfo, I worked on
various projects related to research in specimen digitization. Some of these projects include Morphbank,
a continuously growing database of images that scientists use for international collaboration, research and
education and iDigInfo-OCR, an optical character recognition software for digitizing label information of
specimen collections. I also worked as a Graduate Teaching Assistant for Bachelor level Software
Engineering course. As a part of my course work, I have taken a Python course under Dr. Piyush Kumar
that led to my interest in working on PyQuery. Experience I have gained while working on PyQuery
helped me get an intern opportunity. During the Summer of 2015, I interned with Bank of America.
As an intern, I worked on various technologies related to BigData including Hadoop HDFS, Hive
and Impala.

sEARCH eNGINE

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

sEARCH eNGINE

Uploaded by

Copyright:

Available Formats

Florida State University Libraries

Pyquery: A Search Engine for Python

COLLEGE OF ARTS AND SCIENCES

A SEARCH ENGINE FOR PYTHON PACKAGES AND MODULES

SHIVA KRISHNA IMMINNI

Professor Directing Thesis

3.2.3 Code Quality..................................................................................................................12

4.2.2 Module Level Search......................................................................................................17

8.2 Recommendation for Future Work..............................................................................................44

5.1 Ranking matrix for keyword music...............................................................................................28

5.2 Matching packages and their scores for keyword music................................................................29

7.1 Results comparison for keyword - requests...................................................................................37

7.2 Results comparison for keyword - flask........................................................................................37

7.3 Results comparison for keyword - pygments.................................................................................38

7.4 Results comparison for keyword - Django....................................................................................38

7.5 Results comparison for keyword - pylint.......................................................................................39

7.6 Results comparison for keyword - biological computing..............................................................39

7.7 Results comparison for keyword - 3D printing..............................................................................40

7.8 Results comparison for keyword - web development framework..................................................40

7.9 Results comparison for keyword - material science......................................................................41

7.10 Results comparison for keyword - google maps............................................................................41

3.1 Metadata from PyPI for package requests.......................................................................................9

3.2 Pseudocode for collecting metadata of a module...........................................................................12

3.3 Metadata for a module in Flask package........................................................................................13

4.1 Package level data index mapping.................................................................................................15

4.2 River definition.............................................................................................................................16

4.3 Custom Analyzer with custom pattern filter..................................................................................17

4.4 Module level data indexing mapping............................................................................................18

4.5 Package level search query............................................................................................................19

4.6 Module level search query............................................................................................................20

5.1 Package modal...............................................................................................................................22

5.2 Package statistics...........................................................................................................................23

5.3 Other packages from author...........................................................................................................24

5.4 Pseudocode for ranking algorithm.................................................................................................27

5.5 Module modal...............................................................................................................................30

6.1 System Level Flow Diagram of PyQuery......................................................................................32

6.2 PyQuery homepage.......................................................................................................................33

6.3 PyQuery package level search template.........................................................................................34

6.4 PyQuery module level search template.........................................................................................35

codeWe have discussed

Figure 3.1: Metadata from PyPI for package requests.

and In order to allow

if name == " main ": for

Figure 3.2: Pseudocode for collecting metadata of a module.

‘‘class variable ’’: ‘‘ salt 278 d i ges t me t h o d 28

Figure 3.3: Metadata for a module in Flask package.

DATA INDEXING AND SEARCHING

Figure 4.1: Package level data index mapping.

Figure 4.2: River definition.

Figure 4.3: Custom Analyzer with custom pattern filter.

Figure 4.4: Module level data indexing mapping.

Figure 4.5: Package level search query.

Figure 4.6: Module level search query.

1. Request ES server for top 20 results for user query.

2. Request Bing for top 20 results for user query.

3. Generate other relevant columns or metrics.

# execution starts here columns =

Figure 5.4: Pseudocode for ranking algorithm.

vk-music musicplayer pylast vk-music mps-youtube mopidy-musicbox-webclient

jmbo-music music mps-youtube jmbo-music mps-youtube musicplayer

panya-music mopidy-gmusic mps-youtube panya-music vis-framework musicplayer

tweet-music pyspotify jmbo-music tweet-music pyacoustid mopidy-gmusic

google-music-playlist- musicplayer hachoir-metadata google-music-playlist- cherrymusic hachoir-metadata

spilleliste music21 mp3play spilleliste music21 music21