You are on page 1of 15

Stack Search

StackOverflow Search Clustering


Sonali Sharma, Shubham Goel, Priya Iyer

Project Report

Report presented towards the completion of the class project for INFO 256 Applied Natural language Processing

Date: 12/16/2013


Marti Hearst


Aditi Muralidharan



There are a number of search engines available today. When users type in a query to search for something over the internet, they are often overwhelmed with the a large amount of results. It often becomes difficult to browse through this list and fetch the most relevant items. We implemented an algorithm that would improve user search using NLP techniques and provide a mechanism to categorise search results into relevant categories, thereby making it easy for the end user to navigate to the category of interest and look for results within that category. We decided to implement Findex algorithm in this project. Findex is a text categorization algorithm that provides an overview of search results as categories where categories are made up of most frequent words and phrases in the resulting document set. The algorithm is based on the assumption that the most frequently used word/phrases in a set of documents capture major topics very well. We used the StackOverflow data to implement the algorithm.

  • 2. Project Goals

    • 2.1 Original intent

We originally intended to work on clustering similar sentences using a monothetic clustering algorithm such as DisCover [1] . Monothetic clustering is a clustering technique wherein each cluster is formed using only one feature and that single feature is present across all the samples, which in our case, are documents. The DisCover algorithm is one such type of a monothetic clustering algorithm. Also, we intended to use the WordSeer [2] project and the associated humanities corpora in order to implement the algorithm on it. However, upon deeper analysis we realized some hurdles to this approach as discussed below.

  • 2.2 Algorithm

The DisCover algorithm aims for full coverage however search results do not necessarily have to fall under clusters. Moreover, understanding and implementing the algorithm appeared to be complex and fell out of scope of a class project. Also, a good measure and method of evaluation of the DisCover algorithm had not been achieved so, we decided to use another algorithm Findex[3] in order to categorize search results. The details of the Findex algorithm are discussed in detail in one of the upcoming sections.

  • 2.3 Data

The developer of WordSeer, Aditi M, had set up two text corpora to work with WordSeer. However, the text seemed more appropriate for testing purposes than for actual use for someone.

2.4 WordSeer

Also, WordSeer had already implemented a version of the Findex algorithm for clustering the search results. There wasn’t much left for us to do and we really wanted to learn more about Findex and its implementation.

Thus, we decided to implement a search result clustering algorithm (Findex) over StackOverflow data using the StackOverflow API. We built a search interface for StackOverflow users to type in their questions and see the results in a neat manner.

2.5 Accomplishments


Data processing was a much harder task than we had originally expected. Using the readymade database that Aditi has uploaded on to the WordSeer platform, we had not expected the data cleaning and loading tasks to be so arduous. The StackOverflow API provided us with data in huge XML and HTML files. A lot of our time was consumed in extracting the data, cleaning it, parsing it and loading it to the database.


We decided to implement Findex as our algorithm of choice for categorizing the search results. Details of the implementation will be discussed in the upcoming section. Our original intent was to only create one layer of categories for the search result topics but we ended up creating a second layer of sub-categories by recursively applying the Findex algorithm over each of the parent categories. From an NLP standpoint, we ended up implementing a lot of concepts from work tokenization, lemmatization, n-gram creation to phrase frequency distribution.

Search User Interface

Initially, our vision was to only create the categories on the search results and display the results on the console. However, we decided to go a step further and create a web interface as well. We realized that visualizing the categories and the sub-categories along with the content of the questions and responses from StackOverflow would be a good way to drive home the point further about search result categorization.

Future Work

We intend to be able to plug the system that we developed to the StackOverflow interface to allow users to quickly browse to the intended questions and responses of their interests through the categorized topics. Currently, the StackOverflow user interface only has tags as a way of classification. However, the issue with that is that they are user generated and sometimes they may not necessarily be relevant to the question.

2.6 Results

The following figures represent the results of our project. Each of the figures contains a different query term and different categories corresponding to the different query terms. Notice how some query terms have no sub-categories while some do. This is because Findex only displays the categories or the sub-categories if there is a substantial number of questions that falls under those. Detailed descriptions of these result pages are provided in the next section.

The following figures represent the results of our project. Each of the figures contains a different

Fig. 1 Query word: “python”

Fig.2 Query word: “databases in python”

Fig.2 Query word: “databases in python”

Fig.3 Query word: “Java Interview Questions”

Fig.3 Query word: “Java Interview Questions”

Fig. 4 Part-of-speech tagging for query word: “memory management” 3. Data We downloaded the Stack Exchange

Fig. 4 Part-of-speech tagging for query word: “memory management”

3. Data

We downloaded the Stack Exchange Creative Commons Data Dump, which has all the public data from websites like Stack Overflow, Server Fault, Stack Apps, etc up to September 2011. The data files were in XML format with each question and answer being an entry with the <row> tag. Because the relevant Stack Overflow data was over 4GB in size, we first built our database using the data from which was about 21MB in size. This allowed us to make progress in parallel on the database front and the algorithm front. To save time we initially set up everything in a sqlite database. We processed the XML file using the python built in xml.dom library and stored the answers and questions in different tables. We cleaned the answer test by removing all the HTML tags and dropping the content of certain tags like <CODE> using BeautifulSoup. We did not want to pass the code snippets to our Findex engine. After cleaning the answer text we merged the answer and the question table and got rid of all the extra data, like unanswered questions and irrelevant answers. Using this temporary database we started building our Findex engine and the user interface. As the next step, we started building our database with the Stack Overflow data. We switched from sqlite to MySQL. The xml.dom library did not work well for parsing a large

data set as it reads the entire data file into the memory before processing it. We switched to xml.etree.cElementTree which is a C based library and had to make some changes to the sql statements to import the stack overflow data into the MySQL database. This gave us over 1 million questions with cleaned answer text. For our final demonstration and user interface, we used a subset of ten thousand questions in a sqlite database, to improve the response time of the system.

4. Algorithms

We implemented a modified version of Findex algorithm. Below are the major steps of the algorithm used by us.

4.1 Text Mining

This was one of the most important steps. Before doing frequency calculation, the data had to be transformed into a particular format and stored in sqlite database. The clean data had to be tokenized, lemmatized, converted to trigrams.

Tokenization- The answer text was tokenized using nltk tokenizer. Tokenization splits up a string into a list of constituent words.

Stop word removal - In order to get more relevant categories after applying frequency distribution on tokenized words, it was important to exclude stop words. Without excluding them, stop words would appear in the most frequent words list and would result in meaningless categories. We decided to use the stop word list from the linguistic tools resources of Information Retrieval department of University of Glasgow. This was an exhaustive list of stopwords and worked well on our dataset. The list of stopwords can be viewed here.

Lemmatization - After removing stop words, the next step was to lemmatize the words. Lemmatization was important. Without lemmatization, simple inflections of the words such as debug and debugging, list and listing, car and cars, would appear as separate categories. In order to prevent this we used wordnetlemmatizer to lemmatize tokens.

ngrams - The next step was to create unigrams, bigrams and trigrams. In order to formulate categories we decided to go upto trigrams as phrases up to 3 words made

more relevant categories than phrases consisting of more than 3 words. For every answer, we created a list of unigrams, bigrams, trigrams and stored them in an ngrams table that we created. We store the original unigrams, bigrams, trigrams along with lemmatized version as shown below.

more relevant categories than phrases consisting of more than 3 words. For every answer, we created

Fig. 5 Table storing ngrams for tokenized answers

4.2 User Query

The previous step was one time activity to upload the dataset into the database. Now for the specific search, the users are asked to enter a query in the search interface.

more relevant categories than phrases consisting of more than 3 words. For every answer, we created

Fig. 6 User Interface to input query term

Our program reads the query phrase entered by the user and executes the steps below:














Lemmatize terms - Tokens are then lemmatized [“pointer”,”in”,”memory”]

Stop words - On removing stop words we are left with [“pointer”,”memory”]

Finding relevant questions - For each term in the user query we check the ngrams table to fetch a list of phrases containing the query term. Upon fetching the phrases containing query terms, we calculate the frequency of the phrases and arrange them in descending order.

  • 4.3 Frequency distribution

After fetching all phrases containing the query terms we then calculate frequency distribution of the phrases and finally filter the results by fetching the top 20 phrases. The figure below shows the phrases along with their frequency for the query “pointers in memory”. Each of these phrases is considered as a separate category.

Lemmatize terms - Tokens are then lemmatized [“pointer”,”in”,”memory”] Stop words - On removing stop words we

Fig. 7 Most frequent phrases

Upon getting list of top 20 phrases 9by frequency) we also fetch the corresponding question and answers ids. These are used later to display the search results.

  • 4.4 Hierarchy

In this project we went beyond the Findex algorithm and implemented and introduced hierarchy in the categories. The unigrams in the resultant phrases were considered as the top level category. Bigrams containing the unigram was considered as the second level hierarchy and similarly trigrams containing the bigram was considered as level three hierarchy. while displaying the search results we showed all all three levels (if applicable)

Below is the screenshot of the categories returned with parent child relationship.Parent represents Level 1 category (unigrams), Children represent Level 2 category (bigrams) and Grandchildren represent Level 3 category (trigrams).

Fig. 8 Creating hierarchy of phrases 4.5 Displaying results Front end The front end to display

Fig. 8 Creating hierarchy of phrases

4.5 Displaying results

Front end

The front end to display results was built using flask framework. We connected to sqlite3 database to fetch results and the webpage was built using flask jinja2 template. the search interface was simple and user could provide any type of search query. As described in the section above, the search query was parsed first and the results calculated on the fly and returned back in a hierarchical order. As shown in the figure below, the results were arranged by the frequency of the phrases. The hierarchy of the categories also facilitates search for the end user.

Fig. 9 Search results for query “pointers in memory” The user can now select particular category

Fig. 9 Search results for query “pointers in memory”

The user can now select particular category of interest. On selecting the category the list of associated questions are displayed on the right plane. The answers can be viewed upon clicking on the question. This interface makes it easy for the users to browse through a list of questions corresponding to the selected category. In the example below, the user is only interested in looking at categories memory stream, pointers, memory layout and memory leak. There are 10 questions corresponding to this selection.

Fig. 10 Selecting categories to view relevant search results 5. Further analysis We did further analysis

Fig. 10 Selecting categories to view relevant search results

5. Further analysis

We did further analysis of categories using Parts of speech tagging of the categories. We found some interesting results with this exercise which could be used to further classify the categories. From POS Tagged categories we found patterns like Adjective - Noun, Verb Noun and Noun Noun.

Adjective Noun, Noun noun combination depicted the types of category. We can see that transactional, variable, virtual, actual and table are all types of memory.


[('transactional', 'JJ'), ('memory', 'NN')] [('variable', 'JJ'), ('memory', 'NN')] [('virtual', 'JJ'), ('memory', 'NN')] [('actual', 'JJ'), ('memory', 'NN')] [('table', 'JJ'), ('memory', 'NN')]

Verb noun combination depicts various actions/ usages of the Noun in the category. The actions that you can perform on memory consist of writing, freeing, allocation, loading, sharing etc. This pattern was clearly visible by identifying the verbs in categories.


[('writing', 'VBG'), ('memory', 'NN')] [('freeing', 'VBG'), ('memory', 'NN')] [('allocated', 'VBD'), ('memory', 'NN')] [('related', 'VBD'), ('memory', 'NN')] [('cached', 'VBD'), ('memory', 'NN')] [('loaded', 'VBD'), ('memory', 'NN')] [('shared', 'VBD'), ('memory', 'NN')] [('written', 'VBN'), ('memory', 'NN'), ('tested', 'VBN')] [('written', 'VBN'), ('memory', 'NN')] [('string', 'VBG'), ('memory', 'NN')]

From the above analysis we could clearly see a pattern in the categories returned. Just by looking at the parts of speech it was easy to categorise the list further.

6. Contributions of Each Team Member





Initial parsing stack overflow data




Initial Loading of questions and answers into database




Entire dataset Tokenization, stop word removal and loading ngrams




Frequency calculation









POS tagging




Front end








Code Cleanup




  • 7. Code

We wrote the code for Findex algorithm and Front end from scratch along with the code to extend Findex to do hierarchical categorization and parts of speech tagging. You can find our code repository here:

  • 8. Bibliography

[1] Kummamuru, Krishna, et al. "A hierarchical monothetic document clustering algorithm for summarization and browsing search results." Proceedings of the 13th international conference on World Wide Web. ACM, 2004. [2] Muralidharan, Aditi, and Marti Hearst. "Wordseer: Exploring language use in literary text." Fifth Workshop on Human-Computer Interaction and Information Retrieval. 2011. [3] Käki, Mika, and Anne Aula. "Findex: improving search result use through automatic filtering categories." Interacting with Computers 17.2 (2005): 187-206.