You are on page 1of 2

Finding highly associated tags on stackoverflow

Ankur Kothari - 431153361


April 2, 2018

Abstract
Final term project proposal for Social Media and Data Mining course

1 Problem:
Discovering closely associated tags on stack-overflow is a difficult task. If you go to stack-overflow
and click on any tag, for example, operating system, the result that it displays includes languages
like c++, c sharp, python, macos, kernel etc. Even though a few of these words do make sense, we
rarely want to filter this topic based on languages like python. I would instead like to see words
like multithreading, fault tolerance, and many high-level tags closely associated with ’operating
system’. Similarly for data structure, again c sharp, Java, python comes up in a lot of questions
but I would rather like to see graphs, trees, stacks, queues, linked-list in these tags. Even though
some of these tags do show up, I would like to filter out the tags that are unrelated.
A famous story about association rule mining is the "beer and diaper" story. A purported
survey of the behavior of supermarket shoppers discovered that customers (presumably young
men) who buy diapers tend also to buy beer. This anecdote became popular as an example of how
unexpected association rules might be found from everyday data. There are varying opinions as
to how much of the story is true.
This is why we can’t trust that if c++ occurs a lot with data-structures we can infer any
specific relationship between the two concepts. It might as well mean that c++ comes from a
very different area and data-structures comes from a different area, which actually is true. Similar
phenomena were seen when I tried seeing the pattern on Operating Systems where the top tags
where programming languages which actually had no meaning in this context.

2 Approach
2.1 Data:
Google cloud stores all the stack-overflow questions and answers since 2008 and provides a library
called Big-query which I will be using along with python programming language to find the most
closely related tags in stack-overflow.

2.2 Input:
The user will be entering a keyword like a tag used on the stack-overflow website, for example,
algorithm.

2.3 Querying:
We are going to use SQL, Structured Query Language, for querying the StackOverflow dataset
hosted on the Google cloud using BigQuery, an API provided by Google cloud to query public
datasets available on its platform.
Tables provided are questions, answers, users, etc which I will be using to find various associated
tags and questions for finding similar strongly associated tags.

1
2.4 Approach:
We consider Stack Overflow question tags as technologies for computer programming. Given a set
of Stack Overflow questions, we use association rule mining to mine technology associations from
tag co-occurrences in questions.
In this work, a Stack Overflow question is considered as a transaction and the question tags as
items in the transaction.
First, we collect all the tags from the questions table from questions where the particular tags
occur. Based on this query we create a bigram model and find the 20 most common pairs of tags.
After calculating this, we iterate over these words and find the 20 most common tags for them
and check if the query word occurs in this list. If it does occur then we can make an assumption
that both of the tags are somewhat related. Eventually, we have found all the tags that might be
closely related to the query word.
We would also like to apply clustering like the Topic modeling(LDA method, Louvain method)
and association rule mining algorithms, like Apriori algorithm, especially FP-growth algorithm as
Apriori is very expensive and FP-growth just takes 2 passes over the whole dataset to generate the
frequent tags, and see if we can solve this problem efficiently.
We need to find frequent pairs of technologies, i.e., frequent itemsets that consist of two tags.
A pair of tags is frequent if the percentage of how many questions are tagged with this pair of tags
compared with all the questions is above the minimal support threshold tsup. Given a frequent
pair of tags t1, t2, association rule mining generates an association rule t1 ⇒ t2 if the confidence
of the rule is above the minimal confidence threshold tconf . The confidence of the rule t1 ⇒ t2
is computed as the percentage of how many questions are tagged with the pair of tags compared
with the questions that are tagged with the antecedent tag t1.

2.5 Estimated Time and Effort:


Initially, the time would go into understanding the various relationships among the tags and first
check if using frequent association rule mining I am able to find closely related tags. Later on, if this
method did work then we would move on displaying the various rules on graphs and visualization
so that visualization could capture and show the relevant combinations of words that made more
sense.
After this, we would concentrate on trying to find if we can find a pattern among tags and the
users and their reputation and check if users with lesser reputations ask more questions on some
tags than other. This will help us in recommending new questions to the user to help him build
his understanding of core concepts.

2.6 Output:
After the user enters a keyword, we would be showing him a graph or a tree where we show
the words that are closely associated with the entered keyword and also display categories that
represent those tags in a hierarchical way.

3 References:
"http://ccywch.github.io/chenchunyang.github.io/publication/technology_landscape.pdf/"
"https://en.wikipedia.org/wiki/Association_rule_learning#FP-growth_algorithm/"