You are on page 1of 36

Clustering Models for

Topic Analysis in
Graduate Discussion Forums
Gokarn Mallika Nitin
Swapna Gottipati
Venky Shankararaman

27th International Conference


on Computers in Education
ICCE 2019
Agenda

• Background
• Motivation for Discussion Forum Analysis
• Literature Survey
• Research Problem Statement
• Data Set details
• Solution Model and Architecture Design
• Experiments Findings
• Limitations and Future Work
• Demo
Discussion Forums in Teaching & Learning Process

Flexibility Support the learning process

Provide an equitable space Encourage interactions

As instructors, are we able to fully


exploit the discussions for
effective teaching and learning?
Survey on Class Discussions

50 students participated in a quantitative and qualitative survey

Top 3 suggested learning aids : Top 3 analytics on a visual dashboard:

1. Summary of each topic 1. Questions and answers


discussion (80%) discussed (76%)

2. Report of the entire class 2. Topics discussed (72%)


discussion for a session (74%)

3. Flow of discussions (66%).


3. List of questions with
answers (64%)
Motivation of Discussion Forums Analysis

Effective intervention requires an


understanding of the content and
analysis of the forums

Analytics on discussion forums - collecting,


analysing, and displaying the “traces” that
learners leave behind, with a purpose to
improve learning
Literature Review
Conceptual Framework for Classroom Discussion Analysis

Online Discussions Classroom Discussion Analysis Model


Participation
Individual Behaviour
Learning Management Analysis Faculty Response Analysis

Systems Comparison

Discussion Data Pre- Topic Discovery and Evolution


Visualizations
processing Discussions Data Content Analysis Summarization

Content Type and Extraction

NLP Information retrieval Peer Interactions Interactive Reporting,


techniques Interaction Behaviour
Analysis Web engines
Faculty-Student Interactions

Argumentative Analysis
Discourse Analysis
Sentiment Analysis
In-class Discussions

NLP, Text Mining, Data Mining, Machine Learning, Statistics


Mobile App Based Speech
to Text Conversion APIs
Classroom Discussion Analysis Model

Participation
Individual Behaviour
Analysis Faculty Response Analysis

Comparison

Topic Discovery and Evolution

Discussions Data Content Analysis Summarization

Content Type and Extraction

Individual Interactions
Interaction Behaviour
Analysis
Faculty-Student Interactions
This Paper
Argumentative Analysis
Discourse Analysis
Student profiling Ongoing Work

Cognitive and Social Interaction Analysis in Graduate Discussion Forums.“, Mallika Nitin, Swapna Gottipati, Venky
Shankararaman, FIE 2019. In Proceedings of 49th Annual Frontiers in Education Conference 
Research Problem

Discussions Topic Discovery and


Data Content Analysis
Evolution

Identify the sub-topics and the evolution of


topics within the discussion forum
Indicates students in-class and out- Aids faculty to intervene in the less
of-class learning discussed topics – content expansion

Aids in gaining insights of forums and supports collaborative learning with further features

Research Questions:
RQ1: How the clustering technique performs in discovering sub-topics?
RQ2: Which visualizations are suitable for sub-topic and topic evolution representations?
Data

Graduate course -Text analytics and applications


Offered by School of Information Systems

14 Course
Duration 8 Lesson
Content

>50 Industry
55 Enrolled
students %
Experien
ce

Forum Respons
200
37 participa
nts
es
Data - Discussion Forum Design

The graduates exhibit reservations if the forum is redundant and repetition of the course content.

Week Discussion Forum Thread


0 General discussions
General discussions including concepts, labs, class etc.
1 Text Mining Introduction
What are applications of Text mining in education domain?
2 Text pre-processing and NLP
How search engines (Bing or Google) use NLP?
What are examples of applications of chatbots in different industries?
3 Document Similarity
Explain the differences between the bag of words & vector space model.
4 Text Classification
What are examples of text classification in the industry (Government, healthcare, banks, etc.)?
What are various evaluation measures for text classification?

5 Text Clustering
What are visuals for the displaying cluster results - Free draw and upload?
Explain one clustering evaluation measure with an example.
6 Information Extraction
What are applications of HMM models (Or any other Sequence Model)?
What are examples of information Extraction in Industry (e.g. Finance, Retail, Travel, Healthcare, Media, Education etc.)

9 Sentiment analysis
Discuss the technique to handle negation in opinions.
Discuss technique to handle sarcasm in opinions
Data – Sample Post

Name Week Topic Post


Hi all,There are many industries to use text classification. For
example, the service industry (restaurant, hotel, Booking.com)
will use text classification to do the Sentiment Analysis. With
sentiment analysis, usually you may have a comment, tweet or
review from a user or customer and you want to
programmatically detect the sentiment (if they are talking
Text positively or negatively about something): In the case of hotels,
Student Classificati you may classify opinions to know if they are talking about the
4 2 on service, location, price, etc.Thanks

Main Topic Sub-topics


Solution Model

Data processing Topic Discovery


1. Data cleaning 1. Document similarity metrics
2. Names anonymization 2. Number of Clusters - elbow
3. Stopword removal 3. Kmeans/ Agglomerative
4. Lemmatization/Stem 4. Top-words for cluster (sub-topics)
5. TF-IDF matrix representation

Visualization Evaluations
1. Exploratory visuals 1. Qualitative analysis of clustering
2. Sub-topic network visuals algorithms
3. Topic evolution visuals 2. Quantitative analysis
Application Architecture

Dashboard Components

html css renderer views models

scipy nltk networkx re os plotly

pandas numpy csv sklearn matplot

Data Source (Discussion Forum Data)


Findings and Demo
Participation Analysis
Overall Statistics
Participation Average Contribution – Count/Airtime Based
Individual Behaviour Comparative Position of level of contribution
Analysis
Overall Participation in Main Threads (Class Topic)

Identify topics with most/less participation.

Instructor can intervene and create more content for less participated class topics
Relative Participation Scoring
𝑥 −𝜇
z − 𝑠𝑐𝑜𝑟𝑒=
𝜎

Identify students with least or most participation.

Instructor can intervene and aid the students who need help
Topic Discovery and
Evolution Analysis
Topic Discovery and K-Means, Agglomerative Clustering
Evolution
Human Gold Truth

Content Analysis Coherency Evaluation


K-means Clustering – Best Number of Clusters

To choose the number of clusters


Topic Analysis – Labeling Sub-topics
These are sub-topics
discussed by students
for given main topics

Agglomerative Cluster Human Labels


(Top words in each cluster )
['discuss', 'technique/s', 'opinions', 'handle', 'cluster']  General techniques
['sentiment', 'negation', 'sarcasm', 'analysis', 'expressions']  sentiment and
opinion
['text', 'mining', 'classification', 'government', 'precision']  Classification
['evaluation', 'clusters', 'measures', 'clustering', 'various']  Clustering models

['chatbots', 'customer', 'service', 'questions', 'banks']  Chatbots


['words', 'training', 'dictionary', 'data', 'document']  Model training
['doctors', 'symptoms', 'flu', 'doctor', 'analytics']  HealthCare
['sequence', 'models', 'applications', 'state', 'model']  Application models

Identify students learned topics with out of class research.

Instructor can intervene and provide more info to fill gaps


Qualitative Evaluations: K-means Vs Agglomerative Clustering

We asked two human judges to label the coherence of the top words in each cluster.
Agglomerative Cluster Coheren K-means Cohere
Top words in each cluster t/ Top words in each cluster nt/
Repetiti
ve
Repetiti
ve
Agglomerative
shows good
['machine', 'tweet', 'comments',
'replying', 'google']
C  ['machine', 'training', 'model',
'dataset', 'data']
 C
performance.
['discuss', 'technique/s',  C ['doubt', 'clarified', 'arun', 'comment',  R
'opinions', 'handle', 'cluster'] 'mislead'] What about
['sentiment', 'negation',  C ['customer', 'chatbots', 'service',  C
'sarcasm', 'analysis',
'expressions']
'customers', 'chatbot'] quantitative?
['text', 'mining', 'classification',  C ['doubt', 'clarified', 'arun', 'comment',  R
'government', 'precision'] 'mislead']
['in-class', 'question', 'thanks',  C ['sentiment', 'negation', 'analysis',  C
'students', 'compilation'] 'scope', 'polarity']
['chatbots', 'customer', 'service',  C ['government', 'sites', 'classification',  R
'questions', 'banks'] 'banks', 'examples']
['data', 'plagiarism', 'clinical',  C ['patient', 'text', 'healthcare',  C
'manually', 'website'] 'doctors', 'analytics']
['doctors', 'symptoms', 'flu',  C ['words', 'slide', 'word', 'document',  C
'doctor', 'analytics'] 'general']
['sequence', 'models',  C ['doctor', 'medical', 'records',  C
'applications', 'state', 'model'] 'hospital', 'doctors']
Quantitative Evaluations - Agglomerative

We asked two human judges to label the coherence of the topics to the comments.

87% coherent (2 & 3)


13% non-coherent

Coherence scoring evaluation - Quantitative


Visualizations - Topics & Sub-topics Analysis

We observe that sub-topics can be part of more than one main topic. For example topic, 6 - “medical and healthcare”,
appears under “text classification”, “text mining introduction”, and “clustering”.

Instructor can identify the missing sub-topics and submit the posts
under the main topic to lead the students in the learning process
Visualizations - Topic Evolution Over Time (Sub-topics)

Identify short-lived vs repeated topics.

Instructor can intervene to provide the


feedback to students or fill the gaps of
missed connected topics.
Demo
Findings of Research

RQ1 - For final solution implementation, we provide both cluster


models. Agglomerative performs better than K-means

RQ2 – Network and Interactive graphs are suitable for the user
friendly insights of topics and evolution

Exploratory analytics with visuals such as participation analysis


gives the full overview and can aid in the quick intervention to
improve the learning process
Limitations of our Work

1. Clustering methods are unsupervised and selection of the best number


of clusters is not easy. Hard clustering is another limitation.
• We proposed elbow method which can also be ineffective for some
types of datasets.
• Topic models enable soft clustering. We propose to add topic
modelling as another choice for users.
2. We tested our work on the curated discussion forum with controlled
questions set. Some discussion forums tend to be informal and our
solution should be tested on such data.
• In our ongoing research, we are testing on informal discussions
using social media.
Limited without other analytics such as content,
interactions and behaviour analysis.
Future Work

Content Analysis
Naïve Bayes Model
Content Type Analysis
Human Gold Truth

Content Analysis Accuracy Evaluations

Clustering Model for Topics


Content Summarizarion
Summarization models

Evaluations
Ongoing Research

Topic based Summarizations Content Type Analytics


Thank you!

swapnag@smu.edu.sg

http://www.mysmu.edu/faculty/swapnag/publications.html

31 | Final Presentation
9 | Final Presentation
10 | Final Presentation
Impact of number of postings on the grades
Quantitative Evaluations

Cluster Coherence
Metric kMeans Agglo
silhouette score 0.03 0.02
Calinski-Harabasz
Score 2.06 1.92

Coherence scoring evaluation - Quantitative

You might also like