You are on page 1of 25

One Web Forum

Powered Outer Probabilistic Clustering

Peter Taraba
July 10th, 2017
One Web Forum
Humans tend to do mess once a while and every now and then they realize
it and try to clean up. It is no different with world wide web, which contains a
lot of incorrect information, fraudulent websites, etc. The project does not
concentrate only on pointing out mentioned negatives, but should support
also the positive parts of world wide web. I will speak about two experiments
- Facebook Web Forum and One Web Forum.
One Web Forum
(clean up of www)

Sweden has run out of garbage
and the Scandinavian country
has been forced to import
rubbish from other countries to
keep its state-of-the-art
recycling plants going.

Sweden's recycling system is
so sophisticated that only less
than 1 per cent of its household
waste has been sent to landfill
last year.

One Web Forum
● Negatives:
● Fraudulent web sites

● Fake websites with incorrect information

● “Too many” accounts

● Newspapers logins for comments

● “Too heated” discussions

● Fake news

● Positives:
● Discuss news (“less heated” way)

● Give feedback to companies products

● Support organizations by spreading info

about them on social network
● Provide more information on websites

with not sufficient information
Facebook Web Forum (July 2009)
● Sidebar in web browser
● Every website has its own
“facebook wall”
● One login only (FB)
● Extensions for:
● Microsoft's IE (2009)

● Firefox (2009)

● Google's Chrome (2009)

● Visitors of same website
can talk on same topic
● Facebook connects with
your friends
● FWB connects you with
people who visit same
websites ~ interests
Facebook Web Forum (July 2009)
Facebook Web Forum (July 2009)
Install add on for your browser

● July – August 2009
● Spreading of installs Comments
● 1,000$ budget of facebook ads
● In two months ~ tens of
thousands installs for Firefox
(linear growth)
● Microsoft was not providing stats
● Google did not have store yet
Facebook Web Forum (July 2009)

Facebook is running with Comments Plugin for websites
● Not every web site has facebok plugin

● Once again “too many accounts / technologies” for different websites

● Website owner can control what comments stay on their website (comments moderation)

● “Communistic” approach (in democratic times)
One Web Forum (September 2015)
One Web Forum (September 2015)
● Lessons:
– First I tried anonymous approach
● Installs, but majority of comments were SPAM abusing
application's APIs
– So I used Google's “I'm not a robot” captcha
● Small amount of installs, not spreading enough
● Small amount of comments / users
– I could use again Facebook to start spreading the
● Low quality comments (are we really cleaning up www?)
● LinkedIn could provide higher quality comments
Microsoft to acquire LinkedIn
June 13, 2016, Microsoft News Center
Who is in best position to help clean up the www?

OS Web Social Minuses Pluses


Windows IE LinkedIn % IE LinkedIn

Android Chrome Goo gle+ “Ghostown” % Chrome

Mac OS Safari iTunes % Safari ?

N/A N/A Facebook Dependence 2 Billions
+fake news users

Linux? Firefox N/A % Firefox ?
One Web Forum

Powered Outer Probabilistic Clustering

There is a lot of clustering algorithms in unsupervised machine learning. The most popular
one -- k-means is unable to detect correct number of clusters and other methods are used
to estimate correct number of clusters. We introduce novel algorithm, which not only finds
correct number of clusters, but also finds higher quality clusters in comparison with k-
means on datasets with binary features. We explain this problem on email dataset and
show solution -- software Small Bang Email Client, which improves day to day life of
information workers.

i1 i2 i3 i4 i5 iF
k1 1 1
k2 1 1 1
k3 1 1
k4 1 1 1
samples Cluster #1

1 1 Cluster #2
kN 1 1
i1 i2 i3 i4 i5 iF K-MEANS
k1 1 1 If one member only
inside of cluser,
k2 1 1 1
no need to move it
k3 1 1
k4 1 1 1 POPC
samples Might destroy cluster
(emails) if moving single member
improves optimization

kN 1 1

Cluster #1 Cluster #2

1 feature

7 “clean” clusters

Easy to see “break point” -
correct amount of clusters


If we start with enough clusters
algorithm converges to correct
number of clusters
7 not so “clean” clusters
Completely confused even if
we say find 7 clusters

7 clusters with majority
of noisy features

Ability to find correct amount
and expected clusters
ignoring noisy features
Real example (emails):
- k-means - # clusters?
- POPC – if we sturt with enough clusters,
alg settles on 27 clusters
● Cluster sorting
– Unread emails in cluster?
– Sent emails in cluster?
– Recent conversation in cluster?
● Observations:
– Clusters change over time
● Possible extensions:
– Use data from calendar, sharepoint, etc.
POPC – Small Bang
Demo & More Questions?