You are on page 1of 25

Group F:

Manoj Harpalani
Kalpit Sarda
 Hardware Setup

 Remote Status &Health Monitoring

 Energy Consumption

 Performance Stats

 Wikipedia Vandalism Detection


using Hadoop MapReduce
Buckled up for getting started with complete installation of
the cloud infrastructure .
 Enabled IPMI on cloud Servers to power up or
power down from remote client.
 Created Scripts to startup and shutdown
servers and get their heath stats through
Intelligent Platform Management Interface
(IPMI).
 Installed IPMI client to access and control the
servers remotely on LAN.
 Scripts :

 Readme.txt
 Tapping each element of the cloud rack we
have the below measurements:

 Detailed Measurements:
 Once everybody is done with their Project
Demo we would shut down the servers and
run killer jobs to load test the cloud servers.
 Then we would measure the power
consumption of each server again and this
time using a Digital Ammeter.
Power Usage Effectiveness:

PUE = Total Facility Power


IT Equipment Power

Data Center infrastructure Efficiency :

DCiE = IT Equipment Power


Total Facility Power

 Total Facility Power for Room 1312= 6.5 Kilo Watts (54.42 Amps x 120 Volts /1000)
 Total Power Consumed by Cloud Servers Only = 1.71 Kilo Watts
 Considering other machines in the server room:
PUE = 3/ 1.71 = 1.75 Approx
Name RAM HDD

M1.large 512MB 10GB

C1.large 1GB 20GB

C1.xlarge 2GB 20GB


Instance Loading
19.5
19
18.5
18
17.5
Instance Loading
17
16.5
16
15.5

m1.large c1.large c1.xlarge


Instance
10

4 Instance

m1.large c1.large c1.xlarge


Instance
28
27
26
25
24 Instance
23
22
21

m1.large c1.large c1.xlarge


Vandalism –
Defined as deliberate attempt
to compromise integrity of
articles in Wikipedia.

Goal :
Separate ill-intentional edits
from well intentional ones.
Plagiarism Authorship, and Social
Software Misuse (PAN) Workshop
2010
Training Phase:
Given a list 15000 edits along with their old revision and new revision
article, marked as regular or vandalism by human annotators.

1. Find out the corresponding file of a revision from the data dump.
2. Process and Clean the given Wiki Text article to plain text.
3. Extract Category & Outbound links from article.
4. Find out the diff between the old and the new revision and segregate them as
deletes and inserts or changes.
5. Now extract all related articles from Wikipedia Dump which is a single large
XML file of 26 GB.
6. Generate indexes for articles in the dump file for future use.
7. Use Algorithms like Naïve Bayes, LDA and Sentiment Analysis to train on the
extracted data and then classify each edit as vandalism or regular.

Testing Phase:
Given 100,000 articles, use the trained algorithm and data to classify edits as
vandalism or regular.
Each Map Job
Processes
one record Map 1

Map 1

Map 1
Map 1
List of Edits
Consisting of Reduce Logs
editId, OldRev Map 5
& NewRev

Map n
Write plaintext
files to storage
Wiki2PlainText
Search Revision in Storage
Dump Folder
Vandalism
Cache
Old Revision New Revision
Get Wiki Text for both
old and new revision
Start

Old & New


Revision Diff

Category & Write Deletes &


Storage Outbound Insert Changes in
Separate files
DB Link Extractor

Storage
Map 1

Map 1
Wiki
Dump Map 1
26 GB Stream Map 1
Xml
Hadoop
Streaming Record Map 5
Reader

Mapper

WikiDumpReader

Map n
Index Entry (Article Title, Position)
XML File Reader Storage
DB
XML extract

Start Check
Related
Extract Category & Article,
WikiDumpReader Outbound Links Skip
unrelated
ones

Storage Wiki2PlainText
Total Instance List count : 15000
Training:Test Split 70:30
Training :
Accuacy: 0.95
Precision for class 'van': 0.66
F1 for 'vandalism‘: 0.47
Test:
Accuracy: 0.89
F1 for class 'van': 0.05
Precision for class 'van': 0.06
1. reg/327726655.txt classified as reg (*)
2. reg/326826476.txt classified as van (*)
3. reg/329264380.txt classified as reg
4. van/328213101.txt classified as reg (*)
5. reg/327231041.txt classified as reg
6. reg/327850416.txt classified as reg
7. reg/327185531.txt classified as reg (*)
8. reg/327378367.txt classified as reg
9. reg/328843261.txt classified as reg
10. reg/328410606.txt classified as reg (*)
 Following are some great learnings from the
course:
1. Cloud Buzzzz!
2. What it takes to makes a cloud from scratch?
3. Security Issues & Practical Concerns while
moving on Third Party Cloud.
4. How Eucalyptus works?
5. Google MapReduce
6. Apache Hadoop
7. Amazon EC2
8. Team Work & Coordination
Thank You

Special Thanks to Brain (Systems Staff) for helping us with


measuring Power

You might also like