Group F: Manoj Harpalani Kalpit Sarda

Group F:
Manoj Harpalani
Kalpit Sarda
 Hardware Setup
 Remote Status &Health Monitoring
 Energy Consumption
 Performance Stats
 Wikipedia Vandalism Detection

using Hadoop MapReduce
Buckled up for getting started with complete installation of
the cloud infrastructure .
 Enabled IPMI on cloud Servers to power up or
power down from remote client.
 Created Scripts to startup and shutdown
servers and get their heath stats through
Intelligent Platform Management Interface
(IPMI).
 Installed IPMI client to access and control the
servers remotely on LAN.
 Scripts :
 Readme.txt
 Tapping each element of the cloud rack we
have the below measurements:
 Detailed Measurements:
 Once everybody is done with their Project
Demo we would shut down the servers and
run killer jobs to load test the cloud servers.
 Then we would measure the power
consumption of each server again and this
time using a Digital Ammeter.
Power Usage Effectiveness:
PUE = Total Facility Power

IT Equipment Power
Data Center infrastructure Efficiency :
DCiE = IT Equipment Power

Total Facility Power
 Total Facility Power for Room 1312= 6.5 Kilo Watts (54.42 Amps x 120 Volts /1000)
 Total Power Consumed by Cloud Servers Only = 1.71 Kilo Watts
 Considering other machines in the server room:
PUE = 3/ 1.71 = 1.75 Approx
Name RAM HDD
M1.large 512MB 10GB
C1.large 1GB 20GB
C1.xlarge 2GB 20GB

Instance Loading
19.5
19
18.5
18
17.5
Instance Loading
17
16.5
16
15.5
m1.large c1.large c1.xlarge

Instance
10
4 Instance

Instance
28
27
26
25
24 Instance
23
22
21

Vandalism –
Defined as deliberate attempt
to compromise integrity of
articles in Wikipedia.
Goal :
Separate ill-intentional edits
from well intentional ones.
Plagiarism Authorship, and Social
Software Misuse (PAN) Workshop
2010
Training Phase:
Given a list 15000 edits along with their old revision and new revision
article, marked as regular or vandalism by human annotators.
1. Find out the corresponding file of a revision from the data dump.
2. Process and Clean the given Wiki Text article to plain text.
3. Extract Category & Outbound links from article.
4. Find out the diff between the old and the new revision and segregate them as
deletes and inserts or changes.
5. Now extract all related articles from Wikipedia Dump which is a single large
XML file of 26 GB.
6. Generate indexes for articles in the dump file for future use.
7. Use Algorithms like Naïve Bayes, LDA and Sentiment Analysis to train on the
extracted data and then classify each edit as vandalism or regular.
Testing Phase:
Given 100,000 articles, use the trained algorithm and data to classify edits as
vandalism or regular.
Each Map Job
Processes
one record Map 1
Map 1
Map 1
Map 1
List of Edits
Consisting of Reduce Logs
editId, OldRev Map 5
& NewRev
Map n
Write plaintext
files to storage
Wiki2PlainText
Search Revision in Storage
Dump Folder
Vandalism
Cache
Old Revision New Revision
Get Wiki Text for both
old and new revision
Start
Old & New

Revision Diff
Category & Write Deletes &

Storage Outbound Insert Changes in
Separate files
DB Link Extractor
Storage
Map 1
Map 1
Wiki
Dump Map 1
26 GB Stream Map 1
Xml
Hadoop
Streaming Record Map 5
Reader
Mapper
WikiDumpReader
Map n
Index Entry (Article Title, Position)
XML File Reader Storage
DB
XML extract
Start Check
Related
Extract Category & Article,
WikiDumpReader Outbound Links Skip
unrelated
ones
Storage Wiki2PlainText
Total Instance List count : 15000
Training:Test Split 70:30
Training :
Accuacy: 0.95
Precision for class 'van': 0.66
F1 for 'vandalism‘: 0.47
Test:
Accuracy: 0.89
F1 for class 'van': 0.05
Precision for class 'van': 0.06
1. reg/327726655.txt classified as reg (*)
2. reg/326826476.txt classified as van (*)
3. reg/329264380.txt classified as reg
4. van/328213101.txt classified as reg (*)
 Following are some great learnings from the
course:
1. Cloud Buzzzz!
2. What it takes to makes a cloud from scratch?
3. Security Issues & Practical Concerns while
moving on Third Party Cloud.
4. How Eucalyptus works?
5. Google MapReduce
6. Apache Hadoop
7. Amazon EC2
8. Team Work & Coordination
Thank You
Special Thanks to Brain (Systems Staff) for helping us with

measuring Power

Group F: Manoj Harpalani Kalpit Sarda

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Group F: Manoj Harpalani Kalpit Sarda

Uploaded by

Copyright:

Available Formats

Group F:

 Remote Status &Health Monitoring

 Wikipedia Vandalism Detection

PUE = Total Facility Power

Data Center infrastructure Efficiency :

DCiE = IT Equipment Power

M1.large 512MB 10GB

C1.large 1GB 20GB

C1.xlarge 2GB 20GB

m1.large c1.large c1.xlarge

m1.large c1.large c1.xlarge

m1.large c1.large c1.xlarge

Old & New

Category & Write Deletes &

Special Thanks to Brain (Systems Staff) for helping us with

You might also like