Professional Documents
Culture Documents
Manoj Harpalani
Kalpit Sarda
Hardware Setup
Energy Consumption
Performance Stats
Readme.txt
Tapping each element of the cloud rack we
have the below measurements:
Detailed Measurements:
Once everybody is done with their Project
Demo we would shut down the servers and
run killer jobs to load test the cloud servers.
Then we would measure the power
consumption of each server again and this
time using a Digital Ammeter.
Power Usage Effectiveness:
Total Facility Power for Room 1312= 6.5 Kilo Watts (54.42 Amps x 120 Volts /1000)
Total Power Consumed by Cloud Servers Only = 1.71 Kilo Watts
Considering other machines in the server room:
PUE = 3/ 1.71 = 1.75 Approx
Name RAM HDD
4 Instance
Goal :
Separate ill-intentional edits
from well intentional ones.
Plagiarism Authorship, and Social
Software Misuse (PAN) Workshop
2010
Training Phase:
Given a list 15000 edits along with their old revision and new revision
article, marked as regular or vandalism by human annotators.
1. Find out the corresponding file of a revision from the data dump.
2. Process and Clean the given Wiki Text article to plain text.
3. Extract Category & Outbound links from article.
4. Find out the diff between the old and the new revision and segregate them as
deletes and inserts or changes.
5. Now extract all related articles from Wikipedia Dump which is a single large
XML file of 26 GB.
6. Generate indexes for articles in the dump file for future use.
7. Use Algorithms like Naïve Bayes, LDA and Sentiment Analysis to train on the
extracted data and then classify each edit as vandalism or regular.
Testing Phase:
Given 100,000 articles, use the trained algorithm and data to classify edits as
vandalism or regular.
Each Map Job
Processes
one record Map 1
Map 1
Map 1
Map 1
List of Edits
Consisting of Reduce Logs
editId, OldRev Map 5
& NewRev
Map n
Write plaintext
files to storage
Wiki2PlainText
Search Revision in Storage
Dump Folder
Vandalism
Cache
Old Revision New Revision
Get Wiki Text for both
old and new revision
Start
Storage
Map 1
Map 1
Wiki
Dump Map 1
26 GB Stream Map 1
Xml
Hadoop
Streaming Record Map 5
Reader
Mapper
WikiDumpReader
Map n
Index Entry (Article Title, Position)
XML File Reader Storage
DB
XML extract
Start Check
Related
Extract Category & Article,
WikiDumpReader Outbound Links Skip
unrelated
ones
Storage Wiki2PlainText
Total Instance List count : 15000
Training:Test Split 70:30
Training :
Accuacy: 0.95
Precision for class 'van': 0.66
F1 for 'vandalism‘: 0.47
Test:
Accuracy: 0.89
F1 for class 'van': 0.05
Precision for class 'van': 0.06
1. reg/327726655.txt classified as reg (*)
2. reg/326826476.txt classified as van (*)
3. reg/329264380.txt classified as reg
4. van/328213101.txt classified as reg (*)
5. reg/327231041.txt classified as reg
6. reg/327850416.txt classified as reg
7. reg/327185531.txt classified as reg (*)
8. reg/327378367.txt classified as reg
9. reg/328843261.txt classified as reg
10. reg/328410606.txt classified as reg (*)
Following are some great learnings from the
course:
1. Cloud Buzzzz!
2. What it takes to makes a cloud from scratch?
3. Security Issues & Practical Concerns while
moving on Third Party Cloud.
4. How Eucalyptus works?
5. Google MapReduce
6. Apache Hadoop
7. Amazon EC2
8. Team Work & Coordination
Thank You