This action might not be possible to undo. Are you sure you want to continue?
Dr G Sudha Sadasivam Professor CSE, PSGCT
Data is a new class of economic asset, like currency and gold Source: World Economic Forum 2012
Data is the new raw material
• • • • • Big Data NoSQL databases Hadoop MR examples Case Study
• Genome analysis • Prediction of stock prices
1. SCIENCE PARADIGM SHIFT
observation, description and experimentation Theory development
Volume , variety and RT data
Simulation and models
Trend: New “Big Data” becoming commonplace
10 Terabytes per day
Transactions: 46 Terabytes per year Genomes: Petabytes per year
7 Terabytes per day
20 Petabytes per day
Call Records: 5 Terabytes per day
New Video Uploads: 4.5 Terabytes per day 100 Terabytes per year Massive Volumes of Data ….. LHC: 40 Terabytes per second
etc.Challenge Big Data’s characteristics are challenging conventional information management architectures Massive and growing amounts of information residing internal and external to the organization Unconventional semi structured or unstructured (diverse) including web pages. sensor data from active and passive systems. instant messages. click-streams. social media. log files. emails. Changing information Sentiment analytics Multi-Channel analytics 5 Transaction Claim fraud analytics analytics Warranty claim analytics Surveillance analytics CDR analytics . text messages.
share. process.” Term was introduced by Roger Magoulas of O’reilly in 2005 . analyse. visualise and manage with traditional database and software techniques.What is big data? “A massive volume of both structured and unstructured data that is so large that it's difficult to store.
arts and humanities and environmental sciences.History 1) 1970 . engineering.atmospheric and oceanic env. . 4) computer science. health. 2) Since 2008 more references 3)different disciplines including earth.
What Makes it Big Data? (V4)
1011001010010 0100110101010 1011100101010 100100101
• Volume：Gigabyte(109), Terabyte(1012), Petabyte(1015), Exabyte(1018), Zettabytes(1021) • Variety: Structured,semi-structured, unstructured; Text, image, audio, video, record • Velocity (Dynamic, sometimes time-varying)
Where Do We See Big Data?
Big Data vis-à-vis Existing Communities
Machine Learning NLP
Complex Event Processing
Big data use cases
Today’s Challenge Healthcare Expensive office visits Manufacturing In-person support Location-Based Services Based on home zip code Public Sector Standardized services Retail One size fits all marketing New Data Remote patient monitoring Product sensors What’s Possible Preventive care, reduced hospitalization Automated diagnosis, support Geo-advertising, traffic, local search Tailored services, cost reductions Sentiment analysis segmentation
Real time location data
Why Is Big Data Important? US HEALTH CARE MANUFACTURING GLOBAL PERSONAL LOCATION DATA EUROPE PUBLIC SECTOR ADMIN US RETAIL Increase Decrease dev. Increase service Increase Increase net industry value assembly costs provider revenue industry value margin by per year by per year by by by $300 B –50% $100 B €250 B 60+% ..
2. RDBMS Performance PROBLEMS: RIGID SCHEMA. ACID properties . PERFORMANCE. SCALABILITY.
Data Trends ๏Trend 1: Size ๏Trend 2: Connectedness ๏Trend 3: Semi-structure ๏Trend 4: Architecture .
Connected (many-to-many relationships / treelike characteristics. Requiring frequent schema changes. 2. 3. 4. each of which is only used by a few rows. Having tables with lots of columns.Web / New generation data 1. . Having attribute tables.
rigid schema. easy API. open-source: data partitioning horizontal scalable: new columns / nodes can be added Vertically scalable: more info can be added schema-free replication support. no join . no predefined schema. performance – – – – – – – distributed.Next Generation Databases (Web scale) – 2009 NOSQL – non-relational: flat file • Problems in relational – scalability. Consistency NoSQL – Not only SQL – 1990’s by Carlo Strozzi • With/without ACID properties.
even with concurrent updates. • ACID properties in SQL . • AVAILABILITY : All database clients are able to access same version of the data.NoSQL properties • CONSISTENCY : All database clients see the same data. • PARTITION TOLERANCE : The database can be split over multiple servers.
NoSQL Data models .
Advantages of NoSQL • • • • • • • Cheap. easy to implement Data are replicated and can be partitioned Easy to distribute Don't require a schema Can scale up and down Quickly process large amounts of data Relax the data consistency requirement (CAP) .
potential for inconsistency • No standardized schema • No standard format for queries • No standard language • Most NoSQL systems avoid in-memory storage • No guarantee of support .Disadvantages • New and sometimes buggy • Data is generally duplicated.
managing and analyzing large volumes of semistructured or un-structured data. Hadoop is useful across virtually every vertical industry. It augments enterprise data architectures by providing an efficient way for storing. Hadoop myth vs reality Hadoop is not a direct replacement for enterprise data data stores that are used to manage structured or transactional data. . processing.3.
Hadoop – Some Use Cases Digital marketing automation Log Analysis and Event Correlation Fraud detection and prevention Predictive modeling for new drugs Social network and relationship analysis Perform ETL ( Extract Transform Load ) functions on unstructured data Image Correlation and Analysis Collaborative Filtering 22 Apache Tomcat .
Need a connect to their existing RDBMS Need for a Distributed File System Need for data warehouse Need for a Scalable database GUI to operate and Develop Applications for Hadoop Need for a Framework for Parallel Compute Need for a Machine Learning and Data Mining requirements 23 . we realize that Heterogeneous data from various sources. Real time processing of incoming data.Hadoop – What do we expect from it ? If we analyze the mentioned use cases.
Apache Tomcat 24 .Hadoop – Components HDFS – Distributed File System Chukwa –To monitor Large Distributed System Hue – GUI to operate & develop Hadoop Applications Flume – To move Large Data post processing efficiently Mahout MapReduce – Distributed Processing of large Data sets ZooKeeper – Co-ordination Service for Dist Apps Hive – Data Warehousing framework HBase – Scalable Distributed DB. Supports Structured Data Pig – Framework for Parallel Computation Oozie – Workflow Service to manage Data Processing Jobs Apache Zookeeper Avro – Data Serialization System SQOOP – Connector to Structured Database Many more ….
Apache Tomcat .Hadoop – Who’s Using It ? Uses Hadoop and HBase for : • Social services • Structured data storage • Processing for internal use Uses Hadoop for : • Amazon's product search indices They process millions of sessions daily for analytics. Uses Hadoop for : • Databasing and analyzing Next Generation Sequencing (NGS) dataApache produced for the Cancer Zookeeper Genome Atlas (TCGA) project and other groups 25 And Many More …. Uses Hadoop for : • Search optimization • Research Uses Hadoop for : • Internal log reporting/parsing systems designed to scale to infinity and beyond. • web-wide analytics platform Uses Hadoop : • As a source for reporting/analytics and machine learning.
Hadoop – The Various Forms Today Apache Hadoop – Native Hadoop Distribution from Apache Foundation Yahoo! Hadoop – Hadoop Distribution of Yahoo CDH – Hadoop Distribution from Cloudera GreenPlum Hadoop – Hadoop Distribution from EMC HDP – Hadoop Platform from Hortonworks M3 / M5 / M7 – Hadoop Distribution from MAPR Project Serengeti – Vmware’s Implementation of Hadoop on Vcenter And More … Apache Zookeeper Apache Tomcat 26 .
CONCEPT Moving computation is more efficient than moving large data .Hadoop and MR Programming A framework for running applications on large clusters of commodity hardware ( 1 0 0 0 nodes) which produces huge data (petabytes – zetabytes) and to process it Open source Apache Software Foundation Project Hadoop Includes HDFS a distributed filesystem to distribute data Map/Reduce HDFS implements this data-parallel programming model.
Cluster node runs both DFS and MR .
Hadoop Cluster Architecture: .
HDFS Architecture • NameNode: filename. block > datanode • DataNode: maps block > local disk • Secondary NameNode: periodically merges edit logs Block is also called chunk . offset> blockid.
Dataflow in Hadoop Submit job map schedule reduce map reduce .
Dataflow in Hadoop Read Input File Block 1 HDFS Block 2 map reduce map reduce .
Dataflow in Hadoop Finished map Local FS Finished + Location reduce map Local FS reduce .
Dataflow in Hadoop map Local FS reduce HTTP GET map Local FS reduce .
Dataflow in Hadoop reduce Write Final Answer HDFS reduce .
4. MR Examples .
denoted As = (2r)^2 or 4r^2. The area of the circle. • pi= 4 * No of pts on the circle / num of points on the square • Count the number of generated points that are both in the circle and in the s quare MAP • PI = 4 * r REDUCE • Restricted parallel programming model meant for large clusters – User implements Map() and Reduce() .CALCULATING PI The area of the square. is pi * r2. denoted Ac.
Ex: WORD COUNT EXAMPLE Divide the document and analyse one line in one mapper Reducer sums up all counts Each mapper counts words in 1 line .
1> • The second map emits: < Hello. 1> < Hadoop. 1> < Goodbye. 1> < World. 1> < Hadoop.• File Hello World Bye World Hello Hadoop Goodbye Hadoop • Map For the given sample input the first map emits: < Hello. 1> < Bye. 1> . 1> < World.
2> < World. 1> < Hadoop. 2> The output of the second combine: < Goodbye. 2> < Hello. 2> < Hello. 1> < Hadoop. 1> Thus the output of the job (reduce) is: < Bye. 1> < Goodbye. 1> < World.The output of the first combine: < Bye. 2> . 1> < Hello.
1> • Reduce() – Sums all values for the same key and emits <word. ( 1 1) > => <”hello”. TotalCount> • eg. file text> – Parses file and emits <word. count> pairs • eg.• Map() – Input <filename. <”Hello”. 2> . <”Hello”.
1> < Hadoop. 1> < Hadoop. 1> < Bye. 1> < Bye. 1> < Hadoop. 1> < World. 1> < Hadoop. 2> < Hello. 1> < Bye. 1> < Goodbye.Parallelisation . 2> < Goodbye. 1> < Hello. 2> < World. 1> < Hello. 2> < Hello. 1> < World.Hadoop MapReduce File Hello World Bye World Hello Hadoop GoodBye Hadoop < Hello. 1> < Goodbye. 1> < World. 2> .
But quite often. Is hadoop suitable??? . it takes quite a while (sometimes more than an hour) for the Filer System to actually find it and send it back to the engineer's PC. the engineers need to access those files to modify/evolve/ consult the information inside. The problem is that even though the engineers know precisely the name of the file they want.. And that is because no indexing system exists on the Filer Hosting System (the system tests every single inode until the correct one is found).I work in a company where the Engineering Department guys produce an amazing amount of CAD files (Computer Assisted Design).but there are so many of them.. The files are not very big (a couple of dozens of MB) . So over the years we ended up having hundreds of thousands if not millions of files hosted on different Filler Systems.
TTC. GGG . GTT.1. CAA1. G TT. CCC. Genome clustering • Genome is a DNA sequence (A T G C) • Features characterize a genome – motifs are extracted – MR1 AAAGGGTTTCCCAAAG – Mapper: AAA.1. CAA. AAA. AGG.184.108.40.206. AAG.1. CCC.220.127.116.11.1. TTT. GGT.1. TCC.1.1 Reducer: AAA. GGT. CCA.1.1.1. AA G.1 .2.1.1. TTT.1. GG G .2. TTC.1. TCC.CCA. AAG. AGG.5.
R ead the input sequence Read input Obtain [1x64] Feature Descriptor vector (MR 1) Compare to the existing species by clustering (MR 2) Identify the species 45 .
K Means • Fix up cluster centres • Find the distance between samples and cluster centres • Find which cluster centre is closest to a given sample • Add the sample to the cluster • Re-evaluate cluster centres .
count • Map – calculates distance of each point from centers • Keyout – clusterid . • Reducer: finds closest cluster center.new centers . valueout-distance to each cluster center.Mapreduce kmeans • Key – clusterid. recalculates new centers for next iteration • Out. value – motif.
of features in a sample. V1(k) is the value for kth feature in the sample. V2(k) is value of kth feature of the cluster centre • Mapper: Distance to ith cluster centre is given by .M R K -Means • L et k be no.
• R educer: min [Si] is evaluated for a species considering all cluster centres i • Centroid is evaluated for all features (k) in sample V2 and cluster centre V1. This is done to update centroid .
Accuracy comparison Accuracy in percentage Length of Input Sequence 50 .
Time Efficiency Length of Input Sequence 51 .
HADOOP ON AWS Master Instance Core Instances 3 4 2 6 Task Instances 1 7 .
4 MB 118.P ERFO RMANCE FILE SIZE UPLOADING TIME COMPUTATION TIME OVERALL TIME 15.9 MB 3 sec 5 min 13 min 4 min 6 min 8 min 6 min 9 min 13 min .9 KB 1 sec 4 min 6 min 1 MB 54.
250 200 150 100 50 0 sss ssm MR Total ssl mms .
T -1 M-s.s .2 . C. T. T -1 M-m. C. T -1 M-s. T.m 6 min 11 min M-1.s 8 min 13 min M-1.s . C.2 . C.m . C. T. C.2 . C. C. T.2 .s .NUMBER OF INSTANCES TYPE OF INSTANC E S COMPUTATION TIME OVERALL TIME M-1.m 5 min 9 min M-1.l 7 min 11 min . T -1 M-s.
Local vs cloud cluster 250 200 150 100 50 0 1Kb 15KB Hadoop AWS 15MB 119MB .
Case Study – Prediction of stock prices • A Support Vector Machine (SVM) is a supervised learning method that analyzes data used for classification and regression analysis • Learning in SVM is done by finding a hyperplane • Separates the training set in a feature space using a kernel function which is inner product of input space features .6.
• Binary classification can be viewed as the task of separating classes in feature space f(x) = sign(wTx + b) w Tx + b = 0 w Tx + b > 0 w Tx + b < 0 where w represents normal vector to hyperplane x represents training patterns b represents threshold α represents coefficients s represents support vectors W = ∑ i αi si .
S2 α1.points 3. Input feature selection – MA and RSI Vol avg -. Finding support vectors -.S1. Data Collection – download data sets for month/ year/day 2.+.. Construction of hyper plane y = wT x + b 5. α2 4.Steps 1. Classification of test data using hyper plane and X values into +ive / -ive points .
a) Input feature • Moving Average MA = ∑n i=1 CPi n where. RS = AU 1+RS AD AU = total upwards price changes during the past n days AD = total downwards prices changes during the past n days . CPi = Closing Price on ith day • Relative Strength Index RSI = 100 * RS .2.
55 RS = 3.04 7.87 9.08 8.55 (0.8.53 8.8.32 7.11 5.11 HIGH 8.Input Data & Feature selection example DATE 26.94/(1+3.8.54 + point 81.15 VOLUME 4.61% increase .11 12.42 9.6 2.8 Total data set avg = 1.81 9.66 0.94)*100 Y = RSI = 79.20 8.87 X= MA = Vol Avg = 8.94 RSI= 3.78 -0.73 0.60 CLOSE 7.3 2.70 8.73+0.3 2.83 LOW 7.19 8.78)/0.27 10.8 OPEN-CLOSE 0.8.5 1.48 9.74 OPEN 8.66+0.11 19.
54 13.79 10.30 8.97 10.05 +/-VA + + + + + + - October 8.53 31.74 11.35 45.78 RSI 79.09 55.03 - .32 10.2.97 28.50 83.61 12.24 42.b)Repeat for other months also Month August June May April March November July February January December MA 8.44 12.87 85.73 14.81 31.81 74.94 24.
Y) = ( x1 .y2 ) 2 where X and Y are positive points and negative points of Moving Average and Relative Strength Index .y1 ) 2 + ( x2 . if X = (x1. y2) then the distance is given by. D (X. x2) and Y = (y1.3. Support Vectors • Support vectors are calculated by using Euclidean distance formula • In the Euclidean plane.
44 S1 = 85.03 distance 37. 85.35 (min) 57.87 -ive point 10.+ive point 8.97 1 1 .05 8.53 18.104.22.168 12.30.81 10.50 10.83.44.53 12.88 54.97 10.97 8.05 8.77 22.214.171.124.97. 28.84 43.53 12.62 52.81 S2= 10.03 Others are also calculated and min is found S1= 8. 83.79. 31. 31.31 126.96.36.199 S2 = 83.44 48.50 10. 188.8.131.52. 28.44.01 48.41 2.97 12. 31. 184.108.40.206.
73 .11 α2 = +1 7300.1 7444.95 α2 = -1 α1 = -0.11 α1 + 7160.71.b)Finding α1 and α2 • α1 S1 S1 + α2 S1 S2 = + 1 • α1 S1 S2 + α2 S2 S2 = . α2 = 0.3.81 α1 + 7300.
y = wT x + b Where.4. Construction of Hyperplane • Hyperplane is constructed using the equation . w = ∑ αi si w represents normal vector to hyperplane x represents training patterns b represents threshold α represents coefficients s represents support vector α values are calculated by using the equation • .
37 .25 x + 0.• • • • y = wT x + b wT = ∑ αi Si = α1 s1 + α2 s2 b = 0.02 hyperplane 0.02 y= 1.
RSI =x2= 50 to predict Volume avg • y= 1.05 + 0. Testing set • MA=x1= 12.5.02 = 34.07 predicted value (vol avg) • Overall avg = 18 • Increased % = (34.07-18 )/18 = 89% .25 T x + 0.02 hyperplane 0.44.37 • Y = 34.
decompose an M-class problem into a series of two-class problems . capable of generating very fast classifier functions following a training period • To extend binary class scenario to multi-class scenario.Multi SVM • Support vector machine is a powerful tool for binary classification.
in which there is one binary SVM for each pair of classes to separate members of one class from members of the other • Binary decision tree classification .Approaches for Multi class SVM • Multiclass ranking SVMs. in which there is one binary SVM for each class to separate members of that class from members of other classes • Pairwise classification. in which one SVM decision function attempts to classify all classes • One-against-all classification.
Classes – . 50 to 100 -100 to -50 -50 to 0 One against all 0 to 50 50 to 100 pairwise . -50 to 0.100 to -50. 0 to 50.
BDT -100 to 100 -100 to 0 0 to 100 -100 to -50 -50 to 0 0 to 50 50 to 100 .
Big Data Is About… Tapping into diverse data sets Finding and monetizing unknown relationships Data driven business decisions .7. Conclusion .
Big Data in Action DECIDE ACQUIRE Make Better Decisions Using Big Data ANALYZE ORGANIZE .
Big Data in Action DECIDE ACQUIRE Acquire all available data ANALYZE ORGANIZE .
Big Data in Action DECIDE ACQUIRE Organize and distill big data using massive parallelism ANALYZE ORGANIZE .
at once ANALYZE ORGANIZE .Big Data in Action DECIDE ACQUIRE Analyze all your data.
Big Data in Action DECIDE ACQUIRE Decide based on real-time big data ANALYZE ORGANIZE .
Summary • • • • Big data NoSQL Hadoop Case Study .
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.