Big Data

Eufris 2012

Why should I care?
•$250 billions annual savings in EU alone by enhancing public sector •$600 billions annual consumer surplus from using personal location data globally

•Annual growth of data is remarcable •Data is the most valuable thing most companies have •Data is massively underutilized

Eufris 2012

Forecast There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

What is Big Data? "Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis" IDC "Big Data is a technlogy that helps extract value from the digital universe." IDC "Techniques and technologies that make handling data at extreme scale economical." Forrester

 ABC of Big Data Analytics •making sense of your data, in real-time, in easy way Bandwidth •ingesting, processing and delivering large amounts of data Content •storing, managing and retaining large amounts of data www.netapp.com

 3 V's of Big Data Variety • Big Data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more Velocity • often time sensitive. Big Data must be used as it is streaming in to the enterprise in order to maximize its value to the business Volume • Big Data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information

Few core concepts

Hadoop •The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. •Three subprojects •Hadoop Common •Hadoop Distributed Filesystem (HDFS) •Hadoop MapReduce

MapReduce •Introduced by Google in 2004

 MapReduce on App Engine • Mapreduce is an experimental, innovative, and rapidly changing new feature for App Engine

NoSQL •Definition 1 "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable. The original intention has been modern web-scale databases. Often more characteristics apply as: schema-free, easy replication support, simple API, eventually consistent, a huge data amount, and more. The movement began early 2009 and is growing rapidly." nosql-database.org

NoSQL •Definition 2 "In computing, NoSQL (sometimes expanded to "not only SQL") is a broad class of database management systems that differ from the classic model of the relational database management system (RDBMS) in some significant ways. These data stores may not require fixed table schemas, usually avoid join operations, and typically scale horizontally." Wikipedia

 From ACID to BASE ACID: Atomicity, Consistency, Isolation, Durability BASE: Basically available, Soft state, Eventually consistent

Big Data and cloud

Big Data on AWS

MapReduce on AWS • Not yet Hadoop 1.0

MapReduce on AWS EC2 S3 + DynamoDB

Google BigQuery Features • Speed - Analyze billions of rows(!) in seconds • Simplicity - SQL-like query language, a browser-based graphical interface • Scale - Terabytes of data, trillions of records • Sharing - Powerful group- and user-based permissions using Google accounts • Security - Secure SSL access • Multiple access methods - Can be used by REST API, a command-line tool, and Google Apps Script

BigQuery example

Big Data outside of cloud

Oracle Big Data Appliance About 500 000 $ 18 Oracle Sun Servers • 864 GB main memory. • 216 CPU cores. • 648 TB of raw disk storage. • 40 Gb/s InfiniBand connectivity between nodes and engineered systems. • 10 Gb/s Ethernet connectivity.

Autonomy IDOL 10 "For far too long, organizations have confined structured data to relational databases and unstructured data to simplistic keyword matching technologies." "IDOL 10 brings these worlds together, allowing organizations to automatically process, understand, and act on 100 percent of their data. as businesses can develop entirely new applications that explore the richness and color of Human Information that live in unstructured, semi-structured, and structured forms, in real-time. The results will be dramatic." Price?

