You are on page 1of 15

Implementing Big Data Analysis

Jump Start
Graeme Malcolm | Data Technology Specialist,
Content Master
Pete Harris | Learning Product Planner, Microsoft

Graeme Malcolm | @graeme_malcolm


Microsoft Data Platform Specialist
Consultant, trainer, and author since SQL Server 4.2
One of the worlds first MCSEs in SQL Server 2012 BI
(Fairly) regular blogger at www.contentmaster.com

Longstanding partner with Microsoft


Lead author for Microsoft Official Curriculum SQL Server
2014 and SQL Server 2012 BI courses
Contributing author of Patterns and Practices Guide to Big
Data
Author of numerous training courses and Microsoft Press
titles since SQL Server 7.0

Pete Harris | @SQLPete


Learning Product Planner
Various roles at Microsoft since 1995

Course Topics
Implementing Big Data Analysis
01 | Introduction to Big Data

05 | Processing Big Data with Hive

02 | Getting Started with HDInsight

06 | Automating Big Data


Processing

03 | Windows Azure PowerShell

07 | Analyzing Big Data with Excel

04 | Processing Big Data with Pig

Setting Expectations
Target Audience
BI professionals and data analysts

Suggested Prerequisites/Supporting Material


Experience using Microsoft Excel and Power BI
Knowledge of enterprise BI technologies

Join the MVA Community!


Microsoft Virtual Academy
Free online learning tailored for IT Pros and Developers
Over 1M registered users
Up-to-date, relevant training on variety of Microsoft
products

Earn while you learn!


Get 50 MVA Points for this event!
Visit http://aka.ms/MVA-Voucher
Enter this code: PowerJump1 (expires 8/15/2013)

Click to edit
Master subtitle
style

01 | Introduction to Big Data


Graeme Malcolm | Data Technology Specialist,
Content Master
Pete Harris | Learning Product Planner, Microsoft

Module Overview
What is Big Data?
Big Data Technologies
Map/Reduce
Microsoft Tools for Big Data

What is Big Data?


Data that is too large or complex for analysis in
traditional relational databases
Typified by the 3 Vs:
Volume Huge amounts of data to process
Variety A mixture of structured and unstructured data
Velocity New data generated extremely frequently
Web server log reporting

Social media sentiment analysis

Sensor anomaly detection

Big Data Technologies


Hadoop
Open source distributed data processing cluster
Data processed in Hadoop Distributed File System (HDFS)

Related projects
Hive
Pig
HCatalog
Oozie
Sqoop
Others

Hadoop Cluster
Name Node

Data Nodes

HDFS

Map/Reduce

REDUCE

MAP

Lorem ipsum sit amet magma sit elit


Fusce magna sed sit amet magna
Key

Value

Key

Value

Lorem

Fusce

ipsum

magma

sit

sed

amet

sit

magma

amet

sit

magma

elit

1
Key
Lorem
ipsum
sit
amet
magma
elit
Fusce
sed

Value
1
1
3
2
3
1
1
1

1. Source data is divided


among data nodes
2. Map phase generates
key/value pairs
3. Reduce phase
aggregates values for
each key

Map/Reduce Code in Hadoop

Usually written in Java and compiled as a


Jar
Streaming enables other languages

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {


private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context){
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

Microsoft Tools for Big Data


SQL Server Parallel Data Warehouse
Enterprise data warehouse appliance
Massively Parallel Processing (MPP), shared-nothing
architecture

Windows Azure HDInsight


Cloud-based implementation of Hadoop
Available as a Windows Azure service

PolyBase
Integration technology for SQL Server Parallel Data
Warehouse and HDInsight

Module Summary
Big Data is characterized by
Volume
Variety
Velocity

Hadoop is an open source platform for Big Data


processing
Map/Reduce is a distributed data processing
technique
Microsoft is investing in solutions for Big Data

2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Office, Azure, System Center, Dynamics and other product names are or may be registered
trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of
Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a
commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT
MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

You might also like