Professional Documents
Culture Documents
Textbook Big Data Analytics With Java 1St Edition Rajat Mehta Ebook All Chapter PDF
Textbook Big Data Analytics With Java 1St Edition Rajat Mehta Ebook All Chapter PDF
Rajat Mehta
Visit to download the full and correct content document:
https://textbookfull.com/product/big-data-analytics-with-java-1st-edition-rajat-mehta/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...
https://textbookfull.com/product/from-big-data-to-big-profits-
success-with-data-and-analytics-1st-edition-russell-walker/
https://textbookfull.com/product/big-data-analytics-second-
international-conference-bda-2013-mysore-india-
december-16-18-2013-proceedings-1st-edition-sameep-mehta/
https://textbookfull.com/product/big-data-and-analytics-for-
insurers-1st-edition-boobier/
https://textbookfull.com/product/big-data-analytics-in-
cybersecurity-first-edition-deng/
Big Data Analytics Systems Algorithms Applications
C.S.R. Prabhu
https://textbookfull.com/product/big-data-analytics-systems-
algorithms-applications-c-s-r-prabhu/
https://textbookfull.com/product/big-data-analytics-for-
intelligent-healthcare-management-1st-edition-nilanjan-dey/
https://textbookfull.com/product/data-processing-with-optimus-
supercharge-big-data-preparation-tasks-for-analytics-and-machine-
learning-with-optimus-using-dask-and-pyspark-leon/
https://textbookfull.com/product/emerging-technology-and-
architecture-for-big-data-analytics-1st-edition-anupam-
chattopadhyay/
https://textbookfull.com/product/big-data-in-practice-
how-45-successful-companies-used-big-data-analytics-to-deliver-
extraordinary-results-1st-edition-bernard-marr/
Big Data Analytics with Java
Table of Contents
Big Data Analytics with Java
Credits
About the Author
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Big Data Analytics with Java
Why data analytics on big data?
Big data for analytics
Big data – a bigger pay package for Java developers
Basics of Hadoop – a Java sub-project
Distributed computing on Hadoop
HDFS concepts
Design and architecture of HDFS
Main components of HDFS
HDFS simple commands
Apache Spark
Concepts
Transformations
Actions
Spark Java API
Spark samples using Java 8
Loading data
Data operations – cleansing and munging
Analyzing data – count, projection, grouping, aggregation, and max/min
Actions on RDDs
Paired RDDs
Transformations on paired RDDs
Saving data
Collecting and printing results
Executing Spark programs on Hadoop
Apache Spark sub-projects
Spark machine learning modules
MLlib Java API
Other machine learning libraries
Mahout – a popular Java ML library
Deeplearning4j – a deep learning library
Compressing data
Avro and Parquet
Summary
2. First Steps in Data Analysis
Datasets
Data cleaning and munging
Basic analysis of data with Spark SQL
Building SparkConf and context
Dataframe and datasets
Load and parse data
Analyzing data – the Spark-SQL way
Spark SQL for data exploration and analytics
Market basket analysis – Apriori algorithm
Full Apriori algorithm
Implementation of the Apriori algorithm in Apache Spark
Efficient market basket analysis using FP-Growth algorithm
Running FP-Growth on Apache Spark
Summary
3. Data Visualization
Data visualization with Java JFreeChart
Using charts in big data analytics
Time Series chart
All India seasonal and annual average temperature series dataset
Simple single Time Series chart
Multiple Time Series on a single chart window
Bar charts
Histograms
When would you use a histogram?
How to make histograms using JFreeChart?
Line charts
Scatter plots
Box plots
Advanced visualization technique
Prefuse
IVTK Graph toolkit
Other libraries
Summary
4. Basics of Machine Learning
What is machine learning?
Real-life examples of machine learning
Type of machine learning
A small sample case study of supervised and unsupervised learning
Steps for machine learning problems
Choosing the machine learning model
What are the feature types that can be extracted from the datasets?
How do you select the best features to train your models?
How do you run machine learning analytics on big data?
Getting and preparing data in Hadoop
Preparing the data
Formatting the data
Storing the data
Training and storing models on big data
Apache Spark machine learning API
The new Spark ML API
Summary
5. Regression on Big Data
Linear regression
What is simple linear regression?
Where is linear regression used?
Predicting house prices using linear regression
Dataset
Data cleaning and munging
Exploring the dataset
Running and testing the linear regression model
Logistic regression
Which mathematical functions does logistic regression use?
Where is logistic regression used?
Predicting heart disease using logistic regression
Dataset
Data cleaning and munging
Data exploration
Running and testing the logistic regression model
Summary
6. Naive Bayes and Sentiment Analysis
Conditional probability
Bayes theorem
Naive Bayes algorithm
Advantages of Naive Bayes
Disadvantages of Naive Bayes
Sentimental analysis
Concepts for sentimental analysis
Tokenization
Stop words removal
Stemming
N-grams
Term presence and Term Frequency
TF-IDF
Bag of words
Dataset
Data exploration of text data
Sentimental analysis on this dataset
SVM or Support Vector Machine
Summary
7. Decision Trees
What is a decision tree?
Building a decision tree
Choosing the best features for splitting the datasets
Advantages of using decision trees
Disadvantages of using decision trees
Dataset
Data exploration
Cleaning and munging the data
Training and testing the model
Summary
8. Ensembling on Big Data
Ensembling
Types of ensembling
Bagging
Boosting
Advantages and disadvantages of ensembling
Random forests
Gradient boosted trees (GBTs)
Classification problem and dataset used
Data exploration
Training and testing our random forest model
Training and testing our gradient boosted tree model
Summary
9. Recommendation Systems
Recommendation systems and their types
Content-based recommendation systems
Dataset
Content-based recommender on MovieLens dataset
Collaborative recommendation systems
Advantages
Disadvantages
Alternating least square – collaborative filtering
Summary
10. Clustering and Customer Segmentation on Big Data
Clustering
Types of clustering
Hierarchical clustering
K-means clustering
Bisecting k-means clustering
Customer segmentation
Dataset
Data exploration
Clustering for customer segmentation
Changing the clustering algorithm
Summary
11. Massive Graphs on Big Data
Refresher on graphs
Representing graphs
Common terminology on graphs
Common algorithms on graphs
Plotting graphs
Massive graphs on big data
Graph analytics
GraphFrames
Building a graph using GraphFrames
Graph analytics on airports and their flights
Datasets
Graph analytics on flights data
Summary
12. Real-Time Analytics on Big Data
Real-time analytics
Big data stack for real-time analytics
Real-time SQL queries on big data
Real-time data ingestion and storage
Real-time data processing
Real-time SQL queries using Impala
Flight delay analysis using Impala
Apache Kafka
Spark Streaming
Typical uses of Spark Streaming
Base project setup
Trending videos
Sentiment analysis in real time
Summary
13. Deep Learning Using Big Data
Introduction to neural networks
Perceptron
Problems with perceptrons
Sigmoid neuron
Multi-layer perceptrons
Accuracy of multi-layer perceptrons
Deep learning
Advantages and use cases of deep learning
Flower species classification using multi-Layer perceptrons
Deeplearning4j
Hand written digit recognizition using CNN
Diving into the code:
More information on deep learning
Summary
Index
Big Data Analytics with Java
Big Data Analytics with Java
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Livery Place
35 Livery Street
ISBN 978-1-78728-898-0
www.packtpub.com
Credits
Author
Rajat Mehta
Reviewers
Dave Wentzel
Roberto Casati
Commissioning Editor
Veena Pagare
Acquisition Editor
Chandan Kumar
Deepti Thore
Technical Editors
Jovita Alva
Sneha Hanchate
Copy Editors
Safis Editing
Laxmi Subramanian
Project Coordinator
Shweta H Birwatkar
Proofreader
Safis Editing
Indexer
Pratik Shirodkar
Graphics
Tania Dutta
Production Coordinator
Shantanu N. Zagade
Cover Work
Shantanu N. Zagade
About the Author
Rajat Mehta is a VP (technical architect) in technology at JP Morgan Chase in
New York. He is a Sun certified Java developer and has worked on Java-related
technologies for more than 16 years. His current role for the past few years
heavily involves the use of a big data stack and running analytics on it. He is
also a contributor to various open source projects that are available on his
GitHub repository, and is also a frequent writer for dev magazines.
About the Reviewers
Dave Wentzel is the CTO of Capax Global, a data consultancy specializing in
SQL Server, cloud, IoT, data science, and Hadoop technologies. Dave helps
customers with data modernization projects. For years, Dave worked at big
independent software vendors, dealing with the scalability limitations of
traditional relational databases. With the advent of Hadoop and big data
technologies everything changed. Things that were impossible to do with data
were suddenly within reach.
Before joining Capax, Dave worked at Microsoft, assisting customers with big
data solutions on Azure. Success for Dave is solving challenging problems at
companies he respects, with talented people who he admires.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to
all Packt books and video courses, as well as industry-leading tools to help you
plan your personal development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our
editorial process. To help us improve, please leave us an honest review on this
book’s Amazon page at https://www.amazon.com/dp/1787288986.
If you’d like to join our team of regular reviewers, you can e-mail us at
customerreviews@packtpub.com. We award our regular reviewers with free
eBooks and videos in exchange for their valuable feedback. Help us be relentless
in improving our products!
Chapter 2, First Steps in Data Analysis, takes the first steps towards the field of
analytics on big data. We start with a simple example covering basic statistical
analytic steps, followed by two popular algorithms for building association rules
using the Apriori Algorithm and the FP-Growth Algorithm. For all case studies,
we have used realistic examples of an online e-commerce store to give insights
to users as to how these algorithms can be used in the real world.
Chapter 5, Regression on Big Data, explains how you can use linear regression
to predict continuous values and how you can do binary classification using
logistic regression. A real-world case study of house price evaluation based on
the different features of the house is used to explain the concepts of linear
regression. To explain the key concepts of logistic regression, a real-life case
study of detecting heart disease in a patient based on different features is used.
Chapter 7, Decision Trees, explains that decision trees are like flowcharts and
can be programmatically built using concepts such as Entropy or Gini Impurity.
The golden egg in this chapter is a case study that shows how we can predict
whether a person's loan application will be approved or not using decision trees.
Chapter 8, Ensembling on Big Data, explains how ensembling plays a major role
in improving the performance of the predictive results. I cover different concepts
related to ensembling in this chapter, including techniques such as how multiple
models can be joined together using bagging or boosting thereby enhancing the
predictive outputs. We also cover the highly popular and accurate ensemble of
models, random forests and gradient-boosted trees. Finally, we predict loan
default by users in a dataset of a real-world Lending Club (a real online lending
company) using these models.
Chapter 10, Clustering and Customer Segmentation on Big Data, speaks about
clustering and how it can be used by a real-world e-commerce store to segment
their customers based on how valuable they are. I have covered both k-Means
clustering and bisecting k-Means clustering, and used both of them in the
corresponding case study on customer segmentation.
Chapter 11, Massive Graphs on Big Data, covers an interesting topic, graph
analytics. We start with a refresher on graphs, with basic concepts, and later go
on to explore the different forms of analytics that can be run on the graphs,
whether path-based analytics involving algorithms such as breadth-first search,
or connectivity analytics involving degrees of connection. A real-world flight
dataset is then used to explore the different forms of graph analytics, showing
analytical concepts such as finding top airports using the page rank algorithm.
Chapter 12, Real-Time Analytics on Big Data, speaks about real-time analytics
by first seeing a few examples of real-time analytics in the real world. We also
learn about the products that are used to build real-time analytics system on top
of big data. We particularly cover the concepts of Impala, Spark Streaming, and
Apache Kafka. Finally, we cover two real-life case studies on how we can build
trending videos from data that is generated in real-time, and also do sentiment
analysis on tweets by depicting a Twitter-like scenario using Apache Kafka and
Spark Streaming.
Chapter 13, Deep Learning Using Big Data, speaks about the wide range of
applications that deep learning has in real life whether it's self-driving cars,
disease detection, or speech recognition software. We start with the very basics
of what a biological neural network is and how it is mimicked in an artificial
neural network. We also cover a lot of the theory behind artificial neurons and
finally cover a simple case study of flower species detection using a multi-layer
perceptron. We conclude the chapter with a brief introduction to the
Deeplearning4j library and also cover a case study on handwritten digit
classification using convolution neural networks.
What you need for this book
There are a few things you will require to follow the examples in this book: a
text editor (I use Sublime Text), internet access, admin rights to your machine to
install applications and download sample code, and an IDE (I use Eclipse and
IntelliJ).
You will also need other software such as Java, Maven, Apache Spark, Spark
modules, the GraphFrames library, and the JFreeChart library. We mention the
required software in the respective chapters.
You also need a good computer with a good RAM size, or you can also run the
samples on Amazon AWS.
Who this book is for
If you already know some Java and understand the principles of big data, this
book is for you. This book can be used by a developer who has mostly worked
on web programming or any other field to switch into the world of analytics
using machine learning on big data.
When we wish to draw your attention to a particular part of a code block, the
relevant lines or items are set in bold:
Dataset<Row>data = spark.read().csv("data/heart_disease_data.csv");
System.out.println("Number of Rows -->" + data.count());
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think
about this book—what you liked or may have disliked. Reader feedback is
important for us to develop titles that you really get the most out of.
If there is a topic that you have expertise in and you are interested in either
writing or contributing to a book, see our author guide on
www.packtpub.com/authors.
You can also download the code files by clicking on the Code Files button on
the book's webpage at the Packt Publishing website. This page can be accessed
by entering the book's name in the Search box. Please note that you need to be
logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the
folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
Cable Facilities
Metric System
The National City Bank, 55 Wall St., New York City, which led the
way, has branches in six of the South American Republics,
The Mercantile Bank of the Americas, 44 Pine St., New York,
The American Foreign Banking Corporation, 53 Broadway, New
York,
W. R. Grace and Company’s Bank, 7 Hanover Square, New York,
The First National Bank, 70 Federal St., Boston,
The American Express Company, 65 Broadway, New York, with
offices in Buenos Aires, Argentina; Montevideo, Uruguay; and
Valparaiso, Chile; and with correspondents in other cities, performs
some banking service.
British Banks
Important banks with New York offices and with many branches in
South America are:
The Anglo South American Bank, 49 Broadway, New York,
affiliated with
The British Bank of South America, and with
The Commercial Bank of Spanish America, 49 Broadway, New
York;
The London and River Plate Bank, 51 Wall St., New York,
The London and Brazilian Bank, 56 Wall St., New York,
The Royal Bank of Canada, 68 William St., New York.
Freight Only
Colombia
To Brazil Only