Professional Documents
Culture Documents
04/23/2020 1
Apache Ambari
• Apache Ambari is used to develop a software for provisioning, managing and
monitoring Hadoop Clusters. As per web analytics Apache Ambari is considered to
be of three layers Core Hadoop, Essential Hadoop and Hadoop Support.
• Many companies are awaiting for Apache Ambari job candidates for several roles
to optimize user experience on their websites using Apache Ambari.
• Apache Ambari job description might consists of developing a tool using the
methodologies of the technology to monitor and support the clusters in Hadoop.
• Wisdomjobs created interview questions exclusively for the candidates who are in
search of job.
• Do check our page for Apache Ambari interview questions to get set for the
interview.
04/23/2020 2
Apache Ambari
• Ambari enables System Administrators to:
• Provision a Hadoop Cluster
• Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts.
• Ambari handles configuration of Hadoop services for the cluster.
• Manage a Hadoop Cluster
• Ambari provides central management for starting, stopping, and reconfiguring Hadoop services
across the entire cluster.
• Monitor a Hadoop Cluster
• Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
• Ambari leverages Ambari Metrics System for metrics collection.
• Ambari leverages Ambari Alert Framework for system alerting and will notify you when your
attention is needed (e.g., a node goes down, remaining disk space is low, etc).
04/23/2020 3
Advantages of Ambari
• Its flexible and scalable user interface allows a range of tools such as Pig, MapReduce,
Hive, etc. to be installed on the cluster and administers their performances in a user-
friendly fashion. Some of the key features of this technology can be highlighted as:
• Instantaneous insight into the health of the Hadoop cluster using preconfigured
operational metrics
• User-friendly configuration providing an easy step-by-step guide for installation
• Installation of Apache Ambari is possible through Hortonworks Data Platform (HDP)
• Monitoring dependencies and performances by visualizing and analyzing jobs and tasks
• Authentication, authorization, and auditing by installing Kerberos-based Hadoop clusters
• Flexible and adaptive technology fitting perfectly in the enterprise environment
04/23/2020 4
Ambari Vs Zookeeper
• Therefore, though these tasks may
seem similar from a distance, actually,
these two technologies perform
different tasks on the same Hadoop
cluster making it agile, responsive,
scalable, and fault-tolerant in a big
way.
• As an Apache Ambari Administrator,
you will be creating and managing
Ambari users and groups.
• Import users and groups from LDAP
systems into Ambari.
04/23/2020 5
Ambari Installation
• To build up the cluster, the Install Wizard needs to know some general information regarding
the cluster to which you should supply the fully qualified domain name (FQDN) of your each
host.
• Additionally, the wizard needs access to the private key file the user created in Set Up
Passwordless SSH. This is used to locate all the hosts in the system and to access and interact
with them securely.
• The list of hostnames, one per line, can be entered using the Target Hosts text box
• Select Provide Your SSH Private Key if you want Ambari to automatically install the Ambari
Agent on all your hosts using SSH. In the Host Registration Information, you can use
the Choose File button to find the private key file matching the public key installed earlier on
all your hosts. Alternatively, you can cut and paste the key into the text box manually
• Select Perform Manual Registration if you do not wish Ambari to automatically install the
Ambari Agent
04/23/2020 6
Business Intelligence
• Every Business has data generated by the customers
or stakeholders that is recorded electronically.
• All this data can be analyzed and mined using special
tools and techniques to generate patterns and
intelligence.
• This reflect how the business is functioning.
• Any business organization needs to continually
monitor its business environment and its own
performance, and then rapidly adjust its future
plans. Business intelligence is a broad set of information
• Key performance Indexes (KPIs) or key result areas technology (IT) solutions that includes tools for gathering,
(KRAs) are critical performance metrics for the analyzing, and reporting information to the users about
industries. performance of the organization and its environment.
04/23/2020 7
History of Business Intelligence
• Data Integration
• The process of combining two or more data sets for sharing and analysis, in
order to support information management inside a business.
• Data Mining
• The process of extracting previously unknown and useful information from
large data sets or databases.
• Mined data of a company are then stored in a “Data Warehouse”
Gathering Business Intelligence
• Data Warehousing
• A data warehouse is an special database where all the useful information is
stored in one location for easy access.
• A data mart is essentially the same as a data warehouse, but with more
specific information of predetermined select data.
• Makes for more efficient data analysis and reporting
Gathering Business Intelligence
• Cognos
• Cognos 8 BI
• Information Builders
• WebFOCUS
• Microsoft
• Maestro BI
• ClearForest
• ClearResearch
• Sun Microsystems
• Sun StorEdgeTM SAM-FS/QFS
Pattern Recognition
• A pattern is a design or model that helps grasp something
• Patterns help connect things that may not appear to be connected.
• Patterns help cut through complexity and reveal simpler understandable
trends.
• Patterns can be as definitive as hard scientific rules, like the rule that the
sun always rises in the east.
• A perfect pattern or model is one that
• accurately describes a situation,
• is broadly applicable, and
• can be described in a simple manner.
04/23/2020 23
Pattern Recognition
• Patterns can be temporal, which is something that regularly occurs
over time.
• Patterns can also be spatial, such as things being organized in a
certain way.
• A spatial pattern, following the 80–20 rule, could be that the top 20 percent
of customers lead to 80 percent of the business.
• Patterns can be functional, in that doing certain things leads to certain
effects.
• A functional pattern may involve test-taking skills.
• Good patterns are often symmetric.
04/23/2020 24
Pattern Recognition
• The economic meltdown in 2008 to 2009 was because of the collapse of the
accepted pattern, that is, “housing prices always go up.”
• knowing the business domain well is very important.
• One can predict the traffic pattern on highways from the movement of cell phone
(in the car) locations on the highway.
• Some patterns may be so sparse that a very large amount of diverse data has to be
seen together to notice any connections.
• For instance, locating the debris of a flight that may have vanished midcourse would require
bringing together data from many sources, such as satellites, ships, and navigation systems.
• The raw data may come with various levels of quality, and may even be conflicting.
• The data at hand may or may not be adequate for finding good patterns.
• Additional dimensions of data may need to be added to help solve the problem.
04/23/2020 25
Data Processing Chain
• Data is the new natural resource. Implicit in this statement is the recognition
of hidden value in data.
• Data lies at the heart of business intelligence.
• Data can be modeled and stored in a database.
• Relevant data can be extracted from the operational data stores according to
certain reporting and analyzing purposes, and stored in a data warehouse.
• The data from the warehouse can be combined with other sources of data,
and mined using data mining techniques to generate new insights.
• The insights need to be visualized and communicated to the right audience in
real time for competitive advantage.
04/23/2020 26
Data Processing Chain
• Anything that is recorded is data.
• Observations and facts are data.
Anecdotes and opinions are also
data, of a different kind.
• Data can be numbers, such as
the record of daily weather or
daily sales.
• Data can be alphanumeric, such
as the names of employees and
customers.
04/23/2020 27
Data
• Data could come from any number of sources.
• It could come from operational records inside an organization, and it can come from records compiled by the
industry bodies and government agencies.
• Data could come from individuals telling stories from memory and from people’s interaction in social contexts.
• Data could come from machines reporting their own status or from logs of web usage.
• Data can come in many ways.
• It may come as paper reports.
• It may come as a file stored on a computer.
• It may be words spoken over the phone.
• It may be e-mail or chat on the Internet.
• It may come as movies and songs in DVDs, and so on.
• There is also data about data. It is called metadata.
• For example, people regularly upload videos on YouTube.
• The format of the video file (whether it was a high-def file or lower resolution) is metadata.
• The information about the time of uploading is metadata.
• The account from which it was uploaded is also metadata.
• The record of downloads of the video is also metadata.
04/23/2020 28
Data Types
• Data could be an unordered collection of values.
• For example, a retailer sells shirts of red, blue, and green colors.
• There is no intrinsic ordering among these color values.
• One can hardly argue that any one color is higher or lower than the other.
• This is called nominal (means names) data.
• Data could be ordered values like small, medium, and large.
• For example, the sizes of shirts could be extra-small, small, medium, and large.
• There is clarity that medium is bigger than small, and large is bigger than medium. But the differences
may not be equal.
• This is called ordinal (ordered) data.
• Another type of data has discrete numeric values defined in a certain range, with the
assumption of equal distance between the values.
• Customer satisfaction score may be ranked on a 10-point scale with 1 being lowest and 10 being highest.
• This requires the respondent to carefully calibrate the entire range as objectively as possible and place
his or her own measurement in that scale.
• This is called interval (equal intervals) data.
04/23/2020 29
Data Types
• The highest level of numeric data is ratio data that can take on any numeric value.
• The weights and heights of all employees would be exact numeric values.
• The price of a shirt will also take any numeric value.
• It is called ratio (any fraction) data.
• There is another kind of data that does not lend itself to much mathematical analysis,
at least not directly.
• Such data needs to be first structured and then analyzed.
• This includes data like audio, video, and graphs files, often called BLOBs (Binary Large Objects).
• These kinds of data lend themselves to different forms of analysis and mining.
• Songs can be described as happy or sad, fast-paced or slow, and so on.
• They may contain sentiment and intention, but these are not quantitatively precise.
• Datafication is a new term that means that almost every phenomenon is now being
observed and stored.
04/23/2020 30
Data Warehousing
04/23/2020 31
Need for Data
Warehousing
• Integrated, company-wide view of high-quality information
(from disparate databases)
• Separation of operational and informational systems and
data (for improved performance)
Warehouse
Standard (Operational)
(Informational)
DB
• Mostly reads
updates
• Many small
Queries are transactions
long and complex
• Mb--Tb
Gb Gbofofdata
data
• Current snapshot
History
• Index/hash
Lots of scanson p.k.
• Raw data
Summarized, reconciled data
• Thousandsofofusers
Hundreds users(e.g.,
(e.g.,decision-makers,
clerical users) analysts)
Information-
“Prehistoric “Middle Data
TIME
Based
Times” Ages” Revolution Management
A process of
Information
transforming data
into information and
making it available to
users in a timely
enough manner to
make a difference
Data
OLTP vs Data Warehouse
– OLTP • Warehouse (DSS)
• Application Oriented – Subject Oriented
• Used to run business – Used to analyze business
• Detailed data – Summarized and refined
• Current up to date – Snapshot data
• Isolated Data – Integrated Data
• Clerical User – Knowledge User (Manager)
• Few Records at a time – Large volumes accessed at
accessed – a time (millions)
• (tens) – Mostly Read (Batch Update)
• Read/Update Access – Redundancy
Database Size present
100 GB - few terabytes
• No data redundancy
Database Size 100MB -100 GB – Query throughput is the performance metric
• Transaction throughput is the – Hundreds of users
performance metric – Managed by subsets
• Thousands of users
• Managed in entirety
Data Warehouse Architectures
• Generic Two-Level Architecture
• Independent Data Mart
• Dependent Data Mart and Operational Data Store
• Logical Data Mart and @ctive Warehouse
• Three-Layer architecture
Client Client
Metadata Warehouse
Integration
Source Source
Source
Strategic uses of data warehousing
Industry Functional areas of Strategic use
use
Airline Operations; marketing Crew assignment, aircraft development, mix
of fares, analysis of route profitability,
frequent flyer program promotions
Information
Individually Less
Structured
Departmentally History
Structured Normalized
Detailed
Organizationally More
Data Warehouse
Structured
Data
Warehouse vs. Data Mart
Gigabit Magazine
https://www.gigabitmagazine.com/top10/top-10-biggest-data-centres-wo
rld
04/23/2020 56
Data Mining
• Art and science of discovering useful novel patterns from data
• E.g. seasonality of products
• E.g. customer segments with unique needs
• Supervised learning (right answer is known)
• Decision-making, e.g. approve loan or not
• Predictive patterns, e.g. sales next month
• Exploratory patterns (no right answer)
• Clusters, e.g. customer segments
• Association rules, e.g. products that sell together
04/23/2020 57
Data Mining
Characteristics
• Selecting the right business problem is key
• High value problem
• Data should exist to solve the problem
• Data is the most critical ingredient for DM
• May include soft/unstructured data in addition to
structured (rectangular) data
• Date miner can be an analyst or the end user
• Striking it rich requires creative thinking
• Need effective and easy data mining tools
04/23/2020 58
Data Mining – Major Techniques
Supervised Learning Classification – Decision Trees
(Predictive ability Machine Learning Neural Networks
based on past data) Support Vector M
Naïve Bayes
Classification - Regression
Statistics
Unsupervised
Learning Clustering Analysis K-Means
(Exploratory analysis Association Rules Apriori
to discover patterns)
04/23/2020 59
What is data mining
• Data mining is the art and science of discovering
knowledge, insights and patterns in data.
• Predicting winning chances of a sports team
• Identifying friends and foes in warfare
• Forecasting rainfall patterns in a country or region
• Patterns must be valid, novel, potentially useful,
understandable
• E.g. “customers who buy cheese and milk also buy bread
90% of the time”
04/23/2020 60
Why Data Mining
• Recognition of hidden value in data
• Field developed to help in science and defense
• Evolved to help develop competitive advantage in
business, fast, and at a global scale
• Ability to effectively gather quality data and
efficiently process it
• Availability of vast amounts of data on customers,
vendors, transactions, Web, machines, etc
• Technologies for consolidation and integration of data
sources into data warehouses
• Exponential increase in computing and storage
capabilities, and exponential decrease in costs
04/23/2020 61
Supervised vs. unsupervised
Learning
• Supervised learning: classification is seen as
supervised learning from examples.
• Supervision: The data (observations, measurements, etc.)
are labeled with pre-defined classes. It is like that a
“teacher” gives the classes.
• Test data are classified into these classes too, and
predictive accuracy is checked.
• Unsupervised learning: e.g. clustering
• Class labels of the data are unknown
• Given a set of data, the task is to establish the existence of
classes or clusters in the data
04/23/2020 62
Supervised learning process:
two steps
Learning (training): Learn a model using the training data
Testing: Test the model using unseen test data to assess the model accuracy
04/23/2020 63
Data mining methods/goals
• Decision Trees
• Popular, easy to use, machine learning technique
• Regression Analysis
• Statistical Technique to predict
• Artificial Neural Networks
• Sophistical versatile machine-learning technique
• Clustering
identifying a set of similarity groups in the data
• Association rules
Discovering rules of the form X Y, where X and Y are
sets of data items.
04/23/2020 64
Confusion Matrix
ConfusionMatrix True Class
Positive Negative
Positive
Predicted Class True Positive (TP) False Positive (FP)
Predicted class
04/23/2020 65
Standard Data Mining Process
Generic Steps
• Understand the application
domain
• Identify data sources and
select target data
• Pre-process: cleaning,
attribute selection
• Data mining to extract
patterns or models
• Post-process: identifying
interesting or useful patterns
• Incorporate patterns in real
world tasks
(CRISP-DM)
04/23/2020 66
Data Preparation – A Critical
Task Real-world
Data
• Quality of data is key to data
mining effectiveness
· Collect data • Breadth of data
Data Consolidation · Select data
· Integrate data • Structure / Schema
· Impute missing values
• Sparse /Missing values
Data Cleaning ·
·
Reduce noise in data
Eliminate inconsistencies
• Information density
· Normalize data
• Extract, Transform, Load (ETL)
Data Transformation ·
·
Discretize/aggregate data
Construct new attributes
process
• Scripts for automation
Data Reduction
·
·
Reduce number of variables
Reduce number of cases • From operational to Dare
· Balance skewed data
Warehouses
Well-formed
Data
04/23/2020 67
Comparison of Popular Data Mining
Platforms
Feature Excel IBM SPSS Modeler Weka
Ownership Commercial Commercial, Open-source, free
expensive
Data Mining Limited; extensible Extensive features, Extensive,
Features with add-on unlimited data performance issues
modules sizes with large data
Stand-alone Stand-alone Embedded in BI Stand-alone
software suites
User skills needed End-users For skilled BI Skilled BI analysts
analysts
User interface Text and click, Easy Drag & Drop use, GUI, mostly b&w
colorful, beautiful text output
GUI
Data formats Industry-standard Variety of data Proprietary
sources accepted
04/23/2020 68
Data in Data Mining
• Data: a collection of facts usually obtained as the result of
experiences, observations, or experiments
• Data may consist of numbers, words, images, …
• Data: lowest level of abstraction (from which information
and knowledge are derived)
Data
Categorical Numerical
04/23/2020 69
Data Mining Best Practices
• Asking the right business questions.
• Creative and open in proposing imaginative hypotheses
• Data should be clean and of high quality
• Continuously engaging with the data
• Dissemination and rollout of the solution
04/23/2020 70
Data Mining Wisdom: Myths
• Data mining …
• provides instant solutions/predictions
• is not yet viable for business applications
• requires a separate, dedicated database
• can only be done by those with advanced degrees
• is only for large firms that have lots of customer data
• is another name for the good-old statistics
04/23/2020 71
Data Mining Wisdom: Common Mistakes
1. Selecting the wrong problem for data mining
2. Ignoring what your sponsor thinks data mining is
and what it really can/cannot do
3. Not leaving insufficient time for data acquisition,
selection and preparation
4. Looking only at aggregated results and not at
individual records/predictions
5. Being sloppy about keeping track of the data
mining procedure and results
04/23/2020 72
Data Mining Wisdom: Common Mistakes
6. Ignoring suspicious (good or bad) findings and
quickly moving on
7. Running mining algorithms repeatedly and blindly,
without thinking about the next stage
8. Naively believing everything you are told about
the data
9. Naively believing everything you are told about
your own data mining analysis
10. Measuring your results differently from the way
your sponsor measures them
04/23/2020 73
Dimensions of Data Mining
• DM Inputs
• Data Domains (industry, function, etc)
• Types of Data field (categorical, numerical, blobs)
• Data sources (operations, web)
• Data quality (missing values, outliers)
• DM Outputs/Goals
• Objective functions (prediction, cluster definition etc)
• Output description types (trees, rules, etc)
• Data representation types
• DM Processes
• Methods (Classification, Clustering, etc.)
• Statistical vs AI machine learning
• Algorithm types (decision, trees, rules, neural net, etc)
• Reliability/Accuracy of results (ROC, Confusion matrix)
04/23/2020 74
04/23/2020 75
Review Questions
1. Describe the business intelligence and data mining cycle.
2. Describe the data processing chain.
3. What are the similarities between diamond mining and data mining?
4. What are the different data mining techniques? Which of these would
be relevant in your current work?
5. What is a dashboard? How does it help?
6. Create a visual to show the weather pattern in your city. Could you
show together temperature, humidity, wind, and rain/snow over a
period of time.
04/23/2020 76
Data Visualization
04/23/2020 77
What is visualization and data mining?
http://mathworld.wolfram.com/Box-and-
WhiskerPlot.html
http://en.wikipedia.org/wiki/Box_plot
http://www.upscale.utoronto.ca/GeneralInterest/Harrison/Visualisation/Visualisation.html
Distribution visualisation – US Crime Story
Data Visualization – Common Display Types
Pie Chart
– Emphasizing differences in proportion
among a few numbers
Line Graph – Trend visualization
• Fundamental technique of
data presentation
• Used to compare two
variables
– X-axis is often the control Students participating in sporting activities
variable
– Y-axis is the response
variable
• Good at:
– Showing specific values Mobile
– Trends Phone
– Trends in groups (using use
multiple line graphs)
http://www.babynamewizard.com/voyager
Scatter Plot – Wykresy rozrzutu XY
• Used to present
measurements of two
variables
• Effective if a
relationship exists
between the two
variables
• Bar graph
– Presents categorical variables
– Height of bar indicates value
– Double bar graph allows
comparison
– Note spacing between bars
– Can be horizontal (when would
you use this?) Number of police officers
2095
1999 2000 2001
2002 2003
What is wrong with this graph?
Bad Visualization:
Spreadsheet with misleading Y –axis
2095
1999 2000 2001
2002 2003
Y-Axis scale gives WRONG
impression of big change
Better Visualization
• Extensions of Scatterplots
• Parallel Coordinates
• Radar Figures
• Other tools
• …
Multiple Views
A B C D E
1 4 1 8 3 5 2
2 6 3 4 2 1
3 5 7 2 4 3
3
4 2 6 3 1 5
A B C D E
Problem: does not show correlations
Tableau bar comparisons
Buisness Analytics Tools – Manager Dashboards
Parallel Coordinates
3.5
5.1 0.2
1.4
• Agregate multidimensional
observations
• Each observation gets a
separate colour or graph
symbols
• Variables corresponds to
angles
Wybrana dziedzina
Wykres radarowy –
oceny wskaźników
w ramach dziedziny
I poziom oceny
F. Nightingale (1856) – abstract representation
Buisness Analytics Tools – Typical Reports
• Treemaps display hierarchical data using rectangles. Each branch of the tree
is assigned a rectangle. Then each sub-branch gets assigned to a rectangle
and this continues recursively until a leaf node is found.
• Depending on choice the rectangle representing the leaf node is colored,
sized or both according to chosen attributes.
Gapminder – Motion Charts
Botanical
ConeTree SunTree
Visualization of Search Results & Inter-Document Similarities
Abstract Î Text – MetaSearch Previews
Grokker Kartoo
MSN
Lycos AltaVista
MetaCrystal Î searchCrystal
Brushing and Linking
Census Data
Visualization of Association Rules in SGI/MineSet 3.0
57
IBM Miner – visualization of mining results
SGI – other tools
Graph-based Techniques
Narcissus
• Visualization of a larg
number of web pages
• visualization of compl
highly interconnected
data
Visualization of knowledge discovery process