You are on page 1of 9

Big Data Analytics Internal II Answers

1. What is a Regression Model example

A regression model determines a relationship between an independent variable and a


dependent variable, by providing a function. Formulating a regression analysis helps you
predict the effects of the independent variable on the dependent one. 

Yi = f(Xi, β)+ei
Yi = dependent variable
f = function
Xi = independent variable
β = unknown parameters
ei = error terms

2. What is the main purpose of Multivariate Analysis?

The purposes of multivariate data analysis is to study the relationships among the P
attributes, classify the n collected samples into homogeneous groups, and make
inferences about the underlying populations from the sample

3. Define Statistical Methods?


Statistical methods are mathematical formulas, models, and techniques that are used in
statistical analysis of raw research data. The application of statistical methods extracts
information from research data and provides different ways to assess the robustness of
research outputs.
4. Difference between SVM and kernel?

SVM KERNAL
Kernel is used due to a set of The SVM system of equations that
mathematical functions used in Support includes minimizing some cost function
Vector Machine providing the window to (defined by the kernel function, and a
manipulate the data regularization function).
the algorithm tries to find the optimal  Compared to the other classification and
hyperplane which can be used to classify regression algorithms, the Kernal
new data points approach is completely different. 
Two dimensions the hyperplane is a SVMs are completely different things
simple line. from kernel functions.
Big Data Analytics Internal II Answers

5. What are Cloud deployment Models.

A cloud deployment model is defined according to where the infrastructure for the
deployment resides and who has control over that infrastructure.

6. Define streams?

Streaming data is data that is generated continuously by thousands of data sources, which
typically send in the data records simultaneously, and in small sizes (order of Kilobytes).

7. What is stream data model?


A stream model explains how to reconstruct the underlying signals from individual
stream items. Thus, understanding the model is a prerequisite for stream processing and
stream mining.
8. What is real-time analytics platform in big data?
A real-time analytics platform enables organizations to make the most out of real-time
data by helping them to extract the valuable information and trends from it. Such
platforms help in measuring data from the business point of view in real time, further
making the best use of data.

9. What is real-time sentiment analysis what are its applications?


Sentiment analysis is the automated process of analyzing text to determine the sentiment
expressed (positive, negative or neutral).
Some popular sentiment analysis applications include social media monitoring,
customer support management, and analyzing customer feedback.

10. What are the different techniques to collect the samples from stream?
 Sliding Window.
 Unbiased Reservoir Sampling.
 Biased Reservoir
 Histograms.
Big Data Analytics Internal II Answers

PART – B

11. A) What is R in big data analytics

R analytics (or R programming language) is a free, open-source software used for all
kinds of data science, statistics, and visualization projects. R programming language is
powerful, versatile, AND able to be integrated into BI platforms like Sisense, to help you
get the most out of business-critical data.

Get the most out of data analysis using R

 R, and its sister language Python, are powerful tools to help you maximize your
data reporting.
 Instead of using programming languages through a separate development tool like
R Studio or Jupyter Notebooks, you can integrate R straight into your analytics
stack, allowing you to predict critical business outcomes, create interactive
dashboards using practical statistics, and easily build statistical models.
 Integrating R and Python means advanced analytics can happen faster, with
accurate and up-to-date data.
Uses of R analytics
 There are multiple ways for R to be deployed today across a variety of industries
and fields. One common use of R for business analytics is building custom data
collection, clustering, and analytical models.
 Instead of opting for a pre-made approach, R data analysis allows companies to
create statistics engines that can provide better, more relevant insights due to
more precise data collection and storage.
 In academia and more research-oriented fields, R is an invaluable tool, as these
fields of study usually require highly specific and unique modeling.

 As such, organizations can quickly custom-build analytical programs that can fit
in with existing statistical analyses while providing a much deeper and more
accurate outcome in terms of insights.

 Even when it comes to social media or web data, R can usually provide models
that deliver better or more specific insights than standard measures
Big Data Analytics Internal II Answers

B) What is meant by clustering high dimensional data and describe it types

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few
dozen to many thousands of dimensions. Such high-dimensional spaces of data are often
encountered in areas such as medicine, where DNA microarray technology can produce
many measurements at once, and the clustering of text documents, where, if a word-
frequency vector is used, the number of dimensions equals the size of the vocabulary.

Subspace clustering :

Subspace clustering is an extension of traditional clustering that seeks to find


clusters in different subspaces within a dataset. Often in high dimensional data,
many dimensions are irrelevant and can mask existing clusters in noisy data.

Projected clustering:

Projected clustering seeks to assign each point to a unique cluster, but clusters may exist in
different subspaces. The general approach is to use a special distance function together
with a regular clustering algorithm.

Projection-based clustering

 Projection-based clustering is based on a nonlinear projection of high-dimensional


data into a two-dimensional space. 

 Typical projection-methods like t-distributed stochastic neighbor embedding (t-


SNE), or neighbor retrieval visualizer (NerV) are used to project data explicitly into
two dimensions disregarding the subspaces of higher dimension than two and
preserving only relevant neighborhoods in high-dimensional data.

 In the next step, the Delaunay graph between the projected points is calculated, and
each vertex between two projected points is weighted with the high-dimensional
distance between the corresponding high-dimensional data points.
Big Data Analytics Internal II Answers

Hybrid approaches:

 Not all algorithms try to either find a unique cluster assignment for each point or all
clusters in all subspaces; many settle for a result in between, where a number of
possibly overlapping, but not necessarily exhaustive set of clusters are found.

 An example is FIRES, which is from its basic approach a subspace clustering


algorithm, but uses a heuristic too aggressive to credibly produce all subspace
clusters. 

 Another hybrid approach is to include a human-into-the-algorithmic-loop: Human


domain expertise can help to reduce an exponential search space through heuristic
selection of samples. 

12. Describe prediction of stock market and what algorithm is used for the stock market
prediction?

The method involves collecting news and social media data and extracting sentiments
expressed by individual. Then the correlation between the sentiments and the stock values is
analyzed. The learned model can then be used to make future predictions about stock values.

A financial exchange has two essential functionalities :

 First is to encourage the procedure for the organizations by methods for which they
can exchange.

 The second is to sort out and deal with the setting, where exchange can appropriately
happen.

Contributing and benefitting from the market has never been basic, and that is because of
clear vulnerability and highly unpredictable nature of the market i.e. shares/values can
possibly improve and fall in value quickly. Instability is a factual proportion of the scattering
of profits for a given security or market file. Usually, the higher the unpredictability, the
more hazardous the security.
Big Data Analytics Internal II Answers

Recorded instability likewise 'known unpredictability' is the instability of genuine costs of


basic stocks. They have demonstrated to be most testing yet fulfilling and advantageous for
venture.

 For example, fates and alternatives have taken a lot of considerations, recently.

There are numerous variables that are persuasive on the monetary markets, including
political occasions, cataclysmic events, financial conditions, etc. In spite of the multifaceted
nature of the developments in market costs, market conduct isn't totally arbitrary. Rather, it is
represented by a very nonlinear dynamical framework.

*Predicting the future costs is completed dependent on the specialized investigation, which
concentrates the market's

Efficient Market Hypothesis

 The market examination is in logical inconsistency with the Efficient Market


Hypothesis (EMH). EMH was created in 1970 by financial expert Eugene Fama
whose theory stated that it is impossible for an investor to outperform the market as
all the available information is already there in the stock prices.

 If the EMH was true, it would be impossible to use machine learning methods for
market prediction. Nevertheless, there are many successful technical analyses in the
financial world and the number of studies appearing in academic literature that are
using machine learning techniques for market prediction
Big Data Analytics Internal II Answers
Big Data Analytics Internal II Answers

Algorithm is used for the stock market prediction:

 There are three conventional approaches for stock price prediction: technical
analysis, traditional time series forecasting, and machine learning method.

 Earlier classical regression methods such as linear regression, polynomial regression,


etc. were used to predict stock trends.

 Also, traditional statistical models which include exponential smoothing, moving


average, and ARIMA makes their prediction linearly.

 Nowadays, Support Vector Machines (Cortes & Vapnik, 1995) (SVM) and Artificial
Neural Networks (ANN) are widely used for the prediction of stock price
movements.

 Every algorithm has its way of learning patterns and then predicting. Artificial
Neural Network (ANN) is a popular and more recent method which also incorporate
technical analysis for making predictions in financial markets.

 ANN includes a set of threshold functions. These functions trained on historical data
after connecting each other with adaptive weights and they are used to make future
predictions.

In this study, we have used variations of ANN to predict the stock price. But the efficiency of
forecasting by ANNs depends upon the learning algorithm used to train the ANN. This paper
compares three algorithms, i.e., Levenberg-Marquardt (LM), Scaled Conjugate Gradient and
Bayesian Regularization, neural networks with 20 hidden layers and a delay of 50 data points are
used. Thus, each prediction is made using the last 50 values.

One of the significant merits of the LM approach is that it performs similarly to gradient search
and Newton method for large values of μ and small values of μ respectively. The LM algorithm
merges the best attributes of the steepest-descent algorithm and the Gauss-Newton technique.
Also, many of their limitations avoided. Specifically, this algorithm handles the problem of slow
convergence efficiently 
Big Data Analytics Internal II Answers

Training phase stops when any of the following conditions appear:

 If the maximum number of repetitions achieved.


 If maximum time is overshot.
 The performance reduced to the target.
 If the gradient of the performance is lower than the minimum gradient.
 If the validation performance has crossed the maximum fail times since the last time it
decreased (when using validation).

 The first algorithm is based on Levenberg-Marquardt optimization which uses an


approximation to Hessian matrix to approach second-order training speed. This gives
excellent results and takes a few hours to train.

 The second algorithm Scaled Conjugate Gradient (SCG), based on conjugate directions
uses a step size scaling mechanism to avoid time-consuming line search per learning
iteration, which makes this algorithm much faster than the second order algorithms like
Levenberg-Marquardt. Training using SCG takes a few minutes which is a significant
improvement over Levenberg-Marquardt, but the error also increases in the tick data
prediction.

 The third algorithm Bayesian Regularization takes a few days to train over a large dataset
but gives better results than both Levenberg-Marquardt and SCG.

 All three algorithms provide an accuracy of 99.9% using tick data. The accuracy over 15-
min test dataset changes completely. SCG takes least time and gives best results
compared to Levenberg-Marquardt and Bayesian Regularization. But the result obtained
on 15 min dataset is significantly poor in comparison with that of results obtained using
tick data.

You might also like