You are on page 1of 8

AIDS HOME ASSIGNMENT-4

Name: ABBAS ALI SHAIK


SEC:2
REG.NO:2000030016

1) Explain the “bag of word” model for text representation.


A) In this model, a text (such as a sentence or a document) is represented
as the bag (multiset) of its words, disregarding grammar and even word
order but keeping multiplicity.
2) Different sources result into outliers in the dataset. Discuss what they
are?
A) Most common causes of outliers on a data set:
-> Data entry errors (human errors)

-> Measurement errors (instrument errors)

->Experimental errors

->Intentional

->Data processing errors

->Sampling

->Natural

3) Describe tf-idf score of a word? Explain why it is considered (mostly) as


weight of the word in similarity score computation.

A) TF-IDF from scratch in python on a real-world dataset.

Table of Contents:

• What is TF-IDF?
• Preprocessing data.

• Weights to title and body.

• Document retrieval using TF-IDF matching score.

• Document retrieval using TF-IDF cosine similarity.

4) Describe autoregressive models? Why these models are applicable to a


large class of FMCG and Drug items?
A) An autoregressive model is when a value from a time series is regressed on
previous values from that same time series. for example, yt on yt−1:

yt=β0+β1yt−1+ϵt.

In this regression model, the response variable in the previous time period has
become the predictor and the errors have our usual assumptions about errors
in a simple linear regression model. The order of an autoregression is the
number of immediately preceding values in the series that are used to predict
the value at the present time. So, the preceding model is a first-order
autoregression, written as AR(1).

If we want to predict y this year (yt) using measurements of global


temperature in the previous two years (yt−1,yt−2), then the autoregressive
model for doing so would be:

yt=β0+β1yt−1+β2yt−2+ϵt.

5) Justify the statement: The moving average method of forecasting relies on


a “window of past k observations.

A) A moving average of order 5 is shown, providing an estimate of the


trendcycle. Each value in the 5-MA column is the average of the observations
in the five years window centred on the corresponding year.
of ^TtT^t with k=2k=2 and m=2k+1=5m=2k+1=5. This is easily computed using

autoplot(elecsales, series="Data") +autolayer(ma(elecsales,5), series="5MA")


+xlab("Year") + ylab("GWh") +ggtitle("Annual electricity sales: South Australia")
+scale_colour_manual(values=c("Data"="grey50","5-MA"="red"),
breaks=c("Data","5-MA"))
6) Define time series. Describe the necessary and sufficient conditions for a
stationary time series.
A) A time series is a series of data points indexed in time order. Most
commonly, a time series is a sequence taken at successive equally spaced
points in time.
Time series data have a natural temporal ordering. This makes
time series analysis distinct from cross-sectional studies, in which there is no
natural ordering of the observations. Time series analysis is also distinct from
spatial data analysis where the observations typically relate to geographical
locations. A stochastic model for a time series will generally reflect the fact
that observations close together in time will be more closely related than
observations further apart.

7) Give the mathematical derivation of the ARIMA model as a combination of


AR and MA models

A) A random variable that is a time series is stationary if its statistical


properties are all constant over time. A stationary series has no trend, its
variations around its mean have a constant amplitude, and it wiggles in a
consistent fashion, i.e., its short-term random time patterns always look the
same in a statistical sense.

Predicted value of Y = a constant and/or a weighted sum of one or more recent


values of Y and/or a weighted sum of one or more recent values of the errors.

Auto-Regressive Integrated Moving Average. Lags of the stationarized series in


the forecasting equation are called "autoregressive" terms, lags of the forecast
errors are called "moving average" terms, and a time series which needs to be
differenced to be made stationary is said to be an "integrated" version of a
stationary series. Random-walk and random-trend models, autoregressive
models, and exponential smoothing models are all special cases of ARIMA
models.

A nonseasonal ARIMA model is classified as an "ARIMA(p,d,q)" model, where:

 p is the number of autoregressive terms,


 d is the number of nonseasonal differences needed for stationarity.
 q is the number of lagged forecast errors in the prediction equation.
The forecasting equation is constructed as follows. First, let y denote the dth
difference of Y, which means:

If d=0: yt = Yt

If d=1: yt = Yt - Yt-1

If d=2: yt = (Yt - Yt-1) - (Yt-1 - Yt-2) = Yt - 2Yt-1 + Yt-2

Note that the second difference of Y (the d=2 case) is not the difference from 2

periods ago. Rather, it is the first-difference-of-the-first difference, which is the


discrete analog of a second derivative, i.e., the local acceleration of the series
rather than its local trend.

In terms of y, the general forecasting equation is:

ŷt = μ + ϕ1 yt-1 +…+ ϕp yt-p - θ1et-1 -…- θqet-q

Here the moving average parameters (θ’s) are defined so that their signs are
negative. Often the parameters are denoted there by AR(1), AR(2), …, and
MA(1), MA(2), … etc..

8) Differentiate between Collaborative and Content based filtering


techniques for recommendation systems.

A) Content based filtering, makes recommendations based on user preferences


for product features. Collaborative filtering mimics user-to-user
recommendations. It predicts user preferences as a linear, weighted
combination of other user preferences. Both methods have limitations.

9) Justify the statement: Forecasting a time series requires (1) decomposition,


followed by (2) prediction of the decomposed components and (3)
reconstruction of time series using predicted components.

A) The three types of time series patterns: trend, seasonally and cycles.
When we decompose a time series into components, we usually combine the
trend and cycle into a single trend-cycle. Thus, we think of a time series as
comprising three components: a trend-cycle component, a seasonal
component, and a remainder component.
10) Give a quantitative analysis of the statement: “In the span of a few years,
customers could have instant access to half a million movies on a streaming
service, millions of products on an e-commerce platform, millions of news
articles and user generated content- Thus Recommendation systems are
critically required.”

A) Churn should be defined with precision before it is reported, as a product


interaction is. Once the threshold for churn has been set, the churn metric can
be calculated by counting the number of users from the entire user base that,
as of today, have not interacted with the product in exactly that number of
days. That value of churn can be stored in the analytics system. As with
retention, this calculation delay creates an effect in which churn is not
available for any trailing days, relative to today, below the defined threshold.
11) Justify the statement: “Intrusion Detection System (IDS) is a special class
of Anomaly Detection System (ADS) which is critically required in almost all
software systems security model”
A) An Intrusion Detection System (IDS) is a system that monitors network
traffic for suspicious activity and issues alerts when such activity is discovered.
It is a software application that scans a network or a system for the harmful
activity or policy breaching. Any malicious venture or violation is normally
reported either to an administrator or collected centrally using a security
information and event management (SIEM) system. A SIEM system integrates
outputs from multiple sources and uses alarm filtering techniques to
differentiate malicious activity from false alarms. Although, intrusion detection
systems monitor networks for potentially malicious activity, they are also
disposed to false alarms. Hence, organizations need to fine-tune their IDS
products when they first install them.
12) Compare and contrast between distance and density-based outlier
detection
A)
The difference between Distance-based Outliers, definitions. Both of p 1 and p
2 have 10 neighbours, in some approaches p1 and p 2 have the same outlier
ranking and in other approaches p1 has larger outlier ranking than p 2

16) Explain the terms- Term Frequency (TF) and Inverse Document Frequency
(IDF). What is the role played by TF-IDF in the formulation of similarity index
between two paragraphs?

A) TF-IDF stands for Term Frequency Inverse Document Frequency of records.


It can be defined as the calculation of how relevant a word in a series or corpus
is to a text. The meaning increases proportionally to the number of times in the
text a word appears but is compensated by the word frequency in the corpus
(data-set).

The weight of a term that occurs in a document is simply proportional to the


term frequency.
tf(t,d) = count of t in d / number of words in d
Inverse Document Frequency: Mainly, it tests how relevant the word is. The
key aim of the search is to locate the appropriate records that fit the demand.
Since tf considers all terms equally significant, it is therefore not only possible
to use the term frequencies to measure the weight of the term in the paper.
First, find the document frequency of a term t by counting the number of
documents containing the term: df(t) = N(t) where
df(t) = Document frequency of a term t
N(t) = Number of documents containing the term t
19) Outline the importance of Word-Cloud as a visual of interpreting text
corpus in graphical form.
A) A word cloud is a collection, or cluster, of words depicted in different sizes.
The bigger and bolder the word appears, the more often it’s mentioned within
a given text and the more important.
29) Explain how the Convolution operation can be used as a method of
Feature Detection. Hence, explain how CNN are efficient in classification of
Images / Handwriting.

A) Artificial Intelligence has been witnessing a monumental growth in bridging


the gap between the capabilities of humans and machines. Researchers and
enthusiasts alike, work on numerous aspects of the field to make amazing
things happen. One of many such areas is the domain of Computer Vision.

The agenda for this field is to enable machines to view the world as humans
do, perceive it in a similar manner and even use the knowledge for a multitude
of tasks such as Image & Video recognition, Image Analysis & Classification,
Media Recreation, Recommendation Systems, Natural Language Processing,
etc. The advancements in Computer Vision with Deep Learning have been
constructed and perfected with time, primarily over one particular algorithm-
a Convolutional Neural Network.

30) Explain the following statement- A corpus is essentially a bag of word;


wherein unique words correspond to the dimensions of vector space.

A) Machines are better at understanding numbers that actual text passed on as


tokens. This process of converting text to numbers is called vectorization.
Different approaches of vectorization exist, let’s move from the most primitive
ones to most advanced one. The representation models that will be discussed
in this article are:

1. One-hot representations
2. Distributed Representations

3. Singular Value Decomposition

4. Continuous bag of words model

5. Skip-Gram model

6. Glove Representations

31) Detail the significance of “Trend” component of a time series? Compare


the trend component with the seasonality component.

A) Trend component - a long-term increase or decrease in the data which


might not be linear. Sometimes the trend might change direction as time
increases.

32) Using a suitable example, explain how supervised learning techniques can
be applied for content-based filtering in recommendation engines.

A) The model should recommend items relevant to this user. To do so, you
must first pick a similarity metric (for example, dot product). Then, you must
set up the system to score each candidate item according to this similarity
metric. Using, Dot Product as a Similarity Measure.

Since ⟨x,y⟩=∑i=1dxiyi, a feature appearing in both x and y contributes a 1 to the


sum. In other words, ⟨x,y⟩ is the number of features that are active in both
vectors simultaneously. A high dot product then indicates more common
features, thus a higher similarity.
autoregreesive models are part of arima fmcg and drug items are datasets
6)
https://towardsdatascience.com/machine-learning-part-19-time-series-and-autoregressive-integrated-moving-a
verage-model-arima-c1005347b0d7----
9) Exrracting the components based on trend, seasonality
without decomposition --it takes more time
13) by ju link with formulas
18) you can do but arima will do without any transformation
24) analysis is required for

You might also like