Professional Documents
Culture Documents
Assignment:
Student Name:
University:
Course Name:
Instructor:
Date:
ASSIGNMENT 2
1. Clustering is a technique of identifying objects that depicts some similarity then labeling
the objects to the same group. In the search engine jargon, clustering is a vital technique
that aids the users in getting what they are searching for by grouping the search results by
topics or categories. Via clustering, the search engine is able to compute similarities
between documents and perform clustering based on the semantics of the retrieved
documents. To achieve this, the on submitting the query to a search engine, the search
engine pre-processes the documents from the search results and extracts features from
each of the document. Then, the search engine applies clustering algorithm based on the
similarity matrix so as to obtain the clusters. For example, if the user inputs the keywork
paintbrush, the search engine should display those results that contain the keyword of
“paintbrush” as well as those with the keywords of “paint”, “brush”, “canvas’, etc.
Classification is the technique that is used in identifying and classifying each data
element into a group or class based on set of predefined rules. Classification makes use of
different mechanisms such as a decision tree, linear programming, statistics, and neural
network. In search engine settings, classification can be deployed in aiding users to find a
given information more quickly via organizing the information in meaningful manner.
For example, the search engine can display a list of documents that are related to a given
search keyword.
Anomaly detection is a technique used in detecting the surprising behavior hidden within
the data. In the jargon of search engines, anomaly detection is used in filtering the search
engine spam based on anomaly detection technique. i.e. on submitting the search query,
the search engine assesses the relevance of a webpage’s content based on the keywords
provided and provides the user with the most relevant information. Any webpage that is
ASSIGNMENT 3
not related to the keyword provided is treated as an anomaly and is not displayed to the
user.
Association rule mining is a technique that is used in discovering the association rules
showing the attribute-value conditions inherent within a given dataset. So, based on the
provided keyword by the user, the search engine can provide more details. For example,
if the user searches for “car engine”, the search engine can provide more information
2. Color presents numerous benefits especially in visualization. For example, the visualizer
can make use of different colors so as to differentiate the different variables being used.
Also, the presenter may make use of different colors to bring the notion of emphasis on a
given idea. This can be achieved by using eye-catching colors such as red or via
boldening text.
Color may present several drawbacks while using it for visualization. For example,
excessive use of color to present the variables could result into a noisy and un-
understandable. This could result into more confusion and misinterpretation of data by
the audience. More so, misuse of colors such as inconsistency of color use, excessive use
among other could be devastating to the audience and may result in communicating the
wrong message.
3. In this scenario, it is possible for a data to consist only of anomalous objects. This is
because, before deciding the anomalous algorithm to deploy, the data scientist should
first familiarize with the data and understand its nature i.e. the algorithm to be used in
detecting the anomaly should consider both the behavioral and contextual attributes of the
data. For example, the data scientist should understand the spatial data attributes
ASSIGNMENT 4
4. The instance that considering the non-zero values might give a more accurate view of the
objects than considering the actual magnitude of the values is in the market basket. In the
market basket instance, considering only the presence of non-zeros values, one can only
conclude the association of a small list of items instead of evaluating the entire store
inventory list. In clustering analysis considering the presence of non-zeros values may
5. The shape of the cluster which consists of all documents with the cosine similarity to a
centroid would be almost circular but not completely circular. The data points will be