Web Intelligence & Big Data: By: Chanveer Singh Harmanaq Singh

Web Intelligence
&
Big Data
By:
Chanveer Singh
Harmanaq Singh
Special Thanks To
Google
Wikipedia
Pirates of Caribbean Salazar’s Revenge
Pav Bhaji
Dunkin’ Donuts
Coca Cola
Contents
Web-Scale AI .............................................................................................................................. 2
Big Data ...................................................................................................................................... 2
Web Intelligence ........................................................................................................................ 3
Big Data Analytics....................................................................................................................... 4
Introduction: .................................................................................................................. 4
Tools & Technologies: .................................................................................................... 4
Uses & Challenges .......................................................................................................... 4
Web Intelligence & Big Data ...................................................................................................... 5
Crawling, Indexing, Ranking ....................................................................................................... 6
Crawling: ........................................................................................................................ 6
Indexing:......................................................................................................................... 6
Ranking:.......................................................................................................................... 6
Web Indexing ............................................................................................................................. 7
PageRank.................................................................................................................................... 7
Enterprise Search ....................................................................................................................... 8
Introduction: .................................................................................................................. 8
Components of an Enterprise Search: ........................................................................... 9
Structured Data ........................................................................................................................ 10
Locality Sensitive Hashing ........................................................................................................ 11
Information & News................................................................................................................. 13
Online Advertising.................................................................................................................... 13
AdSense.................................................................................................................................... 14
Introduction: ................................................................................................................ 14
Restrictions: ................................................................................................................. 14
Locations: ..................................................................................................................... 15
TF-IDF ....................................................................................................................................... 15
Introduction: ................................................................................................................ 15
How to Compute: ......................................................................................................... 15
Analysing Sentiment and Intent .............................................................................................. 16
Evolution of Database .............................................................................................................. 17
Evolution of Database Management System: ............................................................. 17
Flat File Database: ........................................................................................................ 17
Relational Database: .................................................................................................... 18
NoSQL Database: ......................................................................................................... 19
Big Data Technology & Trends ................................................................................................. 20
MapReduce .............................................................................................................................. 22
Introduction: ................................................................................................................ 22
Algorithm: .................................................................................................................... 23
Inputs & Outputs: ........................................................................................................ 23
Uses: ............................................................................................................................. 24
Efficiency: ..................................................................................................................... 24
BigTable vs. HBase ................................................................................................................... 25
Classification ............................................................................................................................ 27
Clustering ................................................................................................................................. 28
Points to Remember .................................................................................................... 28
Applications of Cluster Analysis ................................................................................... 28
Data Mining.............................................................................................................................. 29
Data Mining Process: ................................................................................................... 29
Information Extraction ............................................................................................................. 31
Reasoning ................................................................................................................................. 31
Introduction: ................................................................................................................ 31
Types of reasoning: ...................................................................................................... 32
Comparison of deductive and inductive reasoning ..................................................... 32
Abductive Reasoning: .................................................................................................. 32
Analogical Reasoning: .................................................................................................. 32
Common-Sense Reasoning: ......................................................................................... 33
Non- Monotonic Reasoning: ........................................................................................ 33
Inference: ..................................................................................................................... 33
Dealing with Uncertainity ........................................................................................................ 33
Prediction (Forecasting) ........................................................................................................... 35
Introduction: ................................................................................................................ 35
Types of Forecasting: ................................................................................................... 35
Prediction & Classification Issues: ............................................................................... 36
Comparison of Classification & Prediction Methods ................................................... 36
Neural Models.......................................................................................................................... 37
Deep Learning .......................................................................................................................... 38
Regression ................................................................................................................................ 38
Simple Regression ........................................................................................................ 39
Multiple Regression Analysis ....................................................................................... 40
Feature Selection ..................................................................................................................... 40
UNIT 1
1
Web-Scale AI
Information and Communication Technology lies at the basis of innumerable innovations in

our society and has provided remarkable new services (like social media) and new products
(like smart phones and tablets). Traditionally, applications of Artificial Intelligence used to be
limited to micro worlds and toy systems. The horizon has now been widely extended to
distribute mass applications of AI techniques. These developments are supported by a general
availability of computation power and connectivity in the form of the web, social media, big
data, wireless, and mobile platforms with input and output in many modalities.
Artificial Intelligence techniques can be used to the problems encountered when interacting
on the Web or processing data derived from the Web. Examples of problems addressed by
web-scale AI are recommendation systems, click stream analysis, crowd sourcing and demand
aggregation, e-therapy, e-commerce, and avatars with speech synthesis and recognition.
Technical issues are e.g. Map/Reduce architecture for massive data processing and emerging
technologies like the semantic web.
The past decade has witnessed the successful application of many AI techniques used at
`webscale’, on what is are popularly referred to as big data platforms based on the
MapReduce parallel computing paradigm and associated technologies such as distributed file
systems, noSQL databases and stream computing engines. Online advertising, machine
translation, natural language understanding, sentiment mining, personalized medicine, and
national security are some examples of such AI-based web intelligence applications that are
already in the public eye. Others, though less apparent, impact the operations of large
enterprises from sales and marketing to manufacturing and supply chains.
Big Data
Big data is a term for data sets that are so large or complex that traditional data processing
application software is inadequate to deal with them. Challenges include capture, storage,
analysis, data curation, search, sharing, transfer, visualization, querying, updating and
information privacy. The term "big data" often refers simply to the use of predictive analytics,
user behavior analytics, or certain other advanced data analytics methods that extract value
from data, and seldom to a particular size of data set. "There is little doubt that the quantities
of data now available are indeed large, but that’s not the most relevant characteristic of this
new data ecosystem." Analysis of data sets can find new correlations to "spot business trends,
prevent diseases, and combat crime and so on." Scientists, business executives, practitioners
of medicine, advertising and governments alike regularly meet difficulties with large data-sets
in areas including Internet search, fintech, urban informatics, and business informatics.
Scientists encounter limitations in e-Science work, including meteorology, genomics,
connectomics, complex physics simulations, biology and environmental research.
2
Data sets grow rapidly - in part because they are increasingly gathered by cheap and
numerous information-sensing Internet of things devices such as mobile devices, aerial
(remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID)
readers and wireless sensor networks. The world's technological per-capita capacity to store
information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5
exabytes (2.5×1018) of data are generated. One question for large enterprises is determining
who should own big-data initiatives that affect the entire organization.
Relational database management systems and desktop statistics- and visualization-packages

often have difficulty handling big data. The work may require "massively parallel software
running on tens, hundreds, or even thousands of servers". What counts as "big data" varies
depending on the capabilities of the users and their tools, and expanding capabilities make
big data a moving target. "For some organizations, facing hundreds of gigabytes of data for
the first time may trigger a need to reconsider data management options. For others, it may
take tens or hundreds of terabytes before data size becomes a significant consideration."
Big data can be described by the following characteristics:
 Volume: The quantity of generated and stored data. The size of the data determines
the value and potential insight- and whether it can actually be considered big data or
not.
 Variety: The type and nature of the data. This helps people who analyze it to
effectively use the resulting insight.
 Velocity: In this context, the speed at which the data is generated and processed to
meet the demands and challenges that lie in the path of growth and development.
 Variability: Inconsistency of the data set can hamper processes to handle and manage
it.
 Veracity: The quality of captured data can vary greatly, affecting accurate analysis.
Web Intelligence
With the rapid growth of the Internet and the World Wide Web (WWW or the Web), we have
now entered a new information age. The Web provides a rich medium for communication,
which goes far beyond the traditional communication media, such as radio, telephone, and
television. The Web has significant impacts on both academic research and everyday life. It
revolutionizes the way in which information is gathered, stored, processed, presented,
shared, and used. It offers great opportunities and challenges for many areas, such as
business, commerce, marketing, finance, publishing, education, research, and development.
3
Inspite of its current technological advances, it is still rather unclear what will be the next
paradigm shift in the WWW. Web Intelligence (WI), since its conception in 2000 has become
the central field for exploring and developing answers to this question.
Broadly speaking, Web Intelligence (WI) is a new direction for scientific research and
development that explores the fundamental roles as well as practical impacts of Artificial
Intelligence (Al) and advanced Information Technology (IT) on the next generation of Web-
empowered products, systems, services, and activities. It is the key and the most urgent
research field of IT in the era of Web and agent intelligence.
Big Data Analytics
Introduction:
Big data analytics is the process of examining large and varied data sets -- i.e., big data -- to
uncover hidden patterns, unknown correlations, market trends, customer preferences and
other useful information that can help organizations make more-informed business decisions.
Tools & Technologies:

Unstructured and semi-structured data types typically don't fit well in traditional data
warehouses that are based on relational databases oriented to structured data sets.
Furthermore, data warehouses may not be able to handle the processing demands posed by
sets of big data that need to be updated frequently -- or even continually, as in the case of
real-time data on stock trading, the online activities of website visitors or the performance of
mobile applications.
As a result, many organizations that collect, process and analyse big data turn to
NoSQL databases as well as Hadoop and its companion tools, including:
 MapReduce: A software framework that allows developers to write programs that

process massive amounts of unstructured data in parallel across a distributed cluster
of processors or stand-alone computers.
 Spark: An open-source parallel processing framework that enables users to run
large-scale data analytics applications across clustered systems.
 HBase: A column-oriented key/value data store built to run on top of the Hadoop
Distributed File System (HDFS).
 Hive: An open-source data warehouse system for querying and analyzing large
datasets stored in Hadoop files.
Uses & Challenges

Big data analytics applications often include data from both internal systems and external
sources, such as weather data or demographic data on consumers compiled by third-party
information services providers. In addition, streaming analytics applications are becoming
4
common in big data environments, as users look to do real-time analytics on data fed into
Hadoop systems through Spark's Spark Streaming module or other open source stream
processing engines, such as Flink and Storm.
Web Intelligence & Big Data
AI techniques at web scale include predictive intelligence; the best example is the online
advertising, predicting intent and interest, gauging consumer sentiment and predicting
behaviour, detecting adverse events and predicting their impact, categorising and recognising
places, faces of people, personalised genomic medicine for future: DNA samples shared in
web to get the solution for genetic diseases, to know ancestors, so that medication becomes
better.
The elements used to predict the future using AI and Big Data include, Look-Search: Finding
stuff, Listen-Machine Learning: Figuring out what is important and what is not or classifying
& clustering, Learn-Information Extraction: Extracting facts from data, Connect-Reasoning:
Putting different facts together and find out a conclusion, Predict-Data Mining: Mining rules
from data, Correct-Optimization: Figuring out the right thing to do.
5
Crawling, Indexing, Ranking
Crawling:
Crawling is the process by which search engines discover updated content on the web, such
as new sites or pages, changes to existing sites, and dead links.
To do this, a search engine uses a program that can be referred to as a 'crawler', ‘bot’ or
‘spider’ (each search engine has its own type) which follows an algorithmic process to
determine which sites to crawl and how often.
As a search engine's crawler moves through your site it will also detect and record any links it
finds on these pages and add them to a list that will be crawled later. This is how new content
is discovered.
Indexing:
Once a search engine processes each of the pages it crawls, it compiles a massive index of all
the words it sees and their location on each page. It is essentially a database of billions of
web pages.
This extracted content is then stored, with the information then organised and interpreted
by the search engine’s algorithm to measure its importance compared to similar pages.
Servers based all around the world allow users to access these pages almost
instantaneously. Storing and sorting this information requires significant space and both
Microsoft and Google have over a million servers each.
Ranking:
Once a keyword is entered into a search box, search engines will check for pages within their
index that are a closest match; a score will be assigned to these pages based on an algorithm
consisting of hundreds of different ranking signals.
These pages (or images & videos) will then be displayed to the user in order of score.
So in order for your site to rank well in search results pages, it's important to make sure search
engines can crawl and index your site correctly - otherwise they will be unable to
appropriately rank your website's content in search results.
6
Web Indexing
Web indexing (or Internet indexing) refers to various methods for indexing the contents of a
website or of the Internet as a whole. Individual websites or intranets may use a back-of-the-
book index, while search engines usually use keywords and metadata to provide a more useful
vocabulary for Internet or onsite searching. With the increase in the number of periodicals
that have articles online, web indexing is also becoming important for periodical websites.
Back-of-the-book-style web indexes may be called "web site A-Z indexes". The implication
with "A-Z" is that there is an alphabetical browse view or interface. This interface differs from
that of a browse through layers of hierarchical categories (also known as a taxonomy) which
are not necessarily alphabetical, but are also found on some web sites. Although an A-Z index
could be used to index multiple sites, rather than the multiple pages of a single site, this is
unusual.
Metadata web indexing involves assigning keywords or phrases to web pages or web sites
within a metadata tag (or "meta-tag") field, so that the web page or web site can be retrieved
with a search engine that is customized to search the keywords field. This may or may not
involve using keywords restricted to a controlled vocabulary list. This method is commonly
used by search engine indexing.
PageRank
PageRank is a link analysis algorithm and it assigns a numerical weighting to each element of
a hyperlinked set of documents, such as the World Wide Web, with the purpose of
"measuring" its relative importance within the set. The algorithm may be applied to any
collection of entities with reciprocal quotations and references. The numerical weight that it
assigns to any given element E is referred to as the PageRank of E. Rank can contribute to the
importance of an entity.
7
A PageRank results from a mathematical algorithm based on the webgraph, created by all World Wide
Web pages as nodes and hyperlinks as edges, taking into consideration authority hubs such as cnn.com
or usa.gov. The rank value indicates an importance of a particular page. A hyperlink to a page counts
as a vote of support. The PageRank of a page is defined recursively and depends on the number and
PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages
with high PageRank receives a high rank itself.
Enterprise Search
Introduction:
Enterprise search is the practice of making content from multiple enterprise-type sources,
such as databases and intranets, searchable to a defined audience.
"Enterprise search" is used to describe the software of search information within an

enterprise (though the search function and its results may still be public). Enterprise search
can be contrasted with web search, which applies search technology to documents on the
open web, and desktop search, which applies search technology to the content on a single
computer.
Enterprise search systems index data and documents from a variety of sources such as: file
systems, intranets, document management systems, e-mail, and databases. Many enterprise
search systems integrate structured and unstructured data in their collections. Enterprise
search systems also use access controls to enforce a security policy on their users.
Enterprise search can be seen as a type of vertical search of an enterprise.
8
Components of an Enterprise Search:
In an enterprise search system, content goes through various phases from source repository
to search results:
 Content Awareness:
Content awareness (or "content collection") is usually either a push or pull model. In
the push model, a source system is integrated with the search engine in such a way
that it connects to it and pushes new content directly to its APIs. This model is used
when real-time indexing is important. In the pull model, the software gathers content
from sources using a connector such as a web crawler or a database connector. The
connector typically polls the source with certain intervals to look for new, updated or
deleted content.
 Content Processing & analysis:

Content from different sources may have many different formats or document types,
such as XML, HTML, Office document formats or plain text. The content processing
phase processes the incoming documents to plain text using document filters. It is also
often necessary to normalize content in various ways to improve recall or precision.
These may include stemming, lemmatization, synonym expansion, entity extraction,
part of speech tagging.
As part of processing and analysis, tokenization is applied to split the content into
tokens which is the basic matching unit. It is also common to normalize tokens to lower
case to provide case-insensitive search, as well as to normalize accents to provide
better recall.
 Indexing:
The resulting text is stored in an index, which is optimized for quick lookups without
storing the full text of the document. The index may contain the dictionary of all
unique words in the corpus as well as information about ranking and term frequency.
 Query Processing:
Using a web page, the user issues a query to the system. The query consists of any
terms the user enters as well as navigational actions such as faceting and paging
information.
 Matching:
The processed query is then compared to the stored index, and the search system
returns results (or "hits") referencing source documents that match. Some systems
are able to present the document as it was indexed.
9
Structured Data
Structured data refers to kinds of data with a high level of organization, such as information
in a relational database. When information is highly structured and predictable, search
engines can more easily organize and display it in creative ways. Structured data markup is a
text-based organization of data that is included in a file and served from the web. It typically
uses the schema.org vocabulary—an open community effort to promote standard structured
data in a variety of online applications.
Structured data markup describes things on the web, along with their properties. For example,
if your site has recipes, you could use markup to describe properties for each recipe, such as
the summary, the URL to a photo for the dish, and its overall rating from users.
When you provide structured data markup for your online content, you make that content
eligible to appear in two categories of Google Search features:
 Rich results—Structured data for things like recipes, articles, and videos can appear
in Rich Cards, as either a single element or a list of items. Other kinds of structured
data can enhance the appearance of your site in Search, such as with Breadcrumbs,
or a Sitelinks Search Box.
 Knowledge Graph cards—If you're the authority for certain content, Google can treat
the structured data on your site as factual and import it into the Knowledge Graph,
where it can power prominent answers in Search and across Google properties.
Knowledge Graph cards appear for authoritative data about organizations, and
events. Movie reviews, and movie/music play actions, while based on ranking, can also
appear in Knowledge Graph cards once they are reconciled to Knowledge Graph
entities.
10
Locality Sensitive Hashing
LSH is an algorithm used for solving the approximate and exact near neighbor search in high
dimensional spaces. The main idea of the LSH is to use a special family of hash functions,
called LSH functions, to hash points into buckets, such that the probability of collision is much
higher for the objects which are close to each other in their high dimensional space than for
those which are far apart. A collision occurs when two points are in the same bucket. Then,
query points can identify their near neighbors by using the hashed query points to retrieve
the elements stored in the same buckets.
For a domain S of a set of points and distance measure D, the LSH family is defined as:
DEFINITION: A family H = {h: S → U} is called (r1, r2, p1, p2) sensitive for D if for any point v, q
belongs to S
 If v ∈ B(q,r1), then PrH[h(q) = h(v)] ≥ p1,

 If v ∉ B(q,r2), then PrH[h(q) = h(v)] ≤ p2,
Where r1, r2, p1, p2 satisfy p1 < p2 and r1 < r2.
LSH is a dimension reduction technique that projects objects in a high-dimensional space to

a lower-dimensional space while still preserving the relative distances among objects.
Different LSH families can be used for different distance functions.
LSH does not need to search the query in the entire dataset scope. It shrinks the searching
scope to a group of records similar to the query, and conducts refinement. Given n records in
a dataset, traditional methods based on tree structures need O(logn) time for a query, and
linear searching methods need O(n) time for query. LSH can locate the similar records in O(L)
time, where L is a constant. It means that LSH is more efficient in a massive dataset that has
a large number of dimensions and records.
The drawback of LSH is large memory consumption. Because LSH needs to require a large
number of hash tables to cover most near neighbors. Because each hash table has as many
entries as the number of data records in the database, the size of each hash table is decided
by the size of the database. When the space requirement for the hash tables exceeds the
main memory size, a disk I/O may be required for searching more hash tables, which causes
query delay.
11
UNIT 2
12
Information & News
This is the age of information and we are bombarded with tons of information everyday.
News, on the other hand, is specific information that is a communication in the form of print
or electronic media. We all know about newspapers and read them every morning or
whenever we get time. They are a collection of facts and information about recent
happenings though newspapers also have sections where precise information about various
subjects is also presented to the readers. There are many who find the dichotomy between
news and information confusing as they do not find any differences.
The word news is considered to have evolved from the word new. So any information about
an incident, event, occasion, mishap, disaster, or even financial results of a company are
considered to be pieces of news. You must have seen captions of breaking news running at
the bottom of news channels on TV where they carry information about any event that is
taking place at the same instant that another program is being beamed on your screen. Many
times, broadcast of regular programs is stopped and breaking news told to the audiences if it
is considered to be very important for the viewers.
Online Advertising
Online advertising, also called online marketing or Internet advertising or web advertising,
is a form of marketing and advertising which uses the Internet to deliver promotional
marketing messages to consumers. Consumers view online advertising as an unwanted
distraction with few benefits and have increasingly turned to ad blocking for a variety of
reasons.
It includes email marketing, search engine marketing (SEM), social media marketing, many
types of display advertising (including web banner advertising), and mobile advertising. Like
other advertising media, online advertising frequently involves both a publisher, who
integrates advertisements into its online content, and an advertiser, who provides the
advertisements to be displayed on the publisher's content. Other potential participants
include advertising agencies who help generate and place the ad copy, an ad server which
technologically delivers the ad and tracks statistics, and advertising affiliates who do
independent promotional work for the advertiser.
13
AdSense
Introduction:
AdSense is one of many ways to earn money from the Web. AdSense for content is a system
of Google contextual ads that you can place on your blog, search engine, or Web site. Google,
in return, will give you a portion of the revenue generated from these ads. The rate you are
paid varies, depending on the keywords on your Web site used to generate the ads.
Text ads come from Google AdWords, which is Google's advertising program.
Advertisers bid in a silent auction to advertise for each keyword, and then content providers
get paid for the ads they place in their content. Neither advertisers nor content providers are
in complete control over which ads go where. That's one of the reasons why Google has
restrictions on both content providers and advertisers.
Restrictions:
Google restricts AdSense to non-pornographic Web sites. In addition, you may not use ads
that may be confused with Google ads on the same page.
If you use AdSense ads on search results, the search results must use the Google search
engine.
You may not click on your own ads or encourage others to click on your ads with phrases like
"Click on my ads." You must also avoid mechanical or other methods of artificially inflating
your page views or clicks. This is considered to be click fraud.
Google also restricts you from disclosing AdSense details, such as how much you were paid
for a keyword.
Google has additional restricts and may change their requirements at any time, so be sure to
check their policies regularly.
14
Locations:
AdSense is divided into two basic locations.
 AdSense for Content

 AdSense for Search
AdSense for Content covers ads placed in blogs and Web sites. You can also place ads in the
RSS or Atom feed from your blog.
AdSense for Search covers ads placed within search engine results. Companies, such as Blingo
can create a custom search engine using Google search results.
TF-IDF
Introduction:
Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight
often used in information retrieval and text mining. This weight is a statistical measure used
to evaluate how important a word is to a document in a collection or corpus. The importance
increases proportionally to the number of times a word appears in the document but is offset
by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often
used by search engines as a central tool in scoring and ranking a document's relevance given
a user query.
One of the simplest ranking functions is computed by summing the tf-idf for each query term;
many more sophisticated ranking functions are variants of this simple model.
Tf-idf can be successfully used for stop-words filtering in various subject fields including text
summarization and classification.
How to Compute:
Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term
Frequency (TF), aka. the number of times a word appears in a document, divided by the total
number of words in that document; the second term is the Inverse Document Frequency
(IDF), computed as the logarithm of the number of the documents in the corpus divided by
the number of documents where the specific term appears.
15
 TF: Term Frequency, which measures how frequently a term occurs in a document.
Since every document is different in length, it is possible that a term would appear
much more times in long documents than shorter ones. Thus, the term frequency is
often divided by the document length (aka. the total number of terms in the document)
as a way of normalization:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡𝑒𝑟𝑚 𝑡 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑎 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡

TF(t) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡
 IDF: Inverse Document Frequency, which measures how important a term is. While
computing TF, all terms are considered equally important. However it is known that
certain terms, such as "is", "of", and "that", may appear a lot of times but have little
importance. Thus we need to weigh down the frequent terms while scale up the rare
ones, by computing the following:
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠

𝐼𝐷𝐹(𝑡) = 𝑙𝑜𝑔𝑒 ( )
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑤𝑖𝑡ℎ 𝑡𝑒𝑟𝑚 𝑡 𝑖𝑛 𝑖𝑡
Analysing Sentiment and Intent
The term sentiment analysis can be used to refer to many different, but related, problems.
Most commonly, it is used to refer to the task of automatically determining the valence or
polarity of a piece of text, whether it is positive, negative, or neutral. However, more
generally, it refers to determining one’s attitude towards a particular target or topic. Here,
attitude can mean an evaluative judgment, such as positive or negative, or an emotional or
affectual attitude such as frustration, joy, anger, sadness, excitement, and so on. Note that
some people consider feelings to be the general category that includes attitude, emotions,
moods, and other affectual states. We use ‘sentiment analysis’ to refer to the task of
automatically determining feelings from text, in other words, automatically determining
valence, emotions, and other affectual states from text.
Automatic detection and analysis of affectual categories in text has wide-ranging applications.
Below we list some key directions of ongoing work:
 Public Health: Automatic methods for detecting emotions are useful in detecting
depression, identifying cases of cyber-bullying, predicting health attributes at
community level, and tracking well-being. There is also interest in developing robotic
assistants and physio-therapists for the elderly, the disabled, and the sick—robots that
are sensitive to the emotional state of the patient.
16
 Politics: There is tremendous interest in tracking public sentiment, especially in social
media, towards politicians, electoral issues, as well as national and international
events. Tweet streams have been shown to help identify current public opinion
towards the candidates in an election (nowcasting). Some research has also shown the
predictive power of analyzing electoral tweets to determine the number of votes a
candidate will get (forecasting).
 Brand Management: Customer relationship management, and Stock market:
Sentiment analysis of blogs, tweets, and Facebook posts is already widely used to
shape brand image, track customer response, and in developing automatic dialogue
systems for handling customer queries and complaints.
Evolution of Database
Evolution of Database Management System:
Following is a tree which will help you map all types of popular database management system
in a timeline:
The timeline varies from 1980s to current date and is not exhaustive of all forms of data
management systems. However, we will be able to cover most of the popular data
management system.
Flat File Database:

This is probably the easiest to understand but at present rarely used. You can think of this as
a single huge table. Such types of datasets were used long back in 1990s, when data was only
17
used to retrieve information in case of concerns. Very primitive analytics were possible on
these databases.
Relational Database:
Soon people started realizing that such tables will be almost impossible to store on a longer
run. The Flat File brought in a lot of redundant data at every entry. For instance, if we want
to make a single data-set with all products purchased at a grocery store with all information
of the customer and product, we will have every single row consisting of all customer and
product information. Wherever we have a repeat product or customer, we have repeat data.
People thought of storing this as different tables and define a hierarchy to access all the data,
which will be called as hierarchical database.
Hierarchical Database is very similar to your folder structure on the laptop. Every folder can
contain sub-folder and each sub-folder can still hold more sub-folders. Finally in some folders
we will store files. However, every child node (sub-folder) will have a single parent (folder or
sub-folder). Finally, we can create a hierarchy of the dataset:
Hierarchical databases, however can solve many purposes, its applications are restricted to
one-to-one mapping data structures. For example, it will work well if you are using this data
structure to show job profile hierarchy in a corporate. But the structure will fail if the reporting
becomes slightly more complicated and a single employee reports to many managers. Hence,
people thought of database structures which can have different kinds of relations. This type
of structure should allow one-to-many mapping. Such table came to be known as Relational
database management system (RDBMS).
Following is an example RDBMS data structure:
18
As you see from the above diagram, there are multiple keys which can help us merge different
data sets in this data base. This kind of data storage optimizes disc space occupied without
compromising on data details. This is the data base which is generally used by the analytics
industry. However, when the data looses a structure, such data base will be of no help.
NoSQL Database:
NoSQL is often known as “Not Only SQL”. When people realized that unstructured text carry
tonnes of information which they are unable to mine using RDBMS, they started exploring
ways to store such datasets. Anything which is not RDBMS today is loosely known as NoSQL.
After social networks gained importance in the market, such database became common in
the industry. Following is an example where it will become very difficult to store the data on
RDBMS :
Facebook stores terabytes of additional data every day. Let’s try to imagine the structure in
which this data can be structured:
19
In the above diagram, same color box fall into same category object. For example the user,
user’s friends, who liked and Author of comments all are FB users. Now if we try to store the
entire data in RDBMS, for executing a single query which can be just a response of opening
home page, we need to join multiple tables with trillions of row together to find a combined
table and then run algorithms to find the most relevant information for the user. This does
not look to be a seconds job for sure. Hence we need to move from tabular understanding of
data to a more flow (graph) based data structure. This is what brought NoSQL structures. We
will discuss NoSQL databases in our next article. We will also compare various types of NoSQL
databases to understand the fitment of these databases.
Big Data Technology & Trends

1. Open Source:
Open source applications like Apache Hadoop, Spark and others have come to
dominate the big data space, and that trend looks likely to continue. One survey found
that nearly 60 percent of enterprises expect to have Hadoop clusters running in
production by the end of this year. And according to Forrester, Hadoop usage is
increasing 32.9 percent per year.
Experts say that in 2017, many enterprises will expand their use of Hadoop and NoSQL
technologies, as well as looking for ways to speed up their big data processing. Many
will be seeking technologies that allow them to access and respond to data in real
time.
2. In-Memory Technology:
One of the technologies that companies are investigating in an attempt to speed their
big data processing is in-memory technology. In a traditional database, the data is
stored in storage systems equipped with hard drives or solid state drives (SSDs). In-
memory technology stores the data in RAM instead, which is many, many times faster.
A report from Forrester Research forecasts that in-memory data fabric will grow 29.2
percent per year.
Several different vendors offer in-memory database technology, notably SAP, IBM,
Pivotal.
3. Machine Learning:
As big data analytics capabilities have progressed, some enterprises have begun
investing in machine learning (ML). Machine learning is a branch of artificial
intelligence that focuses on allowing computers to learn new things without being
explicitly programmed. In other words, it analyzes existing big data stores to come to
conclusions which change how the application behaves.
20
According to Gartner machine learning is one of the top 10 strategic technology trends
for 2017. It noted that today's most advanced machine learning and artificial
intelligence systems are moving "beyond traditional rule-based algorithms to create
systems that understand, learn, predict, adapt and potentially operate
autonomously."
4. Predictive Analysis
Predictive analytics is closely related to machine learning; in fact, ML systems often

provide the engines for predictive analytics software. In the early days of big data
analytics, organizations were looking back at their data to see what happened and
then later they started using their analytics tools to investigate why those things
happened. Predictive analytics goes one step further, using the big data analysis to
predict what will happen in the future.
The number of organizations using predictive analytics today is surprisingly low—only

29 percent according to a 2016 survey from PwC. However, numerous vendors have
recently come out with predictive analytics tools, so that number could skyrocket in
the coming years as businesses become more aware of this powerful tool.
5. Intelligent Apps
Another way that enterprises are using machine learning and AI technologies is to
create intelligent apps. These applications often incorporate big data analytics,
analyzing users' previous behaviors in order to provide personalization and better
service. One example that has become very familiar is the recommendation engines
that now power many ecommerce and entertainment apps.
In its list of Top 10 Strategic Technology Trends for 2017, Gartner listed intelligent apps
second. "Over the next 10 years, virtually every app, application and service will
incorporate some level of AI," said David Cearley, vice president and Gartner Fellow.
"This will form a long-term trend that will continually evolve and expand the
application of AI and machine learning for apps and services."
6. Intelligent Security
Many enterprises are also incorporating big data analytics into their security strategy.
Organizations' security log data provides a treasure trove of information about past
cyberattack attempts that organizations can use to predict, prevent and mitigate
future attempts. As a result, some organizations are integrating their security
information and event management (SIEM) software with big data platforms like
21
Hadoop. Others are turning to security vendors whose products incorporate big data
analytics capabilities.
7. Internet of Things
The Internet of Things is also likely to have a sizable impact on big data. According to
a September 2016 report from IDC, "31.4 percent of organizations surveyed have
launched IoT solutions, with an additional 43 percent looking to deploy in the next 12
months."
With all those new devices and applications coming online, organizations are going to
experience even faster data growth than they have experienced in the past. Many will
need new technologies and systems in order to be able to handle and make sense of
the flood of big data coming from their IoT deployments.
8. Edge Computing
One new technology that could help companies deal with their IoT big data is edge
computing. In edge computing, the big data analysis happens very close to the IoT
devices and sensors instead of in a data center or the cloud. For enterprises, this offers
some significant benefits. They have less data flowing over their networks, which can
improve performance and save on cloud computing costs. It allows organizations to
delete IoT data that is only valuable for a limited amount of time, reducing storage
and infrastructure costs. Edge computing can also speed up the analysis process,
allowing decision makers to take action on insights faster than before.
MapReduce
Introduction:
MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and
reducers is sometimes nontrivial. But, once we write an application in the MapReduce form,
22
scaling the application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change. This simple scalability is what has
attracted many programmers to use the MapReduce model.
Algorithm:
 Generally MapReduce paradigm is based on sending the computer to where the data resides!
 MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
o Map stage: The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
o Reduce stage: This stage is the combination of the Shuffle stage and the Reduce
stage. The Reducer’s job is to process the data that comes from the mapper. After
processing, it produces a new set of output, which will be stored in the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers
in the cluster.
 The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
 Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Inputs & Outputs:

The MapReduce framework operates on <key, value> pairs, that is, the framework views the
input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the
output of the job, conceivably of different types.
23
The key and the value classes should be in serialized manner by the framework and hence, need
to implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of
a MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3>(Output).
Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)
Uses:
MapReduce is useful in a wide range of applications, including distributed pattern-based

searching, distributed sorting, web link-graph reversal, Singular Value Decomposition, web
access log stats, inverted index construction, document clustering, machine learning, and
statistical machine translation. Moreover, the MapReduce model has been adapted to several
computing environments like multi-core and many-core systems, desktop grids, multi-cluster,
volunteer computing environments, dynamic cloud environments, mobile environments, and
high-performance computing environments.
At Google, MapReduce was used to completely regenerate Google's index of the World Wide
Web. It replaced the old ad hoc programs that updated the index and ran the various analyses.
Development at Google has since moved on to technologies such as Percolator, FlumeJava
and MillWheel that offer streaming operation and updates instead of batch processing, to
allow integrating "live" search results without rebuilding the complete index.
MapReduce's stable inputs and outputs are usually stored in a distributed file system. The
transient data are usually stored on local disk and fetched remotely by the reducers.
Efficiency:
The traditional MapReduce deployment becomes inefficient when source data along with the
computing platform is widely (or even partially) distributed. Applications such as scientific
applications, weather forecasting, click-stream analysis, web crawling, and social networking
applications could have several distributed data sources, i.e., large-scale data could be
collected in separate data center locations or even across the Internet. For these applications,
usually there also exist distributed computing resources, e.g. multiple data centers. In these
cases, the most efficient architecture for running MapReduce jobs over the entire data set
becomes non-trivial.
24
BigTable vs. HBase
HBase is a clone of BigTable. Their design philosophies are almost totally same.
HBase is open sourced, BigTable is not.

BigTable is written with c++, HBase is written with Java.
Currently, BigTable has richer features than HBase.
BigTable supports transactions. Although HBase supports signle row lock, but transactions
across multiple rows is not supported naturely currently.
BigTable's secondary index building is much more mature than HBase.
But the development speed of HBase is very fast. Many top developers are contributing to
this project. So it is very possible HBase will catch up with even surpass BigTable in few
years.
25
UNIT 3
26
Classification
In machine learning and statistics, classification is the problem of identifying to which of a set
of categories (sub-populations) a new observation belongs, on the basis of a training set of
data containing observations (or instances) whose category membership is known. An
example would be assigning a given email into "spam" or "non-spam" classes or assigning a
diagnosis to a given patient as described by observed characteristics of the patient (gender,
blood pressure, presence or absence of certain symptoms, etc.). Classification is an example
of pattern recognition.
In the terminology of machine learning, classification is considered an instance of supervised

learning, i.e. learning where a training set of correctly identified observations is available. The
corresponding unsupervised procedure is known as clustering, and involves grouping data
into categories based on some measure of inherent similarity or distance.
Often, the individual observations are analyzed into a set of quantifiable properties, known
variously as explanatory variables or features. These properties may variously be categorical
(e.g. "A", "B", "AB" or "O", for blood type), ordinal (e.g. "large", "medium" or "small"), integer-
valued (e.g. the number of occurrences of a particular word in an email) or real-valued (e.g. a
measurement of blood pressure). Other classifiers work by comparing observations to
previous observations by means of a similarity or distance function.
An algorithm that implements classification, especially in a concrete implementation, is

known as a classifier. The term "classifier" sometimes also refers to the mathematical
function, implemented by a classification algorithm that maps input data to a category.
Classifications are discrete and do not imply order. Continuous, floating-point values would
indicate a numerical, rather than a categorical, target. A predictive model with a numerical
target uses a regression algorithm, not a classification algorithm.
The simplest type of classification problem is binary classification. In binary classification, the
target attribute has only two possible values: for example, high credit rating or low credit
rating. Multiclass targets have more than two values: for example, low, medium, high, or
unknown credit rating.
In the model build (training) process, a classification algorithm finds relationships between
the values of the predictors and the values of the target. Different classification algorithms
use different techniques for finding relationships. These relationships are summarized in a
model, which can then be applied to a different data set in which the class assignments are
unknown.
27
Classification models are tested by comparing the predicted values to known target values in
a set of test data. The historical data for a classification project is typically divided into two
data sets: one for building the model; the other for testing the model.
Scoring a classification model results in class assignments and probabilities for each case. For
example, a model that classifies customers as low, medium, or high value would also predict
the probability of each classification for each customer.
Classification has many applications in customer segmentation, business modeling,

marketing, credit analysis, and biomedical and drug response modeling.
Clustering
Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.
Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
 The main advantage of clustering over classification is that, it is adaptable to changes and helps
single out useful features that distinguish different groups.
Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.
 Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes
with similar functionalities and gain insight into structures inherent to populations.
 Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to house type,
value, and geographic location.
 Clustering also helps in classifying documents on the web for information discovery.
 Clustering is also used in outlier detection applications such as detection of credit card fraud.
28
 As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of
data to observe characteristics of each cluster.
Data Mining
Data mining is the process of sorting through large data sets to identify patterns and establish
relationships to solve problems through data analysis. Data mining tools allow enterprises to
predict future trends.
Data Mining Process:

Problem definition
A data mining project starts with the understanding of the business problem. Data mining
experts, business experts, and domain experts work closely together to define the project
objectives and the requirements from a business perspective. The project objective is then
translated into a data mining problem definition.
In the problem definition phase, data mining tools are not yet required.
Data exploration
Domain experts understand the meaning of the metadata. They collect, describe, and explore
the data. They also identify quality problems of the data. A frequent exchange with the data
mining experts and the business experts from the problem definition phase is vital.
In the data exploration phase, traditional data analysis tools, for example, statistics, are used
to explore the data.
Data preparation
Domain experts build the data model for the modeling process. They collect, cleanse, and
format the data because some of the mining functions accept data only in a certain format.
They also create new derived attributes, for example, an average value.
In the data preparation phase, data is tweaked multiple times in no prescribed order.
Preparing the data for the modeling tool by selecting tables, records, and attributes, are
typical tasks in this phase. The meaning of the data is not changed.
Modeling
Data mining experts select and apply various mining functions because you can use different
mining functions for the same type of data mining problem. Some of the mining functions
require specific data types. The data mining experts must assess each model.
In the modeling phase, a frequent exchange with the domain experts from the data
preparation phase is required.
29
The modeling phase and the evaluation phase are coupled. They can be repeated several
times to change parameters until optimal values are achieved. When the final modeling phase
is completed, a model of high quality has been built.
Evaluation
Data mining experts evaluate the model. If the model does not satisfy their expectations, they
go back to the modeling phase and rebuild the model by changing its parameters until optimal
values are achieved. When they are finally satisfied with the model, they can extract business
explanations and evaluate the following questions:
 Does the model achieve the business objective?
 Have all business issues been considered?
At the end of the evaluation phase, the data mining experts decide how to use the data mining
results.
Deployment
Data mining experts use the mining results by exporting the results into database tables or
into other applications, for example, spreadsheets.
The following figure shows the phases of the Cross Industry Standard Process for data mining
(CRISP DM) process model.
30
Information Extraction
Information extraction (IE) is the task of automatically extracting structured information from
unstructured and/or semi-structured machine-readable documents. In most of the cases this
activity concerns processing human language texts by means of natural language processing
(NLP). Recent activities in multimedia document processing like automatic annotation and
content extraction out of images/audio/video could be seen as information extraction.
Due to the difficulty of the problem, current approaches to IE focus on narrowly restricted
domains. An example is the extraction from news wire reports of corporate mergers, such as
denoted by the formal relation:
MergerBetween(company1,company2,date)
from an online news sentence such as:
"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp."
A broad goal of IE is to allow computation to be done on the previously unstructured data. A
more specific goal is to allow logical reasoning to draw inferences based on the logical content
of the input data. Structured data is semantically well-defined data from a chosen target
domain, interpreted with respect to category and context.
Reasoning
Introduction:
In information technology a reasoning system is a software system that generates

conclusions from available knowledge using logical techniques such as deduction and
induction. Reasoning systems play an important role in the implementation of artificial
intelligence and knowledge-based systems.
By the everyday usage definition of the phrase, all computer systems are reasoning systems
in that they all automate some type of logic or decision. In typical use in the Information
Technology field however, the phrase is usually reserved for systems that perform more
complex kinds of reasoning. For example, not for systems that do fairly straightforward types
of reasoning such as calculating a sales tax or customer discount but making logical inferences
about a medical diagnosis or mathematical theorem. Reasoning systems come in two modes:
interactive and batch processing. Interactive systems interface with the user to ask clarifying
questions or otherwise allow the user to guide the reasoning process. Batch systems take in
all the available information at once and generate the best answer possible without user
feedback or guidance.
31
Reasoning systems have a wide field of application that includes scheduling, business rule
processing, problem solving, complex event processing, intrusion detection, predictive
analytics, robotics, computer vision, and natural language processing.
Types of reasoning:
 Deductive Reasoning:
Deductive reasoning, as the name implies, is based on deducing new information from
logically related known information. A deductive argument offers assertions that lead
automatically to a conclusion, e.g.
If there is dry wood, oxygen and a spark, there will be a fire.
Given: There is dry wood, oxygen and a spark.
We can deduce: There will be a fire.
 Inductive reasoning:
Inductive reasoning is based on forming, or inducing a `generalization' from a limited
set of observations, e.g.
Observation: All the crows that I have seen in my life are black.
Conclusion: All crows are black
Comparison of deductive and inductive reasoning

We can compare deductive and inductive reasoning using an example. We conclude what will
happen when we let a ball go using both each type of reasoning in turn.
The inductive reasoning is as follows: By experience, every time I have let a ball go, it falls
downwards. Therefore, I conclude that the next time I let a ball go, it will also come down.
The deductive reasoning is as follows: I know Newton's Laws. So I conclude that if I let a ball
go, it will certainly fall downwards.
Thus the essential difference is that inductive reasoning is based on experience while
deductive reasoning is based on rules, hence the latter will always be correct.
Abductive Reasoning:
Deduction is exact in the sense that deductions follow in a logically provable way from the
axioms. Abduction is a form of deduction that allows for plausible inference, i.e. the
conclusion might be wrong, e.g.
Implication: She carries an umbrella if it is raining.
Axiom: She is carrying an umbrella.
Conclusion: It is raining.
This conclusion might be false, because there could be other reasons that she is carrying an
umbrella, e.g. she might be carrying it to protect herself from the sun.
Analogical Reasoning:
Analogical reasoning works by drawing analogies between two situations, looking for
similarities and differences, e.g. when you say driving a truck is just like driving a car, by
analogy you know that there are some similarities in the driving mechanism, but you also
know that there are certain other distinct characteristics of each.
32
Common-Sense Reasoning:
Common-sense reasoning is an informal form of reasoning that uses rules gained through
experience or what we call rules-of-thumb. It operates on heuristic knowledge and heuristic
rules.
Non- Monotonic Reasoning:

Non-Monotonic reasoning is used when the facts of the case are likely to change after some
time, e.g.
Rule:
IF the wind blows,
THEN the curtains sway.
When the wind stops blowing, the curtains should sway no longer. However, if we use
monotonic reasoning, this would not happen. The fact that the curtains are swaying would be
retained even after the wind stopped blowing. In non-monotonic reasoning, we have a `truth
maintenance system'. It keeps track of what caused a fact to become true. If the cause is
removed, that fact is removed (retracted) also.
Inference:
Inference is the process of deriving new information from known information. In the domain
of AI, the component of the system that performs inference is called an inference engine.
Dealing with Uncertainity

Many reasoning systems provide capabilities for reasoning under uncertainty. This is
important when building situated reasoning agents which must deal with uncertain
representations of the world. There are several common approaches to handling uncertainty.
These include the use of certainty factors, probabilistic methods such as Bayesian inference
or Dempster–Shafer theory, multi-valued (‘fuzzy’) logic and various connectionist approaches.
33
UNIT 4
34
Prediction (Forecasting)
Introduction:
In statistics, prediction is a part of statistical inference. One particular approach to such

inference is known as predictive inference, but the prediction can be undertaken within any
of the several approaches to statistical inference. Indeed, one possible description of statistics
is that it provides a means of transferring knowledge about a sample of a population to the
whole population, and to other related populations, which is not necessarily the same as
prediction over time. When information is transferred across time, often to specific points in
time, the process is known as forecasting. Forecasting usually requires time series methods,
while prediction is often performed on cross-sectional data.
Statistical techniques used for prediction include regression analysis and its various sub-
categories such as linear regression, generalized linear models (logistic regression, Poisson
regression, Probit regression), etc. In case of forecasting, autoregressive moving average
models and vector autoregression models can be utilized. When these and/or related,
generalized set of regression or machine learning methods are deployed in commercial usage,
the field is known as predictive analytics.
To use regression analysis for prediction, data are collected on the variable that is to be
predicted, called the dependent variable or response variable, and on one or more variables
whose values are hypothesized to influence it, called independent variables or explanatory
variables. A functional form, often linear, is hypothesized for the postulated causal
relationship, and the parameters of the function are estimated from the data—that is, are
chosen so as to optimize is some way the fit of the function, thus parameterized, to the data.
That is the estimation step. For the prediction step, explanatory variable values that are
deemed relevant to future (or current but not yet observed) values of the dependent variable
are input to the parameterized function to generate predictions for the dependent variable.
Types of Forecasting:
1. Qualitative Methods:
Qualitative forecasting techniques are subjective, based on the opinion and judgment
of consumers, experts; they are appropriate when past data are not available. They
are usually applied to intermediate- or long-range decisions. Examples of qualitative
forecasting methods are informed opinion and judgment, the Delphi method, market
research, and historical life-cycle analogy.
2. Quantitative Methods:
Quantitative forecasting models are used to forecast future data as a function of
past data. They are appropriate to use when past numerical data is available and
when it is reasonable to assume that some of the patterns in the data are expected
to continue into the future. These methods are usually applied to short- or
35
intermediate-range decisions. Examples of quantitative forecasting methods are last
period demand, simple and weighted N-Period moving averages, simple exponential
smoothing, poisson process model based forecasting and multiplicative seasonal
indexes.
Prediction & Classification Issues:

The major issue is preparing the data for Classification and Prediction. Preparing the data involves
the following activities −
 Data Cleaning − Data cleaning involves removing the noise and treatment of missing values. The
noise is removed by applying smoothing techniques and the problem of missing values is solved
by replacing a missing value with most commonly occurring value for that attribute.
 Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis is
used to know whether any two given attributes are related.
 Data Transformation and reduction − The data can be transformed by any of the following
methods.
o Normalization − The data is transformed using normalization. Normalization

involves scaling all values for given attribute in order to make them fall within a
small specified range. Normalization is used when in the learning step, the neural
networks or the methods involving measurements are used.
o Generalization − The data can also be transformed by generalizing it to the higher

concept. For this purpose we can use the concept hierarchies.
Comparison of Classification & Prediction Methods

Here is the criteria for comparing the methods of Classification and Prediction-
 Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label
correctly and the accuracy of the predictor refers to how well a given predictor can guess the
value of predicted attribute for a new data.
 Speed − This refers to the computational cost in generating and using the classifier or predictor.
 Robustness − It refers to the ability of classifier or predictor to make correct predictions from
given noisy data.
 Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently;
given large amount of data.
 Interpretability − It refers to what extent the classifier or predictor understands.
36
Neural Models
Artificial neural networks (ANNs) or connectionist systems are a computational model used
in machine learning, computer science and other research disciplines, which is based on a
large collection of connected simple units called artificial neurons, loosely analogous to axons
in a biological brain. Such systems can be trained from examples, rather than explicitly
programmed, and excel in areas where the solution or feature detection is difficult to express
in a traditional computer program. Like other machine learning methods, neural networks
have been used to solve a wide variety of tasks, like computer vision and speech recognition,
that are difficult to solve using ordinary rule-based programming.
Typically, neurons are connected in layers, and signals travel from the first (input), to the last
(output) layer. Modern neural network projects typically have a few thousand to a few million
neural units and millions of connections; their computing power is similar to a worm brain,
several orders of magnitude simpler than a human brain. The signals and state of artificial
neurons are real numbers, typically between 0 and 1. There may be a threshold function or
limiting function on each connection and on the unit itself, such that the signal must surpass
the limit before propagating. Back propagation is the use of forward stimulation to modify
connection weights, and is sometimes done to train the network using known correct outputs.
However, the success is unpredictable: after training, some systems are good at solving
problems while others are not. Training typically requires several thousand cycles of
interaction.
The goal of the neural network is to solve problems in the same way that a human would,
although several neural network categories are more abstract. New brain research often
stimulates new patterns in neural networks. One new approach is use of connections which
span further to connect processing layers rather than adjacent neurons. Other research being
explored with the different types of signal over time that axons propagate, such as deep
37
learning, interpolates greater complexity than a set of boolean variables being simply on or
off. Newer types of network are more free flowing in terms of stimulation and inhibition, with
connections interacting in more chaotic and complex ways. Dynamic neural networks are the
most advanced, in that they dynamically can, based on rules, form new connections and even
new neural units while disabling others.
Deep Learning
Deep learning (also known as deep structured learning, hierarchical learning or deep
machine learning) is the study of artificial neural networks and related machine learning
algorithms that contain more than one hidden layer. These deep nets:
 Use a cascade of many layers of nonlinear processing units for feature extraction and
transformation. Each successive layer uses the output from the previous layer as
input. The algorithms may be supervised or unsupervised and applications include
pattern analysis (unsupervised) and classification (supervised).
 Are based on the (unsupervised) learning of multiple levels of features or
representations of the data. Higher level features are derived from lower level
features to form a hierarchical representation.
 Are part of the broader machine learning field of learning representations of data.
 Learn multiple levels of representations that correspond to different levels of
abstraction; the levels form a hierarchy of concepts.
In a simple case, there might be two sets of neurons: one set that receives an input signal and
one that sends an output signal. When the input layer receives an input it passes on a
modified version of the input to the next layer. In a deep network, there are many layers
between the input and the output (and the layers are not made of neurons but it can help to
think of it that way), allowing the algorithm to use multiple processing layers, composed of
multiple linear and non-linear transformations.
Deep Learning has been applied to fields such as computer vision, automatic speech
recognition, natural language processing, audio recognition and bioinformatics where it has
been shown to produce state-of-the-art results on various tasks.
Regression
In statistical modeling, regression analysis is a statistical process for estimating the

relationships among variables. It includes many techniques for modeling and analyzing
several variables, when the focus is on the relationship between a dependent variable and
one or more independent variables (or 'predictors'). More specifically, regression analysis
helps one understand how the typical value of the dependent variable (or 'criterion variable')
changes when any one of the independent variables is varied, while the other independent
variables are held fixed. Most commonly, regression analysis estimates the conditional
expectation of the dependent variable given the independent variables – that is, the average
value of the dependent variable when the independent variables are fixed. Less commonly,
38
the focus is on a quantile, or other location parameter of the conditional distribution of the
dependent variable given the independent variables. In all cases, the estimation target is a
function of the independent variables called the regression function. In regression analysis,
it is also of interest to characterize the variation of the dependent variable around the
regression function which can be described by a probability distribution.
Regression analysis is widely used for prediction and forecasting, where its use has substantial
overlap with the field of machine learning. Regression analysis is also used to understand
which among the independent variables are related to the dependent variable, and to explore
the forms of these relationships. In restricted circumstances, regression analysis can be used
to infer causal relationships between the independent and dependent variables. However this
can lead to illusions or false relationships, so caution is advisable; for example, correlation
does not imply causation.
Regression can be classified into two types −
 Simple regression − One independent variable
 Multiple regression − Several independent variables
Simple Regression
Following are the steps to build up regression analysis −
 Specify the regression model

 Obtain data on variables
 Estimate the quantitative relationships
 Test the statistical significance of the results
 Usage of results in decision-making
Formula for simple regression is –
Y = a + bX + u
Y= dependent variable
X= independent variable
a= intercept
b= slope
u= random factor
39
Cross sectional data provides information on a group of entities at a given time, whereas time
series data provides information on one entity over time. When we estimate regression equation
it involves the process of finding out the best linear relationship between the dependent and the
independent variables.
Multiple Regression Analysis

Unlike simple regression in multiple regression analysis, the coefficients indicate the change in
dependent variables assuming the values of the other variables are constant.
Here in null hypothesis, there is no relationship between the dependent variable and the
independent variables of the population.
The formula is –
H0: b1 = b2 = b3 = …. = bk = 0
No relationship exists between the dependent variable and the k independent variables for the
population.
Feature Selection
In machine learning and statistics, feature selection, also known as variable selection,
attribute selection or variable subset selection, is the process of selecting a subset of relevant
features (variables, predictors) for use in model construction. Feature selection techniques
are used for four reasons:
 simplification of models to make them easier to interpret by researchers/users

 shorter training times,
 to avoid the curse of dimensionality,
 enhanced generalization by reducing overfitting (formally, reduction of
variance)
The central premise when using a feature selection technique is that the data contains many
features that are either redundant or irrelevant, and can thus be removed without incurring
much loss of information. Redundant or irrelevant features are two distinct notions, since one
relevant feature may be redundant in the presence of another relevant feature with which it
is strongly correlated.
Feature selection techniques should be distinguished from feature extraction. Feature

extraction creates new features from functions of the original features, whereas feature
selection returns a subset of the features. Feature selection techniques are often used in
domains where there are many features and comparatively few samples (or data points).
Archetypal cases for the application of feature selection include the analysis of written texts
and DNA microarray data, where there are many thousands of features, and a few tens to
hundreds of samples.
40
A feature selection algorithm can be seen as the combination of a search technique for
proposing new feature subsets, along with an evaluation measure which scores the different
feature subsets. The simplest algorithm is to test each possible subset of features finding the
one which minimizes the error rate. This is an exhaustive search of the space, and is
computationally intractable for all but the smallest of feature sets.
41

Web Intelligence & Big Data: By: Chanveer Singh Harmanaq Singh

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web Intelligence & Big Data: By: Chanveer Singh Harmanaq Singh

Uploaded by

Copyright:

Available Formats

Web Intelligence

Information and Communication Technology lies at the basis of innumerable innovations in

Relational database management systems and desktop statistics- and visualization-packages

Big data can be described by the following characteristics:

Big Data Analytics

Tools & Technologies:

 MapReduce: A software framework that allows developers to write programs that

Uses & Challenges

Web Intelligence & Big Data

"Enterprise search" is used to describe the software of search information within an

Enterprise search can be seen as a type of vertical search of an enterprise.

 Content Processing & analysis:

 If v ∈ B(q,r1), then PrH[h(q) = h(v)] ≥ p1,

Where r1, r2, p1, p2 satisfy p1 < p2 and r1 < r2.

LSH is a dimension reduction technique that projects objects in a high-dimensional space to

AdSense is divided into two basic locations.

 AdSense for Content

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡𝑒𝑟𝑚 𝑡 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑎 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠

Analysing Sentiment and Intent

Flat File Database:

Following is an example RDBMS data structure:

Big Data Technology & Trends

Predictive analytics is closely related to machine learning; in fact, ML systems often

The number of organizations using predictive analytics today is surprisingly low—only

Inputs & Outputs:

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

MapReduce is useful in a wide range of applications, including distributed pattern-based

HBase is open sourced, BigTable is not.

In the terminology of machine learning, classification is considered an instance of supervised

An algorithm that implements classification, especially in a concrete implementation, is

Classification has many applications in customer segmentation, business modeling,

Applications of Cluster Analysis

Data Mining Process:

 Does the model achieve the business objective?

 Have all business issues been considered?

from an online news sentence such as:

In information technology a reasoning system is a software system that generates

Comparison of deductive and inductive reasoning

Non- Monotonic Reasoning:

Dealing with Uncertainity

In statistics, prediction is a part of statistical inference. One particular approach to such

Prediction & Classification Issues:

o Normalization − The data is transformed using normalization. Normalization

o Generalization − The data can also be transformed by generalizing it to the higher

Comparison of Classification & Prediction Methods

 Interpretability − It refers to what extent the classifier or predictor understands.

In statistical modeling, regression analysis is a statistical process for estimating the

Regression can be classified into two types −

 Simple regression − One independent variable

 Multiple regression − Several independent variables

 Specify the regression model

Multiple Regression Analysis

 simplification of models to make them easier to interpret by researchers/users

Feature selection techniques should be distinguished from feature extraction. Feature

You might also like