You are on page 1of 40

A

SEMINAR REPORT
On
Machine Learning

Submitted to Rajasthan Technical University


In partial fulfillment of the requirement for the award of the
degree of
Bachelor of Technology
in
COMPUTER SCIENCE & ENGINEERING

Submitted By-
Amrit Kumar Sah (16EVJCS020)

Under the Guidance of


Mr. Bharat Bhushan Singhal
(Asst. Professor, Department of CSE)
at

VIVEKANANDA INSTITUTE OF TECHNOLOGY, JAIPUR


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
RAJASTHAN TECHNICAL UNIVERSITY, KOTA
July, 2019

i
Certificate Copy

ii
Candidate’s Declaration

I “Amrit Kumar Sah [16EVJCS020]”, B.Tech. (Semester VII) of Vivekananda Institute of


Technology, Jaipur hereby declares that the Seminar Report entitled “Machine Learning” is an original
work and data provided in the study is authentic to the best of our knowledge. This report has been not
submitted to any other Institute for the award of any other degree.

Amrit Kumar Sah


Roll No. 16EVJCS020

Place:Jaipur
Date:

iii
ACKNOWLGEMENT

We take this opportunity to express my deepest gratitude to those who have generously helped me in
providing the valuable knowledge and expertise during my training.
It is great pleasure to represent this report on the project named “Machine Learning” undertaken by
me as part of my B.Tech (CSE) curriculum. I am thankful to Vivekananda Institute of Technology
for offering me such a wonderful challenging opportunity.
It is a pleasure that we find ourselves penning down these lines to express our sincere thanks to the
people who helped us along the way in completing our project. We find inadequate words to express
our sincere gratitude towards them.
I express my sincere gratitude to Prof. (Dr.) N.K Agarwal (Principal, VIT) for providing me an
opportunity to undergo this major project as the part of the curriculum. I am thankful to Miss Kirti
Gupta for her support, co-operation and motivation provided to me during training for constant
inspiration, presence and blessings. I would also like to thank my H.O.D Mr. Tushar Vyas for his
valuable suggestions which help a lot in completion of this project
Lastly, I would like to thank the almighty and my parents for moral support and friends with whom I
share my day-to-day experience and receive lots of suggestion that improve my quality of work.

Name : Amrit Kumar Sah


Roll No. : 16EVJCS020

iv
ABSTRACT
In this project, we were asked to experiment with a real world dataset, and to explore how machine learning
algorithms can be used to find the patterns in data. We were expected to gain experience using a common
data-mining and machine learning library, Weka, and were expected to submit a report about the dataset and
the algorithms used. After performing the required tasks on a dataset of my choice, herein lies my final
report.

Keywords: Machine Learning, Pattern Recognition, Classification, Supervised learning, Artificial Intelligence.

v
TABLE OF CONTENTS

CONTENT PAGE NO
Declaration i

Acknowledgement ii

Abstract iii

Table of content iv

Contents v

6
Contents

ACKNOWLGEMENT.............................................................................................................iv

ABSTRACT................................................................................................................................v

CHAPTER 1 INTRODUCTION….......................................................................................10

1.1 Objectives...........................................................................................................................11
1.1.1 Supervised learning.........................................................................................................11
1.1.2 Unsupervised....................................................................................................................11
1.1.3 Decision time....................................................................................................................12
1.2 Motivation...........................................................................................................................12
1.3 Internship Goals.................................................................................................................14
1.4 Report Layout.....................................................................................................................14

CHAPTER 2 INTERNSHIP ENTERPRISE.........................................................................16

2.1 About the Company...........................................................................................................16


2.2 Head Office.........................................................................................................................16
2.3 IT Services Offered............................................................................................................16
2.4 Roles in Job Market...........................................................................................................17
CHAPTER 3 Internship Roles And Responsibilities............................................................18

3.1 Training Attended..............................................................................................................18

3.2 Assigned Responsibilities...................................................................................................18

3.3 Work Environment............................................................................................................18

3.4 Data Analyst Responsibilities..............................................................................................18


3.5 Data Analyst Job Duties....................................................................................................19
3.6 Responsibilities...................................................................................................................19
7
3.7 System Design.....................................................................................................................19
3.8 Performed Tasks................................................................................................................21

CHAPTER 4 INTERNSHIP OUTCOMES...........................................................................24

4.1 Problem & Solution...........................................................................................................24

4.1.1 Problems with their Solutions........................................................................................24

4.2 Learning Outcomes............................................................................................................26


4.2.1 Python Programming.................................................................................................................. 26
4.2.2 NumPy..............................................................................................................................26
4.2.3Pandas.......................................................................................................................................... 27
4.2.4 Data Visualisation........................................................................................................................ 29
4.2.5 Basic Stats And Regression.............................................................................................30
4.2.6 Machine Learning & ML project..................................................................................31
4.2.7 NLP & NLP project........................................................................................................................ 31
5.1 Conclusion...........................................................................................................................33
5.2 Future Scopes.....................................................................................................................33
REFERENCE...........................................................................................................................35

8
LIST OF FIGURES
FIGURES
Figure 1: ML
Figure 2: ML
Figure 3: ML
Figure 4: ML
Figure 5: NLP
Figure 6: NLP
Figure 7: NLP
Figure 8: NLP

9
Chapter 1
Introduction
What is Machine Learning? A
definition
Machine learning is an application of artificial intelligence (AI) that provides systems
the ability to automatically learn and improve from experience without being
explicitly programmed. Machine learning focuses on the development of computer
programs that can access data and use it learn for themselves.

The process of learning begins with observations or data, such as examples, direct
experience, or instruction, in order to look for patterns in data and make better decisions
in the future based on the examples that we provide. The primary aim is to allow the
computers learn automatically without human intervention or assistance and adjust
actions accordingly.

Some machine learning methods

Machine learning algorithms are often categorized as supervised or unsupervised.

 Supervised machine learning algorithms can apply what has been learned in the past
to new data using labeled examples to predict future events. Starting from the analysis
of a known training dataset, the learning algorithm produces an inferred function to
make predictions about the output values. The system is able to provide targets for any
new input after sufficient training. The learning algorithm can also compare its output
with the correct, intended output and find errors in order to modify the model
accordingly.
 In contrast, unsupervised machine learning algorithms are used when the information
used to train is neither classified nor labeled. Unsupervised learning studies how
systems can infer a function to describe a hidden structure from unlabeled data. The
system doesn’t figure out the right output, but it explores the data and can draw
inferences from datasets to describe hidden structures from unlabeled data.
 Semi-supervised machine learning algorithms fall somewhere in between supervised
and unsupervised learning, since they use both labeled and unlabeled data for training –
typically a small amount of labeled data and a large amount of unlabeled data. The
systems that use this method are able to considerably improve learning accuracy.
Usually, semi-supervised learning is chosen when the acquired labeled data requires
skilled and relevant resources in order to train it / learn from it. Otherwise,
acquiringunlabeled data generally doesn’t require additional resources.
 Reinforcement machine learning algorithms is a learning method that interacts with
its environment by producing actions and discovers errors or rewards. Trial and error
search and delayed reward are the most relevant characteristics of reinforcement
learning. This method allows machines and software agents to automatically determine
the ideal behavior within a specific context in order to maximize its performance.

1
Simple reward feedback is required for the agent to learn which action is best; this is
known as the reinforcement signal.
Machine learning enables analysis of massive quantities of data. While it generally
delivers faster, more accurate results in order to identify profitable opportunities or
dangerous risks, it may also require additional time and resources to train it properly.
Combining machine learning with AI and cognitive technologies can make it even more
effective in processing large volumes of information.

1.1 Objectives
The purpose of machine learning is to discover patterns in your data and then make
predictions based on often complex patterns to answer business questions, detect and
analyse trends and help solve problems.

Machine learning is effectively a method of data analysis that works by automating the
process of building data models.

1.1.1 Supervised learning


Supervised learning is the machine learning task of learning a function that maps
an input to an output based on example input-output pairs. It infers a function
from labeled training data consisting of a set of training examples. In supervised
learning, each example is a pair consisting of an input object (typically a vector)
and a desired output value (also called the supervisory signal). A supervised
learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow
for the algorithm to correctly determine the class labels for unseen instances.
This requires the learning algorithm to generalize from the training data to
unseen situations in a "reasonable" way (see inductive bias).

The parallel task in human and animal psychology is often referred to as concept
Learning.

The aim of supervised machine learning is to build a model that makes predictions based on
evidence in the presence of uncertainty. Supervised learning uses classification and regression
techniques to develop predictive models.
• Classification techniques predict discrete responses
• Regression techniques predict continuous responses
Using Suverpised Learning to Predict Heart Attack.

1.1.2 Unsupervised learning


Unsupervised learning is a type of self-organized Hebbian

1
learning that helps find previously unknown patterns in data set without pre-
existing labels. It is also known as self-organization and allows modeling
probability densities of given inputs.[1] It is one of the main three categories of
machine learning, along with supervised and reinforcement learning. Semi-
supervised learning has also been described, and is a hybridization of supervised
and unsupervised techniques.
Two of the main methods used in unsupervised learning are principal component
and cluster analysis. Cluster analysis is used in unsupervised learning to group, or
segment, datasets with shared attributes in order to extrapolate algorithmic
relationships.[2] Cluster analysis is a branch of machine learning that groups
the data that has not been labelled, classified or categorized. Instead of
responding to feedback, cluster analysis identifies commonalities in the data and
reacts based on the presence or absence of such commonalities in each new piece
of data. This approach helps detect anomalous data points that do not fit into
either group.
A central application of unsupervised learning is in the field of density
estimation in statistics,[3] though unsupervised learning encompasses many other
domains involving summarizing and explaining data feature.

The aim of unsupervised machine learning is to find hidden patterns or


intrinsic structures in data. It is used to draw inferences from datasets
consisting of input data without labeled responses. Unsupervised learning uses
clustering techniques to develop models.
• Clustering - is the most common unsupervised learning technique. It is
used for exploratory data analysis to find hidden patterns or groupings in
data.

Applications for clustering include gene sequence analysis, market research,


and object recognition.
Using Unsuverpised Learning to Predict Heart Attack.

1.1.3 Decision Time !!!!!


How to decide which algorithm to use?
• Choosing the right algorithm can seem overwhelming
• There is no best method or one size fits all. Finding the right algorithm is
partly just trial and error
• Algorithm selection also depends on the size and type of data you’re
working with, the insights you want to get from the data, and how those
insights will be used.

1.2 Motivation
To me it was motivating to learn because finally I could see how all the
math I had studies at university is applied in real life, and it's not only
interesting, but also very useful.

1
Also just the thought that given the data you can extract something
useful from it is already very motivating. For example, if you measure
you weight every day, then, when you accumulate enough data, you
can extract some helpful stuff about it that overwise you won't be able
to learn.
Another motivation could be money. Data science is quite a hot topic
nowadays and data scientists are paid quite well - companies have
tons and tons of data and they need people who know how to mine
something useful from this data. And there are more and more "data
driven" companies that need people who can mine insight from the
raw floods of information.

1
1.3 Internship Goal
1. Gain more professional connections.
We can’t stress this enough—the people you meet at your internship are important. Exchange contact
info, connect on LinkedIn, and make sure you’re not saying goodbye for good when you walk out the
door on your last day! These co-workers are your future references, mentors, and friends that can
alert you to any new job opportunities. You wouldn’t wanna miss out on that, would you?

2. Develop skills you can add to your resume.


It’s definitely a big win if you can add at least one hard skill to your resume, such mastering a certain
computer program, learning how to analyze data, or something else that’s measurable. As for your
soft skills? Think of things like effective communication, your ability to work in a team, and your
problem-solving skills.

3. Learn what you do and don’t like.


You’re not only at your internship to learn skills; you also want to learn about yourself! Sometimes
you’ll find that you actually hate working on something you thought you’d enjoy, or you’ll realize
there’s an aspect of your job you’d absolutely love to do more of! This will help you when you’re
searching for future opportunities—you’ll know what kinds of job descriptions you’ll want to avoid,
and vice versa.

4. Gain more confidence in a professional setting.


It’s easy to feel a bit sheepish as an intern, but this is your chance to gain confidence. If right now
you shy away from things like sharing your opinions or speaking in front of large groups, make it a
goal to conquer those fears by the end of your internship. It’ll do you good if you embrace
opportunities that initially scare you!

5. Learn about your own working style.


Here’s another important chance to learn a thing or two about yourself. Are you most productive in
the mornings? Maybe that means you should show up a bit early each day! Do you work best by
collaborating with others? For your next opportunity, you can search for a role involving a lot of
teamwork. It’ll be easier steering your career in the right direction once you’ve got your working
style down pat.

1.4 Report Layout


Layout includes such things as the chapter objective details type of paper chosen, the
margins, the line spacing, the pagination, and the incorporation of equation, illustrations,

1
and references. Table 1 presents general specifications for the page layouts. For each
report that I create, I can assign a report layout. Via the report layout I define the layout
features of the report, such as the page format of the report and the numeric format of
the report data. When I use the layout functions, I can set parameters for the report.
When I define a report layout, I define a layout specifically for a report and I can change
the standard layout assigned to a report.
 In Chapter 1 I mention details about web development introduction. Objective.
Objective include some facility about internship as save the time and resource, reduce
the number of workers, reduce the work load, easy to search and record. Motivation to
web development as inheritance, implement, object declare and so on opportunity. Goals
of web development project PHP is more effective and choice full and ultimate for
fruitful programmer. Report layout is the short overview of full report chapters.
 Chapter 2 describe about internship enterprise like about the company, IT services
offered, roles in job market and responsibilities.
 Chapter 3 contain internship roles and responsibilities like training attended, assigned
responsibilities, work environment, using web development project temples, performed
tasks with figure and mention task.
 Chapter 4 describe about internship outcomes some outcomes of internship is
problems and solutions like OOP understanding, view some component in frame. And
learning outcomes, challenges.
 In the Chapter 5 simply I mention about internship discussion and conclusion.

1
Chapter 2
Internship Enterprise

About the
2.1
Company EduGrad
We are an edtech company with a strong belief on "Learn By Doing" approach, building
optimum learning experience, and developing competencies in the area of emerging
technologies.

Journey
EduGrad was born in May 2018 in a bid to offer niche technologies to learn by doing
approach. In Aug'18 our first course in Data Analytics was roled out. By the end of Dec
our batches are running across 20 colleges spreading across NCR, Hyderabad and Delhi.

Vision
Our vision is to empower students with problem-solving skills.

Mission
Our mission is to prepare our learners capable of facing real world challenges by giving
the world class learning experience, expert mentor guidance and learn by doing
approach.

2.2 Head Office


Registered Address:- 3RD floor, vakula mansion, beside hppetrol pump, tlecomnagar
gachibowli Hyderabad-500032

Business Address:- H-85, 3RD floor, sector-63, Noida, Uttar Pradesh-201301, CIN:
U72900TG2018PTC124426
www.edugrad.com

2.3 IT Services Offered


Technologies Languages
 Learn Data Analytics using Python
 Master Python for Data Science and Machine Learning
 Introduction to GIT
 Learn Web Scraping using Python
 Intro to Database Tools for Data Science
 Presentation Skills for Data Scientists
 Machine Learning
1
2.4 Roles in Job Market
The top respondents for the job title Data ANALYST are from the companies
Accenture, EduGrad, Tata Consultancy Services Limited and EY (Ernst & Young) Etc.
Reported salaries are highest at HSBC where the average pay is Rs 687,845. Other
companies that offer high salaries for this role include Accenture and Tata Consultancy
Services Limited, earning around Rs 484,711 and Rs 464,577, respectively. eClerx pays
the lowest at around Rs 204,419. Genpact and EY (Ernst & Young) also pay on the
lower end of the scale, paying Rs 350,000 and Rs 423,728, respectively.

1
Chapter 3
INTERNSHIP ROLES AND RESPONSIBILITIES
3.1 Training Attended
There are more attendant and requirement is need to build a project we know that, we
can mix all language like Python, Opps concept, Python library. The Anaconda, Jupyter
notebook combines the results of the interpreted and executed python code, which may
be any type of data, including images, with the generated Analytical page. Python code
may also be executed with a command-line interface(CLI) and can be used to implement
standalone graphical applications. Those training I was attending in EduGrad given
below:
1. Python Programming
2. NumPy
3. Pandas
4. Data Visualisation
5. Basic Stats & Regression Models
6. ML Overview & ML Project
7. NLP Overview & NLP Project Completed

3.2 Assigned Responsibilities:

Instructions -

1. Please read the question carefully before attempting them.


2. Solve all the questions in a SINGLE jupyter notebook file.
3. In case name of the variable to be used is mentioned in the question, use
the same name while coding (marks are associated with it)
4. In your answers include your descriptions as and when mentioned. Think
yourself as a Data Analysts, who needs to suggest and explain solutions to the
client based on Data.

3.3 Work Environment


Front end Developer:
The front end developer generally works at client side dealing with the web page design,
graphics that is accessible to the user.
Back end Developer:
The back end developer is a person who is responsible for the back end development
that interacts with the server. This type of web developer specializes in the languages
like Python.

3.4 Data Analyst Responsibilities:


 Interpreting data, analyzing results using statistical techniques

1
 Developing and implementing data analyses, data collection systems and
other strategies that optimize statistical efficiency and quality
 Acquiring data from primary or secondary data sources and maintaining databases

3.5 Data Analyst Job Duties


Data analyst responsibilities include conducting full lifecycle analysis to
include requirements, activities and design. Data analysts will develop analysis
and reporting capabilities. They will also monitor performance and quality
control plans to identify improvements.

3.6 Responsibilities
 Interpret data, analyze results using statistical techniques and
provide ongoing reports
 Develop and implement databases, data collection systems, data analytics
and other strategies that optimize statistical efficiency and quality
 Acquire data from primary or secondary data sources and maintain
databases/data systems
 Identify, analyze, and interpret trends or patterns in complex data sets
 Filter and “clean” data by reviewing computer reports, printouts, and
performance indicators to locate and correct code problems
 Work with management to prioritize business and information needs
 Locate and define new process improvement opportunities

3.7 System Design


This section explains our methodology and the system architecture. Fig. 3 gives a
graphical representation of our prototype system. It consists of two main modules – one
that is language dependent and another that is language independent. The following sub
sections explain individual system components in detail.
A. Data pre-processing This component is part of the language dependent system
module. We designed the preprocessor in such a way that a change in the input language
does not affect the rest of the system components. First, we tokenize the raw
surveyquestions with a tool that is dependent on the survey’s source language. For
Latin-character based languages such as Spanish, German, and French, we build the
tokenizers using the python Natural Language Processing Toolkit (NLTK) [8] toolkit
and predefined regular expressions. For Asian languages such as Japanese, we use
morphology-based segmenters (e.g., MeCab and TinySegmenter for Japanese text) to
tokenize the survey text2. Second,

1
we standardize tokens by removing noise terms and stop-words. We used language-
dependent stop-word lists for this purpose. Third, we represent each survey or question
as a document in a sparse bag-of-words format, after building a vocabulary of corpus-
words (separately for each language we used). Finally, we use documents as input to the
topic learning model which, in turn, learns clusters from the term cooccurrence
frequencies of the corresponding documents. See Fig. 3 for more details. B. Topic
learning As discussed earlier, topic models have the ability to learn semantic
relationships of words from an observed text collection. In this system, topic modeling is
used for three main purposes i) categorizing and ranking surveys, ii) survey sub-
categorization and ranking, and iii) clustering of survey questions under an identified
survey sub-cluster. Survey ranking is performed to identify relevant surveys that belong
to general (top-level) topics such as market research, education, and sports. To perform
ranking, we first compute the topic mixtures of the survey documents, which are formed
by combining survey questions. To estimate the topical structure from the survey
documents, we use HDP [3], which can learn the number topics automatically (this is
one of our primary goals) along with the topic model from large document collections. A
detailed theoretical review of HDP and its inference methods is presented by Teh et al
[3]. We use a modified version of the HDP implementation by Wang and Blei [9] in our
experiments. The major components of a learned HDP model are the corpus-level topic
word association counts and document- level topic mixtures. Each topic in the estimated
model is represented by its topic-word- probabilities. These words are used by language
experts to name survey categories. The document level topic mixtures give an idea of the
topicality of a particular survey to a given topic. This is also quite useful in finding
similar surveys and grouping them together. From the observations of the top-level
survey categorization explained above, we found that some of the topics found by the
HDP estimation process can be further divided into subtopics and the corresponding
surveys can be ranked by subtopic relevance. For modeling survey subtopics, we use the
original LDA model [2] because it is more accurate and less computationally expensive
than HDP. We use the Gensim package’s [10] online variational inference
implementation for the model estimation process. Conventional topic modeling
algorithms are designed to work on larger documents compared to survey- questions
(section II). The chance of a term re-occurrence in the same question is quite low
compared to typical documents used in the topic modeling literature. So, to cluster
questions to build question banks, we represent questions in a much simpler format such
as TF-IDF and perform LSI, which helps to represent the questions in the smaller LSI
space rather than the vocabulary space. C. Survey relevance ranking We use survey
relevance ranking to group together surveys belonging to an estimated topic (Fig. 1). We
use individual surveys’ estimated document topic mixtures, ˆθd, to rank them on
relevance given a topic or set of topics. For a given topic set T ⊂ K, we calculate m(d) =
k∈T ln ˆθd,k + j /∈T ln(1 − ˆθd,j ) (1) for all surveys d = 1, 2, ..., D in the corpus and sort
them to rank their relevance. Here, we assume that the document topic mixtures ˆ θd
satisfy the multinomial property K j=1 ˆθj = 1. Intuitively, we can see that this equation
maximizes the score of a topic set T ⊂ K given a document. A document with a high

2
value of this

2
score is a highly relevant document for that topic set. D. Question clustering and ranking
One of the goals of this project is to design a system that can recommend useful,
relevant survey questions, given a selected survey topic (e.g., education) for building
question banks. Once we have the surveys that belong to a given topic, we group similar
survey questions into question groups and rank them within group based on several
ranking scores. We first apply fuzzy C-means (FCM) clustering [4], [11] to the set of
survey questions represented in LSI space (section III-B). Second, we rank the questions
that belong to a given cluster based on measures such as string matching, fuzzy set
matching [12], and distance from the cluster centroid. Finally, we remove duplicate
questions and present the ranked questions to survey designers

3.8 Performed Tasks

TOPIC MODELLING

Analytics is all about obtaining useful Information from the data. With the growing
amount of data in recent years, which is mostly unstructured, it’s difficult to obtain the
relevant and desired information. But, with the help of technology, powerful methods
can be deployed to mine through the data and fetch the information that we are looking
for.

One such technique in the field of text mining/data mining is Topic Modelling. As the
name suggests, it is a process to automatically identify topics present in any text object
and to derive hidden patterns exhibited by the text corpus. This helps in assisting
better decision making.

Topic Models are very useful for the purpose for document clustering, organizing large
blocks of textual data, information retrieval from unstructured text and feature
selection.

While searching online for any news article on the web, bypassing only some topics
the entire news article is being displayed.

Therefore, each news article or a document can be divided into several topics through
which that entire document can be recreated

This project deals with extracting topics for a couple of news article and also extracts
details such as person name, location and organization for each story.

Project Credit: - Startup Byte

Dataset – The dataset contains two files Startup_data.xlsx and cities_r2.csv

The cities_r2.csv files would help in finding the cities for each Startup
2
The attributes of Startup_data.xlsx are:-

STARTUP NEWS ? The news article posted online on website

SUMMARY ? A shorthand summary of the news article

POSTED BY ? Name of person who posted the startup news

DESCRIPTION ? The complete information or story of that

news

The data is being collected from different sources and data is being stored into csv files.

TASK 1: - Loading the dataset

1. Load the necessary libraries into python


2. Load the dataset Startup_data.xlsx into python using pandas data frame
and name it as startup_data
3. Print the top 5 rows of the data frame and perform explanatory analysis of
the data

TASK 2: - Data Cleaning and Wrangling

1. Combine the Startup_News, Summary and Description columns to a


new column Content.
2. Convert the Content column to a list using a suitable method
3. Clean the data by removing Unicode characters and blank spaces
4. Make a function named as clean that accepts a string and only returns
string having all characters as numbers, alphabet and special characters
only

TASK 3: - Natural Language Processing

1. After cleaning of text apply natural language processing to each story


2. Tokenize each story and remove Stopwords
3. Also, remove punctuation marks and store the lemmatized word into the
final result.

TASK 4: - Text Visualization

After applying natural language processing to each test take top 5 stories and visualize
most frequent words in those 5 stories using Wordcloud library in python

2
TASK 5: - Topic Modelling

2
Apply topic modeling LDA algorithm to each and every news article and extract 10
topics for each news article and store it into a new column corresponding to each news.

TASK 6: - Categorizing each news article

Categorize each story into the following ones:-

1. Games
2. Startup
3. Fund
4. Science
5. Women

Make use of topics extracted in the above step and apply regular expression over them to
categorize each news.

TASK 7: - Finding more insights of data

Use NLP to find person name, location and organization name for each news article.

(Hint: - For Person name and organization make use of Named Entity
Recognition(NER) whereas for Location make use of csv file cities_r2.csv)

2
Chapter 4

Internship Outcomes

4.1 Problem & Solution


What is Machine Learning? We can read authoritative definitions of machine learning, but
really, machine learning is defined by the problem being solved. Therefore the best way to
understand machine learning is to look at some example problems.
In this post we will first look at some well known and understood examples of machine
learning problems in the real world. We will then look at a taxonomy (naming system) for
standard machine learning problems and learn how to identify a problem as one of these
standard cases. This is valuable, because knowing the type of problem we are facing allows
us to think about the data we need and the types of algorithms to try.

4.1.1 Problems with their Solutions


Machine Learning problems are abound. They make up core or difficult parts of the
software you use on the web or on your desktop everyday. Think of the “do you want to
follow” suggestions on twitter and the speech understanding in Apple’s Siri.

Below are 10 examples of machine learning that really ground what machine learning is all
about.

 Spam Detection: Given email in an inbox, identify those email messages that are spam
and those that are not. Having a model of this problem would allow a program to leave non-
spam emails in the inbox and move spam emails to a spam folder. We should all be familiar
with this example.
 Credit Card Fraud Detection: Given credit card transactions for a customer in a
month, identify those transactions that were made by the customer and those that were
not. A program with a model of this decision could refund those transactions that were
fraudulent.
 Digit Recognition: Given a zip codes hand written on envelops, identify the digit for each
hand written character. A model of this problem would allow a computer program to read
and understand handwritten zip codes and sort envelops by geographic region.
 Speech Understanding: Given an utterance from a user, identify the specific
request made by the user. A model of this problem would allow a program to

2
understand and make an attempt to fulfil that request. The iPhone with Siri has this
capability.
 Face Detection: Given a digital photo album of many hundreds of digital photographs,
identify those photos that include a given person. A model of this decision process would
allow a program to organize photos by person. Some cameras and software like iPhoto has
this capability.
 Product Recommendation: Given a purchase history for a customer and a large
inventory of products, identify those products in which that customer will be
interested and likely to purchase. A model of this decision process would allow a
program to make recommendations to a customer and motivate product purchases.
Amazon has this capability. Also think of Facebook, GooglePlus and LinkedIn that
recommend users to connect with you after you sign-up.
 Medical Diagnosis: Given the symptoms exhibited in a patient and a
database of anonymized patient records, predict whether the patient is likely to have
an illness. A model of this decision problem could be used by a program to provide
decision support to medical professionals.
 Stock Trading: Given the current and past price movements for a stock, determine
whether the stock should be bought, held or sold. A model of this decision problem
could provide decision support to financial analysts.
 Customer Segmentation: Given the pattern of behaviour by a user during a trial
period and the past behaviours of all users, identify those users that will convert to
the paid version of the product and those that will not. A model of this decision
problem would allow a program to trigger customer interventions to persuade the
customer to covert early or better engage in the trial.
 Shape Detection: Given a user hand drawing a shape on a touch screen and a
database of known shapes, determine which shape the user was trying to draw. A
model of this decision would allow a program to show the platonic version of that
shape the user drew to make crisp diagrams. The Instaviz iPhone app
does this.
These 10 examples give a good sense of what a machine learning problem looks like. There
is a corpus of historic examples, there is a decision that needs to be modelled and a business
or domain benefit to having that decision modelled and efficaciously made automatically.

Some of these problems are some of the hardest problems in Artificial Intelligence, such as
Natural Language Processing and Machine Vision (doing things that humans do easily).
Others are still difficult, but are classic examples of machine learning such as spam
detection and credit card fraud detection.

Think about some of your interactions with online and offline software in the last week. I’m
sure you could easily guess at another ten or twenty examples of machine learning you have
directly or indirectly used.

2
4.2 Learning Outcomes

4.2.1 Python Programming

Python is a powerful multi-purpose programming language created by Guido van


Rossum.

It has simple easy-to-use syntax, making it the perfect language for someone trying to
learn computer programming for the first time.

This is a comprehensive guide on how to get started in Python, why you should learn it
and how you can learn it.

However, if you have knowledge of other programming languages and want to quickly
get started with Python, visit Python tutorial page.

4.2.2 NumPy
NumPy is the fundamental package for scientific computing with Python. It contains
among other things:
 a powerful N-dimensional array object
 sophisticated (broadcasting) functions
 tools for integrating C/C++ and Fortran code
 useful linear algebra, Fourier transform, and random number capabilities Besides its
obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary data-types can be defined. This allows NumPy to
seamlessly and speedily integrate with a wide variety of databases.
NumPy is licensed under the BSD license, enabling reuse with few restrictions.

Getting Started
To install NumPy, we strongly recommend using a scientific Python distribution. See
Installing the SciPy Stack for details.
Many high quality online tutorials, courses, and books are available to get started with
NumPy. For a quick introduction to NumPy we provide the NumPy Tutorial. We also
recommend the SciPy Lecture Notes for a broader introduction to the scientific Python
ecosystem.
For more information on the SciPy Stack (for which NumPy provides the
fundamental array data structure), see scipy.org.

2
Documentation
The most up-to-date NumPy documentation can be found at Latest (development)
version. It includes a user guide, full reference documentation, a developer guide, meta
information, and “NumPy Enhancement Proposals” (which include the NumPy
Roadmap and detailed plans for major new features).
A complete archive of documentation for all NumPy releases (minor versions; bug
fix releases don’t contain significant documentation changes) since 2009 can be
found at https://numpy.org/doc/
NumPy Enhancement Proposals (NEPs) can be found at https://numpy.org/neps

Support NumPy
If you have found NumPy to be useful in your work, research or company, please
consider making a donation to the project commensurate with your resources. Any
amount helps! All donations will be used strictly to fund the development of NumPy’s
open source software, documentation and community.
NumPy is a Sponsored Project of NumFOCUS, a 501(c)(3) nonprofit charity in the
United States. NumFOCUS provides NumPy with fiscal, legal, and administrative
support to help ensure the health and sustainability of the project. Visit
numfocus.org for more information.
Donations to NumPy are managed by NumFOCUS. For donors in the United States,
your gift is tax-deductible to the extent provided by law. As with any donation, you
should consult with your tax adviser about your particular tax situation.
NumPy’s Steering Council will make the decisions on how to best use any funds
received. Technical and infrastructure priorities are documented on the NumPy
Roadmap.

4.2.3 Pandas
pandas is an open source, BSD-licensed library providing high-performance, easy-to-
use data structures and data analysis tools for the Python programming language.
pandas is a NumFOCUS sponsored project. This will help ensure the success of
development of pandas as a world-class open-source project, and makes it possible to
donate to the project.

v0.25.1 Final (August 22, 2019)

2
This is a minor bug-fix release in the 0.25.x series and includes some regression fixes
and bug fixes. We recommend that all users upgrade to this version.
See the full whatsnew for a list of all the changes.
The release can be installed with conda from the defaults and conda-forge
channels:
conda install pandas
Or via PyPI:
python -m pip install --upgrade pandas
v0.25.0 Final (July 18, 2019)
This is a major release from 0.24.2 and includes a number of API changes, new features,
enhancements, and performance improvements along with a large number of bug fixes.
Highlights include:
 Dropped Python 2 support

 Groupby aggregation with relabeling


 Better repr for MultiIndex
 Better truncated repr for Series and DataFrame
 Series.explode to split list-like values to rows MultiIndexes
The release can be installed with conda from conda-forge or the default channel:
conda install pandas
Or via PyPI:
python3 -m pip install --upgrade pandas
See the full whatsnew for a list of all the changes.
Best way to Install
The best way to get pandas is via conda
conda install pandas
Packages are available for all supported python versions on Windows, Linux, and
MacOS.
Wheels are also uploaded to PyPI and can be installed with
pip install pandas
Quick vignette
What problem does pandas solve?
Python has long been great for data munging and preparation, but less so for data
analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire
data analysis workflow in Python without having to switch to a more domain specific
language like R.
Combined with the excellent IPython toolkit and other libraries, the environment for
doing data analysis in Python excels in performance, productivity, and the ability to
collaborate.
pandas does not implement significant modeling functionality outside of linear and
panel regression; for this, look to statsmodels and scikit-learn. More work is still needed
to make Python a first class statistical modeling environment, but we are well on our
way toward that goal.

3
4.2.4 Data Visualisation
Data visualization is a general term that describes any effort to help people understand
the significance of data by placing it in a visual context. Patterns, trends and correlations
that might go undetected in text-based data can be exposed and recognized easier with
data visualization software.

Today's data visualization tools go beyond the standard charts and graphs used in
Microsoft Excel spreadsheets, displaying data in more sophisticated ways such as
infographics, dials and gauges, geographic maps, sparklines, heat maps, and detailed bar,
pie and fever charts. The images may include interactive capabilities, enabling users to
manipulate them or drill into the data for querying and analysis. Indicators designed to
alert users when data has been updated or predefined conditions occur can also be
included.

Importance of data visualization

Data visualization has become the de facto standard for modern business intelligence
(BI). The success of the two leading vendors in the BI space, Tableau and Qlik -- both of
which heavily emphasize visualization -- has moved other vendors toward a more visual
approach in their software. Virtually all BI software has strong data visualization
functionality.

Data visualization tools have been important in democratizing data and analytics and
making data-driven insights available to workers throughout an organization. They are
typically easier to operate than traditional statistical analysis software or earlier versions
of BI software. This has led to a rise in lines of business implementing data visualization
tools on their own, without support from IT.

Data visualization software also plays an important role in big data


and advanced analytics projects. As businesses accumulated massive troves
of data during the early years of the big data trend, they needed a way to
quickly and easily get an overview of their data. Visualization tools were a
natural fit.

Visualization is central to advanced analytics for similar reasons. When a data


scientist is writing advanced predictive analytics or machine learning algorithms, it
becomes important to visualize the outputs to monitor results and ensure that models are
3
performing as intended. This is because visualizations of complex algorithms are
generally easier to interpret than numerical outputs.

Examples of data visualization

Data visualization tools can be used in a variety of ways. The most common use today is
as a BI reporting tool. Users can set up visualization tools to generate
automatic dashboards that track company performance across key performance
indicators and visually interpret the results.

Many business departments implement data visualization software to track their own
initiatives. For example, a marketing team might implement the software to monitor the
performance of an email campaign, tracking metrics like open rate, click-through
rate and conversion rate.

As data visualization vendors extend the functionality of these tools, they are
increasingly being used as front ends for more sophisticated big data environments. In
this setting, data visualization software helps data engineers and scientists keep track of
data sources and do basic exploratory analysis of data sets prior to or after more detailed
advanced analyses.

4.2.5 Basic Stats & Regression Models


What is Regression?
• Regression is a statistical way to establish a relationship between a dependent variable
and a set of independent variable(s). e.g., if we say that Age = 5 + Height * 10 + Weight
* 13
• Here we are establishing a relationship between Height & Weight of a person with his/
her Age. This is a very basic example of Regression.
• Here Age is a dependent variable which depends upon height and weight.
• Height and Weight are independent variables i.e they do not depend upon any other
variable. In other words we predict the value of dependent variable using independent
variables.

What is Linear Regression?


• “Linear Regression” is a statistical method to regress the data with dependent variable
having continuous values whereas independent variables can have either continuous or
categorical values.
• It is a method to predict dependent variable (Y) based on values of independent
variables (X).
• This technique can be used for the cases where we need to predict some continuous
quantity. E.g., Predicting traffic in a retail store, predicting rainfall in a region.

3
Multiple Linear Regression
• If we have more than one independent variable the procedure for fitting a best fit line
is known as “Multiple Linear Regression”
• Fundamentally there is no difference between ‘Simple’ & ‘Multiple’ linear regression.
Both works on OLS principle and procedure to get the best line is also similar. In the
case of later, regression equation will take a shape like: Y=B0+B1X1+B2X2+B3X3.....
• Where, Bi : Different coefficients Xi : Various independent variables.

4.2.6 Machine Learning & ML Project


What is machine learning?
It is as much about ‘Learning’ as it is about ‘Machines’
• Getting computer to program themselves
• For simplicity, some literature define it as ‘Automation +’ means ‘Automating the
automation.
• Machine learning uses algorithms that learn from data, continuously improving the
prediction of future consumer behaviour, with increasing levels of forecast accuracy as
the volumes of data increase.
“Learning is a process by which a system improves it’s performance by experience.” -
Herbert Simmon Definition by Tom mitchell (1998) : Machine Learning is the study of
algorithms that • Improve their performance P • At some task T • With experience E

ML in Nutshell A hard nut to crack !!!!!!


Machine learning teaches computers to do what comes naturally to humans and animals
: learn from experience. Machine Learning algorithms use computational methods to
“learn” information directly from data without relying on a predetermined equation as
model. The algorithms adaptively improve their performance as the number of samples
available for learning increases.

4.2.7 NLP & NLP Project


Natural language processing (NLP) is a subfield of linguistics, computer
science, information engineering, and artificial intelligence concerned with the
interactions between computers and human (natural) languages, in particular how to
program computers to process and analyze large amounts of natural language data.
Challenges in natural language processing frequently involve speech recognition,
natural language understanding, and natural language generation.
Everything that we express either verbally or in written carries huge amounts of
information. The topic that we choose, our selection of words, our tone, everything adds
to some type of information that can be interpreted and some value can extracted from
it.

3
Theoretically, we can understand and even predict human behavior using that
information. But there is one problem i.e. a person may generate hundreds or thousands
of words in a declaration, each sentence with its corresponding complexity. If one wants
to scale them and analyze several hundreds, thousands or millions of people or
declarations in a given geography, then the situation is daunting one and unmanageable.
The Data that is being generated from different conversations, declarations or even
tweets are types of unstructured data. The Unstructured data can’t be represented in form
of row and column structure of relational databases but the irony is that most of world’s
data is unstructured. It is messy and hard to manipulate. According to stats, 95% of
world’s data is unstructured which can’t be used for analysis and is regarded as dark
data.

Areas of Use
Simply, NLP can be used for automatic handling of natural human language like speech
or text. NLP can be used for recognizing and prediction of diseases based on e - health
records and patient’s own speech. This capability has been explored in severe health
conditions that go from cardiovascular diseases to depression and even schizophrenia. It
enables organizations to easily determine what customers are saying about a service or
product by identifying and extracting information in sources like social media using
sentiment analysis. This analysis can provide a lot of information about the customer’s
choices and their decisions. Also an inventor at IBM developed a cognitive assistant
using NLP that works like a personalized search engine. It learns all about you and then
remind you of a name, a song, or anything that you can’t remember the moment you
need it to. Companies like Yahoo and Google filter and classify your emails as SPAM or
HAM (non
– SPAM) using NLP thereby saving our privacy and security from hackers. The NLP
Group at MIT developed a new system to determine fake source by identifying if a
source is accurate or politically biased, thereby detecting if a news source can be trusted
or not. Amazon’s Alexa and Google Home are examples of intelligent voice driven
interfaces that extensively use NLP to respond to vocal prompts and do everything like
find a particular shop, tell us the weather, suggest the best route to a place or controlling
lights at NLP is also being used in talent identification, recruitment and automated report
generation or minute of meetings. NLP is solely booming in the healthcare industry.
This technology is used in improvising care delivery, disease diagnosis and bringing
down costs. Answering as used by IBM Watson’s answering to a query.

3
5.1 Conclusion
This Report has introduced to Machine Learning & Natural Language
Processing. Now, I know that Machine Learning is a technique of training
machines to perform the activities a human brain can do, albeit bit faster and
better than an average human-being. Today we have seen that the machines can
beat human champions in games such as Chess, AlphaGO, which are considered
very complex. we have seen that machines can be trained to perform human
activities in several areas and can aid humans in living better lives.
Machine Learning can be a Supervised or Unsupervised. If we have lesser
amount of data and clearly labelled data for training, opt for Supervised
Learning. Unsupervised Learning would generally give better performance and
results for large data sets. If we have a huge data set easily available, go for deep
learning techniques. I also have learned Reinforcement Learning and Deep
Reinforcement Learning. You now know what Neural Networks are, their
applications and limitations.
Finally, when it comes to the development of machine learning models of our
own, I looked at the choices of various development languages, IDEs and
Platforms. Next thing that you need to do is start learning and practicing each
machine learning technique. The subject is vast, it means that there is width, but
if you consider the depth, each topic can be learned in a few hours. Each topic is
independent of each other. I need to take into consideration one topic at a time,
learn it, practice it and implement the algorithm/s in it using a language choice of
yours. This is the best way to start studying Machine Learning. Practicing one
topic at a time, very soon you would acquire the width that is eventually required
of a Machine Learning expert.

5.2 Future Scope


 Improved cognitive services
With the help of machine learning services like SDKs and APIs, developers are able to
include and hone the intelligent capabilities into their applications. This will empower
machines to apply the various things they come across, and accordingly carry out an
array of duties like vision recognition, speech detection, and understanding of speech
and dialect. Alexa is already talking to us, and our phones are already listening to our
conversations— how else do you think the machine “wakes up” to run a google search
on 9/11 conspiracies for you? Those improved cognitive skills are something we could
not have ever imagined happening a decade ago, yet, here we are. Being able to engage
humans efficiently is under constant alteration to serve and understand the human

3
species better.We already spend so much time in front of screens that our mobiles have
become an extension of us- and through cognitive learning, it has literally become the
case. Your machine learns all about you, and then accordingly alters your results. No
two people’s Google search results are the same: why? Cognitive learning.
 The Rise of Quantum Computing
“Quantum computing”— sounds like something straight out of a science fiction movie,
no? But it has become a genuine phenomenon. Satya Nadella, the chief executive of
Microsoft Corp., calls i7t one of the three technologies that will reshape our
world. Quantum algorithms have the potential to transform and innovate the field of
machine learning. It could process data at a much faster pace and accelerate the ability to
draw insights and synthesize information.

Heavy-duty computation will finally be done in a jiffy, saving so much of time and
resources. The increased performance of machines will open so many doorways that will
elevate and take evolution to the next level. Something as basic as two numbers- 0 and 1
changed the way of the world, imagine what could be achieved if we ventured into a
whole new realm of computers and physics?

 Rise of Robots
With machine learning on the rise, it is only natural that the medium gets a face on it—
robots! The sophistication of machine learning is not a ‘small wonder’ if you know
what I mean.

Multi-agent learning, robot vision, self-supervised learning all will be accomplished


through robotisation. Drones have already become a normality, and have now even
replaced human delivery men. With the rapid speed technology is moving forward, even
the sky is not the limit. Our childhood fantasies of living in an era of the Jetsons will
soon become reality. The smallest of tasks will be automated, and human beings will no
longer have to be self-reliant because you will have a bot following you like a shadow
at all times.

Career Opportunities in the field?


Now that you are aware of the reach of machine learning and how it can single-handedly
change the course of the world, how can you become a part of it?

Here are some job options that you can potentially think of opting –

1. Machine Learning Engineer – They are sophisticated programmers who develop the
systems and machines that learn and apply knowledge without having any specific
lead or direction.

3
2. Deep Learning Engineer – Similar to computer scientists, they specialise in using deep
learning platforms to develop tasks related to artificial intelligence. Their main goal is
to be able to mimic and emulate brain functions.
3. Data Scientist – Someone who extracts meaning from data and analyses and
interprets it. It requires both methods, statistics, and tools.
4. Computer Vision Engineer – They are software developers who
create vision algorithms for recognising patterns in images.
Machine learning already is and will change the course of the world in
the coming decade. Let’s eagerly prep and wait for what the future
awaits. Let’s hope that machines do not get the bright idea of taking
over the world, because not all of us are Arnold Schwarzenegger.
Fingers crossed!

Reference

www.edugrad.com

www.google.com

www.python.org

www.wikipedia.org

www.tutorialspoint.com

List of Figure

3
3
3
4

You might also like