You are on page 1of 10

Mini Project

on

“Language Detector”
Submitted in partial fulfillment of the requirements

of the degree of

Bachelor of Engineering (Sem-VIII)


By

1. Tushar Wankhede (Roll.No. 75)


2. Devesh Upadhayay (Roll.No. 70)
3. Abhishek Singh (Roll.No. 64)
4. Prasad shinde (Roll.No. 62)
Supervisor:

Dr. D. R. Ingle

Department of Computer Engineering


Bharati Vidyapeeth College of Engineering, Navi Mumbai
C.B.D Belapur, Navi Mumbai-400614

(Affiliated to University of Mumbai)

Academic Year 2021-2022


Department of Computer Engineering

Bharati Vidyapeeth College of Engineering, Navi Mumbai

CERTIFICATE
This is to certify that

1. Tushar Wankhede (Roll.No. 75)


2. Devesh Upadhayay (Roll.No. 70)
3. Abhishek Singh (Roll.No. 64)
4. Prasad shinde (Roll.No. 62)
has satisfactorily completed the requirements of the mini project entitled

“Language Detector”
as prescribed by the University of Mumbai, for the award of the degree of Bachelor
of Engineering in Computer Engineering

Dr. D.R.Ingle Dr. Sandhya Jadhav

Head Of Department Principal


TABLE OF CONTENTS

Sr No Title Page No

1 Introduction 1-2

1.1 Problem Definition

1.2 Scope of project

1.3 Users and their requirements

1.4 Technology to be used

2 Literature Survey 3-4

3 Conceptual system design 5-6

4 Implementation and Evaluation 7-19

5 Conclusion and Future Scope 20

6 References 21
ACKNOWLEDGEMENT
I take this opportunity to express my deepest gratitude and appreciation to all those who have helped me
directly or indirectly towards the successful completion of this dissertation report.
It is a great pleasure and moment of immense satisfaction for me to express my profound gratitude to my
dissertation Project Guide, Prof. D. R. Ingle whose constant encouragement enabled me to work enthusiastically.
His perpetual motivation, patience and excellent expertise in discussion during progress of the dissertation work
have benefited me to an extent, which is beyond expression. I am highly indebted to him for his invaluable
guidance and ever-ready support in the successful completion of this dissertation in time. Working under his
guidance has been a fruitful and unforgettable experience. Despite of his busy schedule, he was always available to
give me advice, support and guidance during the entire period of my project. The completion of this project would
not have been possible without his encouragement, patient guidance and constant support. I express my deepest
sense of gratitude & thanks to Prof. D. R. Ingle for her continuous support, and guidance throughout this work.
I am thankful to Prof. D. R. Ingle, Head of Computer Engineering Department, for their guidance,
encouragement and support during my project. I would like to mention here that he was instrumental in making
available all the needed resources throughout my project. I am highly indebted to him for his kind support.
I am also thankful to Dr. Sandhya Jadhav, Principal, for his encouragement and for providing an outstanding
academic environment, also for providing the adequate facilities.I acknowledge all the staff members of the
department of Computer Engineering for their valuable guidance with their valuable guidance with their interest
and valuable suggestions brightened me.
No words are sufficient to express my gratitude to my beloved Parents for their unwavering
encouragement in every work. I also thank all friends for being a constant source of my support. 
Name : Tushar Wankhede (Roll.No. 75)
Devesh Upadhayay (Roll.No. 70)
Abhishek Singh (Roll.No. 64)
Prasad shinde (Roll.No. 62)
Introduction:
Natural Language Processing (or NLP) is the science of dealing with human language or
text data. One of the NLP applications is Language Identification, which is a technique used
to discover language across text documents. Many real world applications such as chat bots,
comments and feedback forums have lot of data present in unstructured format and in
different languages all together. Now it is important for one to analyze and extract essential
information from this data in order to boost revenues, get insights or increase in customer
support etc. But in order for a person to analyze this data, it is equally important for one to
recognize the language it is represented in. Also other areas of application would be online
video conferencing where in speech in one language must be identified so that it can be
translated into another. So for all these applications the development of a language identfier
application is extremely important.
About the dataset:

In this project we are using Language Detection dataset present in Kaggle site. It's a small
language detection dataset. This dataset consists of text details for 17 different languages, in
order for us to create an NLP model for predicting 17 different language.

Languages

1) English
2) Malayalam
3) Hindi
4) Tamil
5) Kannada
6) French
7) Spanish
8) Portuguese
9) Italian
10) Russian
11) Sweedish 12) Dutch
13) Arabic
14) Turkish
15) German
16) Danish 17) Greek
Using the text we have to create a model which will be able to predict the given language.
This is a solution for many artificial intelligence applications and computational linguists.
These kinds of prediction systems are widely used in electronic devices such as mobiles,
laptops, etc for machine translation, and also on robots. It helps in tracking and identifying
multilingual documents too.
ALGORITHM USED FOR MODEL CREATION :

We are using the naive_bayes algorithm for our model creation. Multinomial Naive Bayes
algorithm is a probabilistic learning method that is mostly used in Natural Language
Processing (NLP). The algorithm is based on the Bayes theorem and predicts the tag of a text
such as a piece of email or newspaper article. It calculates the probability of each tag for a
given sample and then gives the tag with the highest probability as output.
Naive Bayes classifier is a collection of many algorithms where all the algorithms share one
common principle, and that is each feature being classified is not related to any other feature.
The presence or absence of a feature does not affect the presence or absence of the other
feature.

Implementation

Importing libraries and dataset

So let’s get started. First of all, we will import all the required libraries.

import pandas as pd import numpy as np


import re import seaborn as sns import
matplotlib.pyplot as plt import warnings
warnings.simplefilter("ignore")

Now let’s import the language detection dataset

data = pd.read_csv("Language Detection.csv") data.head(10)


As this dataset contains text details for 17 different languages. So let’s count the value count

for each language.

data["Language"].value_counts()

Output :

English1385
French1014
Spanish819
Portugeese739
Italian698
Russian692
Sweedish676
Malayalam594
Dutch546
Arabic536
Turkish474
German470
Tamil469
Danish428
Kannada369
Greek365
Hindi63
Name: Language, dtype: int64
Separating Independent and Dependent features

Now we can separate the dependent and independent variables, here text data is the

independent variable and the language name is the dependent variable.

X = data["Text"] y =
data["Language"]
Label Encoding

Our output variable, the name of languages is a categorical variable. For training the model

we should have to convert it into a numerical form, so we are performing label encoding

on that output variable. For this process, we are importing LabelEncoder from sklearn.

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder() y = le.fit_transform(y)
Text Preprocessing

This is a dataset created using scraping the Wikipedia, so it contains many unwanted
symbols, numbers which will affect the quality of our model. So we should perform text
preprocessing techniques.
# creating a list for appending the preprocessed text data_list
= []
# iterating through all the text for text in X:
# removing the symbols and numberstext

= re.sub(r'[!@#$(),n"%^*?:;~`0-9]', ' ', text) text = re.sub(r'[[]]', ' ', text)


# converting the text to lower casetext = text.lower()# appending to data_list
data_list.append(text)

Bag of Words

As we all know that, not only the output feature but also the input feature should be of the

numerical form. So we are converting text into numerical form by creating a Bag of Words

model using CountVectorizer.

from sklearn.feature_extraction.text import CountVectorizer


cv = CountVectorizer()
X = cv.fit_transform(data_list).toarray()
X.shape # (10337, 39419)

Train Test Splitting

We preprocessed our input and output variable. The next step is to create the training set, for

training the model and test set, for evaluating the test set. For this process, we are using a

train test split.

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

Model Training and Prediction

And we almost there, the model creation part. We are using the naive_bayes algorithm for

our model creation. Later we are training the model using the training set.

from sklearn.naive_bayes import MultinomialNB


model = MultinomialNB() model.fit(x_train,
y_train)
So we’ve trained our model using the training set. Now let’s predict the output for the test set.
y_pred = model.predict(x_test)

Model Evaluation

Now we can evaluate our model

from sklearn.metrics import accuracy_score,


confusion_matrix,
classification_report ac = accuracy_score(y_test, y_pred) cm =
confusion_matrix(y_test, y_pred)

print("Accuracy is :",ac)
# Accuracy is : 0.9772727272727273 O/P
IS BELOW:

You might also like