Language Detector: Bachelor of Engineering (Sem-VIII)

Mini Project
on
“Language Detector”
Submitted in partial fulfillment of the requirements
of the degree of
Bachelor of Engineering (Sem-VIII)

By
1. Tushar Wankhede (Roll.No. 75)

2. Devesh Upadhayay (Roll.No. 70)
3. Abhishek Singh (Roll.No. 64)
4. Prasad shinde (Roll.No. 62)
Supervisor:
Dr. D. R. Ingle
Department of Computer Engineering

Bharati Vidyapeeth College of Engineering, Navi Mumbai
C.B.D Belapur, Navi Mumbai-400614
(Affiliated to University of Mumbai)
Academic Year 2021-2022

Department of Computer Engineering
Bharati Vidyapeeth College of Engineering, Navi Mumbai
CERTIFICATE
This is to certify that
1. Tushar Wankhede (Roll.No. 75)

2. Devesh Upadhayay (Roll.No. 70)
3. Abhishek Singh (Roll.No. 64)
4. Prasad shinde (Roll.No. 62)
has satisfactorily completed the requirements of the mini project entitled
“Language Detector”
as prescribed by the University of Mumbai, for the award of the degree of Bachelor
of Engineering in Computer Engineering
Dr. D.R.Ingle Dr. Sandhya Jadhav
Head Of Department Principal

TABLE OF CONTENTS
Sr No Title Page No
1 Introduction 1-2
1.1 Problem Definition
1.2 Scope of project
1.3 Users and their requirements
1.4 Technology to be used
2 Literature Survey 3-4
3 Conceptual system design 5-6
4 Implementation and Evaluation 7-19
5 Conclusion and Future Scope 20
6 References 21
ACKNOWLEDGEMENT
I take this opportunity to express my deepest gratitude and appreciation to all those who have helped me
directly or indirectly towards the successful completion of this dissertation report.
It is a great pleasure and moment of immense satisfaction for me to express my profound gratitude to my
dissertation Project Guide, Prof. D. R. Ingle whose constant encouragement enabled me to work enthusiastically.
His perpetual motivation, patience and excellent expertise in discussion during progress of the dissertation work
have benefited me to an extent, which is beyond expression. I am highly indebted to him for his invaluable
guidance and ever-ready support in the successful completion of this dissertation in time. Working under his
guidance has been a fruitful and unforgettable experience. Despite of his busy schedule, he was always available to
give me advice, support and guidance during the entire period of my project. The completion of this project would
not have been possible without his encouragement, patient guidance and constant support. I express my deepest
sense of gratitude & thanks to Prof. D. R. Ingle for her continuous support, and guidance throughout this work.
I am thankful to Prof. D. R. Ingle, Head of Computer Engineering Department, for their guidance,
encouragement and support during my project. I would like to mention here that he was instrumental in making
available all the needed resources throughout my project. I am highly indebted to him for his kind support.
I am also thankful to Dr. Sandhya Jadhav, Principal, for his encouragement and for providing an outstanding
academic environment, also for providing the adequate facilities.I acknowledge all the staff members of the
department of Computer Engineering for their valuable guidance with their valuable guidance with their interest
and valuable suggestions brightened me.
No words are sufficient to express my gratitude to my beloved Parents for their unwavering
encouragement in every work. I also thank all friends for being a constant source of my support.
Name : Tushar Wankhede (Roll.No. 75)
Devesh Upadhayay (Roll.No. 70)
Abhishek Singh (Roll.No. 64)
Prasad shinde (Roll.No. 62)
Introduction:
Natural Language Processing (or NLP) is the science of dealing with human language or
text data. One of the NLP applications is Language Identification, which is a technique used
to discover language across text documents. Many real world applications such as chat bots,
comments and feedback forums have lot of data present in unstructured format and in
different languages all together. Now it is important for one to analyze and extract essential
information from this data in order to boost revenues, get insights or increase in customer
support etc. But in order for a person to analyze this data, it is equally important for one to
recognize the language it is represented in. Also other areas of application would be online
video conferencing where in speech in one language must be identified so that it can be
translated into another. So for all these applications the development of a language identfier
application is extremely important.
About the dataset:
In this project we are using Language Detection dataset present in Kaggle site. It's a small
language detection dataset. This dataset consists of text details for 17 different languages, in
order for us to create an NLP model for predicting 17 different language.
Languages
1) English
2) Malayalam
3) Hindi
4) Tamil
5) Kannada
6) French
7) Spanish
8) Portuguese
9) Italian
10) Russian
11) Sweedish 12) Dutch
13) Arabic
14) Turkish
15) German
16) Danish 17) Greek
Using the text we have to create a model which will be able to predict the given language.
This is a solution for many artificial intelligence applications and computational linguists.
These kinds of prediction systems are widely used in electronic devices such as mobiles,
laptops, etc for machine translation, and also on robots. It helps in tracking and identifying
multilingual documents too.
ALGORITHM USED FOR MODEL CREATION :
We are using the naive_bayes algorithm for our model creation. Multinomial Naive Bayes
algorithm is a probabilistic learning method that is mostly used in Natural Language
Processing (NLP). The algorithm is based on the Bayes theorem and predicts the tag of a text
such as a piece of email or newspaper article. It calculates the probability of each tag for a
given sample and then gives the tag with the highest probability as output.
Naive Bayes classifier is a collection of many algorithms where all the algorithms share one
common principle, and that is each feature being classified is not related to any other feature.
The presence or absence of a feature does not affect the presence or absence of the other
feature.
Implementation
Importing libraries and dataset
So let’s get started. First of all, we will import all the required libraries.
import pandas as pd import numpy as np

import re import seaborn as sns import
matplotlib.pyplot as plt import warnings
warnings.simplefilter("ignore")
Now let’s import the language detection dataset
data = pd.read_csv("Language Detection.csv") data.head(10)

As this dataset contains text details for 17 different languages. So let’s count the value count
for each language.
data["Language"].value_counts()
Output :
English1385
French1014
Spanish819
Portugeese739
Italian698
Russian692
Sweedish676
Malayalam594
Dutch546
Arabic536
Turkish474
German470
Tamil469
Danish428
Kannada369
Greek365
Hindi63
Name: Language, dtype: int64
Separating Independent and Dependent features
Now we can separate the dependent and independent variables, here text data is the
independent variable and the language name is the dependent variable.
X = data["Text"] y =
data["Language"]
Label Encoding
Our output variable, the name of languages is a categorical variable. For training the model
we should have to convert it into a numerical form, so we are performing label encoding
on that output variable. For this process, we are importing LabelEncoder from sklearn.
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() y = le.fit_transform(y)
Text Preprocessing
This is a dataset created using scraping the Wikipedia, so it contains many unwanted
symbols, numbers which will affect the quality of our model. So we should perform text
preprocessing techniques.
# creating a list for appending the preprocessed text data_list
= []
# iterating through all the text for text in X:
# removing the symbols and numberstext
= re.sub(r'[!@#$(),n"%^*?:;~`0-9]', ' ', text) text = re.sub(r'[[]]', ' ', text)

# converting the text to lower casetext = text.lower()# appending to data_list
data_list.append(text)
Bag of Words
As we all know that, not only the output feature but also the input feature should be of the
numerical form. So we are converting text into numerical form by creating a Bag of Words
model using CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X = cv.fit_transform(data_list).toarray()
X.shape # (10337, 39419)
Train Test Splitting
We preprocessed our input and output variable. The next step is to create the training set, for
training the model and test set, for evaluating the test set. For this process, we are using a
train test split.
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
Model Training and Prediction
And we almost there, the model creation part. We are using the naive_bayes algorithm for
our model creation. Later we are training the model using the training set.
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB() model.fit(x_train,
y_train)
So we’ve trained our model using the training set. Now let’s predict the output for the test set.
y_pred = model.predict(x_test)
Model Evaluation
Now we can evaluate our model
from sklearn.metrics import accuracy_score,

confusion_matrix,
classification_report ac = accuracy_score(y_test, y_pred) cm =
confusion_matrix(y_test, y_pred)
print("Accuracy is :",ac)
# Accuracy is : 0.9772727272727273 O/P
IS BELOW:

Language Detector: Bachelor of Engineering (Sem-VIII)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Language Detector: Bachelor of Engineering (Sem-VIII)

Uploaded by

Copyright:

Available Formats

Mini Project

Bachelor of Engineering (Sem-VIII)

1. Tushar Wankhede (Roll.No. 75)

Department of Computer Engineering

(Affiliated to University of Mumbai)

Academic Year 2021-2022

Bharati Vidyapeeth College of Engineering, Navi Mumbai

1. Tushar Wankhede (Roll.No. 75)

Dr. D.R.Ingle Dr. Sandhya Jadhav

Head Of Department Principal

1.1 Problem Definition

1.2 Scope of project

1.3 Users and their requirements

1.4 Technology to be used

2 Literature Survey 3-4

3 Conceptual system design 5-6

4 Implementation and Evaluation 7-19

5 Conclusion and Future Scope 20

Importing libraries and dataset

import pandas as pd import numpy as np

Now let’s import the language detection dataset

data = pd.read_csv("Language Detection.csv") data.head(10)

for each language.

independent variable and the language name is the dependent variable.

from sklearn.preprocessing import LabelEncoder

= re.sub(r'[!@#$(),n"%^*?:;~`0-9]', ' ', text) text = re.sub(r'[[]]', ' ', text)

model using CountVectorizer.

from sklearn.feature_extraction.text import CountVectorizer

Train Test Splitting

train test split.

from sklearn.model_selection import train_test_split

Model Training and Prediction

from sklearn.naive_bayes import MultinomialNB

Now we can evaluate our model

from sklearn.metrics import accuracy_score,

You might also like