You are on page 1of 23

Chapter 1

PREAMBLE
1.1 Introduction
The New York Times, Stack Overflow, Wikipedia, and numerous other community-driven
online publications are examples of online news outlets. Social networking sites such as
Facebook serve as an example of a platform for social media. They provide users with a
forum for conversation in which content moderators are engaged to maintain decency and
encourage meaningful debates. Moderators are responsible for ensuring that users adhere to
the platform's discourse guidelines, which prohibit the use of offensive language [1]. They
ensure compliance with these guidelines by removing users' comments, either in full or in
part. Research on comment classification typically makes use of black-box models. This type
of research employs supervised machine learning approaches. For example, research has been
done to identify rude, aggressive, and abusive language as well as hate speech, racism, and
sexism. Other types of discriminatory speech have also been examined [3]. The nearly
complete moderation of comments is offered as part of the message pre-classification process
in order to assist moderators. Black-box models are unable to provide any kind of explanation
for the results that they produce automatically. As a consequence of this, they are incapable
of being deployed in an effective manner to filter comments. Users and moderators have a
healthy amount of skepticism over an automation that is difficult to understand. The
attraction of a classifier that was learned by machine learning is boosted by explanations
since they inspire confidence. This has the potential to ensure that the moderation process is
open and impartial [2]. It would appear that the diagnosis of offensive language has been the
subject of a significant amount of research, and the application of deep learning techniques to
natural language processing has resulted in a notable improvement in classification
performance for such a job in recent times.

1.1.1 Artificial Intelligence


Artificial intelligence is a field of employing intelligence algorithms to create intelligent
machines [2]. Its goal is to study human intellect utilizing computers, although it isn't limited
to biologically observable approaches. Artificial intelligence is currently classified into four
categories: reflexive, constrained memory, abstract reasoning, and self-aware. Once it
receives the inputs, reactionary AI replicates the procedure and does not learn anything new.
For example, an IBM supercomputer that plays chess, email spam filtering, and a Netflix
recommender system. Memory problems by watching behaviors or data, Algorithm learns
from historic information and develops its knowledge exponentially. For example,
autonomous vehicles can learn from their surroundings. According to the terms of cognitive
AI, systems will develop actual judgement call qualities analogous to mankind. Sophia, a
bipedal robot created by Hong Kong-based Hanson Robotics, can detect faces and respond to
encounters with her own facial features. Information systems that are self-aware will be
sensitive to the feelings and cognitive factors. They will be able to make inferences that other
types of AI will not be able to make. They haven't quite developed an advanced AI and lack
the necessary machinery and mechanisms to implement such. The accelerated development
of artificial intelligence has a profound impact on business and society. Many findings have
the capability to have a considerable influence on employees, employment, and competition
by directly changing the development and characteristics of a diverse variety of commodities.
Although even though it is projected that while these consequences will be significant,
artificial intelligence also has the potential to alter the way that inventions are developed,
which might have even more significant effects that eventually outweigh the direct effects.
Concerns regarding the possibility of massive employment displacement are raised by some
uses of artificial intelligence, such as machine learning, which will undoubtedly provide
lower-cost or higher-quality inputs into many current production processes. However, deep
learning offers the potential for changes in the fundamental structure of the innovation
process within those domains, in addition to productivity benefits across a wide range of
sectors.

1.1.2 Supervised Learning [3]


Training a computer system on input data that has been labelled for a certain output is a step
in the supervised learning process of artificial intelligence (AI) development. The classifier is
constructed until something detects the underlying relationships and trends between the
incoming and outgoing labels, permitting it to generate appropriate categorization results
when confronted with new data. Instances of regression and classification problems that
supervised learning is successful at overcoming include determining the classification a news
article belongs to and estimating the level of business for a particular future date. The
objective of supervised learning is to give meaning to data within the framework of a specific
investigation. Like all machine learning algorithms, supervised learning is based on training.
Throughout the system's training process, classed data sets are fed to it, teaching it how to
connect each input value is probably output. The algorithm is presented with labelled data
which has not yet been divulged to it as test data, which is then given to the trained model.
The test data is used to evaluate the performance of the prediction on unlabeled data. The
supervised learning process in machine learning techniques is improved by tracking the
model's outputs in real-time and modifying the architecture to get the system closer to the
target accuracy. The level of precision that can be attained is affected by two factors: the
available labelled data and the algorithm. Then again:
● Training data should be balanced and organized. When choosing the data that the
model is trained on, data scientists must exercise caution because incorrect or
duplicate data will skew the AI's understanding.
● Whereas if the training data set has insufficient examples, the model will struggle and
be unable to deliver appropriate results when faced with novel situations. How
successfully the AI responds to new cases depends on the diversity of the data.
● Prediction precision, ironically, is not always a good sign; it may also imply that the
system is overfitting, or that it is too customized to the specific training sample.
Because once confronted with the reality problems, a data set like this can perform
admirably in simulations but horribly when put to the test. The test data must be
distinct from the training data in order to prevent overfitting and to guarantee that the
model does not draw conclusions from its prior knowledge but rather makes
generalized inferences.
● But in the other side, the algorithm generates what to do with the data. For instance,
as shown by OpenAI's GPT-3, deep learning algorithms may be trained to extract
billions of parameters from their data and achieve previously unheard-of levels of
accuracy.
The following is the workflow of Supervised Learning Procedure:
Figure 1.1 – Flow of Supervised learning
A data scientist or engineer trains a piece of code known as a machine learning model to
learn from data. As a result, if you feed the model wrong data, it will produce inaccurate or
incorrect predictions when it is trained. We can create an IoT system to gather data from
multiple sensors in order to develop a real-time machine learning project, depending on the
kind of project we want to produce. The data set may originate from a wide range of sources,
including files, databases, sensors, and more; nevertheless, it cannot be used directly for the
analysis process since it may have a significant amount of missing data, extremely high
values, disordered text data, or noisy data. As a result, data preparation is carried out to
overcome this issue. Data preprocessing is one of the most important machine learning
procedures. In order to create machine learning models more accurately, it is the most
important step. The 80/20 rule holds true in machine learning. A data scientist should spend
80% of their time preprocessing data, and 20% of their time doing analysis. Then, for model
training, we divide the data into three categories: "Training data," "Validation data," and
"Testing data."
The "training data set" is used to train the classifier, the "validation set" to tweak the
parameters, and the "test data set" to evaluate the classifier's performance. It's vital to keep in
mind that just the training and/or validation set is available when the classifier is being
trained. The classifier's training process cannot make use of the test data set. Only when the
classifier is being assessed will the test set be accessible.

Figure 1.2– Model Evaluation

A step in the model development process is model evaluation. Finding the model that
reproduces our data the most precisely and predicts its future performance is useful. We may
adjust the model's hyper-parameters to increase accuracy. The confusion matrix may also be
examined in order to increase the proportion of genuine positives and true negatives.

1.1.3 Natural Language Processing [4]


Computer science, artificial intelligence, and computational linguistics all intersect in natural
language processing. Natural language processing allows computers to analyze, comprehend,
and derive meaning from human speech (NLP). NLP is used in text mining, machine
translation, and automated question answering. It's feasible to use text summarizing,
sentiment analysis, and other real-world applications.
NLP combines lexicon-based rule-based description of human words with statistics, machine
learning, and deep learning models. These advancements essentially allow computers to
perceive conversational language as a form of textual or semantic language and 'interpret' its
true meaning, as well as the intentions and feelings of the presenter or journalist. NLP is
utilized to fuel computer programs that move data from one location to another, respond to
spoken requests, and assess large volumes of data in real time. Vocal style NLP is used in
fitness trackers, smart speakers, speech-to-text transcription tools, contact center chatbots,
and a variety of other commercial services. Additionally, natural language processing (NLP)
is increasingly being used in business solutions to help organizations streamline operations,
boost employee productivity, and make complex but vital business processes simpler.
Software that accurately translates the intended meaning of text or voice input is extremely
difficult to build due to the ambiguity of human language. Engineers must educate natural
language-driven systems to recognize and interpret homophones, word combinations, humor,
accents, symbols, language and application anomalies, and rectify grammatical faults
efficiently from the beginning if those apps are to be productive. Countless NLP systems
break down natural speech and audio to help the system understand what it's absorbing.
Several examples of these duties are as follows: The process of accurately converting voice
input into text data is known as speech recognition or speech to text. Any program that
accepts voice commands or responds to spoken queries must use speech recognition. Due to
people's tendency to speak quickly, slowly, and slowly by accident, frequently change their
volume and cadence, speak in a variety of dialects, and occasionally use poor spelling, speech
recognition is exceedingly difficult.
Part of speech tagging is the technique of identifying a word's part of speech from a certain
lexicon or section of text based on its use and context[5]. Lexical resources are the term used
to describe the semantic analysis process used to choose which word among several potential
meanings makes the most sense in the context at hand. A technique for classifying words or
sentences as meaningful entities is called name entity recognition. According to NEM, "Fred"
is the name of a man, and "Kentucky" is a state.
Coreference resolution is the task of evaluating regardless of whether two phrases belong to
that or nearly the same subject. Finding out who or what a pronoun refers to is a good
example, and it can also imply discovering a symbolism or phrase in the book. Sentiment
categorization aims to retrieve objective features, views, attitudes, cynicism, befuddlement,
and suspicion from text[5]. The issue of transforming structured data into human language is
known as voice recognition synthesis, which is frequently referred to as the opposite extreme
of speech recognition or automatic speech recognition.

Figure 1.3– NLP in Business


Natural language processing is extremely advantageous for companies looking for complex
and all-encompassing solutions for job automation and orchestration, such as those in the
banking, legal, healthcare, manufacturing, transportation, and energy sectors (NLP). Pattern
matching, rule-based computing, statistical modelling, machine learning, and information
retrieval are just a few of the NLP-based supported services that serve as the foundation for
RPA. By employing statistical methods like word frequency and stemming algorithms,
computers may automatically convert user inputs—such as voice or text—into instructions
that are understandable by machines. This aids in improving job performance as well as in
making mistakes less costly in the future.
1.1.3.1 Business Applications in NLP
Presently, NLP is extensively used to a variety of industries, including healthcare,
speech pattern recognition, document classification, and weather forecasting [6]. In fact,
corporate NLP capabilities are so widespread that we frequently utilize them without even
realizing them. Siri and Alexa, our car navigation system that shows us the quickest path, our
favorite OTT streaming service that suggests movies we should watch, autocomplete
predictive texts on our phones, and translation software are just a few examples of how NLP
has influenced our daily lives. Apply the fundamentals of NLP is necessary if you want to
know how it may be applied in your business and boost growth. The several technical subsets
connected to NLP are frequently confused with one another. Text analysis and NLP are
frequently confused, which is the most common misunderstanding. They are not; rather, there
is a significant distinction between the two. The Applications are as follows:

1.1.3.1.1 Social Media Sentiment Analysis


NLP for social media listening is unique because it understands online shorthand (like
LOL, BRB, and TL; DR), slang, code-switching, emoticons and emojis, and hashtags. You
can gather data from whatever language that your clients wish to use using NLP and then
prepare it for consumption by an ML model. Based on the emotions it discovers in your
social media mentions—positive, negative, or neutral—sentiment analysis gives you extra
insights into how well your brand is performing. And in doing so, it gives you practical,
helpful insights [7]. You may alter your advertising campaign, enhance your brand
reputation, improve aspects of your product or service, or reach out to influencers as part of
your marketing plan based on customer sentiment discovered through social media
monitoring. Sentiment analysis is used in a wide range of fields, such as business and
marketing, politics, health, and social policy. There are numerous applications for sentiment
analysis that can support decision-making in a range of different sectors. Sentiment analysis
can be used to examine global occurrences like a disaster, activity, sport, or event. Examples
include research comparing how individuals in western and eastern nations see ISIS. The
outcome demonstrates the variety of viewpoints from which ISIS is perceived as a terrorist
globally. Sentiment analysis has the additional benefit of increasing public awareness of data
security and the risk of security breaches.
Figure 1.4 – Sentiment Analysis in social media

Sentiment analysis can be used to predict political elections since studies demonstrate that
data from Twitter is more trustworthy as a platform with a 94 percent connection to polling
data and the potential to compete with more advanced polling approaches.
Finally, consumer input is vital when sentiment analysis is performed because it may allow
businesses and organizations to take the proper actions to improve their goods or services and
company strategy [9]. This is demonstrated by a study that looks at social media users'
opinions and experiences with drugs and cosmetics. Sentiment analysis benefits business
owners by enabling them to assess client satisfaction with their goods or services, as well as
how effectively they interact with them on social media and how well their brand is doing in
general.

1.1.3.1.2 Language Translation


There are several other places where you may find information, however not everyone is
bilingual. Online translations are therefore frequently useful, especially for researchers.
Additionally, if it weren't for the NLP technology's ability to rapidly and effectively translate
audio to text at scale, we wouldn't be able to watch the countless foreign films and
documentaries with subtitles that are offered on our video streaming channels [10]. Linguists
are particularly interested in the morphology, anthropological linguistics, philology, syntax,
and phonology of languages since they are so fascinating, distinctive, and complex. Data
scientists are constantly gaining new knowledge, and this knowledge enables them to develop
AI/ML models that can comprehend language. Early in the 20th century, the first machine
translation (MT) program was created. The main objective of MT research was to replace
human translation because of the drawbacks of deliberate translating practices and perhaps
high translation costs. The development and progress of automatic language-to-language
translation is the main objective of MT. A corpus-based approach, which is the principal
technique used in machine translation (MT), compares words or passages of text from the
source language with instances from other languages that have been compiled in a parallel
corpus or database [11]. A statistical method is used to select the translations, which lowers
the number of variables and increases the accuracy of the translation process. Due to its
reliance on a large database and higher translation accuracy compared to other machine
translation programs, Google Translate (GT), which was published in 2006, is the most
widely used machine translation program. The computer system that powers GT compares
the available output language, which is available in millionths of a second, with the input
language, which can take the form of texts, media, speech, images, or real-time video. When
a user enters a searched keyword at the word, phrase, or sentence level, GT searches for those
patterns across the millions of articles compiled in its online databases before offering the
translation that is most like the sought-after words.

Figure 1.5 – Language Translation Process Flow

The accuracy and caliber of the product are always ensured because most of the traditional
translation labor is done manually. However, in the case of close international contacts, the
effectiveness and cost of human translation fall far short of the necessary standards [12].
Machine translation has substantially improved in speed and cost because to the rapid growth
of the Internet and the enormous processing capacity of computers but having slightly worse
translation quality than hand translation. The model adds a recommendation module for
statistical machine translation vocabulary knowledge to complete the fusion of statistical
machine translation vocabulary knowledge. The model is based on the neural machine
translation "Encoder-Decoder" as the main body. The vocabulary recommendation module
for statistical machine translation observes the target language and attention data and uses it
to provide historical data for word recommendations. The model's core underpinning
technology is neural machine translation, and vocabulary alignment data from statistical
machine translation is integrated using continuous word expression and neural networks.
Statistical machine translation uses decoding data from neural machine translation to generate
suggestions based on vocabulary alignment data at each decoding stage.

1.1.4 Offensive Language Detection


All these social networking sites include one common characteristic: online news sources like
the New York Daily news, question answering platforms like Stack Overflow, collaborative
projects including Wiki page, and media platforms like Facebook [13]. Websites provide us
with a conversation environment for users, with internet users on hand to preserve a
courteous tone and encourage productive debate. Panelists verify that perhaps the platform's
conversation regulations, such as the restriction of inappropriate language, are maintained.
They follow these guidelines by eliminating a participant's message in half or altogether.
Anyone using social media, message boards, or other online forums runs the risk of being the
target of taunts and even harassment. They should all burn in hell for what they've done, or
similar expressions, are unfortunately frequently found online. These can significantly harm a
user's experience or the civility of a community. In addition to software that employs regular
expressions and blacklists to detect offensive language and remove posts, many online
organizations have rules and regulations that users must adhere to in order to prevent abusive
language. A memorial to the late actor Robin Williams was written by his daughter Zelda
after his passing. She made the decision to close all of her internet accounts when she began
to experience online harassment.
Twitter revised its hate speech guidelines in response to this harassment. Although
automatically detecting abusive language online is an important issue and endeavour, the
earlier work hasn't been very coherent, which has hampered progress. Abuseful language is
probably very grammatical and versatile. Although there are several instances of offensive
language being used online, such as Add anotherJEW fined a billion for stealing like a little
maggot Put them all on hold. It can be useful to use this automatic process indication.
The regulations of a forum are generally given as instructions, and sometimes
approach that combines with "computer ethics," or perhaps the basic norms of online
communication. Nonetheless, this doesn't really guarantee that certain respondents
commented certain parameters while leaving comments [14]. Editors on internet comment
rooms must then justify their participation. They might, for example, replace a comment
directly with both the message "Eliminated." "Derogatory comments and personal insults are
disallowed." or "Generalizations and character insults are explicitly banned." If a message
body is subsequently closed, organizers will leave a final paragraph, such as "The above
statement part was already forced to close attributable to racist judgments, unsubstantiated
claims, conspiracy theories, and harsh diatribes." From a one hand, the goal of these
justifications is to eliminate corruption. They do, however, strive to teach users how to follow
the discussion guidelines. Comment classification research frequently use supervised
machine learning techniques and black-box models. There is study on detecting hate speech,
racism/sexism, and offensive/aggressive/abusive language, for example. However, semi-
automated comment moderation in the form of pre-classification of comments is insufficient
to assist moderators [15]. Black-box models are incapable of providing justifications for their
automated decisions. As a result, they can't be used to moderate comments correctly. Users
and moderators are skeptical of the unfathomable automation. Explanations aid in the
development of trust and acceptance of machine-learned classifiers. Only then can we
guarantee a fair and transparent moderation procedure.
In general, there are two other reasons for explanations. To begin with, machine-
learning classifiers are only legal to use if their results can be explained. Users who are
harmed by automated decision-making, such as when a credit application is denied, have the
right under the EU's General Data Protection Regulation to "receive an explanation of the
outcome obtained" (GDPR). Another point of view is that explanations might reveal a
model's strengths and faults. They may also aid in the detection of any bias in a model's
decisions. Based on these findings, researchers might attempt to improve the models. There
has been a lot of research into offensive language detection, and thanks to deep learning
approaches to natural language processing, classification accuracy has improved dramatically
in recent years. However, one aspect of this classification endeavor has gone virtually
unnoticed: the requirement to justify categorization results. In study on explanation
approaches, explainability is distinct from interpretability.
Figure 1.6 – Offensive Language Detection Flow

To distinguish between abusive (the "Abuse class") and non-abusive (the "non-abuse class")
messages, it needs gathering criteria from the content of each message under examination.
various morphological features. The number of characters used to express the maximum
word length, average word length, and message length. We tally the total number of
characters in the message. We categorize characters into six groups (letters, numbers,
punctuation, spaces, and others) and calculate two attributes for each group: how often the
character occurs and how many characters there are overall in the message [16]. Most
abusive texts frequently employ copy/paste. The redundant information is reduced by using
the Lempel-Ziv-Welch (LZW) technique to compress the message, and the character-based
ratio of the message's raw to compressed lengths is computed after that. Furthermore,
excessively long phrases are commonly used in hostile texts. These phrases can be
recognized because when the message is flattened, all instances of letters that appear more
than twice in a row are eliminated. For example, the word "looooooool" could be shortened to
"lool". Then determine the difference between the lengths of the raw and compacted
messages.

1.1.5 Dravidian Text Offensive Language Detection


The usage of social networking sites for a variety of purposes, including product promotion,
news dissemination, success celebration, and more, has grown significantly. On the other
hand, it is also used to verbally abuse, threaten, and vilify specific racial or ethnic groups
[17]. It is crucial to locate and remove such posts from social media platforms as soon as you
can because they typically have bad effects on people and spread quickly. Research on
recognizing inflammatory and hateful content has grown in popularity in recent years. The
usage of code-mixed language is one of the many difficulties in recognizing hate speech on
social networking platforms. In the last 10 years, user-generated content on social media
platforms like Twitter, YouTube, and Instagram has increased considerably. They provide a
space for discussion and interaction where users can mingle, express their opinions, and share
their knowledge. The use of offensive messages or statements that may be directed at a
particular person or group of people is one of the recurring problems with online social media
platforms. They act as triggers for people to publish offensive content, which can be harmful
and detrimental to people's mental health. In recent years, a critical field of study in natural
language processing has emerged: the automatic detection of such harmful comments and
posts. Transformer models have already been used to teach Tamil, Kannada, and Malayalam
speakers how to distinguish offensive language. To preserve context for the users' intent, we
restrict text preprocessing techniques like lemmatization, stemming, removing models that
are based phrases, etc. Since transformer representations are contextual models (e.g., BERT,
XLM-RoBERTa, etc.), it can be observed that stop words receive essentially the same
attention as non-stop words when we employ them. In multilingual environments, code-
mixing happens frequently, and the texts that result from it occasionally use scripts other than
the local tongue. As a result of the challenging nature of code-switching at various linguistic
levels in the text, systems trained on monolingual data have difficulty processing code-mixed
data. The comment or post may contain several phrases; however, the corpora's average
sentence length is everyone has a comment or post level annotation. This dataset also raises
issues with class imbalance that are consistent with real-world situations. The category for
YouTube comments that include Not-offensive, Offensive-Untargeted, Offensive-Target-
Person, Offensive-Target-Group, Offensive-Target-Other, or Not-in-Indented Language. The
use of offensive language in social media communication has become a significant social
problem. Such phrases might have a detrimental effect on readers' psychological health. It
could have a negative impact on people's emotions and actions. Hate speech has ignited riots
in numerous locations around the world. Therefore, it's imperative to refrain from using
derogatory words on social media. Automated tools for spotting offensive language have
been the subject of much research. One of the numerous challenges that such systems must
overcome is the use of code-mixed text. "Code-mixing" is the practice of using words from
several different languages at once or in between sentences.
Users have the option to produce content in casual situations on social networking
sites and product review websites [18]. Additionally, in order to improve the user experience,
these platforms make sure that the user communicates their thoughts in a way that makes
them feel comfortable by using their native language or transitioning between one or more
languages throughout the same conversation. However, since most NLP systems are trained
on formal languages with proper syntax, issues arise when "user created" comments are
analyzed. While user-generated content in environments with low resources is frequently
combined with English or other high-resource languages, most breakthroughs in sentiment
analysis and abusive language identification algorithms are based on monolingual data for
high-resource languages. All the Dravidian languages are highly agglutinative and have
distinctive writing systems. Kannada and Malayalam use a phonemic abugida, which is
written from left to right. In 580 BC, the Dravidian characters were initially uncovered on
pottery from the Tamil Nadu regions of Keezhadi, Sivagangai, and Madurai by the
Archaeological Survey of India and the Tamil Nadu State Department of Archaeology [19].
The Tamil writing system was historically derived from the Tamili script, which was neither
an exclusive Abugida, Abjad, or Alphabet system. The Jaina texts Samavayanga Sutta and
Pannavana Sutta, with dates ranging from the third and fourth centuries BCE, and the ancient
grammar book Tolkappiyam, with dates ranging from the ninth century BCE to the sixth
century BCE, both explain Tamil's writing system. At various times throughout history,
Tamil has been written in the Tamili, Vattezhuthu, Chola, Pallava, and Chola-Pallava scripts.
The present Tamil script has its roots in the Chola-Pallava script, which became the norm in
the northern region of Tamilia in the seventh century CE. The Malayalam script, which is
based on the antiquated Vatteluttu script, adds additional letters from the Grantha script to
produce loan words. The Kannada and Telugu scripts were developed from the Bhattiprolu
script, a southern version of the Brahmi script. Bhattiprolu script evolved into an early form
of Kannada writing known as Kadamba script, which later gave rise to Telugu and Kannada
scripts. Social media users typically utilise the Latin alphabet since it is accessible on
computers and mobile devices and is easy to use, despite the fact that these languages have
their own writing systems.

1.1.6 Code-Mixed Data


High-resource languages used in social media networks have been the focus of most of the
recent research on sentiment analysis and the detection of offensive language. Models have
been successful in predicting sentiment and offensiveness using such extensive monolingual
training data. However, considering the increase in social media usage by multilingual users,
a system trained on sparsely resourced code-mixed data is necessary [20]. Due to the lack of
big datasets for Tamil-English, Kannada-English, or Malayalam-English despite the need, we
gathered and created a code-mixed dataset from YouTube. In this article, we describe how to
create corpora for Dravidian languages with few resources using YouTube comments. This is
a follow-up to two workshop papers and group projects.

Figure 1.7 – Data Gathering of Dravidian Text

The reputation of a person or an organization can be significantly impacted by the rapidly


changing information provided by millions of users on online platforms like Twitter,
Facebook, or YouTube [21]. This emphasizes the importance of computerized sentiment
analysis and abusive language on online social media. Due to the variety of content, it offers,
including songs, courses, product reviews, trailers, and more, YouTube is one of the most
widely used social networking sites in the Indian subcontinent. YouTube allows users to
submit videos and receive comments from other users. Resources-constrained languages now
enable more user-generated content. As a result, we choose to gather YouTube comments for
our dataset. Reviews were gathered from a number of 2019 YouTube-posted Tamil,
Kannada, and Malayalam movie trailers. The comments were gathered using the YouTube
Comment Scraper tool Footnote. We were able to produce datasets for sentiment analysis and
the recognition of offensive language by manually labelling these comments. We sought
comments that involve code-mixing at various textual levels, with enough representation for
each sentiment and abusive language classes in all three languages. Examples of code-mixing
in Tamil, Kannada, and Malayalam corpora are shown alongside their English translations in
Figs. 2, 3, and 4. We made sure to remove all user-related data from the corpus while taking
data privacy into account.
Figure 1.8 – Code-Mixed Tamil Language Dataset Examples

Figure 1.9 – Code-Mixed Kannada Language Dataset Examples


Figure 1.10 – Code-Mixed Tamil Language Dataset Examples

The corpora originally gathered from social media, so they include a variety of real-world
data that has been blended with code. When each sentence is written or spoken in a different
language, the process of switching languages inside a sentence is known as inter-sentential
switching. When one sentence is written in one language and the other is written in a
different language, this is known as intra-sentential swapping. In addition to texts with a
variety of scripts, vocabulary, morphology, inter- and intra-sentential switches, our corpora
also include texts that are entirely monolingual in their original languages [21]. To accurately
reflect the usage seen in the actual world, they kept all the instances of code-mixing. In order
to account for the various offensiveness levels in the comments, they flattened the three-level
hierarchical annotation scheme of this work into one with just five labels and a sixth label.
Comments written in a language other than the intended language are justified as not being in
the intended language. The comments written in other Dravidian languages using Roman
script serve as examples of this. The six categories into which each comment will be divided
are as follows in order to make the annotation decisions easier to understand:
● Non - offensive: The comment is free of offensive language.
● Offensive Untargeted: Comment contains offensive language or is profane but is not
intended to harm anyone specifically. Without referring to any specific individuals,
these are the remarks that use improper language.
● Offensive Targeted Individual: A comment that intentionally offends or uses vulgarity
against a specific person.
● Offensive Targeted Group: Group or community that is being insulted or profanely
spoken about in the comment is the subject of the offensiveness.
● Offensive Targeted Other: Article contains offensive language or slurs that don't fall
under any of the preceding two categories. Offensive Targeted Other (e.g., a situation,
an issue, an organization or an event).
● Not in indented language: The comment is not intended if it is not made in the
language that is indented. For instance, a statement in a Tamil job is not considered
Tamil if it does not contain Tamil written in Tamil script or Latin characters. Once the
data was annotated, these remarks were removed.

1.1.7 Translation and Transliteration


Transliteration is the process of transforming a word from one language's alphabet to that of
another. To be clear, that's not really "translation," where another "content" of the term is
retained by simultaneously switching the words to the original language.

Fig 1.11 Differences between Translation and Transliteration


The goal of machine translation is really to maintain even more of the utterance's semantic
information without adhering to that same morphological grammar - translation method [22].
The goal of transliteration is to maintain even more of the parent word's authentic sounding
while adhering to the syntactic rules of the target language. For instance, speakers of
languages other than English are now familiar with the phrase "Manchester" for the city. In
addition to posing difficulties for spoken language technologies like automatic speech
recognition, spoken keyword search, and text-to-speech, many such words or phrases are
frequently called entities that are crucial in cross-lingual information retrieval, information
extraction, and computational linguistics. Methods for transliteration incorporate two
alternative neural machine translation (NMT) architectures: convolutional sequence-to-
sequence dependent NMT and recurrent neural network. Typically, new words and specified
entities are transliterated between typographical systems. Numerous machine transliteration
algorithms have been created during the past several years based on the pronunciation of the
native and target as well as quantitative and vernacular techniques. Numerous
multidisciplinary applications, such as corpora alignment, transnational text processing,
cross-lingual knowledge discovery and extraction, and, most crucially, language translation
systems, require transliteration. Automated, machine learning-based transliteration systems
are desperately needed, particular with the prominence of numerous languages and the rise in
the number of two or more languages. Transliteration is a technique that can be used by
machine translation programs to handle terms that are not part of their vocabulary.
Transliteration is a complicated and difficult task because of the various problems that may
arise. Since there are differences in intonation between language as well as between dialects
of the same language, transliteration is a daunting challenge.

1.1.8 NLP Framework Implementation


We use words that other people interpret to signify tens of thousands of different things every
day. Although we like to think of communication as being simple, we all know that words
contain far more nuance than that [24]. NLP in AI doesn't ever concentrate on voice
modulation; instead, it uses contextual patterns. We invariably infer some context from our
words and our delivery.
Fig 1.12 NLP Framework Components
I.
Lexical and Morphological Analysis
Lexical analysis is the study of a vocabulary's terms and expressions. It provides an example
of how to assess, recognize, and define word structures. To do this, a text is divided into
paragraphs, words, and sentences. Each word is broken down into its constituent components
and nonword features, such as punctuation, are removed from the words.
Semantic Analysis
A structure called semantic analysis is created by the syntactic analyzer and is used to assign
meanings. This section turns linear word sequences into structures. It shows how the words
relate to one another. Semantics only addresses words, phrases, and sentences in their literal
sense. By doing this, the genuine meaning or dictionary definition is simply removed from its
context. The syntactic analyzer's assigned structures always have obvious meanings. For
example, "Colorless green thinking" This would be disregarded by the Symantec analysis as
being colourless. In this context, green is absurd.
Pragmatic Analysis
The entire communicative and social content is the subject of pragmatic analysis, along with
how it influences interpretation. The intentional usage of words in context must be removed
or extracted. This study continuously placed a strong emphasis on what was stated and how it
was understood. By using a set of guidelines that characterize cooperative conversations,
pragmatic analysis helps users identify this desired result. Shut the window? for instance, is
more appropriately understood as a request than a command.
Syntax analysis
Words are often regarded as the smallest syntactic units. The rules and conventions that
dictate how sentences are put together are referred to as a language's syntax. The study of
syntax focuses on how word arrangements can alter their meaning. This necessitates carefully
studying the words of a sentence as well as its grammatical construction. To demonstrate how
the words relate to one another, the words are turned into the framework.

1.1.8.1 NLP Framework Implementation Standard workflow


Beginning with a corpus of text documents, we normally go through the standard
steps of text pre-processing, text wrangling, and basic exploratory data analysis [25]. We
typically depict the text using appropriate feature engineering techniques based on the
preliminary findings. We can concentrate on building predicted supervised models or
unsupervised models, which frequently concentrate more on pattern mining and grouping,
depending on the situation. Before making it available for usage in the future, we evaluate the
finished model and the overall success criteria with the pertinent stakeholders or consumers.

Fig 1.13 NLP Standard Workflow


Textual data pre-processing and cleaning can include numerous processes. We'll make heavy
use of the cutting-edge NLP libraries nltk and spacy. Because text is the least organized data
type, it contains a range of noise and cannot be easily analyzed. Pre-processing is thus
required. Text preprocessing is the process of cleaning, standardizing, and eliminating noise
from text before analysis. Textual data pre-processing and cleaning may use a variety of
techniques. Modern NLP libraries nltk and spacy will be used extensively. Text is the least
organized data format; therefore, it contains a range of noise that cannot be easily analyzed.
Pre-processing is therefore required. Text is cleaned, standardized, and noise is removed
before to analysis. Preprocessing is the term for this. Preprocessed data must be transformed
into features before analysis. Text features can be created in a variety of ways, including
syntactical parsing, entities/N-grams/word-based features, statistical features, and word
embeddings, depending on their intended use. Syntactical parsing involves examining the
syntax and organization of the words in a sentence in order to determine the relationships
between them.
Dependency Trees - Sentences are made up of several words strung together. The basic
dependency grammar determines how a phrase's words relate to one another. A subfield of
syntactic text analysis known as dependency grammar uses (labelled) asymmetrical binary
relations between two lexical objects (words). You can think of each relationship as a triplet
(relation, governor, dependent). The phrase "Republican Senator Brownback of Kansas
introduced proposals on ports and immigration" comes to mind. An illustration of the link
between the concepts as a tree reveals the following:

Fig 1.14 Dependency Trees


Depending on the data type and the task at hand, different text cleaning procedures are used.
Before text is tokenized, the string is typically lowercased, and the punctuation removed. The
technique of tokenization involves breaking a string up into a collection of strings (or
"tokens").
Classification
The results are quite good, and logistic regression makes it incredibly simple to train the
machine [22]. Considering that binary variables in the data can only take on the values 1
(success) or 0, (failure). In order to construct our system, we divided all our data into two
groups: training data and testing data (our system will be examined upon this unseen data).
Evaluation
Model evaluation is the practice of using several evaluation metrics to understand how
effectively a machine learning model performs as well as its benefits and drawbacks. The
efficiency of a model must be assessed early in the research process. Monitoring a model is
aided by model evaluation as well. Accuracy, precision, confusion matrix, log-loss, and AUC
are the most popular metrics for measuring classification performance (area under the ROC
curve).

Figure 1.15: Example of how Model Evaluation is performed


machine learning model's evaluation criterion should frequently be the confusion matrix. It
provides you with a very straightforward yet precise approach to assess how well your model
is functioning. For your convenience, the confusion matrix's top performance metrics are
shown below. In an ideal world, we would desire a model with a precision and recall of 1.
This is an always accurate machine learning model, or an F1-score of 1, which is unusual.
Therefore, we should strive for greater precision while keeping a higher recall value. Now
that we are aware of the confusion matrix performance measurements, let's investigate how
we may apply them in a multi-class machine learning model.

You might also like