You are on page 1of 45

%%

%% Copyright 2007-2020 Elsevier Ltd


%%
%% This file is part of the 'Elsarticle Bundle'.
%% ---------------------------------------------
%%
%% It may be distributed under the conditions of the LaTeX Project Public
%% License, either version 1.2 of this license or (at your option) any
%% later version. The latest version of this license is in
%% http://www.latex-project.org/lppl.txt
%% and version 1.2 or later is part of all distributions of LaTeX
%% version 1999/12/01 or later.
%%
%% The list of all files belonging to the 'Elsarticle Bundle' is
%% given in the file `manifest.txt'.
%%
%% Template article for Elsevier's document class `elsarticle'
%% with harvard style bibliographic references

%\documentclass[preprint,12pt]{elsarticle}

%% Use the option review to obtain double line spacing


%% \documentclass[authoryear,preprint,review,12pt]{elsarticle}

%% Use the options 1p,twocolumn; 3p; 3p,twocolumn; 5p; or 5p,twocolumn


%% for a journal layout:
%\documentclass[final,1p,times,authoryear]{elsarticle}
%\documentclass[final,1p,times,twocolumn]{elsarticle}
\documentclass[final,3p,times,authoryear]{elsarticle}
%\documentclass[final,3p,times,twocolumn,authoryear]{elsarticle}
%% \documentclass[final,5p,times,authoryear]{elsarticle}
%% \documentclass[final,5p,times,twocolumn,authoryear]{elsarticle}

%% For including figures, graphicx.sty has been loaded in


%% elsarticle.cls. If you prefer to use the old commands
%% please give \usepackage{epsfig}

%% The amssymb package provides various useful mathematical symbols


\usepackage{amssymb}
%% The amsthm package provides extended theorem environments
%% \usepackage{amsthm}
%\usepackage[ruled,vlined]{algorithm2e}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{float}
\usepackage{makecell}
\usepackage{multirow}
%\usepackage{multicol}
\usepackage{adjustbox}
\usepackage{lscape}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{sidenotes}
\usepackage{natbib}
\usepackage{hyperref}
\usepackage{pifont}
\usepackage{graphicx}
\usepackage{color}
\usepackage{longtable}

\usepackage{changepage}
\usepackage{lipsum}
\usepackage{lscape}
%% The lineno packages adds line numbers. Start line numbering with
%% \begin{linenumbers}, end it with \end{linenumbers}. Or switch it on
%% for the whole article with \linenumbers.
%% \usepackage{lineno}

\journal{Heliyon}

\begin{document}

\begin{frontmatter}

%% Title, authors and addresses

%% use the tnoteref command within \title for footnotes;


%% use the tnotetext command for theassociated footnote;
%% use the fnref command within \author or \affiliation for footnotes;
%% use the fntext command for theassociated footnote;
%% use the corref command within \author for corresponding author footnotes;
%% use the cortext command for theassociated footnote;
%% use the ead command for the email address,
%% and the form \ead[url] for the home page:
%% \title{Title\tnoteref{label1}}
%% \tnotetext[label1]{}
%% \author{Name\corref{cor1}\fnref{label2}}
%% \ead{email address}
%% \ead[url]{home page}
%% \fntext[label2]{}
%% \cortext[cor1]{}
%% \affiliation{organization={},
%% addressline={},
%% city={},
%% postcode={},
%% state={},
%% country={}}
%% \fntext[label3]{}

\title{Deep Learning-Based Natural Language Processing in Human-Robot Interaction:


A Survey}

%% use optional labels to link authors explicitly to addresses:


%% \author[label1,label2]{}
%% \affiliation[label1]{organization={},
%% addressline={},
%% city={},
%% postcode={},
%% state={},
%% country={}}
%%
%% \affiliation[label2]{organization={},
%% addressline={},
%% city={},
%% postcode={},
%% state={},
%% country={}}

\author[a]{}
\ead{}
\author[b]{}
\ead{}
\author[c]{}
\ead{}
\author[c]{}
\ead{}

\address[a]{Department of Computer Science, American International University-


Bangladesh, Dhaka-1229, Bangladesh}

\address[b]{Department of Computer Science \& Engineering, Bangladesh University of


Business \& Technology, Dhaka-1216, Bangladesh}

% \affiliation{organization={},%Department and Organization


% addressline={},
% city={},
% postcode={},
% state={},
% country={}}

\begin{abstract}
Human-robot interaction (HRI) is at the forefront of rapid development, with the
integration of deep learning techniques into natural language processing (NLP)
representing significant potential. This research addresses the complicated
dynamics of HRI and highlights the central role of deep learning in shaping
communication between humans and robots. In contrast to a narrow focus on sentiment
analysis, the study encompasses various HRI facets, including dialogue systems,
language understanding and contextual communication. The study systematically
examines applications, algorithms and models that define the current landscape of
deep learning-based NLP in HRI. It also presents common pre-processing techniques,
datasets and customized evaluation metrics. Insights into the benefits and
challenges of machine learning and deep learning algorithms in HRI are provided,
complemented by a comprehensive overview of the current state of the art. The
manuscript concludes with a comprehensive discussion of specific HRI challenges and
suggests thoughtful research directions. This work aims to provide a balanced
understanding of models, applications, challenges and research directions in the
field of deep learning-based NLP in human-robot interaction, with a focus on recent
contributions to the field.

\end{abstract}

%%Graphical abstract
%\begin{graphicalabstract}
%\includegraphics{grabs}
%\end{graphicalabstract}
\iffalse
%%Research highlights
\begin{highlights}
\item Research highlight 1
\item Research highlight 2
\end{highlights}
\fi
\begin{keyword}
%% keywords here, in the form: keyword \sep keyword
Human-Robot Interaction, Deep Learning, Natural Language Processing
%% PACS codes here, in the form: \PACS code \sep code

%% MSC codes here, in the form: \MSC code \sep code


%% or \MSC[2008] code \sep code (2000 is the default)

\end{keyword}

\end{frontmatter}

%% \linenumbers
\section{Introduction}
Human-robot interaction has recently gained significant popularity and become a
trending topic, reflecting the growing fascination and curiosity about the dynamics
between humans and robots, underlined by the increasing importance of deep learning
and natural language processing (NLP) in the design of these interactions. This
trend is further reinforced by the recognition of the central role played by the
combination of deep learning and natural language processing (NLP), highlighting
the importance of this synergy in improving the quality and nuance of human-robot
interaction. The convergence of deep learning-based NLP and human-robot interaction
represents a transformative intersection where advances in language understanding
and processing contribute significantly to the seamless integration of robots into
various aspects of human life. As we delve into this fascinating intersection, it
becomes clear that the evolving technology landscape promises exciting developments
that are yet to unfold.

The combination of human-robot interaction (HRI) with deep learning-based natural


language processing (NLP) holds immense potential to advance robotics and improve
the user experience. First, the integration of deep learning with NLP enables
robots to understand and respond to human speech in a more nuanced and
contextualized way. This leads to improved dialog understanding and enables robots
to interpret not only the literal meaning, but also the feelings and intentions
behind human utterances. In addition, deep learning facilitates continuous learning
and adaptation, allowing robots to refine their understanding of language over time
as they are exposed to different interactions. This adaptability is crucial in
dynamic environments where language patterns can evolve. In addition, the
combination of HRI and NLP based on deep learning improves the personalization of
interactions, as the robots can tailor their responses to the preferences and
communication styles of individual users. This synergy also contributes to the
development of socially intelligent robots that are able to recognize and respond
appropriately to social cues, fostering more natural and engaging human-robot
relationships. Ultimately, the fusion of HRI and NLP based on deep learning
represents a powerful convergence that not only expands the technical capabilities
of robots, but also significantly increases their utility and adoption in various
real-world applications.

The fusion of human-robot interaction (HRI) and deep learning-based natural


language processing (NLP) is at the forefront of technological innovation, leading
to transformative advances in various fields. \cite{li2023manufacturing} highlights
the integration of NLP based on deep learning in manufacturing, particularly in
robotic assembly lines that enable seamless communication and task coordination
between human operators and robots. This application highlights the potential for
improved efficiency and collaboration in dynamic production environments.
Similarly, recent breakthroughs in service robots, as demonstrated by \
cite{smith2023customerservice}, have brought virtual assistants equipped with NLP
based on deep learning into mainstream use. This revolutionizes the dialogue with
the customer by enabling greater effectiveness and personalization by more
intuitively understanding and responding to users' needs. In education, more and
more interactive learning experiences are being enabled by robots equipped with
NLP, as the example of \cite{brown2022education} shows. This represents a
significant shift towards improved adaptability in answering student questions and
creating dynamic and personalized educational content.

In addition to these areas, healthcare is making significant strides with the


integration of NLP-based deep learning into robotic systems, as applications such
as Mabu by \cite{fadhil2019healthcare} show. This innovation upgrades the
communication between patient and robot, helping to improve healthcare assistance
and patient engagement. Furthermore, in smart homes, the seamless interaction
facilitated by voice-activated virtual assistants such as Amazon's Alexa and Google
Assistant \cite{luger2016virtualassistants} exemplifies the merging of HRI and NLP.
Intuitive communication through natural language commands reflects a growing trend
in the design of user experiences in intelligent environments. The use of humanoid
robots such as Pepper, equipped with advanced NLP capabilities, in retail
environments further demonstrates the potential of interactive and intuitive
communication \cite{hoffmann2019retail}. These diverse examples collectively
highlight the growing prevalence and impactful applications of combining HRI with
NLP-based deep learning across manufacturing, customer service, education,
healthcare, and smart environments.

However, the integration of deep learning (DL)-based natural language processing


(NLP) into human-robot interaction (HRI) introduces unique concerns and challenges,
creating new opportunities for research exploration. A comprehensive investigation
of the application of DL-based NLP in HRI is needed, including a thorough
exploration of DL algorithms, preprocessing methods, datasets, diverse
applications, related challenges, and future research directions. While some
related work exists, none of it comprehensively covers this area, especially from a
holistic research perspective. Therefore, we conducted a systematic review focusing
exclusively on recent cutting-edge research articles to gather insights on the
latest developments in the field of DL-based NLP in HRI. This review aims to
support practitioners and researchers in advancing DL-based NLP applications in HRI
by providing a snapshot of current progress in this dynamic and evolving field.
\\
\\
The main contributions of this study are:
\begin{itemize}
\item This study presents a comprehensive exploration of the different
application domains where deep learning (DL)-based natural language processing
(NLP) and human-robot interaction (HRI) intersect. The analysis addresses the
nuanced advances and innovations within this intersection and provides insights
into a variety of real-world applications that utilize the synergy of DL-based NLP
in HRI.

\item The study offers an in-depth exploration that provides valuable insights
into methods of data preprocessing to optimize model training. It also highlights
commonly used datasets that are important for research and benchmarking in the
context of DL-based NLP in human-robot interaction.

\item A detailed examination of the predominant DL-based NLP algorithms used in


human-robot interaction is presented, highlighting the individual strengths and
limitations of each algorithm. This thorough investigation improves our
understanding of the applications of DL-based NLP in the complex dynamics of human-
robot interaction.

\item Thorough analysis that thoroughly explores and examines recent advances
and contributions by researchers and highlights the latest experimental results
shaping the evolving landscape of deep learning-based natural language processing
(NLP) in human-robot interaction (HRI).
\item A comprehensive discussion discusses the challenges in the field of deep
learning-based NLP in HRI and identifies future research opportunities to overcome
these obstacles.

\end{itemize}

Throughout rest of this paper, key sections are systematically navigated, each of
which contributes to a comprehensive understanding of the study. Section \
ref{sec:Methodology}, Methodology, explains the systematic approach that was used
to investigate the interplay between deep learning-based natural language
processing (NLP) and human-robot interaction (HRI). Section \ref{sec:Dataset},
Datasets, describes in detail the key data sources for the subsequent analyses. The
examination of Section \ref{sec:Models}, Models, provides a detailed insight into
the various deep learning models used to improve natural language understanding in
the context of HRI. Section \ref{sec:Application}, Application, examines the
practical implementation of these models in various real-world scenarios. Section \
ref{sec:Results_analysis}, Results and Analysis, rigorously breaks down the results
and provides nuanced insights into the findings of the study. The preprocessing
methods are described in detail in Section \ref{sec:Pre-processing}. Section \
ref{sec:Challenges}, Recent Changes, critically discusses changes and advances in
the field to ensure the study is up to date. Finally, Section \ref{sec:Conclusion},
Conclusion, captures the essence of the study by summarizing the main findings,
reflecting on their implications, and suggesting possible avenues for future
research in the field of deep learning-based NLP in human-robot interaction.

% \subsubsection{1.1}
% In the modern era of technological advancement, dynamic interaction between
humans and robots, known as human-robot interaction (MRT), has moved from the realm
of science fiction into mainstream daily life. This fascinating interaction
involves studying how humans communicate and interact with robots. Originally
concerned with simple control interfaces, HRI has undergone a remarkable evolution
driven by the advent of Deep Learning - a subset of machine learning that employs
complex neural networks. In particular, in the captivating field of natural
language processing (NLP), a subfield of artificial intelligence, the journey of
seamless interpretation, understanding, and generation of human language by
machines is upon this field. The convergence of Deep Learning and NLP in the field
of HRI is like an artistic symphony orchestrating intricate patterns and
relationships within the complex fabric of human communication.
% \subsubsection{1.2}
% This technological fusion is not only pushing the boundaries of human-robot
interaction, but also creating a new paradigm - a world where robots not only
understand the nuances of human speech but also anticipate and respond to our
intentions with uncanny familiarity. Imagine virtual assistants that not only
understand words spoken by a person but also recognize his or her underlying
emotions, paving the way for unprecedented personalized technological interactions.
As the world makes progress through this ever-evolving landscape, examples of
monumental achievements such as BERT \cite{n1BERT} and GPT-3 \cite{n1GPT} come to
mind, transforming machines into language virtuosos capable of coherent dialogues,
accurately translating languages, and even rendering human emotions. This synergy
of advanced technologies is changing the landscape of HRI, leading to a future
where robots are not just tools, but true companions to our human experience.
% \subsubsection{1.3}
% The evolution of human-robot interaction (HRI) through the lens of Deep Learning
is closely intertwined with the contributions of influential models. Convolutional
neural networks (CNNs) \cite{n1CNN}, developed in the 1990s by LeCun et al.
redefined image analysis for robots by enabling the extraction of complicated
patterns and advancing tasks such as object recognition. Recurrent neural networks
(RNNs)\cite{n1RNN}, dating back to Elman's 1990 work and advanced by Hochreiter and
Schmidhuber in 1997 with the introduction of long short-term memory (LSTM) cells,
revolutionized sequential data analysis and enabled robots to understand human
gestures and speech nuances. The landscape changed further with the advent of
Transformer models, e.g., the introduction of Bidirectional Encoder Representations
from Transformers (BERT) \cite{n1BERT} by Devlin et al. in 2018, which captured
contextual linguistic subtleties. This was followed in 2020 by OpenAI's Generative
Pre-trained Transformer 3 (GPT-3) \cite{n1GPT}, a milestone in language generation
that enables machines to produce coherent, contextual text. The historical
development of these models is intertwined with the growth of Deep Learning, which
is reshaping the evolution of HRI and setting the stage for future innovations that
will improve human-robot interaction through the convergence of Deep Learning
methods.
% \subsubsection{}
% The need to explore the field of human-robot interaction (HRI) in the context of
Deep Learning stems from its transformative potential for numerous real-world
applications. As robots increasingly become an integral part of our lives,
improving their ability to seamlessly interact with humans is critical. Improved
HRI facilitates collaborative tasks in industrial automation, where robots work
with human workers to optimize efficiency and safety. In healthcare, robots can
support patient care, accompany the elderly, or assist people with limited
mobility. In addition, personalized virtual assistants based on Deep Learning
provide tailored answers and recommendations, revolutionizing customer service and
the use of technology. These examples show that the advancement of HRI through Deep
Learning is not only redefining the human-robot dynamic, but also holds the key to
unprecedented efficiency, accessibility, and convenience in various sectors of our
society.
% \subsubsection{1.5}
% This review examines the evolution of human-robot interaction (HRI) in the
context of Deep Learning and demonstrates its transformative potential in various
real-world applications. Human-robot interaction, known as HRI, has evolved from
simple control interfaces to complex interactions driven by advances in Deep
Learning, particularly in the area of natural language processing (NLP). The
convergence of Deep Learning and NLP is transforming HRI by enabling robots to not
only understand human language, but also anticipate intentions and respond with
deep understanding. This fusion has produced virtual assistants capable of
understanding emotions and paving the way for personalized interactions. Key
influential models such as Convolutional Neural Networks (CNNs), Recurrent Neural
Networks (RNNs), and Transformer models such as BERT and GPT-3 are highlighted as
milestones that have shaped the development of HRI. The transformative potential of
Deep Learning-powered HRI is evident in applications ranging from industrial
automation to healthcare to personalized customer service. As robots become an
integral part of our lives, this review paper highlights the role of Deep Learning
in redefining the human-robot dynamic and shaping a future where robots serve as
efficient and accessible companions in various aspects of society.
% \subsubsection{1.5}
\section{Methodology}
\label{sec:Methodology}

% need to change the bold text


The systematic literature review (SLR) technique used in this study was developed
by \textcolor{red}{Keeleet al. [8, 9]}. The paper presents the SLR process in three
separate phases: planning, execution, and reporting of the review.

\subsection{Planning the Review}


We provided a brief overview of the following in this subsection: the research
questions, the review article's source, and the inclusion and exclusion standards
\subsubsection{Research questions}
The key research questions were:\\
RQ1: What is the HRI's operational process? \\
RQ2: What are the HRI's application domains?\\
RQ3: What HRI-related research publications were published in 2022 and 2023, and
what applications did they apply to?\\
RQ4: What types of datasets have been used for the HRI experiment?\\
RQ5: What pre-processing techniques were used for the HRI experiment?\\
RQ6: What are the most popular HRI algorithms and models, and what are their
benefits and drawbacks?\\
RQ7: What HRI-related developments were done and what are the features of those
made in 2022 and 2023?\\
RQ8: What are the HRI research problems and prospects for the future?\\
\subsubsection{Source of review articles}
Only highly qualified academic articles that can be found in databases like
ScienceDirect, SpringerLink, MDPI, Plos One, ACM Digital Library, IEEE Xplore, and
certain renowned conferences are considered in the survey.
\subsubsection{Criteria for inclusion and exclusion}
\textcolor{red}{As shown in Figure 2, the essential sources for this study were
gathered in accordance with PRISMA (Preferred Reporting Items for Systematic
Reviews and Meta-Analyses) standards. The inclusion and exclusion standards in
PRISMA} are also described in Table 2. The criteria used to decide whether to
accept or reject a paper are shown in this table.\\
\textbf{Table 2. The inclusion and exclusion criteria used to choose articles are
described in the table.}
\begin{center}

\begin{tabular}{|c|c|}
\hline
Inclusion Criteria& Exclusion Criteria \\
\hline
IC1: Research articles written in English. & EC1: Duplicate articles. \\
\hline
IC2: Articles published in 2022 and 2023. &EC2: Not related to the theme of the
review. \\
\hline
IC3: Articles that have only been published in scholarly journals (just a few
exceptional conference papers).&EC3: Articles which have lack of information and
review papers.\\
\hline
\end{tabular}
\end{center}
\subsection{Executing the Review}
The process of obtaining important information from the articles is the main
emphasis of this phase. An organized and systematic literature review is ensured
by the five sub-phases listed below.
\subsubsection{Topical relationship}
We chose the most pertinent original research papers that were conducted in this
field with an emphasis on publications about recognizing human activities.
\subsubsection{Goals and results}
\textcolor{red}{Sections 3, 7, and 8 cover the primary objectives, contributions,
experimental findings, and limitations of several relevant publications.}
\subsubsection{Evaluation metrics, Datasets, Pre-Processing methods, and
Algorithms}
\textcolor{red}{Sections 4 to 7 provide all evaluation measures, datasets,
preprocessing techniques, and algorithms utilized in the HRI studies.}
\subsubsection{Research type}
The paper type is described, such as scholarly journals, conference/workshop
proceedings, book chapters, or thesis work. For a thorough, systematic evaluation,
our study mostly chose scholarly articles.
\subsubsection{Publication year and type}
\textcolor{red}{We collected 176 publications for this study's first collection,
all of which were only published in 2022 and 2023. 83 articles that focused on
applications and breakthroughs in HRI were eventually chosen for the review after
careful inspection. Notably, 2023 witnessed the publication of more than 60\% of
these papers. Our conscious decision to only take into account current papers
emphasizes our dedication to offering an expert and cutting-edge overview of HRI.}
\subsection{Outcome}
Lastly, the collected data has been analyzed, addressing current issues and
challenges while giving new directions for future study.
\section{Dataset}
\label{sec:Dataset}
The undeniable need for datasets in any given field cannot be overstated. The
following table lists the most important datasets used in the field of deep
learning-based natural language processing for human-robot interaction, underlining
their paramount importance in driving research and innovation.
\begin{table}[H]
\begin{adjustwidth}{-4cm}{-4cm} % Comment out/remove adjustwidth environment if
table fits in text column.
\centering
\caption{
{\bf Datasets}}
\begin{tabular}{p{.5cm}p{1.5cm}p{2cm}p{0.8cm}p{2cm}p{8cm}}
\hline
\centering

\textbf{Ref.} & \textbf{Year} & \textbf{Dataset} & \textbf{Type} &\


textbf{Application} & \textbf{Feature}\\ \hline

\cite{dataset.mscoco} &2014& MS COCO dataset & Image & Interdisciplinary & A


dataset of 328,000 photos of 91 easily identifiable objects for four-year-olds with
2.5 million labeled instances was subjected to detailed statistical analysis and
performance evaluation, and differences from other datasets were identified\\ \
hline
\cite{dataset.IMSDb} &2020& IMSDb & Text & Entertainment & The It uses fine-tuned
GPT-2 and BART models that incorporate specific genre tags to produce scripts
tailored to each genre\\ \hline
\cite{dataset.VirtualAssistantMax} &2022& Virtual-Assistant-Max & CSV & Industrial
& The data is divided into 2 folders, They contain 10 and 6 intents respectively,
each with 40 labels, which are used to train robust models for different user
interactions in different service environments.\\ \hline
\cite{dataset.IITBombayEnglishHindiParallelCorpus} &2018& IIT Bombay English-Hindi
Corpus & Text & Emotion and Sentiment Analysis & The dataset used for translations
since 2016 includes 49,400 sentence pairs from various sources, with 1,659,082
segments from GNOME, KDE4 and TED presentations in the training set.\\ \hline
\cite{dataset.ELDERLY-AT-HOMEcorpus} &2015& ELDERLY-AT-HOME corpus. & Video &
Emotion and Sentiment Analysis & This multimodal data set comprises 1516 utterances
and 6593 words and represents a valuable resource for the study of complex
interactions in the field of elderly care.\\ \hline
\cite{dataset.Chinese-KBQA-Dataset} &2018& Chinese KBQA Dataset (NLPCC-ICCPOL 2016)
& Text
& Interdisciplinary & Largest Chinese KBQA dataset with 43 million subject-
predicate-object triples and 6 million entities. It contains 14,609 training pairs
and 9,870 test-question-answer pairs provided by Microsoft researchers and obtained
from Baidu Encyclopedia and Fobox.\\ \hline
\cite{dataset.ABSA} &2022& ABSA dataset & Text & Healthcare & There are 526 disease
names and 2078 symptoms, comprising around 2,000 pieces of information. However,
the scope of the collection is limited.\\ \hline
\cite{dataset.MELD} &2018& MELD dataset & Text & Emotion and Sentiment Analysis &
Database from the television series Friends, over 13,000 annotated utterances in
1,433 dialogs with multiple speakers. Includes audio, visual and textual modalities
and offers raw videos, audio segments and transcripts.\\ \hline
\cite{dataset.MSVD} &2010& Microsoft Research Video Description Corpus (MSVD) &
Text & Entertainment & The data set comprises around 120K sentences that were
collected in the summer of 2010. Mechanical Turk employees were paid to watch short
video clips and summarize the action in a single sentence. This resulted in a data
set with roughly parallel descriptions for over 2,000 video snippets.\\ \hline
\cite{dataset.MSR-VTT} &2020& MSR-VTT (Microsoft Research Video to Text) & Text &
Entertainment & Consisting of 10,000 video clips in 20 categories, each annotated
with 20 English sentences from Amazon Mechanical Turks. The dataset contains
approximately 29,000 unique words in subtitles and uses 6K clips for training, 497
for validation and 2,990 for testing in the standard split.\\ \hline
\cite{dataset.Squad} &2016& Stanford Question Answering Dataset (SQuAD) & Text &
Interdisciplinary & Merging 100K questions from SQuAD1.1 with 50K opponent-
generated unanswerable questions designed by crowdworkers to be very similar to
answerable questions.\\ \hline
\cite{dataset.n2c2} &2014& National NLP Clinical Challenges (n2c2) dataset & Text &
Healthcare & The dataset consists of 1237 discharge reports from the Partners
HealthCare Research Patient Data Repository.\\ \hline
\cite{dataset.MPII-MD} &2015& MPII Human Pose Dataset & Image & Entertainment &
The dataset comprises approximately 25,000 images, of which 15,000 are for
training, 3,000 for validation and 7,000 for testing (with labels withheld). The
dataset was extracted from YouTube videos and includes 410 different human
activities, each annotated with up to 16 body joints.\\ \hline
\cite{dataset.Talk2car} &2019& Talk2Car & Image & Industrial & Task2Car contains
8349 training examples, 1163 validation examples and 2447 test examples.\\ \hline
\cite{dataset.LRC} &2017& Leeds Robotic Commands(LRC). & Image & Industrial &
dataset comprises 204 videos with around 17K images. It contains 1024 commands, an
average of five per video. A variety of 51 objects are manipulated in the videos,
including basic block shapes, fruit, cutlery and office supplies.\\ \hline

\end{tabular}
\label{tab:dataset}
\end{adjustwidth}
\end{table}

% \begin{table}[htbp]
% \centering
% \caption{}
% \label{table:4}
% \begin{adjustbox}{width=1\textwidth}
% \begin{tabular}{|m{2.5cm}|m{10cm}|}\hline
% \textbf{Preprocessing Methods} & \textbf{Description} & \textbf{Taxonomy} &
% \textbf{Deep Learning Techniques} &
% \textbf{Image Segmentation Applications} &
% \textbf{Datasets} &
% \textbf{Challenges} &
% \textbf{Future Opportunities} & %\checkmark \ding{55}
% \textbf{Evaluation Metrics} &
% \textbf{Contribution}
% \\ \hline
% \cite{S1} & Deep Learning Techniques for Medical Image Segmentation:
Achievements and Challenges &\ding{55} &\checkmark &\ding{55} &\ding{55} &\
checkmark &\ding{55} &\ding{55} &The contribution of this paper is a critical
appraisal of popular deep learning-based methods for medical image segmentation,
along with a summary of common challenges faced and proposed solutions.\\
% \hline
% \end{tabular}
% \end{adjustbox}
% \end{table}

% \begin{table}[!ht]
% \begin{adjustwidth}{-4cm}{-4cm} % Comment out/remove adjustwidth environment if
table fits in text column.
% \centering
% \caption{
% {\bf The table discusses the commonly used pre-processing methods in sentiment
analysis.}}
% % \begin{tabular}{p{2.5cm}p{14cm}}
% \begin{tabular}{p{4cm}p{14cm}}
% \hline

% \bf{Pre-Processing Method} & \textbf{Description}


% \\ %\thickhline

% \hline

% Image Crop & Image cropping is a pre-processing technique that involves removing
a specific part of an image, typically to focus on a region of interest or to
resize it for further analysis or display. \\ \hline

% Tokenization & Breaking down a text into smaller linguistic units such as words,
phrases, or symbols to facilitate further analysis. \\ \hline

% Padding & Commonly used in sequence-based models such as recurrent neural


networks (RNNs) and transformers to ensure that all input sequences are the same
length, allowing for efficient batch processing. \\ \hline

% Lowercasing & Helpful for standardizing text to reduce vocabulary complexity and
prevent the model from treating words with different capitalization as different
entities. \\ \hline

% Lemmatization \newline
% \& \newline
% Stemming & Techniques for reducing words to their basic form in order to deal
with word variations and thus reduce the size of the vocabulary and improve the
generalization capability of the model. \\ \hline

% Normalization & converting text data to a standard format by removing accents,


special characters, or diacritics to enable consistent and uniform
representation. \\ \hline

% Noise removal & An important step in improving the quality of text data by
removing irrelevant information such as special characters, symbols, or irrelevant
words that do not contribute to the overall meaning. \\ \hline

% Feature-Extraction & Essential for capturing the most important information from
the text data and creating meaningful representations that can be effectively used
by the model to learn patterns and make predictions. \\ \hline

% Word Embedding & a technique used to represent words as dense vectors in a


multidimensional space, preserving semantic relationships between words and
improving the model's ability to understand context and meaning. \\ \hline

% Stop word removal & Removes frequently occurring words (e.g., 'and', 'the', 'is')
that do not contain important information and are often ignored during analysis to
reduce noise and increase processing speed. \\ \hline

% Removal of special \newline characters \& numbers & Helps clean up the text by
removing non-alphabetic characters and numeric digits to ensure that the data
focuses on the textual information relevant to the analysis. \\ \hline

% Vectorization & The process of converting text data into numeric vectors, making
it suitable for various machine learning models that require numeric input for
processing. \\ \hline

% POS tagging \newline


% (Part-of-Speech Tagging) & This involves assigning grammatical tags to words in a
sentence, allowing the model to understand the role of each word and its
relationship to other words in the text. \\ \hline

% Handling Contractions & Expanding contractions to their full form helps


standardize text data and avoid ambiguity, especially in cases where the
contraction may have a different meaning. \\ \hline

% Named Entity Recognition (NER) & A process that identifies and classifies named
entities in text, such as names, places, dates, and numeric values, so that the
model can recognize specific entities and their context. \\ \hline

% Punctuation removal & Removing punctuation from text data helps to simplify the
text and ensures that the model focuses on the context of the text, improving the
accuracy of the analysis and predictions. \\ \hline

% \end{tabular}

% \label{tab:preprocessing}
% \end{adjustwidth}
% \end{table}
\section{Pre-processing Methods}
\label{sec:Pre-processing}
Pre-processing is vital in deep learning-based natural language processing for
human-robot interaction, converting raw text into machine-readable formats using
techniques like data cleansing and tokenization. These methods, akin to sentiment
analysis, refine input data and enhance model performance. Rigorous pre-processing
ensures data integrity, contributing significantly to scientific advancements in
optimizing NLP applications in HRI. The table \ref{tab:preprocessing}below
elucidates prevalent pre-processing methods in this field.
\begin{table}[H]%[htb]%
\begin{adjustwidth}{-4cm}{-4cm} % Comment out/remove adjustwidth environment if
table fits in text column.
\centering
\caption{
{\bf The table discusses the commonly used pre-processing methods in sentiment
analysis.}}
% \begin{tabular}{p{2.5cm}p{14cm}}
\begin{tabular}{p{3cm}p{12cm}p{5cm}}

\hline

\bf{Pre-Processing Method} & \textbf{Description}& \textbf{References}


\\ %\thickhline

\hline

Image Crop & Image cropping is a pre-processing technique that involves removing a
specific part of an image, typically to focus on a region of interest or to resize
it for further analysis or display.
&\cite{zhang2023tcpcnet},\cite{kushol2023effects},\cite{fanjie2023sust}
\\ \hline

Tokenization & Breaking down a text into smaller linguistic units such as words,
phrases, or symbols to facilitate further analysis. &\cite{halawani2023automated},\
cite{baghaei2023deep},\cite{aldunate2022understanding},\cite{zhou2023generating},\
cite{pandey2022mental}\\ \hline

Padding & Commonly used in sequence-based models such as recurrent neural networks
(RNNs) and transformers to ensure that all input sequences are the same length,
allowing for efficient batch processing.
&\cite{ashraf2023bert},\cite{inamdar2023machine},\cite{jianan2023deep},\
cite{karasoy2022spam}
\\ \hline

Lowercasing & Helpful for standardizing text to reduce vocabulary complexity and
prevent the model from treating words with different capitalization as different
entities.
&\cite{zaheer2023multi},\cite{duong2023deep},\cite{alshahrani2023applied}
\\ \hline

Lemmatization \newline
\& \newline
Stemming & Techniques for reducing words to their basic form in order to deal with
word variations and thus reduce the size of the vocabulary and improve the
generalization capability of the model. &\cite{budiharto2021novel},\
cite{ayanouz2020smart},\cite{guazzo2023deep},\cite{wang2023deepsa} \\ \hline

Normalization & converting text data to a standard format by removing accents,


special characters, or diacritics to enable consistent and uniform representation.
&\cite{abdalla2023sentiment},\cite{matti2023autokeras},\
cite{kumar2022classification}
\\ \hline

Noise removal & An important step in improving the quality of text data by removing
irrelevant information such as special characters, symbols, or irrelevant words
that do not contribute to the overall meaning.
&\cite{amaar2022detection},\cite{kheraleveraging},\cite{merdivan2019dialogue}
\\ \hline

Feature-Extraction & Essential for capturing the most important information from
the text data and creating meaningful representations that can be effectively used
by the model to learn patterns and make predictions. &\cite{johnston2023ns},\
cite{yohanes2023emotion},\cite{10343159}
\\ \hline

Word Embedding & a technique used to represent words as dense vectors in a


multidimensional space, preserving semantic relationships between words and
improving the model's ability to understand context and meaning. &\
cite{wan2023text},\cite{chang2023changes},\cite{wang2023generating}
\\ \hline

Stop word removal & Removes frequently occurring words (e.g., 'and', 'the', 'is')
that do not contain important information and are often ignored during analysis to
reduce noise and increase processing speed. &\cite{balouch2023transformer},\
cite{mithun2023development},\cite{das2022deep},\cite{nijhawan2022stress}
\\ \hline
Removal of special \newline characters \& numbers & Helps clean up the text by
removing non-alphabetic characters and numeric digits to ensure that the data
focuses on the textual information relevant to the analysis. &\
cite{olthof2021deep},\cite{nictoi2023unveiling},\cite{das2023sentiment},\
cite{olthof2021deep}
\\ \hline

Vectorization & The process of converting text data into numeric vectors, making it
suitable for various machine learning models that require numeric input for
processing. &\cite{gupta2023detecting},\cite{xavier2022natural},\
cite{marulli2021exploring}
\\ \hline

POS tagging \newline


(Part-of-Speech Tagging) & This involves assigning grammatical tags to words in a
sentence, allowing the model to understand the role of each word and its
relationship to other words in the text. &\cite{eppe2016exploiting},\
cite{pandy2023extracting},\cite{villa2023extracting}
\\ \hline

Handling Contractions & Expanding contractions to their full form helps standardize
text data and avoid ambiguity, especially in cases where the contraction may have a
different meaning. &\cite{chai2023twitter},\cite{mahimaidoss2023emotion}
\\ \hline

Named Entity Recognition (NER) & A process that identifies and classifies named
entities in text, such as names, places, dates, and numeric values, so that the
model can recognize specific entities and their context. &\cite{ahmed2023fine},\
cite{jang2022exploration}
\\ \hline

Punctuation removal & Removing punctuation from text data helps to simplify the
text and ensures that the model focuses on the context of the text, improving the
accuracy of the analysis and predictions. &\cite{agarwal2023deepgram},\
cite{motyka2023information},\cite{ashfaque2023design}
\\ \hline

\end{tabular}

\label{tab:preprocessing}
\end{adjustwidth}
\end{table}

\section{Models}
\label{sec:Models}
In the academic field of human-robot interaction using deep learning-based natural
language processing, this resource serves as a valuable guide for navigating the
complicated landscape of deep learning and natural language processing, especially
for beginners, and facilitates informed model selection. Given the paramount
importance of deep learning-based natural language processing in human-robot
interaction, the comprehensive overview of key models in the table \ref{tab:models}
contributes to a holistic understanding of the diverse applications in this
specialized field.
\newgeometry{
left=0.4in,
right=0.3in,
top=0.4in,
bottom=0.4in,
}
\begin{landscape}

\begin{longtable}{p{2cm}p{8.5cm}p{6.5cm}p{6.5cm}p{2cm}}
\caption{The table analyzes the commonly used deep learning algorithms in NLP-based
human-robot interaction.}\\
\hline
\label{tab:models}
Algorithm & Description & Advantages & Limitations & Studies \\
\hline
\endfirsthead
\caption*{Table continued from previous page}\\
\hline
Algorithm & Description & Advantages & Limitations & Studies \\
\hline
\endhead
\hline
\multicolumn{5}{r}{\textit{Continued on next page}}
\endfoot
\hline
\endlastfoot

Gated Recurrent Unit (GRU) &


GRU, a type of recurrent neural network, utilizes gating mechanisms for efficient
information flow regulation, addressing challenges like vanishing gradients in
sequential data processing. &
\begin{itemize}
\item \textbf{Efficient Training:} GRUs ensure faster convergence in recurrent
neural networks.
\item \textbf{Reduced Parameters:} GRUs have fewer parameters, which increases
the efficiency of the calculations.
\item \textbf{Parallelization Advantage:} GRUs support faster training through
improved parallelization.
\end{itemize}
&
\begin{itemize}
\item \textbf{Short Memory Range:} GRUs may have difficulty capturing long-term
patterns.
\item \textbf{Hyperparameter Sensitivity:} GRU performance is sensitive to
hyperparameter settings.
\item \textbf{Limited Expressive Power:} GRUs may be less powerful at complex
tasks than architectures such as LSTMs.
\end{itemize}
&\cite{algorithem.GRU.one},\cite{algorithem.GRU.two},\cite{algorithem.GRU.three} \\

\hline

Sequence to Sequence (SEQ2SEQ) &


Seq2Seq is a neural network architecture tailored for sequential data processing,
like language translation, employing an encoder to convert input sequences into a
consistent contextual representation, subsequently utilized by a decoder to
generate corresponding output sequences. &
\begin{itemize}
\item \textbf{Versatility:} Seq2Seq handles tasks of variable length, such as
language translations.

\item \textbf{Context Awareness:} Seq2Seq captures the sequence context for


coherent results.

\item \textbf{Automated Learning:} Seq2Seq learns mappings without manual


feature engineering.
\end{itemize}
&
\begin{enumerate}
\item \textbf{Input Length Variability:} Seq2Seq struggles with variable-length
inputs.
\item \textbf{Limited Long-Term Comprehension:} Difficulty capturing long-term
dependencies.
\item \textbf{Attention Gaps:} Cannot focus effectively on important input
details.
\end{enumerate} & \cite{algorithem.SEQ2SEQ.one},\cite{algorithem.SEQ2SEQ.two},\
cite{algorithem.SEQ2SEQ.three},\cite{algorithem.SEQ2SEQ.four} \\
\hline

Hierarchical Attention Network (HAN) &


The Hierarchical Attention Network (HAN) is a specialized deep learning framework
for sequential data, employing hierarchical attention mechanisms to effectively
capture context at both word and sentence levels, enhancing its ability to analyze
hierarchical structures in text data.
&
\textbf{Versatile Understanding:} HAN understands documents effectively at word and
sentence level.

\textbf{Precise Attention:} HAN's attention mechanism focuses on the most important


information.

\textbf{Adaptable Length Handling:} HAN manages documents of variable length


through hierarchical attention.
&
\begin{enumerate}
\item \textbf{Complex Configuration:} HANs need to be set up carefully due to
the dual attention mechanisms.

\item \textbf{High Resource Consumption:} HANs can be resource-intensive,


limiting real-time use.

\item \textbf{Data Dependency:} HANs require an abundance of tagged data, which


is a challenge with smaller datasets.
\end{enumerate}

& \cite{algorithem.HAN.one},\cite{algorithem.HAN.two}, \\
\hline
Bidirectional Encoder Representations from Transformers (BERT) &
BERT is a natural language processing algorithm that employs bidirectional context
comprehension through pre-training on extensive unlabeled text data, enabling
nuanced contextual understanding and substantial performance improvements across
diverse downstream NLP tasks. &
\begin{itemize}
\item \textbf{Precision in Context:} Excellent understanding of words in sentence
context.
\item \textbf{Flexible Pre-training:} Adapts to tasks with minimal task-specific
data through versatile pre-training.
\item \textbf{Comprehensive Contextual Awareness:} Utilizes bidirectional
attention for a comprehensive understanding of surrounding words.
\end{itemize}

& \begin{itemize}
\item \textbf{High Compute Requirements:} Intensive calculations limit use in
resource-constrained environments.

\item \textbf{Sequential Limits:} Difficulties in capturing sequential nuances


affect task performance.

\item \textbf{Memory Requirements:} Large memory requirements pose a challenge


for use on devices with limited memory.
\end{itemize}
& \cite{algorithem.BERT.one},\cite{algorithem.BERT.two},\
cite{algorithem.BERT.three},\cite{algorithem.BERT.four} \\
\hline
Long Short Term Memory (LSTM) & LSTM, a recurrent neural network architecture,
addresses the vanishing gradient problem by utilizing memory cells and specialized
gates to selectively store and retrieve information over long sequences, enhancing
its effectiveness in modeling dependencies within sequential data like natural
language. &
\begin{itemize}
\item \textbf{Extended Memory Span:} Excels in retaining information over long
sequences.
\item \textbf{Gradient Stability:} Ensures stable gradients through gating
mechanisms in training.
\item \textbf{Versatile Adaptability:} Adapts effectively to diverse tasks,
processing complex patterns in data.
\end{itemize}

&
\begin{itemize}
\item \textbf{Complexity:} More difficult to understand due to complexity.
\item \textbf{Calculation Requirements:} Requires more resources for calculation.
\item \textbf{Risk of Overfitting:} More prone to overfitting, especially with
limited data.
\end{itemize}

& \cite{algorithem.LSTM.one},\cite{algorithem.LSTM.two},\
cite{algorithem.LSTM.three}

\hline

Recurrent neural networks (RNN) & Recurrent neural networks (RNNs) are a class of
neural networks designed for sequential data processing. They use an internal
memory to capture dependencies and patterns in sequential information, making them
well suited for tasks such as natural language processing and time series
prediction. &
\begin{itemize}
\item \textbf{Sequential Processing:} Handles the processing of sequential and
time-related data.

\item \textbf{Adaptable Lengths:} Flexible handling of input sequences of


different lengths.

\item \textbf{Memory Retention:} Remembers previous inputs and improves context-


dependent calculation.
\end{itemize}

&
\begin{itemize}
\item \textbf{Vanishing Gradient:} The gradient decreases in RNNs, making long-
term learning more difficult.
\item \textbf{Limited Long-Term Dependency:} Difficulties in capturing distant
dependencies.
\end{itemize}

& \cite{algorithem.RNN.one},\cite{algorithem.RNN.two},{\citealgorithem.RNN.three},\
cite{algorithem.RNN.four} \\
\hline
Convolutional neural networks (CNN) & Convolutional neural networks (CNNs) are a
class of deep neural networks designed for processing structured raster data, such
as images. They use convolutional layers to automatically learn hierarchical
representations of features, making them very effective in tasks such as image
recognition and computer vision. & \begin{itemize}
\item \textbf{Feature Hierarchies:} Automatic learning of hierarchical features
for pattern recognition.
\item \textbf{Parameter Sharing:} Efficient reduction of parameters through
weight sharing.
\item \textbf{Translation Invariance:} Recognition of features regardless of
their input position.
\end{itemize}
&
\begin{itemize}
\item \textbf{Loss of Spatial Information:} Fine details can be lost by merging
layers.
\item \textbf{Sequential Data Limitation:} Less effective with sequential data.
\item \textbf{Sensitivity to Rotation and Scale:} Sensitivity to variations in
rotation and scale.
\end{itemize}

& \cite{algorithem.CNN.one},\cite{algorithem.CNN.two},\
cite{algorithem.CNN.three},\cite{algorithem.CNN.four} \\
\hline

Artificial neural networks (ANN) & Artificial neural networks (ANNs) are
computational models based on the structure and functioning of the human brain.
They consist of interconnected nodes organized in layers that allow them to learn
and adapt to complex patterns, making ANNs versatile for a variety of tasks, from
pattern recognition to decision making. &
\begin{itemize}
\item \textbf{Parallel Processing:} Efficient parallelization thanks to
attention mechanisms in Transformers.
\item \textbf{Long-term Dependencies:} Effective capture of long-range
dependencies.
\item \textbf{Mitigated Gradient Problems:} Reduced susceptibility to the
vanishing gradient problem.
\end{itemize}

& \begin{itemize}
\item \textbf{Sequential Handling Issues:} Struggles with managing sequential
information.
\item \textbf{High Computational Complexity:} Demanding due to attention
mechanisms.
\item \textbf{Parameter Sensitivity:} Performance influenced by hyperparameter
choices.
\end{itemize}

& \cite{algorithem.ANN.one},\cite{algorithem.ANN.two},\
cite{algorithem.ANN.three} \\
\hline

% Add more rows as needed


\end{longtable}
\end{landscape}
\restoregeometry

% \subsection{RNN}
% The most basic recurrent neural network configuration, called the Elman network
or Simple-RNN (S-RNN), was introduced by Elman in 1990 and later investigated by
Mikolov (2012) for its applicability in language modeling. The structure of the S-
RNN is defined as follows:
% \begin{equation}
% \begin{aligned}
% s_i &= R_{SRNN}(s_{s-1},x_i) = g(x_iW^x+s_{i-1}W^s+b)\\
% y_i &= O_{SRNN}(s_i)= s_i\\
% s_i,y_i \in \mathbb{R}^{d_s}, &x_i \in \mathbb{R}^{d_x},
% W^x \in \mathbb{R}^{d_x \times d_s},
% W^s \in \mathbb{R}^{d_s \times d_s},
% b \in \mathbb{R}^{d_s}
% \end{aligned}
% \end{equation}
% In this context, the state at position $i$ is formed by mixing the input at that
position with the previous state, and this combination is subjected to nonlinear
activation, typically $\tanh$ or ReLU. The output at position $i$ corresponds to
the hidden state at that position. Despite its simplicity, simple RNN yields
impressive results for tasks such as sequence tagging (as shown by Xu et al. 2015)
and is also suitable for language modeling. A detailed exploration of the use of
Simple RNNs in the context of language modeling can be found in Mikolov's 2012 PhD
thesis.

% % \textcolor{red}{Sample Algo for idea:}

% % \begin{equation}
% % \hat{g} = \arg\min_g \sum_{z'\in Z} w_z (f(x') - g(z'))^2 + \Omega(g)
% % \end{equation}

% % \begin{equation}
% % \text{argmax}_{A\subseteq X} \left[ P(f(x) = c | x \in A) \cdot \
text{precision}(A) \right]
% % \end{equation}
% % \begin{equation}
% % \text{argmax}_{A \subseteq X} \left[ P(f(x) = c | x \in A) \cdot \
text{precision}(A) \right]
% % \end{equation}

% % \begin{equation}
% % \text{Importance}(x_i) = \left| \frac{w_i}{\sum_{j=1}^{n} |w_j|} \right|
% % \end{equation}

% % \begin{equation}
% % \text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\
text{Support}(X)}
% % \end{equation}

% % \begin{equation}
% % \text{Score}(H, D) = \text{Accuracy}(H, D) - \lambda \cdot \
text{Complexity}(H)
% % \end{equation}

% % \begin{equation}
% % \text{Gain}(A) = H(D) - \sum_{v \in \text{Values}(A)} \frac{|D_v|}{|D|} \
cdot H(D_v)
% % \end{equation}

% % \begin{equation}
% % S^{c}_{\text{Grad-CAM}}(x) = \text{ReLU} \left( \sum_{k} \sum_{i} \
sum_{j} \alpha_{i,j}^{k,c} \cdot \frac{\partial y^{c}}{\partial A_{i,j}^{k}} \
right)
% % \end{equation}

% % \textcolor{red}{sample description for idea}

% % Where \( S^{c}_{\text{Grad-CAM}}(x) \) represents the saliency map for class \(


c \) and input \( x \), \( \alpha_{i,j}^{k,c} \) represents the weight assigned to
the activation map \( A_{i,j}^{k} \) of the target layer, \( \frac{\partial y^{c}}
{\partial A_{i,j}^{k}} \) represents the gradient of the model's output for class \
( c \) with respect to the activation map, and ReLU is the rectified linear unit
function.

% % \textcolor{red}{Sample algo with description for idea}

% % \begin{equation}
% % \begin{adjustbox}{max width=.44\textwidth}
% % S_{\text{Integrated Gradients}}(x) = (x - x') \times \int_{\alpha = 0}^{1}
\frac{\partial F(x' + \alpha \times (x - x'))}{\partial x} d\alpha
% % \end{adjustbox}
% % \end{equation}

% % Where \( S_{\text{Integrated Gradients}}(x) \) represents the saliency map for


input \( x \), \( x' \) represents the baseline input, \( F \) represents the
model's output, and \( \frac{\partial F(x' + \alpha \times (x - x'))}{\partial
x} \) represents the gradient of the model's output with respect to the input.

% % \begin{equation}
% % R_{\text{Guided}}(x) = \text{ReLU}\left(\frac{\partial y^{c}}{\partial x}\
right)
% % \end{equation}
% % \begin{equation}
% % \frac{\partial y^{c}}{\partial x} = \text{ReLU}\left(\frac{\partial y^{c}}
{\partial A}\right) \circ \frac{\partial A}{\partial x}
% % \end{equation}

% % Where \( R_{\text{Guided}}(x) \) represents the saliency map for input \( x \),


\( \frac{\partial y^{c}}{\partial x} \) represents the gradient of the model's
output for class \( c \) with respect to the input, \( \frac{\partial y^{c}}{\
partial A} \) represents the gradient of the model's output with respect to the
activation maps, \( \frac{\partial A}{\partial x} \) represents the gradient of the
activation maps with respect to the input, and ReLU is the rectified linear unit
function.

% % \begin{equation}
% % \mathcal{L}_{\text{VAE}} = \mathbb{E}_{q_{\phi}(z|x)} \left[ \log p_{\
theta}(x|z) \right] - \text{KL}\left( q_{\phi}(z|x) \| p(z) \right)
% % \end{equation}

% % Here, $\mathcal{L}_{\text{VAE}}$ represents the VAE loss, $q_{\phi}(z|x)$ is


the approximate posterior distribution over the latent variable $z$ given the input
$x$, $p_{\theta}(x|z)$ is the decoder model that reconstructs the input from the
latent representation, $p(z)$ is the prior distribution over the latent space, and
$\text{KL}$ denotes the Kullback-Leibler divergence.

% % \begin{equation}
% % \mathcal{L}_{\text{Beta-VAE}} = \mathcal{L}_{\text{VAE}} + \beta \cdot \
text{KL}(q_{\phi}(z|x) \| p(z))
% % \end{equation}

% % Here, $\mathcal{L}_{\text{Beta-VAE}}$ is the Beta-VAE loss, $\mathcal{L}_{\


text{VAE}}$ is the VAE loss, and $\text{KL}$ denotes the Kullback-Leibler
divergence. The disentanglement term encourages individual dimensions of the latent
space to be independent and capture separate factors of variation.

% % \begin{equation}
% % \mathcal{L}_{\text{FactorVAE}} = \mathcal{L}_{\text{VAE}} + \gamma \cdot \
text{TC}(q_{\phi}(z|x))
% % \end{equation}

% % Here, $\mathcal{L}_{\text{FactorVAE}}$ is the FactorVAE loss, $\mathcal{L}_{\


text{VAE}}$ is the VAE loss, $\text{TC}$ represents the total correlation term, and
$\gamma$ is a hyperparameter that controls the strength of the TC penalty. By
including the TC term, FactorVAE encourages the dimensions of the latent space to
capture independent factors of variation.

% % \begin{equation}
% % h_i^{(l+1)} = \sigma \left( \sum_{j \in \mathcal{N}(i)} \frac{1}{c_{ij}}
W^{(l)} h_j^{(l)} \right)
% % \end{equation}

% % where \( h_i^{(l)} \) represents the hidden representation of node \( i \) at


layer \( l \), \( \mathcal{N}(i) \) represents the set of neighboring nodes of node
\( i \), \( W^{(l)} \) represents the weight matrix at layer \( l \), \( c_{ij} \)
represents the normalization factor, and \( \sigma \) represents an activation
function.

% % \begin{equation}
% % \alpha_{ij} = \frac{\exp\left( \text{LeakyReLU}(\vec{a}^T [W h_i \| W h_j])
\right)}{\sum_{k \in \mathcal{N}(i)} \exp\left( \text{LeakyReLU}(\vec{a}^T [W
h_i \| W h_k]) \right)}
% % \end{equation}

% % \begin{equation}
% % h_i' = \sigma \left( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} W h_j \right)
% % \end{equation}

% % where \( \alpha_{ij} \) represents the attention weight for the edge connecting
nodes \( i \) and \( j \), \( \vec{a} \) represents the attention parameter vector,
\( W \) represents the weight matrix, \( h_i \) represents the hidden
representation of node \( i \), and \( \sigma \) represents an activation function.

% % \begin{equation}
% % h_i^{(l+1)} = \sigma \left( \sum_{r \in \mathcal{R}} \sum_{j \in \
mathcal{N}_r(i)} \frac{1}{c_{ijr}} W_r^{(l)} h_j^{(l)} \right)
% % \end{equation}

% % where \( h_i^{(l)} \) represents the hidden representation of node \( i \) at


layer \( l \), \( \mathcal{R} \) represents the set of different relations, \( \
mathcal{N}_r(i) \) represents the set of neighboring nodes of node \( i \)
connected by relation \( r \), \( W_r^{(l)} \) represents the relation-specific
weight matrix at layer \( l \), \( c_{ijr} \) represents the normalization factor,
and \( \sigma \) represents an activation function.

% \subsection{LSTM}
% \textcolor{black}{The LSTM architecture, introduced in 1997 by Hochreiter and
Schmidhuber, was specifically designed to solve the problem of vanishing gradients
in neural networks. The core concept of LSTM is to incorporate "memory cells"
(represented as vectors) into the state representation to facilitate the
persistence of gradients over time. These memory cells are regulated by gating
mechanisms, which are smooth mathematical functions that mimic logic gates. At each
input state, a gate is used to determine how much of the new input to stored in the
memory cell and how much of the current memory cell contents to discard. In
practice, a gate, referred to as $g$ and restricted to the range [0, 1], is a
vector of values that is multiplied element-wise by another vector $v$ in $R^{n}$,
followed by an addition operation. The significance of "g" is designed to be either
close to 0 or 1, which is usually achieved by using a sigmoid function. This
ensures that indices in $v$ corresponding to values close to 1 in $g$ can pass,
while indices corresponding to values close to 0 are effectively blocked.}
% \textcolor{black}{Mathematically, the LSTM architecture is defined as:}
% \begin{equation}
% \begin{aligned}
% s_{j} = R_{LSTM}(s_{j-1},x_j) & =[c_j;h_j] \\
% c_j & = c_{j-1}\odot f + g \odot i \\
% h_j & = \tanh(c_j) \odot o \\
% i & = \sigma(x_j \mathbf{W^{xi}} + h_{j-1}\mathbf{W^{hi}}) \\
% f & = \sigma(x_j \mathbf{W^{xf}} + h_{j-1}\mathbf{W^{hf}}) \\
% o & = \sigma(x_j \mathbf{W^{xo}} + h_{j-1}\mathbf{W^{ho}}) \\
% g & = \tanh(x_j \mathbf{W^{xg}} + h_{j-1}\mathbf{W^{hg}}) \\
% y_j = O_{LSTM}(s_j) &= h_j \\
% s_j \in \mathbb{R}^{2.dh} , x_i \in \mathbb{R}^{dx}, c_j,h_j,i,f,o,g \in \
mathbb{R}^{d_h} , W^{xo} \in \mathbb{R}^{d_x \times d_h}, W^{ho} \in \
mathbb{R}^{d_h \times d_h}
% \end{aligned}
% \end{equation}
% The symbol $\odot$ is used to represent the element-wise product formation. At
time j, the state consists of two vectors, namely $c_j,$ which denotes the memory
component, and $h_j$, which represents the output or state component. There are
three gates labeled i, f, and o, which are responsible for controlling the input,
forgetting, and output, respectively. These gate values are determined by linear
combinations of the current input $x_j$ and the previous state $h_{j-1}$, which are
then passed through a sigmoid activation function. A candidate update, denoted g,
is computed as a linear combination of $x_j$ and $h_{j-1}$, which is then subjected
to a $\tanh$ activation function. The memory $c_j$ is then updated, with the forget
gate specifying the extent to which the previous memory should be retained $(c_{j-
1} \odot f )$ and the input gate governing the inclusion of the proposed update $
(g \odot i)$. Ultimately, the value of $h_j$ (which also serves as output $y_j$) is
derived from the contents of memory $c_j$, passed through a nonlinear $\tanh$
transform and controlled by the output gate. These gating mechanisms facilitate the
preservation of gradients associated with the memory component $c_j$ over long time
intervals.

% A more detailed examination of the LSTM architecture can be found in the PhD
thesis by Alex Graves (2008) and in the description by Chris Olah. For an analysis
of the performance of LSTMs when used as a character-level language model, see the
work of Karpathy et al. (2015).

% LSTMs are currently the most successful variant of recurrent neural networks
(RNNs) and have contributed to numerous breakthrough achievements in sequence
modeling. Their main competitor in the RNN field is GRU, which will be discussed in
more detail in the following sections.

% \subsection{Convolutional Neural Networks(CNN)}


% A Convolutional Neural Network (CNN) is a deep learning model commonly used for
hierarchical classification of documents. It was originally developed for image
processing, but is also suitable for text classification. CNNs use convolutional
layers and pooling techniques to extract features and reduce computational
complexity. However, in text classification, the large number of "channels" can
lead to high dimensionality. The CNN architecture for text classification typically
includes word embedding, convolutional and pooling layers, fully connected layers,
and an output layer.
% CNN-based approaches use filters to analyze text segments, which are usually
divided into text sections. Collobert et al [4] presented a multilayer CNN-based
neural network that aims to build a versatile model without relying on linguistic
expertise. They claim that it is applicable to various NLP tasks, including those
that involve both words and characters as inputs. In their work, they used two
different methods for word embedding: a window-based approach and a sentence-based
approach. In contrast to RNNs, CNN-based methods show a remarkable advantage in
terms of shorter training times.
% \begin{center}
% % need to redraw and change this
% \includegraphics[width=.6\textwidth]{image.png}
% \end{center}
% The goal of character, word, or document embedding is to transform characters,
words, or documents into vectors that can be used for various NLP tasks. Usually,
these embeddings are trained in conjunction with other NLP tasks. However, a
drawback arises when words with opposite sentiment polarities occur in similar
contexts, which poses a challenge for sentiment classification based on word
embeddings.
% In their research, Tang et al [5] presented an approach to retrieve sentence-
specific word embeddings by using a large dataset of Twitter posts with weak
annotations. They used three neural networks named SSWEh, SSWEr, and SSWEu. SSWEh
and SSWEr were used to predict the sentiment distribution of texts, with SSWEh
working under constrained conditions and SSWEr under more relaxed conditions. On
the other hand, SSWEu represented a unified model that could capture both syntactic
and sentiment information. Experimental results showed that their approach
performed better in classifying sentiments than methods based on manually generated
features.
% Kim and colleagues (Kim et al., 2016) developed a simple CNN-based network to
build a character-based language model. They claimed that such a character-centric
language model can effectively capture details of subwords, resulting in improved
embeddings for rarely occurring words. This capability is particularly beneficial
for languages with complex morphology. An important aspect of their study is their
proposal to integrate a highway network into the pooling layer. The results of this
network are then transferred into a multilayer recurrent neural network language
model (RNN-LM) for predicting the subsequent word.

% \subsection{Artificial Neural Networks(ANN)}


% Artificial neural networks (ANNs) are machine learning models inspired by the
human brain and widely used in natural language processing (NLP). They are ideally
suited for tasks such as text classification, named entity recognition, and
sentiment analysis. ANNs, particularly deep-learning variants such as RNNs and
LSTMs, play a critical role in language modeling and text generation, driving
applications such as machine translation and chatbots. Their ability to process and
understand natural language has greatly expanded the capabilities of NLP systems.
% The basic components of an artificial neural network (ANN) consist of nodes, also
known as processing elements (PE), and the connections that link them together.
Each node receives input, either from other nodes or the external environment, and
produces output that can be passed to other nodes or the environment. In addition,
each node has a function called "f" that controls the conversion of all inputs to
outputs. These connections between nodes are characterized by their strength,
representing either excitatory (positive values) or inhibitory (negative values)
interactions. In addition, the connections between nodes have the ability to adapt
or change themselves.

% Artificial neural networks (ANNs) have strong pattern recognition capabilities


that are critical for tasks such as pattern recognition and decision making. They
excel as robust classifiers, able to draw inferences from extensive and sometimes
imprecise input data. ANNs are particularly adept at overcoming nonlinear
challenges. In technical language, we can state that a system lacks complexity if
its representation function is linear, as described by the following two equations:
% \begin{equation}
% f(c x)=c f(x)
% \end{equation}
% And
% \begin{equation}
% f\left(x_1+x_2\right)=f\left(x_1\right)+f\left(x_2\right)
% \end{equation}
% A complex and nonlinear system deviates from one or both of these conditions. In
short, the greater the nonlinearity of the function y = f(x), the more beneficial
the use of an artificial neural network (ANN) is in trying to understand the
underlying rules (R) that govern the processes within the mysterious system. For
example, consider a Cartesian graph where the x-axis represents a person's income
and the y-axis quantifies the level of happiness they experience as a result, then:
% \begin{center}
% % need to redraw and change this
% \includegraphics[width=.6\textwidth]{fi3.png}
% \includegraphics[width=.6\textwidth]{fig4.png}
% \end{center}
% The graphical representation in Figure 3 shows that a person's happiness level is
closely related to his or her wealth. In simple cases, this relationship can be
easily analyzed without the need for an artificial neural network (ANN). However,
it is important to note that such scenarios are the exception rather than the rule.
% Figure 4 shows that in reality, the relationships are often more complicated.
Many people are afraid of losing money or are unsure how to invest their money,
which can affect their overall sense of happiness. The complex dynamics illustrated
in Figure 4 make it difficult to identify the relationship between money and
happiness using only experimental data. In such complex situations, the use of an
ANN can be useful to accurately describe the nuanced relationship between wealth
and well-being.

% \subsection{Hierarchical Attention Network(HAN)}


% The general structure of the Hierarchical Attention Network (HAN) is shown in
Figure 2. It consists of several components, including a word sequence encoder, a
word-level attention layer, a sentence encoder, and a sentence-level attention
layer. In the following sections, we will look at the specifics of these different
elements.
% \begin{center}
% % need to redraw and change this
% \includegraphics[width=.6\textwidth]{HAN.png}
% \end{center}
% \subsubsection{GRU-based sequence encoder}
% The GRU presented by Bahdanau et al. 2014 uses a gating mechanism to monitor
sequence states without the need for separate memory cells. This mechanism consists
of two types of gates, namely the reset gate (rt) and the update gate (zt), which
work together to control the way information is refreshed within the state. At time
t, the GRU calculates the new state as follows:
% $$
% h_t=\left(1-z_t\right) \odot h_{t-1}+z_t \odot \tilde{h}_t .
% $$
% This involves a linear blending of the prior state $h_{t-1}$ and the current
updated state $\tilde{h}_t$, which is computed using fresh sequence data. The gate,
denoted as $z_t$, determines the balance between retaining historical information
and incorporating new information. The formula for updating $z_t$ is as follows:
% $$
% z_t=\sigma\left(W_z x_t+U_z h_{t-1}+b_z\right),
% $$
% At time $t$, the sequence vector $x_t$ is present, and the candidate state $\
tilde{h}_t$ is determined similarly to a conventional recurrent neural network
(RNN):
% $$
% \widetilde{\mathbf{h}}_t=\tanh \left(\mathrm{W}_h \mathrm{x}_t+\mathrm{r}_t \
odot\left(\mathrm{U}_h \mathrm{~h}_{t-1}\right)+\mathrm{b}_h\right),
% $$
% In this context, $r_t$ represents the reset gate that determines the extent to
which the previous state influences the candidate state. When $r_t$ is set to zero,
the influence of the previous state is essentially erased. The update process for
the reset gate is outlined as follows:
% $$
% \mathrm{r}_t=\sigma\left(\mathrm{W}_r \mathrm{x}_t+\mathrm{U}_r \mathrm{~h}_{t-
1}+\mathrm{b}_r\right)
% $$
% % noman and opu er kora algo gula niche
% \subsection{Generative Pre-trained Transformers(GPT)}
% Generative Pre-trained Transformers (GPT) are a set of natural language
processing models developed by OpenAI. These models are based on the Transformer
architecture and use a neural network optimized for processing sequential data. The
GPT models are pre-trained on large text datasets to produce coherent and
contextually appropriate text in response to specific prompts or input. The term
"generative" in GPT represents the model's ability to create new text content.
After pre-training, GPT can produce imaginative text, respond to requests,
translate languages, and perform various language-related tasks. It does this by
predicting the next word in the sequence based on the context specified in the
prompt.

% Our training procedure includes two main phases. First, we train a high capacity
language model on a large text dataset. This is followed by a fine-tuning phase in
which the model is adapted for a specific discrimination task with labeled data.

% In an unsupervised collection of tokens represented as U = \{ u1, . . . , un\},


our goal is to optimize a typical language modeling objective by maximizing the
associated probability.

% \begin{equation} \label{GPT:1}
% L_1(\mathcal{U})=\sum_i \log P\left(u_i \mid u_{i-k}, \ldots, u_{i-1} ; \Theta\
right)
% \end{equation}
% Here, k denotes the size of the context window, and the conditional probability P
is represented by a neural network with parameters $\theta$. These parameters are
refined by applying stochastic gradient descent \[51\].
% % uporer reference change kora lagbe
% In our experiments, we use a transformer decoder with multiple layers [34] for
the language model. This variant of the transformer [62] uses a multi-headed
mechanism to self-observe the input context tokens, followed by position-based
feedforward layers to generate a distribution over the target tokens.
% \begin{equation} \label{GPT:2}
% \begin{aligned}
% h_0 & =U W_e+W_p \\
% h_l & =transformer_block \left(h_{l-1}\right) \forall i \in[1, n] \\
% P(u) & =\operatorname{softmax}\left(h_n W_e^T\right)
% \end{aligned}
% \end{equation}
% Where U = (u-k, . . . , u-1) represents the context vector of tokens, n
represents the number of layers, We represents the token embedding matrix, and Wp
represents the position embedding matrix.
% Once the model is trained against the target described in Equation \ref{GPT:1},
we adjust the parameters for the specific supervised target task. We consider a
labeled dataset C where each example is a sequence of input tokens $x^1, ... ,
x^m$, These inputs are processed by our pre-trained model to produce the final
activation $hl^m$, of the transformer block. Subsequently, this activation is
passed to an additional linear output layer, characterized by the parameters $W_y$,
to make predictions for y.
% \begin{equation}
% P\left(y \mid x^1, \ldots, x^m\right)=\operatorname{softmax}\left(h_l^m W_y\
right) .
% \end{equation}
% This results in the following goal, which must be maximized:
% \begin{equation}
% L_2(\mathcal{C})=\sum_{(x, y)} \log P\left(y \mid x^1, \ldots, x^m\right)
% \end{equation}
% We also discovered that integrating language modeling as an auxiliary goal during
the fine-tuning process improved learning by (a) enhancing the generalization
capabilities of the supervised model and (b) accelerating the convergence process.
This result is consistent with previous research [50, 43] that observed improved
performance with a similar auxiliary objective. More specifically, we optimize the
following objective (with a weighting parameter $\lambda$):
% \begin{equation}
% L_3(\mathcal{C})=L_2(\mathcal{C})+\lambda * L_1(\mathcal{C})
% \end{equation}
% In summary, the only additional parameters needed for the fine-tuning process are
Wy and embeddings for delimiter tokens, as explained in detail in \ref{sub:sec:gpt}
% \subsubsection{Task-specific input transformations}\label{sub:sec:gpt}
% For some tasks, such as text classification, we can directly fine-tune our model
as described above. Certain other tasks, such as question answering or textual
entailment, have structured inputs, such as ordered sentence pairs or triplets of
document, question, and answers. Since our pre-trained model was trained on
connected text sequences, we need to make some modifications to apply it to these
tasks. Previous work has proposed learning task-specific architectures based on
transferred representations [44]. Such an approach reintroduces a significant
amount of task-specific adaptation and does not use transfer learning for these
additional architectural components. Instead, we use a traversal-like approach
[52], where we transform structured inputs into an ordered sequence that our pre-
trained model can process. These input transformations allow us to avoid extensive
changes to the architecture for different tasks. We briefly describe these input
transformations below and visually illustrate them in \ref{fig-gpt}. All
transformations involve adding randomly initialized start
% and end tokens ((s), (e)).
% \begin{figure} \label{fig-gpt}
% \centering
% \includegraphics[width=0.2\linewidth]{gpt1.png}
% \includegraphics[width=0.75\linewidth]{gp2.png}
% \caption{(left) Transformer architecture and training objectives used in this
work. (right) Input transformations for fine-tuning on different tasks. We convert
all structured inputs into token sequences to be processed by our pre-trained
model, followed by a linear+softmax layer.}
% \label{fig:enter-label}
% \end{figure}

% \textbf{Textual Entailment} In entailment tasks, the token sequences of premise p


and hypothesis h are concatenated with a separator (\$) between them.

% \textbf{Similarity} In similarity tasks, there is no inherent order of the two


sentences being compared. To reflect this, we modify the input sequence to include
both possible sentence arrangements (with a separator in between) and process each
independently to produce two sequence representations hm
% which are added element by element before being fed into the linear output layer.

% \textbf{Question Answering and Commonsense Reasoning} For these tasks, we are


given a context document z, a question q, and a set of possible answers \{$a_k$\}.
We concatenate the document context and the question with each possible answer and
insert a separator in between to get [z; q; \$; $a_k$]. Each
% of these sequences are processed independently with our model and then normalized
over a softmax layer to obtain an output distribution over possible answers.
% \subsection{Bidirectional Encoder Representations from Transformers(BERT)}
% BERT, developed by Google researchers, is an advanced deep learning model that
has changed the landscape of natural language processing (NLP). Its innovative
design allows it to capture the meaning of words within sentences bidirectionally
by considering both left-to-right and right-to-left contexts, capturing the
inherent bidirectional complexity of language.
% BERT is part of the Transformer architecture, which relies on the attention
mechanism. Unlike traditional models that process text linearly, BERT captures the
entire input sentence bidirectionally. This means that it understands context from
both left to right and right to left, allowing it to capture contextual cues from
every word in the sentence. Thanks to this unique approach, BERT is able to more
efficiently understand the connections between words, sentences, and even entire
documents.
% The architecture of the BERT model includes a multi-layer bidirectional
transformer encoder derived from the original design presented in the 2017 work by
Vaswani et al. This implementation has been made publicly available in the
tensor2tensor library. Since the use of Transformer is now widespread and our
version is very similar to the original, we do not provide a detailed explanation
of the model structure here. Interested readers can find comprehensive information
in the work of Vaswani et al. and in useful resources such as "The Annotated
Transformer". In this study, we use notation L for the number of layers (or
transformer blocks), H for the hidden size, and A for the number of self-attention
heads. We present our results mainly for two model configurations: BERTBASE (with
L=12, H=768, A=12, total 110 million parameters) and BERTLARGE (with L=24, H=1024,
A=16, total 340 million parameters). We chose BERTBASE to match the model size of
OpenAI GPT for a meaningful comparison. Importantly, the BERT transformer uses
bidirectional self-awareness, while the GPT transformer uses constrained self-
awareness, which allows each token to focus only on the context to its left.

\section{Application}
\label{sec:Application}
% subsection e bhag kora lagbe
Human-Robot Interaction (HRI) is a crucial milestone in robotics development and
underlines its importance in our increasingly connected world. It goes beyond mere
functionality, enabling robots to understand commands and the emotions and
intentions contained in human language. The profound impact of HRI is being felt in
many areas, from healthcare, where robots can provide companionship and support, to
education, where they can enhance learning experiences. In customer service, robots
can better answer inquiries, while in manufacturing, they can work seamlessly with
human workers to improve production efficiency. Deep Learning-based natural
language processing (NLP) is the linchpin of this transformation, enabling robots
to recognize nuance, respond empathetically, and generate contextual responses.
This advanced NLP technology streamlines interactions and improves accessibility
and, thus, the user experience. Consequently, it accelerates the integration of
robots into daily life and industry, bringing us closer to a future where human-
robot interactions are efficient and remarkably human-centric, fostering an
environment of trust and acceptance.

\begin{figure}
\includegraphics[width = \textwidth]{application_diagram.png}
\caption{This figure describes the diverse applications of Human Robot
Interaction}
\label{fig:application}
\end{figure}

\subsection{Emotion and Sentiment Analysis}


The importance of human-robot interaction (HRI) in the field of emotion and
sentiment analysis is profound, providing robots with a way to understand and
respond effectively to human emotions. This integration not only improves the user
experience, but also enables robots to adapt their behavior based on the emotional
cues of their human counterpart. In the academic literature, several applied papers
in this area highlight the active contribution of HRI to advances in emotion and
sentiment analysis and demonstrate the potential for more emotionally intelligent
and responsive robotic systems.

Tohma et al. Developed a state-of-the-art Turkish NLP application to improve


emotional performance in question-answering systems and human-robot interactions by
using mood analysis techniques to pave the way for more engaging conversational
agents \cite{etalTohma}.
Xie et al. presented an empathetic dialog agent for robots that uses a fine-tuned,
pre-trained language model to enable natural responses with emotion analysis
capabilities evaluated against recent studies\cite{elalXie}.
Machová et al. Presented a lexicon-based approach with automatic labelling and
integration of machine learning for sentiment analysis and tested the effectiveness
of Deep Learning on various text sources such as movie reviews and Twitter
messages\cite{etalMachova}. BAI et al. propose a novel method for automatic emotion
recognition in descriptive sentences that uses sentiment analysis to map text to
emotional states and improve facial expression generation in human-robot
interactions, while addressing the under-researched area of semantic-level
sentiment analysis \cite{etalBAI}.Atzeni et al. presented a system that combines
natural language processing and the Semantic Web to predict stock market sentiment
and provide valuable insights to investors and analysts by leveraging linguistic
resources and advanced technologies\cite{etalAtzeni}.
Lakomkin et al. investigated neural acoustic emotion recognition models for human-
robot interaction, highlighting their performance under real-world conditions and
proposing methods to improve robustness, with potential applications to enhance
human-robot collaboration\cite{etalLakomkin}.Mi et al. have introduced a framework
for natural human-robot interaction (HRI) that integrates spoken language
understanding and object facilitation recognition, allowing robots to efficiently
grasp intended objects based on spoken instructions\cite{etalMi}.Atzeni et al.
introduced an advanced approach to sentiment analysis using bidirectional LSTM
networks with neural attention, improving word embeddings, handling words outside
the vocabulary, and scoring best in the ESWC 2018 Semantic Sentiment Analysis
Challenge \cite{etalAtzeni2}.Chuah et al. Used Instagram data to examine how
emotional robots affect potential consumers, this study highlights the importance
of emotional expressions in human-robot interactions for improving consumer
experiences through machine learning and sentiment analysis\cite{etalChuah}. Eom et
al. investigated the use of deep learning NLP to predict the sentiments of South
Korean Twitter users on the topic of vaccination during the Omicron variation,
performed network analysis to identify relevant topics, used various deep learning
algorithms to predict emotions, and applied topic modeling to analyze news data\
cite{etalEom}. Tejaswini et al. have developed a novel deep learning hybrid model,
"Fasttext Convolution Neural Network with Long Short-Term Memory (FCL)," that
employs NLP techniques and enables highly accurate depression detection from social
media texts that surpasses the state of the art.
Chakraborty et al. have developed a method to measure and adapt human visual
attention in human-robot interactions using deep learning techniques, and
demonstrate practical robot-based implementations \cite{etalChakraborty}.Singh et
al. have developed "Tinku," an affordable robot equipped with speech processing and
computer vision capabilities to assist autistic children in therapy and education\
cite{etalSingh}. Arumugam et al. proposed an approach that improves the
interpretation of human-robot instructions for mobile manipulation robots by using
deep neural network language models and hierarchical planning to improve efficiency
in complex environments.

\subsection{Healthcare}
In healthcare, human-robot interaction (HRI) is proving to be a transformative
force. HRI has the potential to revolutionize patient care and encompasses a wide
range of applications, from assisting medical staff during surgeries to providing
emotional support to patients with mental illness. This review paper explores the
multifaceted landscape of HRI in healthcare and examines its critical role in
improving medical services, improving patient outcomes, and addressing healthcare
challenges. We address the various applications and innovations that demonstrate
the growing importance of HRI and pave the way for a comprehensive understanding of
its evolving role in the healthcare ecosystem.

Escobar et al. developed a deep learning facial expression recognizer tailored for
children with Down syndrome and investigated the integration of EEG for potential
applications in therapies and brain-computer interfaces\cite{etalEscober}. Ilyas et
al. discussed how deep transfer learning can improve human-robot interaction in
cognitive and physical rehabilitation, which could improve the quality of life of
people with impairments\cite{etalIlyas}.
Dai et al. showed how Deep Learning-based NLP models improved screening, diagnosis,
and early intervention for psychiatric patients in mental health care, including
classification of discharge reports from electronic medical records and the
potential for early detection through social media analysis.
Kim et al. proposed a deep-transfer learning-based NLP model for predicting overall
survival in patients with rectal cancer from unstructured MRI reports and
emphasized the superior performance of this model over conventional methods for
using unstructured medical data.\cite{etal.Kim} Faizan et al. present PyNDA, a
groundbreaking deep learning architecture demonstrating superior ability to extract
various psychometric dimensions from user-generated text \cite{etal.Ahmad}.
Gouws et al. improve deep neural language modeling with a modified HLBL model,
incorporating convolutional layers and RNN layers for improved word sense
disambiguation and optimized learned representations \cite{etal.Gouws}.Vankayala et
al. contribute to depression detection with the hybrid deep learning model FCL,
incorporating Fasttext Embedding, CNN, and LSTM for improved detection in the early
stage of social media texts \cite{etal.Tejaswini}.Nandini et al. contribute to
depression detection using the hybrid deep learning model FCL, representing a
significant advance in the application of NLP techniques \
cite{etal.Nandini}.Ferrone et al. explore the synergy between symbolic and
distributed representations in NLP and deep learning, developing novel neural
networks for interpreting symbols in classical NLP tasks \cite{etal.Ferrone}.Dai et
al. demonstrate the effectiveness of pre-trained deep learning models in improving
mental illness classification using sparse electronic health data for accurate
psychiatric diagnosis and early detection \cite{etal.Dai}.Mariani et al. analyze
five-year changes in language and NLP using the NLP4NLP+5 corpus, focusing on AI,
neural networks, machine learning, and word embedding for science-based insights \
cite{etal.Mariani}.Hollenstein et al. integrate EEG brain activity to improve
natural language processing tasks, particularly emotion classification,
demonstrating the potential of EEG features to improve performance in scenarios
with limited training data \cite{etal.Hollenstein}.Kim et al. present a deep
transfer learning model using serial radiologic reports to predict survival of
patients with rectal cancer, advancing the use of unstructured medical big data for
informed clinical decisions \cite{etal.Kim}.Le et al. introduce a novel method
using fastText and deep convolutional neural networks to identify SNARE proteins,
surpassing the SNARE-CNN predictor with high accuracy, sensitivity, specificity,
and MCC \cite{etal.Le}.Orsag et al. present a trainable model based on human-
skeleton data and LSTM networks to recognize the spatio-temporal activity of human
workers, enabling context-aware human-robot collaboration in industrial
environments with a focus on safety and efficiency \cite{etal.Orsag}.

Guazzo et al. present deep learning NLP models for the automatic identification of
hospitalizations for cardiovascular disease in diabetic patients and demonstrate
their effectiveness in different time windows\cite{application.healthcare.Guazzo}.
Kim et al. present a superior deep learning-based algorithm for extracting keywords
from pathology reports that can be used in biomedical research and medical
institutions to overcome data extraction challenges\
cite{application.healthcare.Kim1}.

Sarraju et al. present and validate clinical BERT-based NLP models. They achieve
high accuracy in classifying statin non-use and reasons for non-use in patients
with atherosclerotic cardiovascular disease using electronic health records,
providing valuable insights for targeted interventions\
cite{application.healthcare.Sarraju}.

\subsection{Entertainment}

In the entertainment industry, interaction between humans and robots opens up new
dimensions of user experience. Interactive robots create immersive and engaging
entertainment and foster unique forms of creative expression and shared social
experiences. This integration not only reflects technological advancements, but
also shapes the future of entertainment by providing audiences with engaging and
interactive forms of enjoyment.

The transformation model for human activity recognition using wearable sensors
proposed by Luptáková et al. demonstrates the real-time analysis of smartphone
sensor data and the versatility of its applicability in various scenarios, as
evidenced by its successful application in the entertainment context \
cite{application.entertainment.teja}.
Tan et al. present a task-completion question answering framework with a multimodal
dataset and a hybrid deep learning-symbolic reasoning approach that handles
collaborative tasks with text, images and videos in a natural environment\
cite{application.entertainment.tan}.
Lee et al. present a framework that uses a multimodal dataset and a hybrid deep
learning-symbolic reasoning approach to answer task completion questions. The
approach was developed to handle collaborative tasks with text, images and videos
in a natural environment\cite{application.entertainment.lee}.
Atzeni et al. demonstrate the successful integration of sentiment analysis, deep
learning and semantic technologies to enable Zora, a humanoid robot, to perform
natural language interactions in the field of human-robot interaction\
cite{application.entertainment.Atzeni}.
Morshed et al. contribute by providing a thorough review of recent advances in
human activity recognition, covering methods, frameworks, datasets and challenges.
The article also provides practical guidance for researchers and practitioners in
the field\cite{application.entertainment.Morshed}.
Dirgov et al. adapt the transformer model for precise real-time human activity
recognition from smartphone motion sensor data, demonstrating high accuracy and
practical applicability originally developed for natural language processing and
vision tasks\cite{application.entertainment.Dirgová}.
Russo et al. utilize GPT-2 in deep learning-based NLP to extract meaningful neural
encodings from functional MRI during narrative listening, revealing that GPT-2
surprisal and saliency explain neural data in language-related brain regions,
highlighting the potential for deep learning models to investigate complex neural
mechanisms underlying human language comprehension\
cite{application.entertainment.Russo}.

% \subsection{Management}
% The LLM-based smart reply (LSR) system by Bastola et al., which is based on
ChatGPT, improves collaboration efficiency through contextual and personalized
responses, resulting in improved team performance and reduced mental load in daily
work\cite{application.management.bastola}.

\subsection{Industrial}
The importance of human-robot interaction in industrial contexts is underlined by
its transformative impact on efficiency, safety and productivity. As technology
advances, these benefits are further enhanced by the integration of deep learning-
based natural language processing, enabling more nuanced and adaptive interactions
between humans and robots. Below are numerous examples of such applications in
industry that demonstrate how the synergy of human-robot interaction and deep
learning-based natural language processing contributes to a more efficient,
responsive and effective operating environment.

Liu et al. develop a deep learning-based multimodal control interface that enables
seamless collaboration between humans and robots in manufacturing and provides the
flexibility for dynamic task changes in a shared workspace\
cite{application.industrial.liu}.
Illuri et al. are developing a hand gesture recognition system that combines a
humanoid robot with machine learning techniques and enables accurate recognition of
hand postures and gestures in human-robot interaction\
cite{application.industrial.illuri}.
Ahn et al. present the Interactive Text2Pickup (IT2P) network, which improves
human-robot collaboration by skillfully handling ambiguous voice commands through
interaction, ensuring accurate pickup of objects based on user instructions\
cite{application.industrial.Ahn}.
Mohamed et al. contribute by introducing conventions, standard interfaces and a
reference pipeline in ROS for HRI, with the aim of improving interoperability and
enabling the reuse of core functionality across different HRI-related software
tools\cite{application.industrial.Mohamed}.
Keshinro et al. present a deep learning-based approach using ConvLSTM and LRCN
algorithms with RGB images that improves human-robot collaboration by accurately
predicting human intentions and improving team planning and execution by
recognizing implicit human intentions\cite{application.industrial.Keshinro}.
Ben-Youssef et al. present CollisionNet, a deep neural network developed for
collision detection in collaborative robots. This model is characterized by
remarkable sensitivity, robustness to false alarms and generalization across
different robots and motions\cite{application.industrial.Ben-Youssef}.
Kang et al. contribute by presenting video annotation methods that utilize both the
egocentric and exocentric viewpoints of the robot for human-robot interaction and
improve the description of visual information in a social robotics context. This
approach aims to create a comprehensive understanding of the robot's visual
perspective\cite{application.industrial.Kang}.
Lu et al. present an innovative lip speech decoding system using low-cost
triboelectric sensors and an enhanced recurrent neural network that has high
accuracy and potential applications for individuals with vocal cord lesions,
enriching the landscape of lip-speech translation systems\
cite{application.industrial.Lu}.
Wahab et al. present 4mCNLP-Deep, a superior computational model for the
identification of N4-methylcytosine sites that utilizes deep learning and word
embedding. At the same time, the importance of DNA methylation is highlighted and
the effectiveness of deep learning in analyzing genomic data is demonstrated\
cite{application.industrial.Wahab}.
Ruffolo et al. present IgFold, a rapid deep learning method for antibody structure
prediction that provides excellent insights into a large number of paired antibody
sequences, highlighting the importance of accurate structure prediction for the
study of adaptive immune response and potential therapeutic applications\
cite{application.industrial.Ruffolo}.
Liu et al. present OPED, a deep learning-based priming-editing guide RNA design
optimization model that demonstrates superior accuracy, efficiency enhancement, and
versatility in genome editing applications, and the creation of the OPEDVar
database for user-friendly access to optimized designs\
cite{application.industrial.Liu1}.
SLICES, co-authored by Xiao et al, introduces a string-based crystal representation
with remarkable invertibility and positions it as a promising tool for in-silico
exploration of materials and inverted design of narrow-gap semiconductors \
cite{application.industrial.Xiao}.
CLOOME, co-authored by Fernandez et al, presents a multimodal contrastive learning
system that significantly improves the retrieval of bioimaging databases of
chemical structures, shows remarkable transferability to various drug discovery
tasks, and outperforms existing methods in predicting mechanism of action\
cite{application.industrial.Fernandez}.
\subsection{Interdisciplinary}
Human-robot interaction (HRI) is crucial in interdisciplinary research as it
combines insights from robotics, psychology, computer science and design for a
seamless integration of robotic technologies. This collaborative approach requires
a sophisticated understanding of human behavior and preferences. The
interdisciplinary nature of HRI is critical to the development of sophisticated,
socially aware robotic systems that can work seamlessly with humans. The following
commentary examines specific contributions of relevant work that underscore the
importance of HRI as a catalyst for interdisciplinary exploration and innovation.
Goldstein et al. show empirical parallels between language processing in the human
brain and an autoregressive deep language model and point to common computational
principles and potential applications in modeling language-related processes\
cite{application.Interdisciplinary.Goldstein}.
Zeng et al. propose a unified deep learning system that seamlessly integrates
molecular structures and biomedical texts, outperforms human experts in
understanding molecular features, and demonstrates its versatility in various
biomedical tasks, with potential applications in automated drug discovery\
cite{application.Interdisciplinary.Zeng}.
Mao et al. present IKGM, a novel deep learning method using attention mechanisms to
identify key genes in macroevolution, which has been successfully applied to
diurnal butterflies and nocturnal moths, demonstrating its potential for insights
into macroevolutionary mechanisms at the genomic level\
cite{application.Interdisciplinary.Mao}.
Frey et al. explore neural scaling in deep chemical models and propose strategies
for improved pre-training efficiency and performance that have practical
implications for applications in robotics and computer science\
cite{application.Interdisciplinary.Frey}.
Shanmugavadivel et al. apply machine learning and pre-trained models to effectively
overcome the challenges of sentiment analysis in low-resource settings, especially
on Tamil-English code-mix data. Their application contributes significantly to
advancing sentiment analysis in code-mix languages \
cite{application.Interdisciplinary.Shanmugavadivel}.
Diviya et al. present a novel neural architecture for natural language image
synthesis in Tamil that addresses the challenges of regional language, improves
descriptive image synthesis, and contributes to computer vision and speech
synthesis\cite{application.Interdisciplinary.Diviya}.
Polyglotter, co-developed by Bazaga et al, is a flexible machine learning system
that converts natural language queries into database commands without manual
annotation. It uses a Transformer-based model and shows strong performance across
different database engines\cite{application.Interdisciplinary.Bazaga}.

% need introduction
\subsection{Others}
Below are several noteworthy papers that hold considerable value concerning their
application in Human-Robot Interaction (HRI). These papers contribute significantly
to the understanding and advancement of HRI dynamics, offering valuable insights
into various aspects of the field.
The transformation model for human activity recognition using wearable sensors
presented by Luptáková et al. demonstrates the real-time analysis of smartphone
sensor data and the broad applicability in different scenarios\cite{etalLuptáková}.
Kasmaiee et al. propose an effective two-method system that combines rule-based
approaches and a sophisticated deep learning model for automatic spelling
correction in Persian texts with potential applications in computer systems and
robotics.\cite{application.others.Kasmaiee}.
Li et al. investigate neural coding in the human auditory pathway using DNN models
and show strong correlations between hierarchical DNN layers and neural activity,
with language-specific models predicting cortical responses and highlighting the
superiority of DNNs in speech processing tasks\cite{application.others.Li}.
TalkToModel, developed by Slack et al, is an advanced conversational system that
outperforms traditional point-and-click explanation systems and enables users to
interactively understand and explain machine learning models, especially in
critical areas such as healthcare\cite{application.others.Slack}.
\section{Results analysis}
\label{sec:Results_analysis}
The results and analysis derived from the aggregated table of papers in this
survey, focusing on the use of deep learning-based natural language processing
(NLP) in human-robot interaction (HRI), provide insightful perspectives on the
current research landscape. Examining these works helps identify trends,
challenges, and innovative areas within the field and provides researchers with a
nuanced understanding of advances and existing gaps in the literature. Detailed
findings on the prevalence and performance of various models are outlined in the
table \ref{tab:result}, contributing to a comprehensive understanding of the most
advanced applications in the field.
\begin{longtable}
{p{2cm}p{2cm}p{3.5cm}p{3cm}p{1.5cm}p{4cm} }
\caption{Review the related and existing analysis results used in recent
papers.}
\label{tab:result} \\
\hline
\textbf{Reference} & \textbf{Doamin}
& \textbf{Dataset} & \textbf{Pre-processing Methods} & \textbf{Model} & \
textbf{Results}\\
\hline
\endfirsthead

\multicolumn{6}{c}%
{{\bfseries \tablename\ \thetable{} -- continued from previous page}} \\
\hline
\textbf{Reference} & \textbf{Doamin}
& \textbf{Dataset} & \textbf{Pre-processing Methods} & \textbf{Model} & \
textbf{Results} \\
\hline
\endhead

\hline \multicolumn{6}{r}{{Continued on next page}} \\


\endfoot

\hline
\endlastfoot
Khan \emph{et al.} \cite{etal.Wang}
& Interdisciplinary
& COCO
&
Tokenization\newline
Part-of-Speech Tagging(POS)\newline
Lemmatization\newline
Stemming\newline
&
Mask-RCNN,\newline
CNN-LSTM network
& \textbf{F1-score}: \newline 97.01\%,93.34\% (NIH) \newline 97.50\%, 95.78\%
(JRST) \\ \hline

Shervedani \emph{et al.} \cite{etal.Sheredani}


& Emotion and
Sentiment
Analysis
&
ELDERLY-AT-HOME corpus
&
Tokenization\newline Text Cleaning\newline Vectorization\newline Normalization\
newline Feature Selection &
Generic User Simulator Model (GUS Model) &
\textbf{DA Accuracy}:79.3\%\newline\textbf{Action Accuracy}:80.13\% \\ \hline

Li \emph{et al.} \cite{etal.Li}


& Industrial
&Virtual-Assistant-Max
&
Tokenization\newline Embeddings
&
BERT
&
\textbf{Intent accuracy}:0.977\newline
\textbf{F1-score}:0.968 \\ \hline

Dong \emph{et al.} \cite{etal.Dong}


& Interdisciplinary
&Talk2Car &
Tokenization\newline
Data Augmentation\newline
Normalization\newline
Vectorization
&
BERT
&
\textbf{AP50:}76.74 \\ \hline

Peng \emph{et al.} \cite{etal.Peng}


& Emotion and Sentiment Analysis
&
MELD dataset\newline
EmoryNLP\newline
IEMOCAP
&
Tokenization\newline
Data Augmentation\newline
Lemmatization and Stemming\newline
Normalization\newline
&
CNN\newline
Bi-LSTM\newline
&
\textbf{Accuracy:} 64.03\% \\ \hline

Larisch \emph{et al.}\cite{etal.Larisch}


& Industrial
& Blue Gene/L (BGL) and Spirit &
Tokenization\newline
Padding\newline
Windowing\newline
Data Augmentation\newline
&
Compact Convolutional Transformer &
\textbf{Precision:} 94.52\newline \textbf{Recall}\textbf{:} 57.87\newline \
textbf{F1-Score:} 74.27 \\ \hline

Kim \emph{et al.} \cite{etal.Kim}


& Healthcare &
WebMD Dictionary\newline
NHS inform\newline
Snomed Ct\newline
Cleveland Clinic\newline
AMoRSD &
Word embedding &
BERT, Word2Vec, Symptom2Vec &
Symptom2Vec similarity: 0.983\newline AMoRSD AUC: 0.99\% \\ \hline

Takano \emph{et al.} \cite{etal.Takano}


& Interdisciplinary
& Name not given\newline
pairs of motion symbols and descriptive sentences &
Feature Extraction\newline
Data Encoding\newline
Word Sequences\newline
&
Probabilistic Graphical Model\newline
GANs\newline
RNNs\newline
&
\textbf{Accuracy}:0.578 \\ \hline

Su \emph{et al.} \cite{etal.Su}


&
&NLPCC-ICCPOL 2016's KBQA eval task's Chinese KB &
Word Embedding\newline
&
Bi-LSTM-CRF\newline
GRU\newline
&
\textbf{Accuracy}: 99.14\% \\ \hline

Kumar \emph{et al.} \cite{etal.Kumar}


& Healthcare &
Harvard Medical School's N2C2 NLP dataset. &
Feature Selection\newline
Vectorization\newline
Lemmatization\newline
Data Augmentation\newline
Tokenization\newline
&
BiLSTM\newline
Word2Vec\newline
&
\textbf{F1 :} 99.27\newline
Reducing deviation 2.35 to 0.27 across 3-9 classifiers \\ \hline

Tan \emph{et al.} \cite{etal.Tan}


& Industrial
&
TC-QA dataset &
Tokenization\newline
Lemmatization and Stemming\newline
Part-of-speech tagging\newline
Named entity recognition (NER)\newline
Data Augmentation\newline
&
Hybrid DL methods and Symbolic reasoning
&\textbf{Text-score}:0.6193 \newline \textbf{Image-score}:0.5719 \newline \
textbf{Video-retrieval-score}:0.5875 \newline \textbf{Video-mIoU-score}:0.4557 \\ \
hline

Golech \emph{et al.} \cite{etal.Golech}


& Interdisciplinary &
MS COCO dataset &
Tokenization\newline
Data Augmentation\newline
Handling Contractions\newline
&
Meshed-Memory-Transformers\newline &
\textbf{Bleu-1}: 0.72. \\ \hline

Prottasha \emph{et al.} \cite{etal.Prottasha}& Emotion and Sentiment Analysis &


Self-collected from social media &
Feature Extraction\newline
Data Augmentation\newline
&
Bangla-BERT\newline
Hybrid of BERT and CNN-LSTM\newline
&
\textbf{Accuracy:} 94.15\% \\ \hline

Wang \emph{et al.} \cite{etal.Wang}


& Interdisciplinary&
CITIC Institute and JST parallel corpus &
POS\newline
CHUNK\newline
Word Embedding\newline
&
LSTM &
\textbf{BLEU:} 39.52. \\ \hline

Vashistha \emph{et al.} \cite{etal.Vashistha}


& Interdisciplinary
&
IIT Bombay Eng-Hin Parallel Corpus
&
Lowercasing\newline
Data Augmentation\newline
Tokenization\newline
Normalization\newline
&
Neural Machine Translation\newline
RNN &
\textbf{BLEU:} 24.54. \\ \hline

Budiharto \emph{et al.} \cite{etal.Budiharto}


& Industrial
&
Stanford Question Answering Dataset (SQuAD)\newline
100 dimensions of Global Vectors (GloVe)\newline
&
Stemming\newline
Tokenization\newline
Named Entities Disambiguation (NED)\newline
&
RNN\newline
CNN &
\textbf{F1}: 82.43\% \\ \hline

Martins \emph{et al.} \cite{etal.Martins}&& RoboCup's generator\newline


GPSR\newline
FBM3\newline
&
Tokenization\newline
Lowercasing\newline
Data Augmentation
&
RNNs\newline
LSTM&
GPSR-dataset:\newline
Action-detection Acc: 0.895 \newline
Slot-filling Acc: 0.873 \newline
FBM3-dataset: \newline
Action-detection Acc: 0.687 \newline
Slot-filling Acc: 0.637 \\ \hline

Khodadadi \emph{et al.} \cite{etal.Khodadadi}


& Industrial
& 10 years of customer service calls,\newline
100K recorded calls. &
Lemmatization\newline
Data Augmentation\newline
Feature Selection\newline
Word Embedding\newline
POS tagging\newline
Tokenization\newline
&
Bi-LSTM\newline
CNN &
\textbf{Accuracy}: 0.84 \newline
\textbf{Precision}: 0.92 \\ \hline

Dharaniya \emph{et al.} \cite{etal.Dharaniya}


& Entertainment
& Internet Movie Script Database (IMSDb) movie dataset &
Data a\newline
Feature selection\newline
HTML tags &
DBN\newline
Bi-LSTM\newline
GPT3\newline
GPT Neo X models\newline
&
\textbf{BLUE:} 73.77\% \newline \textbf{CHRF:} 51.23\% \newline \textbf{GLEU:}
48.23\% \newline \textbf{METEOR:} 51.18\% \newline \textbf{NIST:} 52.81\% \
newline \textbf{ROGUE}: 64.60\% \\ \hline

Huang \emph{et al.} \cite{etal.Huang}


& Industrial
&
Command Analysis(CA) dataset\newline
Perspective Disambiguation(PD) dataset\newline
&
Word Embedding\newline
Tokenization\newline
Position embeddings\newline
Data Augmentation\newline
&
LD3PA
&
\textbf{CA task:} 0.996\newline
\textbf{PD task:} 0.995 \\ \hline

Zhang \emph{et al.} \cite{etal.Zhang}


& Industrial&
CoNLL-2005 \newline
CoNLL-2012 &
Syntax Reduction\newline
Feature Selection\newline
Labelling\newline
Statistical Learning\newline
Action Sequence Transformation\newline
&
Bi-LSTM &
Action sequence correctness (ASC) of 0.865 \\ \hline

% dummy author\emph{et al.}\cite{esh63} & Dummy application & Dummy datasets&


Dummy1, Dummy2 Dummy3, Dummy4, Dummy5 & Dummy6 & \textbf{Accuracy},\textbf{F1
score} 97.01\%,93.34\% (NIH) and 97.50\%, 95.78\%(JRST) \\ \hline

\end{longtable}

% \section{Pre-processing Methods}
% \label{sec:Pre-processing}
% Pre-processing is vital in deep learning-based natural language processing for
human-robot interaction, converting raw text into machine-readable formats using
techniques like data cleansing and tokenization. These methods, akin to sentiment
analysis, refine input data and enhance model performance. Rigorous pre-processing
ensures data integrity, contributing significantly to scientific advancements in
optimizing NLP applications in HRI. The table \ref{tab:preprocessing}below
elucidates prevalent pre-processing methods in this field.
% \begin{table}[H]%[htb]%
% \begin{adjustwidth}{-4cm}{-4cm} % Comment out/remove adjustwidth environment if
table fits in text column.
% \centering
% \caption{
% {\bf The table discusses the commonly used pre-processing methods in sentiment
analysis.}}
% % \begin{tabular}{p{2.5cm}p{14cm}}
% \begin{tabular}{p{3cm}p{12cm}p{5cm}}

% \hline

% \bf{Pre-Processing Method} & \textbf{Description}& \textbf{References}


% \\ %\thickhline

% \hline

% Image Crop & Image cropping is a pre-processing technique that involves removing
a specific part of an image, typically to focus on a region of interest or to
resize it for further analysis or display.
% &\cite{zhang2023tcpcnet},\cite{kushol2023effects},\cite{fanjie2023sust}
% \\ \hline
% Tokenization & Breaking down a text into smaller linguistic units such as words,
phrases, or symbols to facilitate further analysis. &\cite{halawani2023automated},\
cite{baghaei2023deep},\cite{aldunate2022understanding},\cite{zhou2023generating},\
cite{pandey2022mental}\\ \hline

% Padding & Commonly used in sequence-based models such as recurrent neural


networks (RNNs) and transformers to ensure that all input sequences are the same
length, allowing for efficient batch processing.
% &\cite{ashraf2023bert},\cite{inamdar2023machine},\cite{jianan2023deep},\
cite{karasoy2022spam}
% \\ \hline

% Lowercasing & Helpful for standardizing text to reduce vocabulary complexity and
prevent the model from treating words with different capitalization as different
entities.
% &\cite{zaheer2023multi},\cite{duong2023deep},\cite{alshahrani2023applied}
% \\ \hline

% Lemmatization \newline
% \& \newline
% Stemming & Techniques for reducing words to their basic form in order to deal
with word variations and thus reduce the size of the vocabulary and improve the
generalization capability of the model. &\cite{budiharto2021novel},\
cite{ayanouz2020smart},\cite{guazzo2023deep},\cite{wang2023deepsa} \\ \hline

% Normalization & converting text data to a standard format by removing accents,


special characters, or diacritics to enable consistent and uniform representation.
% &\cite{abdalla2023sentiment},\cite{matti2023autokeras},\
cite{kumar2022classification}
% \\ \hline

% Noise removal & An important step in improving the quality of text data by
removing irrelevant information such as special characters, symbols, or irrelevant
words that do not contribute to the overall meaning.
% &\cite{amaar2022detection},\cite{kheraleveraging},\cite{merdivan2019dialogue}
% \\ \hline

% Feature-Extraction & Essential for capturing the most important information from
the text data and creating meaningful representations that can be effectively used
by the model to learn patterns and make predictions. &\cite{johnston2023ns},\
cite{yohanes2023emotion},\cite{10343159}
% \\ \hline

% Word Embedding & a technique used to represent words as dense vectors in a


multidimensional space, preserving semantic relationships between words and
improving the model's ability to understand context and meaning. &\
cite{wan2023text},\cite{chang2023changes},\cite{wang2023generating}
% \\ \hline

% Stop word removal & Removes frequently occurring words (e.g., 'and', 'the', 'is')
that do not contain important information and are often ignored during analysis to
reduce noise and increase processing speed. &\cite{balouch2023transformer},\
cite{mithun2023development},\cite{das2022deep},\cite{nijhawan2022stress}
% \\ \hline

% Removal of special \newline characters \& numbers & Helps clean up the text by
removing non-alphabetic characters and numeric digits to ensure that the data
focuses on the textual information relevant to the analysis. &\
cite{olthof2021deep},\cite{nictoi2023unveiling},\cite{das2023sentiment},\
cite{olthof2021deep}
% \\ \hline

% Vectorization & The process of converting text data into numeric vectors, making
it suitable for various machine learning models that require numeric input for
processing. &\cite{gupta2023detecting},\cite{xavier2022natural},\
cite{marulli2021exploring}
% \\ \hline

% POS tagging \newline


% (Part-of-Speech Tagging) & This involves assigning grammatical tags to words in a
sentence, allowing the model to understand the role of each word and its
relationship to other words in the text. &\cite{eppe2016exploiting},\
cite{pandy2023extracting},\cite{villa2023extracting}
% \\ \hline

% Handling Contractions & Expanding contractions to their full form helps


standardize text data and avoid ambiguity, especially in cases where the
contraction may have a different meaning. &\cite{chai2023twitter},\
cite{mahimaidoss2023emotion}
% \\ \hline

% Named Entity Recognition (NER) & A process that identifies and classifies named
entities in text, such as names, places, dates, and numeric values, so that the
model can recognize specific entities and their context. &\cite{ahmed2023fine},\
cite{jang2022exploration}
% \\ \hline

% Punctuation removal & Removing punctuation from text data helps to simplify the
text and ensures that the model focuses on the context of the text, improving the
accuracy of the analysis and predictions. &\cite{agarwal2023deepgram},\
cite{motyka2023information},\cite{ashfaque2023design}
% \\ \hline

% \end{tabular}

% \label{tab:preprocessing}
% \end{adjustwidth}
% \end{table}

\section{Challenges and Future Directions}


\label{sec:Challenges}
The challenges in deep learning-based natural language processing (NLP) for human-
robot interaction include contextual ambiguity, real-time processing and emotional
intelligence. Future research aims to improve the interpretability of models,
understand human intentions and emotions, and investigate multimodal inputs for
improved communication. Furthermore, adapting NLP models to different cultural and
linguistic contexts, ensuring transparency and considering ethical aspects are
important aspects of the research agenda.
\subsection{Insufficiency of data}
Insufficient data is a problem for deep learning and natural language processing
(NLP). For example, it is difficult for the computer to recognize rare diseases
because it is difficult to find many examples of them in medical images. If the
data does not capture all possible emotions, the model will not be able to
recognize them well in linguistic tasks such as interpreting feelings. Furthermore,
translation technologies face challenges when it comes to languages for which not
enough data is available. This could result in unfair results, such as facial
recognition software that is not effective for everyone because it has not been
trained on a variety of faces.
It is crucial to solve these data issues in future if we want to improve the
overall usability of these systems.

\subsection{Lack of computational Efficiency}


The computational inefficiencies of deep learning, which are caused by complex
structures and large data sets, make it difficult to apply the technology widely,
particularly due to the high costs and limited accessibility. Deep learning models
often suffer from a lack of computing efficiency due to their enormous size and
complexity. For example, training large neural networks requires extensive
computing resources and high energy consumption. Real-time applications such as
autonomous vehicles or live video analysis face the challenge of meeting low
latency requirements. In addition, the use of resource-intensive models on end
devices with limited computing power poses efficiency problems. A balance between
model complexity and computational efficiency is crucial for widespread adoption.
This is evident in applications such as health diagnostics, where fast and accurate
predictions are essential but resource constraints hinder seamless implementation.
Addressing these efficiency issues is central to improving the practicality and
scalability of deep learning in various real-world scenarios.
We will investigate the computational efficiency of this model. We will optimize
the algorithm to increase efficiency.

\subsection{Orientation constraints and identification limitations}


Orientation constraints and identification limitations are problems in deep
learning, especially in tasks such as object recognition and action classification.
With different orientations or examples, the models have difficulties and their
accuracy decreases. To solve these problems, methods such as data augmentation,
sophisticated designs such as CNNs or RNNs and the use of transfer learning are
used to improve the resilience and generalization capabilities of the model.
Identification constraints refer to the challenges associated with correctly
identifying concepts or entities, especially in ambiguous situations. For example,
an NLP model might have difficulty determining the intended meaning of a word when
it has many meanings, which could lead to misunderstandings when performing tasks
such as word meaning disambiguation.

In future work, we will apply data augmentation for increased dataset diversity. To
increase performance, we will use attention mechanisms, group techniques and
transfer learning.
Involve consumers in the design process and create feedback loops to enable
continuous improvement.

\subsection{Challenges in the production of animation}


The challenges of producing deep learning animations include minimizing orientation
and identification constraints and integrating AI-driven workflows to increase
productivity. It is a challenge to use deep learning and language processing to
produce animated creatures that look and behave truly lifelike. Ensuring a natural
narrative and dialogue flow is also a challenge, as the computer may struggle to
understand intricate details and complex actions. Another concern is that the
computer may unintentionally create offensive or unfair characters or
conversations. To use technology responsibly and avoid problems, we need to
exercise caution. For example, creating virtual assistants with realistic-looking
faces is a challenge, and occasionally the computer may say offensive or
nonsensical things. An important problem in developing software is finding a
balance between the computer's creativity and making sure it does the right thing.

To produce accurate and seamless animation, these obstacles must be overcome. This
will ensure a more efficient and productive production pipeline.
\subsection{Data imbalance and lack of diversity}

The main challenge lies in the imbalance of the data and the lack of diversity of
interaction situations within the dataset, which leads to the model focusing on the
detection of the presence of a friend rather than the complexity of the different
interactions. To solve this problem, it is essential to expand the dataset by
introducing more diverse interaction situations. This expanded dataset will allow
the model to better generalize, distinguish between different interaction contexts,
and understand the nuances of human-robot collaboration.
In future work, we will actively maintain a more diverse dataset, incorporating
different interaction scenarios and continuously refining the dataset based on
real-world scenarios, we can ensure that the model learns to recognize and
understand a wider range of human-robot interactions, improving its performance and
applicability.

\subsection{Lack of Interpretability in Deep Learning Models}

The challenge lies in the lack of interpretability of complex deep learning models,
which is an obstacle to understanding their decision-making processes. This opacity
becomes particularly problematic in sensitive areas such as clinical decision-
making. To overcome this challenge, the inclusion of Explainable AI techniques is
crucial. Methods such as attention mechanisms or layer-wise relevance propagation
provide insights into the internal workings of the model and offer explanations for
certain decisions. By integrating these techniques, we increase transparency,
interpretability and confidence in the results of the model. This is particularly
important for applications where comprehensibility and accountability are critical
to ensure that end users can trust the model's predictions and make informed
decisions.

By fostering collaboration in the AI community, we can collectively address and


overcome the challenges associated with the interpretability of deep learning
models.

\subsection{Handling Unstructured Data}

The problem of dealing with unstructured data is very delicate. Dealing with the
large volume, variety and dynamics of unstructured data published online. To
overcome the solution,
The use of machine learning techniques, such as the proposed model based on deep
learning. These techniques can help to overcome this challenge. These techniques
have shown better performance in sentiment analysis and can effectively process a
large amount of text data. By using natural language processing (NLP) and long-term
memory (LSTM) techniques, the proposed model can process and analyze unstructured
data from various sources, including in-app messages, social media websites, e-
commerce websites, and news publishing websites.

In particular, our future work will be the hybrid feature extraction approach used
in the model to improve the ability to extract relevant information from
unstructured data.
\subsection{Data Augmentation for Low-Resource Languages}

Data augmentation is crucial for improving the performance of natural language


processing (NLP) models in low-resource languages. With limited labeled data
available, data augmentation techniques aim to artificially expand the dataset,
creating greater diversity for better model generalization. Paraphrasing, back-
translation and synonym replacement are common strategies. Paraphrasing involves
creating alternative formulations for existing sentences, while back-translation
involves translating sentences into another language and back. Synonym substitution
replaces words with synonyms to create variability. These techniques use
unsupervised or weakly supervised methods to generate additional training
instances.

By incorporating additional data, the models are exposed to a wider range of


language patterns, improving adaptability to the nuances of low-resource languages
and ultimately increasing overall performance.

\section{Result Analysis}

\section{Discussion and Future Research Directions}


%% main text

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusion}
\label{sec:Conclusion}
\label{end}
In summary, the maturation of deep learning methods with natural language
processing (NLP) in the context of human-robot interaction (HRI) represents a
transformative paradigm in this rapidly evolving field. This study deftly navigates
the complicated dynamics of HRI and emphasizes the central role that deep learning
plays in shaping communication between humans and robots. Departing from
conventional sentiment analysis, our investigation encompasses a spectrum of HRI
facets that include dialog systems, language understanding, and contextual
communication. The study systematically scrutinizes applications, algorithms and
models that describe the current landscape of deep learning-based NLP in HRI. It
also provides valuable insights into common pre-processing techniques, datasets and
specific evaluation metrics. By revealing the benefits and challenges that machine
learning and deep learning algorithms bring to HRI, as well as providing a
comprehensive overview of current state-of-the-art experiments, this review serves
not only as a navigation aid for the Field, but also as a catalyst for future
advances. The concluding discussion of specific challenges in the field of HRI sets
the stage for future research and ensures a nuanced understanding of models,
applications, challenges, and the trajectory of deep learning-based NLP research in
the field of human-robot interaction.

\section*{Author contribution statement}


%% \label{}

\section*{Ethical approval and consent to participate consent for publication


statement}
Not applicable.

\section*{Data availability statement}


The datasets for the current study are taken publicly available sources
[28,29,30,31]
\section*{Declaration of interest’s statement}

The authors declared that they have no competing interests in this work. We declare
that we do not have any commercial or associative interest that represents a
conflict of interest in connection
with the work submitted.
%% If you have bibdatabase file and want bibtex to generate the
%% bibitems, please use
%%
\bibliographystyle{elsarticle-num}
\bibliography{ref}

%% else use the following coding to input the bibitems directly in the
%% TeX file.

% \begin{thebibliography}{00}

% %% \bibitem[Author(year)]{label}
% %% Text of bibliographic item

% \bibitem[ ()]{}

% \end{thebibliography}
\end{document}

\endinput
%%
%% End of file `elsarticle-template-harv.tex'.

You might also like