Professional Documents
Culture Documents
%\documentclass[preprint,12pt]{elsarticle}
\usepackage{changepage}
\usepackage{lipsum}
\usepackage{lscape}
%% The lineno packages adds line numbers. Start line numbering with
%% \begin{linenumbers}, end it with \end{linenumbers}. Or switch it on
%% for the whole article with \linenumbers.
%% \usepackage{lineno}
\journal{Heliyon}
\begin{document}
\begin{frontmatter}
\author[a]{}
\ead{}
\author[b]{}
\ead{}
\author[c]{}
\ead{}
\author[c]{}
\ead{}
\begin{abstract}
Human-robot interaction (HRI) is at the forefront of rapid development, with the
integration of deep learning techniques into natural language processing (NLP)
representing significant potential. This research addresses the complicated
dynamics of HRI and highlights the central role of deep learning in shaping
communication between humans and robots. In contrast to a narrow focus on sentiment
analysis, the study encompasses various HRI facets, including dialogue systems,
language understanding and contextual communication. The study systematically
examines applications, algorithms and models that define the current landscape of
deep learning-based NLP in HRI. It also presents common pre-processing techniques,
datasets and customized evaluation metrics. Insights into the benefits and
challenges of machine learning and deep learning algorithms in HRI are provided,
complemented by a comprehensive overview of the current state of the art. The
manuscript concludes with a comprehensive discussion of specific HRI challenges and
suggests thoughtful research directions. This work aims to provide a balanced
understanding of models, applications, challenges and research directions in the
field of deep learning-based NLP in human-robot interaction, with a focus on recent
contributions to the field.
\end{abstract}
%%Graphical abstract
%\begin{graphicalabstract}
%\includegraphics{grabs}
%\end{graphicalabstract}
\iffalse
%%Research highlights
\begin{highlights}
\item Research highlight 1
\item Research highlight 2
\end{highlights}
\fi
\begin{keyword}
%% keywords here, in the form: keyword \sep keyword
Human-Robot Interaction, Deep Learning, Natural Language Processing
%% PACS codes here, in the form: \PACS code \sep code
\end{keyword}
\end{frontmatter}
%% \linenumbers
\section{Introduction}
Human-robot interaction has recently gained significant popularity and become a
trending topic, reflecting the growing fascination and curiosity about the dynamics
between humans and robots, underlined by the increasing importance of deep learning
and natural language processing (NLP) in the design of these interactions. This
trend is further reinforced by the recognition of the central role played by the
combination of deep learning and natural language processing (NLP), highlighting
the importance of this synergy in improving the quality and nuance of human-robot
interaction. The convergence of deep learning-based NLP and human-robot interaction
represents a transformative intersection where advances in language understanding
and processing contribute significantly to the seamless integration of robots into
various aspects of human life. As we delve into this fascinating intersection, it
becomes clear that the evolving technology landscape promises exciting developments
that are yet to unfold.
\item The study offers an in-depth exploration that provides valuable insights
into methods of data preprocessing to optimize model training. It also highlights
commonly used datasets that are important for research and benchmarking in the
context of DL-based NLP in human-robot interaction.
\item Thorough analysis that thoroughly explores and examines recent advances
and contributions by researchers and highlights the latest experimental results
shaping the evolving landscape of deep learning-based natural language processing
(NLP) in human-robot interaction (HRI).
\item A comprehensive discussion discusses the challenges in the field of deep
learning-based NLP in HRI and identifies future research opportunities to overcome
these obstacles.
\end{itemize}
Throughout rest of this paper, key sections are systematically navigated, each of
which contributes to a comprehensive understanding of the study. Section \
ref{sec:Methodology}, Methodology, explains the systematic approach that was used
to investigate the interplay between deep learning-based natural language
processing (NLP) and human-robot interaction (HRI). Section \ref{sec:Dataset},
Datasets, describes in detail the key data sources for the subsequent analyses. The
examination of Section \ref{sec:Models}, Models, provides a detailed insight into
the various deep learning models used to improve natural language understanding in
the context of HRI. Section \ref{sec:Application}, Application, examines the
practical implementation of these models in various real-world scenarios. Section \
ref{sec:Results_analysis}, Results and Analysis, rigorously breaks down the results
and provides nuanced insights into the findings of the study. The preprocessing
methods are described in detail in Section \ref{sec:Pre-processing}. Section \
ref{sec:Challenges}, Recent Changes, critically discusses changes and advances in
the field to ensure the study is up to date. Finally, Section \ref{sec:Conclusion},
Conclusion, captures the essence of the study by summarizing the main findings,
reflecting on their implications, and suggesting possible avenues for future
research in the field of deep learning-based NLP in human-robot interaction.
% \subsubsection{1.1}
% In the modern era of technological advancement, dynamic interaction between
humans and robots, known as human-robot interaction (MRT), has moved from the realm
of science fiction into mainstream daily life. This fascinating interaction
involves studying how humans communicate and interact with robots. Originally
concerned with simple control interfaces, HRI has undergone a remarkable evolution
driven by the advent of Deep Learning - a subset of machine learning that employs
complex neural networks. In particular, in the captivating field of natural
language processing (NLP), a subfield of artificial intelligence, the journey of
seamless interpretation, understanding, and generation of human language by
machines is upon this field. The convergence of Deep Learning and NLP in the field
of HRI is like an artistic symphony orchestrating intricate patterns and
relationships within the complex fabric of human communication.
% \subsubsection{1.2}
% This technological fusion is not only pushing the boundaries of human-robot
interaction, but also creating a new paradigm - a world where robots not only
understand the nuances of human speech but also anticipate and respond to our
intentions with uncanny familiarity. Imagine virtual assistants that not only
understand words spoken by a person but also recognize his or her underlying
emotions, paving the way for unprecedented personalized technological interactions.
As the world makes progress through this ever-evolving landscape, examples of
monumental achievements such as BERT \cite{n1BERT} and GPT-3 \cite{n1GPT} come to
mind, transforming machines into language virtuosos capable of coherent dialogues,
accurately translating languages, and even rendering human emotions. This synergy
of advanced technologies is changing the landscape of HRI, leading to a future
where robots are not just tools, but true companions to our human experience.
% \subsubsection{1.3}
% The evolution of human-robot interaction (HRI) through the lens of Deep Learning
is closely intertwined with the contributions of influential models. Convolutional
neural networks (CNNs) \cite{n1CNN}, developed in the 1990s by LeCun et al.
redefined image analysis for robots by enabling the extraction of complicated
patterns and advancing tasks such as object recognition. Recurrent neural networks
(RNNs)\cite{n1RNN}, dating back to Elman's 1990 work and advanced by Hochreiter and
Schmidhuber in 1997 with the introduction of long short-term memory (LSTM) cells,
revolutionized sequential data analysis and enabled robots to understand human
gestures and speech nuances. The landscape changed further with the advent of
Transformer models, e.g., the introduction of Bidirectional Encoder Representations
from Transformers (BERT) \cite{n1BERT} by Devlin et al. in 2018, which captured
contextual linguistic subtleties. This was followed in 2020 by OpenAI's Generative
Pre-trained Transformer 3 (GPT-3) \cite{n1GPT}, a milestone in language generation
that enables machines to produce coherent, contextual text. The historical
development of these models is intertwined with the growth of Deep Learning, which
is reshaping the evolution of HRI and setting the stage for future innovations that
will improve human-robot interaction through the convergence of Deep Learning
methods.
% \subsubsection{}
% The need to explore the field of human-robot interaction (HRI) in the context of
Deep Learning stems from its transformative potential for numerous real-world
applications. As robots increasingly become an integral part of our lives,
improving their ability to seamlessly interact with humans is critical. Improved
HRI facilitates collaborative tasks in industrial automation, where robots work
with human workers to optimize efficiency and safety. In healthcare, robots can
support patient care, accompany the elderly, or assist people with limited
mobility. In addition, personalized virtual assistants based on Deep Learning
provide tailored answers and recommendations, revolutionizing customer service and
the use of technology. These examples show that the advancement of HRI through Deep
Learning is not only redefining the human-robot dynamic, but also holds the key to
unprecedented efficiency, accessibility, and convenience in various sectors of our
society.
% \subsubsection{1.5}
% This review examines the evolution of human-robot interaction (HRI) in the
context of Deep Learning and demonstrates its transformative potential in various
real-world applications. Human-robot interaction, known as HRI, has evolved from
simple control interfaces to complex interactions driven by advances in Deep
Learning, particularly in the area of natural language processing (NLP). The
convergence of Deep Learning and NLP is transforming HRI by enabling robots to not
only understand human language, but also anticipate intentions and respond with
deep understanding. This fusion has produced virtual assistants capable of
understanding emotions and paving the way for personalized interactions. Key
influential models such as Convolutional Neural Networks (CNNs), Recurrent Neural
Networks (RNNs), and Transformer models such as BERT and GPT-3 are highlighted as
milestones that have shaped the development of HRI. The transformative potential of
Deep Learning-powered HRI is evident in applications ranging from industrial
automation to healthcare to personalized customer service. As robots become an
integral part of our lives, this review paper highlights the role of Deep Learning
in redefining the human-robot dynamic and shaping a future where robots serve as
efficient and accessible companions in various aspects of society.
% \subsubsection{1.5}
\section{Methodology}
\label{sec:Methodology}
\begin{tabular}{|c|c|}
\hline
Inclusion Criteria& Exclusion Criteria \\
\hline
IC1: Research articles written in English. & EC1: Duplicate articles. \\
\hline
IC2: Articles published in 2022 and 2023. &EC2: Not related to the theme of the
review. \\
\hline
IC3: Articles that have only been published in scholarly journals (just a few
exceptional conference papers).&EC3: Articles which have lack of information and
review papers.\\
\hline
\end{tabular}
\end{center}
\subsection{Executing the Review}
The process of obtaining important information from the articles is the main
emphasis of this phase. An organized and systematic literature review is ensured
by the five sub-phases listed below.
\subsubsection{Topical relationship}
We chose the most pertinent original research papers that were conducted in this
field with an emphasis on publications about recognizing human activities.
\subsubsection{Goals and results}
\textcolor{red}{Sections 3, 7, and 8 cover the primary objectives, contributions,
experimental findings, and limitations of several relevant publications.}
\subsubsection{Evaluation metrics, Datasets, Pre-Processing methods, and
Algorithms}
\textcolor{red}{Sections 4 to 7 provide all evaluation measures, datasets,
preprocessing techniques, and algorithms utilized in the HRI studies.}
\subsubsection{Research type}
The paper type is described, such as scholarly journals, conference/workshop
proceedings, book chapters, or thesis work. For a thorough, systematic evaluation,
our study mostly chose scholarly articles.
\subsubsection{Publication year and type}
\textcolor{red}{We collected 176 publications for this study's first collection,
all of which were only published in 2022 and 2023. 83 articles that focused on
applications and breakthroughs in HRI were eventually chosen for the review after
careful inspection. Notably, 2023 witnessed the publication of more than 60\% of
these papers. Our conscious decision to only take into account current papers
emphasizes our dedication to offering an expert and cutting-edge overview of HRI.}
\subsection{Outcome}
Lastly, the collected data has been analyzed, addressing current issues and
challenges while giving new directions for future study.
\section{Dataset}
\label{sec:Dataset}
The undeniable need for datasets in any given field cannot be overstated. The
following table lists the most important datasets used in the field of deep
learning-based natural language processing for human-robot interaction, underlining
their paramount importance in driving research and innovation.
\begin{table}[H]
\begin{adjustwidth}{-4cm}{-4cm} % Comment out/remove adjustwidth environment if
table fits in text column.
\centering
\caption{
{\bf Datasets}}
\begin{tabular}{p{.5cm}p{1.5cm}p{2cm}p{0.8cm}p{2cm}p{8cm}}
\hline
\centering
\end{tabular}
\label{tab:dataset}
\end{adjustwidth}
\end{table}
% \begin{table}[htbp]
% \centering
% \caption{}
% \label{table:4}
% \begin{adjustbox}{width=1\textwidth}
% \begin{tabular}{|m{2.5cm}|m{10cm}|}\hline
% \textbf{Preprocessing Methods} & \textbf{Description} & \textbf{Taxonomy} &
% \textbf{Deep Learning Techniques} &
% \textbf{Image Segmentation Applications} &
% \textbf{Datasets} &
% \textbf{Challenges} &
% \textbf{Future Opportunities} & %\checkmark \ding{55}
% \textbf{Evaluation Metrics} &
% \textbf{Contribution}
% \\ \hline
% \cite{S1} & Deep Learning Techniques for Medical Image Segmentation:
Achievements and Challenges &\ding{55} &\checkmark &\ding{55} &\ding{55} &\
checkmark &\ding{55} &\ding{55} &The contribution of this paper is a critical
appraisal of popular deep learning-based methods for medical image segmentation,
along with a summary of common challenges faced and proposed solutions.\\
% \hline
% \end{tabular}
% \end{adjustbox}
% \end{table}
% \begin{table}[!ht]
% \begin{adjustwidth}{-4cm}{-4cm} % Comment out/remove adjustwidth environment if
table fits in text column.
% \centering
% \caption{
% {\bf The table discusses the commonly used pre-processing methods in sentiment
analysis.}}
% % \begin{tabular}{p{2.5cm}p{14cm}}
% \begin{tabular}{p{4cm}p{14cm}}
% \hline
% \hline
% Image Crop & Image cropping is a pre-processing technique that involves removing
a specific part of an image, typically to focus on a region of interest or to
resize it for further analysis or display. \\ \hline
% Tokenization & Breaking down a text into smaller linguistic units such as words,
phrases, or symbols to facilitate further analysis. \\ \hline
% Lowercasing & Helpful for standardizing text to reduce vocabulary complexity and
prevent the model from treating words with different capitalization as different
entities. \\ \hline
% Lemmatization \newline
% \& \newline
% Stemming & Techniques for reducing words to their basic form in order to deal
with word variations and thus reduce the size of the vocabulary and improve the
generalization capability of the model. \\ \hline
% Noise removal & An important step in improving the quality of text data by
removing irrelevant information such as special characters, symbols, or irrelevant
words that do not contribute to the overall meaning. \\ \hline
% Feature-Extraction & Essential for capturing the most important information from
the text data and creating meaningful representations that can be effectively used
by the model to learn patterns and make predictions. \\ \hline
% Stop word removal & Removes frequently occurring words (e.g., 'and', 'the', 'is')
that do not contain important information and are often ignored during analysis to
reduce noise and increase processing speed. \\ \hline
% Removal of special \newline characters \& numbers & Helps clean up the text by
removing non-alphabetic characters and numeric digits to ensure that the data
focuses on the textual information relevant to the analysis. \\ \hline
% Vectorization & The process of converting text data into numeric vectors, making
it suitable for various machine learning models that require numeric input for
processing. \\ \hline
% Named Entity Recognition (NER) & A process that identifies and classifies named
entities in text, such as names, places, dates, and numeric values, so that the
model can recognize specific entities and their context. \\ \hline
% Punctuation removal & Removing punctuation from text data helps to simplify the
text and ensures that the model focuses on the context of the text, improving the
accuracy of the analysis and predictions. \\ \hline
% \end{tabular}
% \label{tab:preprocessing}
% \end{adjustwidth}
% \end{table}
\section{Pre-processing Methods}
\label{sec:Pre-processing}
Pre-processing is vital in deep learning-based natural language processing for
human-robot interaction, converting raw text into machine-readable formats using
techniques like data cleansing and tokenization. These methods, akin to sentiment
analysis, refine input data and enhance model performance. Rigorous pre-processing
ensures data integrity, contributing significantly to scientific advancements in
optimizing NLP applications in HRI. The table \ref{tab:preprocessing}below
elucidates prevalent pre-processing methods in this field.
\begin{table}[H]%[htb]%
\begin{adjustwidth}{-4cm}{-4cm} % Comment out/remove adjustwidth environment if
table fits in text column.
\centering
\caption{
{\bf The table discusses the commonly used pre-processing methods in sentiment
analysis.}}
% \begin{tabular}{p{2.5cm}p{14cm}}
\begin{tabular}{p{3cm}p{12cm}p{5cm}}
\hline
\hline
Image Crop & Image cropping is a pre-processing technique that involves removing a
specific part of an image, typically to focus on a region of interest or to resize
it for further analysis or display.
&\cite{zhang2023tcpcnet},\cite{kushol2023effects},\cite{fanjie2023sust}
\\ \hline
Tokenization & Breaking down a text into smaller linguistic units such as words,
phrases, or symbols to facilitate further analysis. &\cite{halawani2023automated},\
cite{baghaei2023deep},\cite{aldunate2022understanding},\cite{zhou2023generating},\
cite{pandey2022mental}\\ \hline
Padding & Commonly used in sequence-based models such as recurrent neural networks
(RNNs) and transformers to ensure that all input sequences are the same length,
allowing for efficient batch processing.
&\cite{ashraf2023bert},\cite{inamdar2023machine},\cite{jianan2023deep},\
cite{karasoy2022spam}
\\ \hline
Lowercasing & Helpful for standardizing text to reduce vocabulary complexity and
prevent the model from treating words with different capitalization as different
entities.
&\cite{zaheer2023multi},\cite{duong2023deep},\cite{alshahrani2023applied}
\\ \hline
Lemmatization \newline
\& \newline
Stemming & Techniques for reducing words to their basic form in order to deal with
word variations and thus reduce the size of the vocabulary and improve the
generalization capability of the model. &\cite{budiharto2021novel},\
cite{ayanouz2020smart},\cite{guazzo2023deep},\cite{wang2023deepsa} \\ \hline
Noise removal & An important step in improving the quality of text data by removing
irrelevant information such as special characters, symbols, or irrelevant words
that do not contribute to the overall meaning.
&\cite{amaar2022detection},\cite{kheraleveraging},\cite{merdivan2019dialogue}
\\ \hline
Feature-Extraction & Essential for capturing the most important information from
the text data and creating meaningful representations that can be effectively used
by the model to learn patterns and make predictions. &\cite{johnston2023ns},\
cite{yohanes2023emotion},\cite{10343159}
\\ \hline
Stop word removal & Removes frequently occurring words (e.g., 'and', 'the', 'is')
that do not contain important information and are often ignored during analysis to
reduce noise and increase processing speed. &\cite{balouch2023transformer},\
cite{mithun2023development},\cite{das2022deep},\cite{nijhawan2022stress}
\\ \hline
Removal of special \newline characters \& numbers & Helps clean up the text by
removing non-alphabetic characters and numeric digits to ensure that the data
focuses on the textual information relevant to the analysis. &\
cite{olthof2021deep},\cite{nictoi2023unveiling},\cite{das2023sentiment},\
cite{olthof2021deep}
\\ \hline
Vectorization & The process of converting text data into numeric vectors, making it
suitable for various machine learning models that require numeric input for
processing. &\cite{gupta2023detecting},\cite{xavier2022natural},\
cite{marulli2021exploring}
\\ \hline
Handling Contractions & Expanding contractions to their full form helps standardize
text data and avoid ambiguity, especially in cases where the contraction may have a
different meaning. &\cite{chai2023twitter},\cite{mahimaidoss2023emotion}
\\ \hline
Named Entity Recognition (NER) & A process that identifies and classifies named
entities in text, such as names, places, dates, and numeric values, so that the
model can recognize specific entities and their context. &\cite{ahmed2023fine},\
cite{jang2022exploration}
\\ \hline
Punctuation removal & Removing punctuation from text data helps to simplify the
text and ensures that the model focuses on the context of the text, improving the
accuracy of the analysis and predictions. &\cite{agarwal2023deepgram},\
cite{motyka2023information},\cite{ashfaque2023design}
\\ \hline
\end{tabular}
\label{tab:preprocessing}
\end{adjustwidth}
\end{table}
\section{Models}
\label{sec:Models}
In the academic field of human-robot interaction using deep learning-based natural
language processing, this resource serves as a valuable guide for navigating the
complicated landscape of deep learning and natural language processing, especially
for beginners, and facilitates informed model selection. Given the paramount
importance of deep learning-based natural language processing in human-robot
interaction, the comprehensive overview of key models in the table \ref{tab:models}
contributes to a holistic understanding of the diverse applications in this
specialized field.
\newgeometry{
left=0.4in,
right=0.3in,
top=0.4in,
bottom=0.4in,
}
\begin{landscape}
\begin{longtable}{p{2cm}p{8.5cm}p{6.5cm}p{6.5cm}p{2cm}}
\caption{The table analyzes the commonly used deep learning algorithms in NLP-based
human-robot interaction.}\\
\hline
\label{tab:models}
Algorithm & Description & Advantages & Limitations & Studies \\
\hline
\endfirsthead
\caption*{Table continued from previous page}\\
\hline
Algorithm & Description & Advantages & Limitations & Studies \\
\hline
\endhead
\hline
\multicolumn{5}{r}{\textit{Continued on next page}}
\endfoot
\hline
\endlastfoot
\hline
& \cite{algorithem.HAN.one},\cite{algorithem.HAN.two}, \\
\hline
Bidirectional Encoder Representations from Transformers (BERT) &
BERT is a natural language processing algorithm that employs bidirectional context
comprehension through pre-training on extensive unlabeled text data, enabling
nuanced contextual understanding and substantial performance improvements across
diverse downstream NLP tasks. &
\begin{itemize}
\item \textbf{Precision in Context:} Excellent understanding of words in sentence
context.
\item \textbf{Flexible Pre-training:} Adapts to tasks with minimal task-specific
data through versatile pre-training.
\item \textbf{Comprehensive Contextual Awareness:} Utilizes bidirectional
attention for a comprehensive understanding of surrounding words.
\end{itemize}
& \begin{itemize}
\item \textbf{High Compute Requirements:} Intensive calculations limit use in
resource-constrained environments.
&
\begin{itemize}
\item \textbf{Complexity:} More difficult to understand due to complexity.
\item \textbf{Calculation Requirements:} Requires more resources for calculation.
\item \textbf{Risk of Overfitting:} More prone to overfitting, especially with
limited data.
\end{itemize}
& \cite{algorithem.LSTM.one},\cite{algorithem.LSTM.two},\
cite{algorithem.LSTM.three}
\hline
Recurrent neural networks (RNN) & Recurrent neural networks (RNNs) are a class of
neural networks designed for sequential data processing. They use an internal
memory to capture dependencies and patterns in sequential information, making them
well suited for tasks such as natural language processing and time series
prediction. &
\begin{itemize}
\item \textbf{Sequential Processing:} Handles the processing of sequential and
time-related data.
&
\begin{itemize}
\item \textbf{Vanishing Gradient:} The gradient decreases in RNNs, making long-
term learning more difficult.
\item \textbf{Limited Long-Term Dependency:} Difficulties in capturing distant
dependencies.
\end{itemize}
& \cite{algorithem.RNN.one},\cite{algorithem.RNN.two},{\citealgorithem.RNN.three},\
cite{algorithem.RNN.four} \\
\hline
Convolutional neural networks (CNN) & Convolutional neural networks (CNNs) are a
class of deep neural networks designed for processing structured raster data, such
as images. They use convolutional layers to automatically learn hierarchical
representations of features, making them very effective in tasks such as image
recognition and computer vision. & \begin{itemize}
\item \textbf{Feature Hierarchies:} Automatic learning of hierarchical features
for pattern recognition.
\item \textbf{Parameter Sharing:} Efficient reduction of parameters through
weight sharing.
\item \textbf{Translation Invariance:} Recognition of features regardless of
their input position.
\end{itemize}
&
\begin{itemize}
\item \textbf{Loss of Spatial Information:} Fine details can be lost by merging
layers.
\item \textbf{Sequential Data Limitation:} Less effective with sequential data.
\item \textbf{Sensitivity to Rotation and Scale:} Sensitivity to variations in
rotation and scale.
\end{itemize}
& \cite{algorithem.CNN.one},\cite{algorithem.CNN.two},\
cite{algorithem.CNN.three},\cite{algorithem.CNN.four} \\
\hline
Artificial neural networks (ANN) & Artificial neural networks (ANNs) are
computational models based on the structure and functioning of the human brain.
They consist of interconnected nodes organized in layers that allow them to learn
and adapt to complex patterns, making ANNs versatile for a variety of tasks, from
pattern recognition to decision making. &
\begin{itemize}
\item \textbf{Parallel Processing:} Efficient parallelization thanks to
attention mechanisms in Transformers.
\item \textbf{Long-term Dependencies:} Effective capture of long-range
dependencies.
\item \textbf{Mitigated Gradient Problems:} Reduced susceptibility to the
vanishing gradient problem.
\end{itemize}
& \begin{itemize}
\item \textbf{Sequential Handling Issues:} Struggles with managing sequential
information.
\item \textbf{High Computational Complexity:} Demanding due to attention
mechanisms.
\item \textbf{Parameter Sensitivity:} Performance influenced by hyperparameter
choices.
\end{itemize}
& \cite{algorithem.ANN.one},\cite{algorithem.ANN.two},\
cite{algorithem.ANN.three} \\
\hline
% \subsection{RNN}
% The most basic recurrent neural network configuration, called the Elman network
or Simple-RNN (S-RNN), was introduced by Elman in 1990 and later investigated by
Mikolov (2012) for its applicability in language modeling. The structure of the S-
RNN is defined as follows:
% \begin{equation}
% \begin{aligned}
% s_i &= R_{SRNN}(s_{s-1},x_i) = g(x_iW^x+s_{i-1}W^s+b)\\
% y_i &= O_{SRNN}(s_i)= s_i\\
% s_i,y_i \in \mathbb{R}^{d_s}, &x_i \in \mathbb{R}^{d_x},
% W^x \in \mathbb{R}^{d_x \times d_s},
% W^s \in \mathbb{R}^{d_s \times d_s},
% b \in \mathbb{R}^{d_s}
% \end{aligned}
% \end{equation}
% In this context, the state at position $i$ is formed by mixing the input at that
position with the previous state, and this combination is subjected to nonlinear
activation, typically $\tanh$ or ReLU. The output at position $i$ corresponds to
the hidden state at that position. Despite its simplicity, simple RNN yields
impressive results for tasks such as sequence tagging (as shown by Xu et al. 2015)
and is also suitable for language modeling. A detailed exploration of the use of
Simple RNNs in the context of language modeling can be found in Mikolov's 2012 PhD
thesis.
% % \begin{equation}
% % \hat{g} = \arg\min_g \sum_{z'\in Z} w_z (f(x') - g(z'))^2 + \Omega(g)
% % \end{equation}
% % \begin{equation}
% % \text{argmax}_{A\subseteq X} \left[ P(f(x) = c | x \in A) \cdot \
text{precision}(A) \right]
% % \end{equation}
% % \begin{equation}
% % \text{argmax}_{A \subseteq X} \left[ P(f(x) = c | x \in A) \cdot \
text{precision}(A) \right]
% % \end{equation}
% % \begin{equation}
% % \text{Importance}(x_i) = \left| \frac{w_i}{\sum_{j=1}^{n} |w_j|} \right|
% % \end{equation}
% % \begin{equation}
% % \text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\
text{Support}(X)}
% % \end{equation}
% % \begin{equation}
% % \text{Score}(H, D) = \text{Accuracy}(H, D) - \lambda \cdot \
text{Complexity}(H)
% % \end{equation}
% % \begin{equation}
% % \text{Gain}(A) = H(D) - \sum_{v \in \text{Values}(A)} \frac{|D_v|}{|D|} \
cdot H(D_v)
% % \end{equation}
% % \begin{equation}
% % S^{c}_{\text{Grad-CAM}}(x) = \text{ReLU} \left( \sum_{k} \sum_{i} \
sum_{j} \alpha_{i,j}^{k,c} \cdot \frac{\partial y^{c}}{\partial A_{i,j}^{k}} \
right)
% % \end{equation}
% % \begin{equation}
% % \begin{adjustbox}{max width=.44\textwidth}
% % S_{\text{Integrated Gradients}}(x) = (x - x') \times \int_{\alpha = 0}^{1}
\frac{\partial F(x' + \alpha \times (x - x'))}{\partial x} d\alpha
% % \end{adjustbox}
% % \end{equation}
% % \begin{equation}
% % R_{\text{Guided}}(x) = \text{ReLU}\left(\frac{\partial y^{c}}{\partial x}\
right)
% % \end{equation}
% % \begin{equation}
% % \frac{\partial y^{c}}{\partial x} = \text{ReLU}\left(\frac{\partial y^{c}}
{\partial A}\right) \circ \frac{\partial A}{\partial x}
% % \end{equation}
% % \begin{equation}
% % \mathcal{L}_{\text{VAE}} = \mathbb{E}_{q_{\phi}(z|x)} \left[ \log p_{\
theta}(x|z) \right] - \text{KL}\left( q_{\phi}(z|x) \| p(z) \right)
% % \end{equation}
% % \begin{equation}
% % \mathcal{L}_{\text{Beta-VAE}} = \mathcal{L}_{\text{VAE}} + \beta \cdot \
text{KL}(q_{\phi}(z|x) \| p(z))
% % \end{equation}
% % \begin{equation}
% % \mathcal{L}_{\text{FactorVAE}} = \mathcal{L}_{\text{VAE}} + \gamma \cdot \
text{TC}(q_{\phi}(z|x))
% % \end{equation}
% % \begin{equation}
% % h_i^{(l+1)} = \sigma \left( \sum_{j \in \mathcal{N}(i)} \frac{1}{c_{ij}}
W^{(l)} h_j^{(l)} \right)
% % \end{equation}
% % \begin{equation}
% % \alpha_{ij} = \frac{\exp\left( \text{LeakyReLU}(\vec{a}^T [W h_i \| W h_j])
\right)}{\sum_{k \in \mathcal{N}(i)} \exp\left( \text{LeakyReLU}(\vec{a}^T [W
h_i \| W h_k]) \right)}
% % \end{equation}
% % \begin{equation}
% % h_i' = \sigma \left( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} W h_j \right)
% % \end{equation}
% % where \( \alpha_{ij} \) represents the attention weight for the edge connecting
nodes \( i \) and \( j \), \( \vec{a} \) represents the attention parameter vector,
\( W \) represents the weight matrix, \( h_i \) represents the hidden
representation of node \( i \), and \( \sigma \) represents an activation function.
% % \begin{equation}
% % h_i^{(l+1)} = \sigma \left( \sum_{r \in \mathcal{R}} \sum_{j \in \
mathcal{N}_r(i)} \frac{1}{c_{ijr}} W_r^{(l)} h_j^{(l)} \right)
% % \end{equation}
% \subsection{LSTM}
% \textcolor{black}{The LSTM architecture, introduced in 1997 by Hochreiter and
Schmidhuber, was specifically designed to solve the problem of vanishing gradients
in neural networks. The core concept of LSTM is to incorporate "memory cells"
(represented as vectors) into the state representation to facilitate the
persistence of gradients over time. These memory cells are regulated by gating
mechanisms, which are smooth mathematical functions that mimic logic gates. At each
input state, a gate is used to determine how much of the new input to stored in the
memory cell and how much of the current memory cell contents to discard. In
practice, a gate, referred to as $g$ and restricted to the range [0, 1], is a
vector of values that is multiplied element-wise by another vector $v$ in $R^{n}$,
followed by an addition operation. The significance of "g" is designed to be either
close to 0 or 1, which is usually achieved by using a sigmoid function. This
ensures that indices in $v$ corresponding to values close to 1 in $g$ can pass,
while indices corresponding to values close to 0 are effectively blocked.}
% \textcolor{black}{Mathematically, the LSTM architecture is defined as:}
% \begin{equation}
% \begin{aligned}
% s_{j} = R_{LSTM}(s_{j-1},x_j) & =[c_j;h_j] \\
% c_j & = c_{j-1}\odot f + g \odot i \\
% h_j & = \tanh(c_j) \odot o \\
% i & = \sigma(x_j \mathbf{W^{xi}} + h_{j-1}\mathbf{W^{hi}}) \\
% f & = \sigma(x_j \mathbf{W^{xf}} + h_{j-1}\mathbf{W^{hf}}) \\
% o & = \sigma(x_j \mathbf{W^{xo}} + h_{j-1}\mathbf{W^{ho}}) \\
% g & = \tanh(x_j \mathbf{W^{xg}} + h_{j-1}\mathbf{W^{hg}}) \\
% y_j = O_{LSTM}(s_j) &= h_j \\
% s_j \in \mathbb{R}^{2.dh} , x_i \in \mathbb{R}^{dx}, c_j,h_j,i,f,o,g \in \
mathbb{R}^{d_h} , W^{xo} \in \mathbb{R}^{d_x \times d_h}, W^{ho} \in \
mathbb{R}^{d_h \times d_h}
% \end{aligned}
% \end{equation}
% The symbol $\odot$ is used to represent the element-wise product formation. At
time j, the state consists of two vectors, namely $c_j,$ which denotes the memory
component, and $h_j$, which represents the output or state component. There are
three gates labeled i, f, and o, which are responsible for controlling the input,
forgetting, and output, respectively. These gate values are determined by linear
combinations of the current input $x_j$ and the previous state $h_{j-1}$, which are
then passed through a sigmoid activation function. A candidate update, denoted g,
is computed as a linear combination of $x_j$ and $h_{j-1}$, which is then subjected
to a $\tanh$ activation function. The memory $c_j$ is then updated, with the forget
gate specifying the extent to which the previous memory should be retained $(c_{j-
1} \odot f )$ and the input gate governing the inclusion of the proposed update $
(g \odot i)$. Ultimately, the value of $h_j$ (which also serves as output $y_j$) is
derived from the contents of memory $c_j$, passed through a nonlinear $\tanh$
transform and controlled by the output gate. These gating mechanisms facilitate the
preservation of gradients associated with the memory component $c_j$ over long time
intervals.
% A more detailed examination of the LSTM architecture can be found in the PhD
thesis by Alex Graves (2008) and in the description by Chris Olah. For an analysis
of the performance of LSTMs when used as a character-level language model, see the
work of Karpathy et al. (2015).
% LSTMs are currently the most successful variant of recurrent neural networks
(RNNs) and have contributed to numerous breakthrough achievements in sequence
modeling. Their main competitor in the RNN field is GRU, which will be discussed in
more detail in the following sections.
% Our training procedure includes two main phases. First, we train a high capacity
language model on a large text dataset. This is followed by a fine-tuning phase in
which the model is adapted for a specific discrimination task with labeled data.
% \begin{equation} \label{GPT:1}
% L_1(\mathcal{U})=\sum_i \log P\left(u_i \mid u_{i-k}, \ldots, u_{i-1} ; \Theta\
right)
% \end{equation}
% Here, k denotes the size of the context window, and the conditional probability P
is represented by a neural network with parameters $\theta$. These parameters are
refined by applying stochastic gradient descent \[51\].
% % uporer reference change kora lagbe
% In our experiments, we use a transformer decoder with multiple layers [34] for
the language model. This variant of the transformer [62] uses a multi-headed
mechanism to self-observe the input context tokens, followed by position-based
feedforward layers to generate a distribution over the target tokens.
% \begin{equation} \label{GPT:2}
% \begin{aligned}
% h_0 & =U W_e+W_p \\
% h_l & =transformer_block \left(h_{l-1}\right) \forall i \in[1, n] \\
% P(u) & =\operatorname{softmax}\left(h_n W_e^T\right)
% \end{aligned}
% \end{equation}
% Where U = (u-k, . . . , u-1) represents the context vector of tokens, n
represents the number of layers, We represents the token embedding matrix, and Wp
represents the position embedding matrix.
% Once the model is trained against the target described in Equation \ref{GPT:1},
we adjust the parameters for the specific supervised target task. We consider a
labeled dataset C where each example is a sequence of input tokens $x^1, ... ,
x^m$, These inputs are processed by our pre-trained model to produce the final
activation $hl^m$, of the transformer block. Subsequently, this activation is
passed to an additional linear output layer, characterized by the parameters $W_y$,
to make predictions for y.
% \begin{equation}
% P\left(y \mid x^1, \ldots, x^m\right)=\operatorname{softmax}\left(h_l^m W_y\
right) .
% \end{equation}
% This results in the following goal, which must be maximized:
% \begin{equation}
% L_2(\mathcal{C})=\sum_{(x, y)} \log P\left(y \mid x^1, \ldots, x^m\right)
% \end{equation}
% We also discovered that integrating language modeling as an auxiliary goal during
the fine-tuning process improved learning by (a) enhancing the generalization
capabilities of the supervised model and (b) accelerating the convergence process.
This result is consistent with previous research [50, 43] that observed improved
performance with a similar auxiliary objective. More specifically, we optimize the
following objective (with a weighting parameter $\lambda$):
% \begin{equation}
% L_3(\mathcal{C})=L_2(\mathcal{C})+\lambda * L_1(\mathcal{C})
% \end{equation}
% In summary, the only additional parameters needed for the fine-tuning process are
Wy and embeddings for delimiter tokens, as explained in detail in \ref{sub:sec:gpt}
% \subsubsection{Task-specific input transformations}\label{sub:sec:gpt}
% For some tasks, such as text classification, we can directly fine-tune our model
as described above. Certain other tasks, such as question answering or textual
entailment, have structured inputs, such as ordered sentence pairs or triplets of
document, question, and answers. Since our pre-trained model was trained on
connected text sequences, we need to make some modifications to apply it to these
tasks. Previous work has proposed learning task-specific architectures based on
transferred representations [44]. Such an approach reintroduces a significant
amount of task-specific adaptation and does not use transfer learning for these
additional architectural components. Instead, we use a traversal-like approach
[52], where we transform structured inputs into an ordered sequence that our pre-
trained model can process. These input transformations allow us to avoid extensive
changes to the architecture for different tasks. We briefly describe these input
transformations below and visually illustrate them in \ref{fig-gpt}. All
transformations involve adding randomly initialized start
% and end tokens ((s), (e)).
% \begin{figure} \label{fig-gpt}
% \centering
% \includegraphics[width=0.2\linewidth]{gpt1.png}
% \includegraphics[width=0.75\linewidth]{gp2.png}
% \caption{(left) Transformer architecture and training objectives used in this
work. (right) Input transformations for fine-tuning on different tasks. We convert
all structured inputs into token sequences to be processed by our pre-trained
model, followed by a linear+softmax layer.}
% \label{fig:enter-label}
% \end{figure}
\section{Application}
\label{sec:Application}
% subsection e bhag kora lagbe
Human-Robot Interaction (HRI) is a crucial milestone in robotics development and
underlines its importance in our increasingly connected world. It goes beyond mere
functionality, enabling robots to understand commands and the emotions and
intentions contained in human language. The profound impact of HRI is being felt in
many areas, from healthcare, where robots can provide companionship and support, to
education, where they can enhance learning experiences. In customer service, robots
can better answer inquiries, while in manufacturing, they can work seamlessly with
human workers to improve production efficiency. Deep Learning-based natural
language processing (NLP) is the linchpin of this transformation, enabling robots
to recognize nuance, respond empathetically, and generate contextual responses.
This advanced NLP technology streamlines interactions and improves accessibility
and, thus, the user experience. Consequently, it accelerates the integration of
robots into daily life and industry, bringing us closer to a future where human-
robot interactions are efficient and remarkably human-centric, fostering an
environment of trust and acceptance.
\begin{figure}
\includegraphics[width = \textwidth]{application_diagram.png}
\caption{This figure describes the diverse applications of Human Robot
Interaction}
\label{fig:application}
\end{figure}
\subsection{Healthcare}
In healthcare, human-robot interaction (HRI) is proving to be a transformative
force. HRI has the potential to revolutionize patient care and encompasses a wide
range of applications, from assisting medical staff during surgeries to providing
emotional support to patients with mental illness. This review paper explores the
multifaceted landscape of HRI in healthcare and examines its critical role in
improving medical services, improving patient outcomes, and addressing healthcare
challenges. We address the various applications and innovations that demonstrate
the growing importance of HRI and pave the way for a comprehensive understanding of
its evolving role in the healthcare ecosystem.
Escobar et al. developed a deep learning facial expression recognizer tailored for
children with Down syndrome and investigated the integration of EEG for potential
applications in therapies and brain-computer interfaces\cite{etalEscober}. Ilyas et
al. discussed how deep transfer learning can improve human-robot interaction in
cognitive and physical rehabilitation, which could improve the quality of life of
people with impairments\cite{etalIlyas}.
Dai et al. showed how Deep Learning-based NLP models improved screening, diagnosis,
and early intervention for psychiatric patients in mental health care, including
classification of discharge reports from electronic medical records and the
potential for early detection through social media analysis.
Kim et al. proposed a deep-transfer learning-based NLP model for predicting overall
survival in patients with rectal cancer from unstructured MRI reports and
emphasized the superior performance of this model over conventional methods for
using unstructured medical data.\cite{etal.Kim} Faizan et al. present PyNDA, a
groundbreaking deep learning architecture demonstrating superior ability to extract
various psychometric dimensions from user-generated text \cite{etal.Ahmad}.
Gouws et al. improve deep neural language modeling with a modified HLBL model,
incorporating convolutional layers and RNN layers for improved word sense
disambiguation and optimized learned representations \cite{etal.Gouws}.Vankayala et
al. contribute to depression detection with the hybrid deep learning model FCL,
incorporating Fasttext Embedding, CNN, and LSTM for improved detection in the early
stage of social media texts \cite{etal.Tejaswini}.Nandini et al. contribute to
depression detection using the hybrid deep learning model FCL, representing a
significant advance in the application of NLP techniques \
cite{etal.Nandini}.Ferrone et al. explore the synergy between symbolic and
distributed representations in NLP and deep learning, developing novel neural
networks for interpreting symbols in classical NLP tasks \cite{etal.Ferrone}.Dai et
al. demonstrate the effectiveness of pre-trained deep learning models in improving
mental illness classification using sparse electronic health data for accurate
psychiatric diagnosis and early detection \cite{etal.Dai}.Mariani et al. analyze
five-year changes in language and NLP using the NLP4NLP+5 corpus, focusing on AI,
neural networks, machine learning, and word embedding for science-based insights \
cite{etal.Mariani}.Hollenstein et al. integrate EEG brain activity to improve
natural language processing tasks, particularly emotion classification,
demonstrating the potential of EEG features to improve performance in scenarios
with limited training data \cite{etal.Hollenstein}.Kim et al. present a deep
transfer learning model using serial radiologic reports to predict survival of
patients with rectal cancer, advancing the use of unstructured medical big data for
informed clinical decisions \cite{etal.Kim}.Le et al. introduce a novel method
using fastText and deep convolutional neural networks to identify SNARE proteins,
surpassing the SNARE-CNN predictor with high accuracy, sensitivity, specificity,
and MCC \cite{etal.Le}.Orsag et al. present a trainable model based on human-
skeleton data and LSTM networks to recognize the spatio-temporal activity of human
workers, enabling context-aware human-robot collaboration in industrial
environments with a focus on safety and efficiency \cite{etal.Orsag}.
Guazzo et al. present deep learning NLP models for the automatic identification of
hospitalizations for cardiovascular disease in diabetic patients and demonstrate
their effectiveness in different time windows\cite{application.healthcare.Guazzo}.
Kim et al. present a superior deep learning-based algorithm for extracting keywords
from pathology reports that can be used in biomedical research and medical
institutions to overcome data extraction challenges\
cite{application.healthcare.Kim1}.
Sarraju et al. present and validate clinical BERT-based NLP models. They achieve
high accuracy in classifying statin non-use and reasons for non-use in patients
with atherosclerotic cardiovascular disease using electronic health records,
providing valuable insights for targeted interventions\
cite{application.healthcare.Sarraju}.
\subsection{Entertainment}
In the entertainment industry, interaction between humans and robots opens up new
dimensions of user experience. Interactive robots create immersive and engaging
entertainment and foster unique forms of creative expression and shared social
experiences. This integration not only reflects technological advancements, but
also shapes the future of entertainment by providing audiences with engaging and
interactive forms of enjoyment.
The transformation model for human activity recognition using wearable sensors
proposed by Luptáková et al. demonstrates the real-time analysis of smartphone
sensor data and the versatility of its applicability in various scenarios, as
evidenced by its successful application in the entertainment context \
cite{application.entertainment.teja}.
Tan et al. present a task-completion question answering framework with a multimodal
dataset and a hybrid deep learning-symbolic reasoning approach that handles
collaborative tasks with text, images and videos in a natural environment\
cite{application.entertainment.tan}.
Lee et al. present a framework that uses a multimodal dataset and a hybrid deep
learning-symbolic reasoning approach to answer task completion questions. The
approach was developed to handle collaborative tasks with text, images and videos
in a natural environment\cite{application.entertainment.lee}.
Atzeni et al. demonstrate the successful integration of sentiment analysis, deep
learning and semantic technologies to enable Zora, a humanoid robot, to perform
natural language interactions in the field of human-robot interaction\
cite{application.entertainment.Atzeni}.
Morshed et al. contribute by providing a thorough review of recent advances in
human activity recognition, covering methods, frameworks, datasets and challenges.
The article also provides practical guidance for researchers and practitioners in
the field\cite{application.entertainment.Morshed}.
Dirgov et al. adapt the transformer model for precise real-time human activity
recognition from smartphone motion sensor data, demonstrating high accuracy and
practical applicability originally developed for natural language processing and
vision tasks\cite{application.entertainment.Dirgová}.
Russo et al. utilize GPT-2 in deep learning-based NLP to extract meaningful neural
encodings from functional MRI during narrative listening, revealing that GPT-2
surprisal and saliency explain neural data in language-related brain regions,
highlighting the potential for deep learning models to investigate complex neural
mechanisms underlying human language comprehension\
cite{application.entertainment.Russo}.
% \subsection{Management}
% The LLM-based smart reply (LSR) system by Bastola et al., which is based on
ChatGPT, improves collaboration efficiency through contextual and personalized
responses, resulting in improved team performance and reduced mental load in daily
work\cite{application.management.bastola}.
\subsection{Industrial}
The importance of human-robot interaction in industrial contexts is underlined by
its transformative impact on efficiency, safety and productivity. As technology
advances, these benefits are further enhanced by the integration of deep learning-
based natural language processing, enabling more nuanced and adaptive interactions
between humans and robots. Below are numerous examples of such applications in
industry that demonstrate how the synergy of human-robot interaction and deep
learning-based natural language processing contributes to a more efficient,
responsive and effective operating environment.
Liu et al. develop a deep learning-based multimodal control interface that enables
seamless collaboration between humans and robots in manufacturing and provides the
flexibility for dynamic task changes in a shared workspace\
cite{application.industrial.liu}.
Illuri et al. are developing a hand gesture recognition system that combines a
humanoid robot with machine learning techniques and enables accurate recognition of
hand postures and gestures in human-robot interaction\
cite{application.industrial.illuri}.
Ahn et al. present the Interactive Text2Pickup (IT2P) network, which improves
human-robot collaboration by skillfully handling ambiguous voice commands through
interaction, ensuring accurate pickup of objects based on user instructions\
cite{application.industrial.Ahn}.
Mohamed et al. contribute by introducing conventions, standard interfaces and a
reference pipeline in ROS for HRI, with the aim of improving interoperability and
enabling the reuse of core functionality across different HRI-related software
tools\cite{application.industrial.Mohamed}.
Keshinro et al. present a deep learning-based approach using ConvLSTM and LRCN
algorithms with RGB images that improves human-robot collaboration by accurately
predicting human intentions and improving team planning and execution by
recognizing implicit human intentions\cite{application.industrial.Keshinro}.
Ben-Youssef et al. present CollisionNet, a deep neural network developed for
collision detection in collaborative robots. This model is characterized by
remarkable sensitivity, robustness to false alarms and generalization across
different robots and motions\cite{application.industrial.Ben-Youssef}.
Kang et al. contribute by presenting video annotation methods that utilize both the
egocentric and exocentric viewpoints of the robot for human-robot interaction and
improve the description of visual information in a social robotics context. This
approach aims to create a comprehensive understanding of the robot's visual
perspective\cite{application.industrial.Kang}.
Lu et al. present an innovative lip speech decoding system using low-cost
triboelectric sensors and an enhanced recurrent neural network that has high
accuracy and potential applications for individuals with vocal cord lesions,
enriching the landscape of lip-speech translation systems\
cite{application.industrial.Lu}.
Wahab et al. present 4mCNLP-Deep, a superior computational model for the
identification of N4-methylcytosine sites that utilizes deep learning and word
embedding. At the same time, the importance of DNA methylation is highlighted and
the effectiveness of deep learning in analyzing genomic data is demonstrated\
cite{application.industrial.Wahab}.
Ruffolo et al. present IgFold, a rapid deep learning method for antibody structure
prediction that provides excellent insights into a large number of paired antibody
sequences, highlighting the importance of accurate structure prediction for the
study of adaptive immune response and potential therapeutic applications\
cite{application.industrial.Ruffolo}.
Liu et al. present OPED, a deep learning-based priming-editing guide RNA design
optimization model that demonstrates superior accuracy, efficiency enhancement, and
versatility in genome editing applications, and the creation of the OPEDVar
database for user-friendly access to optimized designs\
cite{application.industrial.Liu1}.
SLICES, co-authored by Xiao et al, introduces a string-based crystal representation
with remarkable invertibility and positions it as a promising tool for in-silico
exploration of materials and inverted design of narrow-gap semiconductors \
cite{application.industrial.Xiao}.
CLOOME, co-authored by Fernandez et al, presents a multimodal contrastive learning
system that significantly improves the retrieval of bioimaging databases of
chemical structures, shows remarkable transferability to various drug discovery
tasks, and outperforms existing methods in predicting mechanism of action\
cite{application.industrial.Fernandez}.
\subsection{Interdisciplinary}
Human-robot interaction (HRI) is crucial in interdisciplinary research as it
combines insights from robotics, psychology, computer science and design for a
seamless integration of robotic technologies. This collaborative approach requires
a sophisticated understanding of human behavior and preferences. The
interdisciplinary nature of HRI is critical to the development of sophisticated,
socially aware robotic systems that can work seamlessly with humans. The following
commentary examines specific contributions of relevant work that underscore the
importance of HRI as a catalyst for interdisciplinary exploration and innovation.
Goldstein et al. show empirical parallels between language processing in the human
brain and an autoregressive deep language model and point to common computational
principles and potential applications in modeling language-related processes\
cite{application.Interdisciplinary.Goldstein}.
Zeng et al. propose a unified deep learning system that seamlessly integrates
molecular structures and biomedical texts, outperforms human experts in
understanding molecular features, and demonstrates its versatility in various
biomedical tasks, with potential applications in automated drug discovery\
cite{application.Interdisciplinary.Zeng}.
Mao et al. present IKGM, a novel deep learning method using attention mechanisms to
identify key genes in macroevolution, which has been successfully applied to
diurnal butterflies and nocturnal moths, demonstrating its potential for insights
into macroevolutionary mechanisms at the genomic level\
cite{application.Interdisciplinary.Mao}.
Frey et al. explore neural scaling in deep chemical models and propose strategies
for improved pre-training efficiency and performance that have practical
implications for applications in robotics and computer science\
cite{application.Interdisciplinary.Frey}.
Shanmugavadivel et al. apply machine learning and pre-trained models to effectively
overcome the challenges of sentiment analysis in low-resource settings, especially
on Tamil-English code-mix data. Their application contributes significantly to
advancing sentiment analysis in code-mix languages \
cite{application.Interdisciplinary.Shanmugavadivel}.
Diviya et al. present a novel neural architecture for natural language image
synthesis in Tamil that addresses the challenges of regional language, improves
descriptive image synthesis, and contributes to computer vision and speech
synthesis\cite{application.Interdisciplinary.Diviya}.
Polyglotter, co-developed by Bazaga et al, is a flexible machine learning system
that converts natural language queries into database commands without manual
annotation. It uses a Transformer-based model and shows strong performance across
different database engines\cite{application.Interdisciplinary.Bazaga}.
% need introduction
\subsection{Others}
Below are several noteworthy papers that hold considerable value concerning their
application in Human-Robot Interaction (HRI). These papers contribute significantly
to the understanding and advancement of HRI dynamics, offering valuable insights
into various aspects of the field.
The transformation model for human activity recognition using wearable sensors
presented by Luptáková et al. demonstrates the real-time analysis of smartphone
sensor data and the broad applicability in different scenarios\cite{etalLuptáková}.
Kasmaiee et al. propose an effective two-method system that combines rule-based
approaches and a sophisticated deep learning model for automatic spelling
correction in Persian texts with potential applications in computer systems and
robotics.\cite{application.others.Kasmaiee}.
Li et al. investigate neural coding in the human auditory pathway using DNN models
and show strong correlations between hierarchical DNN layers and neural activity,
with language-specific models predicting cortical responses and highlighting the
superiority of DNNs in speech processing tasks\cite{application.others.Li}.
TalkToModel, developed by Slack et al, is an advanced conversational system that
outperforms traditional point-and-click explanation systems and enables users to
interactively understand and explain machine learning models, especially in
critical areas such as healthcare\cite{application.others.Slack}.
\section{Results analysis}
\label{sec:Results_analysis}
The results and analysis derived from the aggregated table of papers in this
survey, focusing on the use of deep learning-based natural language processing
(NLP) in human-robot interaction (HRI), provide insightful perspectives on the
current research landscape. Examining these works helps identify trends,
challenges, and innovative areas within the field and provides researchers with a
nuanced understanding of advances and existing gaps in the literature. Detailed
findings on the prevalence and performance of various models are outlined in the
table \ref{tab:result}, contributing to a comprehensive understanding of the most
advanced applications in the field.
\begin{longtable}
{p{2cm}p{2cm}p{3.5cm}p{3cm}p{1.5cm}p{4cm} }
\caption{Review the related and existing analysis results used in recent
papers.}
\label{tab:result} \\
\hline
\textbf{Reference} & \textbf{Doamin}
& \textbf{Dataset} & \textbf{Pre-processing Methods} & \textbf{Model} & \
textbf{Results}\\
\hline
\endfirsthead
\multicolumn{6}{c}%
{{\bfseries \tablename\ \thetable{} -- continued from previous page}} \\
\hline
\textbf{Reference} & \textbf{Doamin}
& \textbf{Dataset} & \textbf{Pre-processing Methods} & \textbf{Model} & \
textbf{Results} \\
\hline
\endhead
\hline
\endlastfoot
Khan \emph{et al.} \cite{etal.Wang}
& Interdisciplinary
& COCO
&
Tokenization\newline
Part-of-Speech Tagging(POS)\newline
Lemmatization\newline
Stemming\newline
&
Mask-RCNN,\newline
CNN-LSTM network
& \textbf{F1-score}: \newline 97.01\%,93.34\% (NIH) \newline 97.50\%, 95.78\%
(JRST) \\ \hline
\end{longtable}
% \section{Pre-processing Methods}
% \label{sec:Pre-processing}
% Pre-processing is vital in deep learning-based natural language processing for
human-robot interaction, converting raw text into machine-readable formats using
techniques like data cleansing and tokenization. These methods, akin to sentiment
analysis, refine input data and enhance model performance. Rigorous pre-processing
ensures data integrity, contributing significantly to scientific advancements in
optimizing NLP applications in HRI. The table \ref{tab:preprocessing}below
elucidates prevalent pre-processing methods in this field.
% \begin{table}[H]%[htb]%
% \begin{adjustwidth}{-4cm}{-4cm} % Comment out/remove adjustwidth environment if
table fits in text column.
% \centering
% \caption{
% {\bf The table discusses the commonly used pre-processing methods in sentiment
analysis.}}
% % \begin{tabular}{p{2.5cm}p{14cm}}
% \begin{tabular}{p{3cm}p{12cm}p{5cm}}
% \hline
% \hline
% Image Crop & Image cropping is a pre-processing technique that involves removing
a specific part of an image, typically to focus on a region of interest or to
resize it for further analysis or display.
% &\cite{zhang2023tcpcnet},\cite{kushol2023effects},\cite{fanjie2023sust}
% \\ \hline
% Tokenization & Breaking down a text into smaller linguistic units such as words,
phrases, or symbols to facilitate further analysis. &\cite{halawani2023automated},\
cite{baghaei2023deep},\cite{aldunate2022understanding},\cite{zhou2023generating},\
cite{pandey2022mental}\\ \hline
% Lowercasing & Helpful for standardizing text to reduce vocabulary complexity and
prevent the model from treating words with different capitalization as different
entities.
% &\cite{zaheer2023multi},\cite{duong2023deep},\cite{alshahrani2023applied}
% \\ \hline
% Lemmatization \newline
% \& \newline
% Stemming & Techniques for reducing words to their basic form in order to deal
with word variations and thus reduce the size of the vocabulary and improve the
generalization capability of the model. &\cite{budiharto2021novel},\
cite{ayanouz2020smart},\cite{guazzo2023deep},\cite{wang2023deepsa} \\ \hline
% Noise removal & An important step in improving the quality of text data by
removing irrelevant information such as special characters, symbols, or irrelevant
words that do not contribute to the overall meaning.
% &\cite{amaar2022detection},\cite{kheraleveraging},\cite{merdivan2019dialogue}
% \\ \hline
% Feature-Extraction & Essential for capturing the most important information from
the text data and creating meaningful representations that can be effectively used
by the model to learn patterns and make predictions. &\cite{johnston2023ns},\
cite{yohanes2023emotion},\cite{10343159}
% \\ \hline
% Stop word removal & Removes frequently occurring words (e.g., 'and', 'the', 'is')
that do not contain important information and are often ignored during analysis to
reduce noise and increase processing speed. &\cite{balouch2023transformer},\
cite{mithun2023development},\cite{das2022deep},\cite{nijhawan2022stress}
% \\ \hline
% Removal of special \newline characters \& numbers & Helps clean up the text by
removing non-alphabetic characters and numeric digits to ensure that the data
focuses on the textual information relevant to the analysis. &\
cite{olthof2021deep},\cite{nictoi2023unveiling},\cite{das2023sentiment},\
cite{olthof2021deep}
% \\ \hline
% Vectorization & The process of converting text data into numeric vectors, making
it suitable for various machine learning models that require numeric input for
processing. &\cite{gupta2023detecting},\cite{xavier2022natural},\
cite{marulli2021exploring}
% \\ \hline
% Named Entity Recognition (NER) & A process that identifies and classifies named
entities in text, such as names, places, dates, and numeric values, so that the
model can recognize specific entities and their context. &\cite{ahmed2023fine},\
cite{jang2022exploration}
% \\ \hline
% Punctuation removal & Removing punctuation from text data helps to simplify the
text and ensures that the model focuses on the context of the text, improving the
accuracy of the analysis and predictions. &\cite{agarwal2023deepgram},\
cite{motyka2023information},\cite{ashfaque2023design}
% \\ \hline
% \end{tabular}
% \label{tab:preprocessing}
% \end{adjustwidth}
% \end{table}
In future work, we will apply data augmentation for increased dataset diversity. To
increase performance, we will use attention mechanisms, group techniques and
transfer learning.
Involve consumers in the design process and create feedback loops to enable
continuous improvement.
To produce accurate and seamless animation, these obstacles must be overcome. This
will ensure a more efficient and productive production pipeline.
\subsection{Data imbalance and lack of diversity}
The main challenge lies in the imbalance of the data and the lack of diversity of
interaction situations within the dataset, which leads to the model focusing on the
detection of the presence of a friend rather than the complexity of the different
interactions. To solve this problem, it is essential to expand the dataset by
introducing more diverse interaction situations. This expanded dataset will allow
the model to better generalize, distinguish between different interaction contexts,
and understand the nuances of human-robot collaboration.
In future work, we will actively maintain a more diverse dataset, incorporating
different interaction scenarios and continuously refining the dataset based on
real-world scenarios, we can ensure that the model learns to recognize and
understand a wider range of human-robot interactions, improving its performance and
applicability.
The challenge lies in the lack of interpretability of complex deep learning models,
which is an obstacle to understanding their decision-making processes. This opacity
becomes particularly problematic in sensitive areas such as clinical decision-
making. To overcome this challenge, the inclusion of Explainable AI techniques is
crucial. Methods such as attention mechanisms or layer-wise relevance propagation
provide insights into the internal workings of the model and offer explanations for
certain decisions. By integrating these techniques, we increase transparency,
interpretability and confidence in the results of the model. This is particularly
important for applications where comprehensibility and accountability are critical
to ensure that end users can trust the model's predictions and make informed
decisions.
The problem of dealing with unstructured data is very delicate. Dealing with the
large volume, variety and dynamics of unstructured data published online. To
overcome the solution,
The use of machine learning techniques, such as the proposed model based on deep
learning. These techniques can help to overcome this challenge. These techniques
have shown better performance in sentiment analysis and can effectively process a
large amount of text data. By using natural language processing (NLP) and long-term
memory (LSTM) techniques, the proposed model can process and analyze unstructured
data from various sources, including in-app messages, social media websites, e-
commerce websites, and news publishing websites.
In particular, our future work will be the hybrid feature extraction approach used
in the model to improve the ability to extract relevant information from
unstructured data.
\subsection{Data Augmentation for Low-Resource Languages}
\section{Result Analysis}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusion}
\label{sec:Conclusion}
\label{end}
In summary, the maturation of deep learning methods with natural language
processing (NLP) in the context of human-robot interaction (HRI) represents a
transformative paradigm in this rapidly evolving field. This study deftly navigates
the complicated dynamics of HRI and emphasizes the central role that deep learning
plays in shaping communication between humans and robots. Departing from
conventional sentiment analysis, our investigation encompasses a spectrum of HRI
facets that include dialog systems, language understanding, and contextual
communication. The study systematically scrutinizes applications, algorithms and
models that describe the current landscape of deep learning-based NLP in HRI. It
also provides valuable insights into common pre-processing techniques, datasets and
specific evaluation metrics. By revealing the benefits and challenges that machine
learning and deep learning algorithms bring to HRI, as well as providing a
comprehensive overview of current state-of-the-art experiments, this review serves
not only as a navigation aid for the Field, but also as a catalyst for future
advances. The concluding discussion of specific challenges in the field of HRI sets
the stage for future research and ensures a nuanced understanding of models,
applications, challenges, and the trajectory of deep learning-based NLP research in
the field of human-robot interaction.
The authors declared that they have no competing interests in this work. We declare
that we do not have any commercial or associative interest that represents a
conflict of interest in connection
with the work submitted.
%% If you have bibdatabase file and want bibtex to generate the
%% bibitems, please use
%%
\bibliographystyle{elsarticle-num}
\bibliography{ref}
%% else use the following coding to input the bibitems directly in the
%% TeX file.
% \begin{thebibliography}{00}
% %% \bibitem[Author(year)]{label}
% %% Text of bibliographic item
% \bibitem[ ()]{}
% \end{thebibliography}
\end{document}
\endinput
%%
%% End of file `elsarticle-template-harv.tex'.