Mega Project Report

GOVERNMENT POLYTECHNIC, SAKOLI
ACADEMIC SESSION 2022-2023
A CAPSTONE PROJECT PLANNING REPORT ON

“AI VOICE ASSISTANT APPLICATION”
This planning project is submitted in partial fulfilment of the requirement
for the award of Diploma in
COMPUTER TECHNOLOGY
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION, MUMBAI
Submitted by
Gitesh N. Ujgaonkar Pratik N. Vairagade

Sagar R. Meshram Hardik V. Bawankule
Under the guidance of

Dr. U. B. Aher
(Lecturer, Computer Technology)
DEPARTMENT OF COMPUTER TECHNOLOGY
i
CERTIFICATE
This is to certify that the project Report entitled “AI Voice Assistant Application” using
Kotlin was successfully completed by students of fifty semester Diploma in Computer Technology.
1. Gitesh Narendra Ujgaonkar

2. Pratik N. Vairagade
3. Sagar R. Meshram
4. Hardik V. Bawankule
In partial fulfilment of the requirements for the award of the Diploma in Computer Technology
Department and submitted to the Department of Computer Technology of Government Polytechnic,
Sakoli work carried out a period for the academic year 2022-2023 as per curriculum.
Subject Teacher Guide Name

Shri.A. A. Bhajpayee Dr. U. B. Aher
Head of the Department Principal

Shri.V. B. Khobragade Shri.S. P. Lambhade
ii
SUBMISSION
We are the Student of Third year of course Computer Technology, Sincerely Submit Report of
Capstone Project. We have completed time-to-time project as described in this report by our skills
under the guidance of Dr. U. B. Aher.
Roll No Enrolment No Name Signature

48 2000910159 Gitesh Narendra Ujgaonkar
53 2100910087 Pratik Nilkanth Vairagade
54 2100910177 Meshram Sagar Rajendra
59 2100910190 Hardik Vijaykumar Bawankule
iii
ACKNOWLEDGEMENT
With a deep sense of gratitude, we take this opportunity to express our sincere thanks to our
project guide, Dr. U. B. Aher, for her continuous expertise and assistance and co-operation. We also
express our gratitude to Head of Department Prof. Mr. V. B. Khobragade for inspiration, and
encouragement. We also express our gratitude to the Principal of Government Polytechnic, Sakoli for
reviewing our project as well as giving guidance and inspiration and also encouraging us.
iv
TABLE OF CONTENT
1. Introduction…………………………………………………………………………....01
1.1. Study of AI Assistants…………………………………………………………….02
1.2. Capabilities of AI assistant………………………………………………………..05
1.3. How does and AI virtual assistant works…………………………………………05
1.4. The operating principles of AI assistant…………………………………………. 06
2. Literature survey………………………………………………………………………07
3. Proposed system……………………………………………………………………….09
3.1. Proposed architecture……………………………………………………………..09
4. System architecture……………………………………………………………………10
4.1. Basic Workflow…………………………………………………………………...10
4.2. Detailed workflow………………………………………………………………...10
5. Flowchart of the app…………………………………………………………………...12
6. Voice processing……………………………………………………………………… 13
6.1. Introduction to speech recognition………………………………………………...13
6.2. Text to Speech……………………………………………………………………..18
7. Image Recognition……………………………………………………………………. 21
7.1. Introduction to image recognition…………………………………………………21
7.2. Image recognition algorithm………………………………………………………22
7.3. How does image recognition work………………………………………………...22
7.4. How AI is used for image recognition………….………………………………….23
8. Song recognition……………………………………….………………………………. 25
8.1. Introduction to song recognition…………………….……………………………..25
8.2. Song recognition…………………………………….……………………………..25
8.3. Methods……………………………………………….……………………………27
9. User Interface……………………………………………….…………………………. 29
9.1. Introduction to Interface………………………………….………………………...29
9.2. Types of user interface…………………………………….……………………….29
9.3. UI and UX………………………………………………….………………………30
9.4. History of UI……………………………………………….………………………30
9.5. Graphical UIs……………………………………………….……………………...31
9.6. Mobile UIs………………………………………………….………………………31
10. Reference………………………………………………………..……………………....3
v
ABSTRACT
AI voice Assistant is an application which uses voice commands to perform several tasks like
calling, sending message, reading out loud messages or opening applications. In this project we are
going to get an overview of how this application works and how the operations are performed. Voice
assistants are software agents that can interpret human speech and respond via synthesized voices.
Apple's Siri, Amazon's Alexa, Microsoft's Crotona, and Google's Assistant are the most popular voice
assistants and are embedded in smart phones or dedicated home speakers. Users can ask their assistants
questions, control home automation devices and media playback via voice, and manage other basic
tasks such as email, to-do lists, and calendars with verbal commands. This report will explore the basic
workings and common features of today's voice assistants. In this report we will study and understand
how some famous AI voice assistant applications which are available in market works.
This application uses android API to import Speech to Text feature to recognize human voice
and Text to Speech feature to communicate with user. Several algorithms are used to convert human
voice to text data and text data to human voice.
Our application is capable to recognize images, understanding which objects are in the picture
with the help of the AI, and sorting them out based on it. This feature can be very useful in managing
user’s photos intelligently and to find and recognize things which user don’t know of. This feature can
also be used to search similar images on the internet. Image recognition uses several algorithms, which
we will see ahead in the report.
Our application can also hear any beats of the song and recognize the song to which those beats
belong to. We will see which methods will be used to achieve this feature.
The UI of the application is going to be very easy and simple to understand. We don’t want our
user to get confused while using our application. There are types of UI which we will see ahead in the
report.
vi
1. INTRODUCTION
Talking to artificial intelligence is no longer a science fiction. A virtual assistant, also called
AI assistant or digital assistant, is an application program that understands human voice commands
and completes tasks for the user. A virtual assistant is a technology based on artificial intelligence. The
software uses a device’s microphone to receive voice requests while the voice output takes place at the
speaker.
These virtual assistants can be found in all gadgets such as smartphones, tablets and smart
watches now. The increasing competition in this area has led to many improvements.
Recently the usage of virtual assistants to control our surroundings is becoming a common
practice. We make use of Google AI, Siri, Alexa, Cortana, and many other similar virtual assistants to
complete tasks for us with a simple voice or audio command. We could ask them to play music or
open a particular file or any other similar task, and they would perform such actions with ease.
A voice assistant can be a digital assistant that uses human voice, language process algorithms,
and synthesis to pay attention to particular voice commands and come applicable information or
perform particular functions as appealed by the user supported commands, commonly known as
intents, tell by the user, voice assistants will come applicable information by hearing for particular
keywords and filtering out the close noise. While voice assistants may be completely a software system
primarily builds on and ready to combine into all devices, some assistants are sketched individually
for every unique device application, like the Amazon Alexa clock. Now a day, voice assistants are
combined into some of the devices we intend to use on a daily, like cell phones, computers, and good
speakers.
The main motive of this project is to develop an application for physically challenged person.
In this project we are going to develop a static voice assistant using Kotlin language which will perform
operations like copy and paste files from one location to another location, to send a message to the
users mobile also order food and mobile using voice commands.
The mass adoption of AI in users’ everyday lives is additionally refueling the shift towards
voice. One of the most popular voice assistants are Siri, from Apple, Amazon Echo, which responds
to the name of Alexa from Amazon, Cortana from Microsoft, Google Assistant from Google, and the
recently appeared intelligent assistant under the name AIVA. This report presents a brief introduction
to the architecture and construction of voice assistants.
1
1.1 STUDY OF AI ASSITANTS
Siri, Google Now, and Cortana are the world-known names. Of course, there are many mobile
assistant’s apps on the shelves of app stores. However, we are going to focus on studying three
technologies mentioned above because, according to the Mind Meld research, they are preferred by
the majority of users.
Let us understand what these apps are and how they work.
1.1.1 STUDYING SIRI
If we study Siri, we certainly notice that it was unavailable for most of the third-party
applications. With iOS 10 release, the situation has changed a lot. At WWDC 2016, it was announced
that Siri can be integrated with the apps that work in the following areas;
• Audio and video calls

• Messaging and contacts
• Payments via Siri
• Photos search
• Workout
• Car booking
To enable the integration, Apple's introduced a special SiriSDK that consists of two
frameworks. The first one covers the range of tasks to be supported in our app, and the second one
advises on a custom visual representation when one of the tasks is performed.
Each of the app types above defines a certain range of tasks which are called intents. The term
refers to the users' intentions and, as a result, to the particular scenarios of their behavior.
In SiriSDK, all the intents have corresponding custom classes with defined properties. The
properties accurately describe the task they belong to. For instance, if a user wants to start a workout,
the properties may include the type of exercises and the time length of a session. Having received the
voice request, the system completes the intent object with the defined characteristics and sends it to
2
the app extension. The last part processes the data and show the correct result at the output.
We can find more information about how to work with intents and objects on Apple's official
website.
Below is a scheme on intents processing
fig 1.1.1 How Siri processes intents
1.1.2 STUDYING GOOGLE NOW AND VOICE ACTIONS
Google has always been the first one to show the maximum loyalty to developers. Unlike
Apple, Google doesn't have strict requirements for design. The improvement period in the Play Market
is much shorter and not as fastidious as in the Apple App Store too.
Nevertheless, in a question of smart assistant integration, it appears quite conservative. For
now, Google assistant works with the selected apps only. The list includes such hot names as eBay,
Lyft, Airbnb, and others. They are allowed to make their own Now Cards via special API.
The good news is that we still have a chance to create Google Assistant app command for our
own app. For that, we need to register the application with Google.
Remember we must not confuse Google Now with the voice commands. Now is not just about
3
listening and responding. It is an intelligent creature that can learn, analyze, and conclude. Voice
actions are the narrower concept. It works on the basis of speech recognition followed by information
search.
Google provides the developers with a step-by-step guide for integrating such functionality
into an app. Voice Actions API teaches how to include a voice mechanism both in mobile and wearable
apps.
fig 1.1.2 What is Google's intelligent mechanism
1.1.3 STUDYING CORTANA
Microsoft encourages developers to use the Cortana voice assistant in their mobile and desktop
apps. we can provide the users with an opportunity to set a voice control without directly calling
Cortana. In the Cortana Dev Center, it describes how to make a request to a specific application.
Basically, it offers three ways to integrate the app name into a voice command:
Prefixal Prefixal, when the app name stands in front of the speech
command, e.g., ‘Fitness Time, choose a workout for me!'
Infixal Infixal, when the app name is placed in the middle of the vocal
command, e.g., 'Set a Fitness Time workout for me, please!'
Suffixal Suffixal, when the app name is put to the end of a command
phrase, e.g., 'Adjust some workout in Fitness Time!'
4
we can activate either the background or foreground app with speech commands through
Cortana. The first type is suitable for apps with simple commands that don't require additional
instructions,
e.g., 'Show the current date and time!'.
The second - for the apps that work with more complex commands, like 'Send the Hello
message to Ann'.
In the last case, besides setting the command, we specify its parameters:
 What message? - Hello message
 Who should it be sent to? - To Ann.
1.2. CAPABILITIES OF AI ASSISTANTS
Tasks performed by a personal assistant or secretary includes reading text or email messages
aloud, looking up phone numbers, scheduling, placing phone calls and reminding the end user about
appointments. Popular virtual assistants currently include Amazon Alexa, Apple's Siri, Google
Assistant and Microsoft's Cortana -- the digital assistant built into Windows Phone 8.1 and Windows
10.
Virtual assistants typically perform simple jobs for end users, such as adding tasks to a calendar
providing information that would normally be searched in a web browser or controlling and checking
the status of smart home devices, including lights, cameras and thermostats.
Users also task virtual assistants to make and receive phone calls, create text messages, get
directions, hear news and weather reports, find hotels or restaurants, check flight reservations, hear
music, or play games.
1.3. HOW DOES AN AI VIRTUAL ASSISTANT WORK
A virtual assistant is a technology based on artificial intelligence. The software uses a device’s
microphone to receive voice requests while the voice output takes place at the speaker. But the most
exciting thing happens between these two actions. It is a combination of several different technologies:
voice recognition, voice analysis and language processing.
5
When a user asks a personal assistant to perform a task, the natural language audio signal is
converted into digital data (Speech to Text) that can be analyzed by the software. Then this data is
compared with a database of the software using an innovative algorithm to find a suitable answer. This
database is located on distributed servers in cloud networks. For this reason, most personal assistants
cannot work without a reliable Internet connection.
With the increasing number of queries, the software’s database gets expanded and optimized,
which improves voice recognition and increases the response time of the system.
1.4. THE OPERATING PRINCIPLE OF AI ASSISTANT
The general operating principle of artificial intelligence assistants is the ability to make
personal decisions based on incoming data. The software has to include an advanced set of tools for
processing received data, in order to make proper individual choices.
Artificial neural networks were invented to help develop the discussed software. Such networks
imitate the human brain’s ability to remember, to help the assistant recognize and classify data and
customize predicting mechanisms based on thorough analysis.
The memory process is executed deductively, i.e., top-down: first, the app analyzes several
variants of outcome then, it remembers the variants applied by a human (i.e., the system remembers
proper answers to the question “How are you?” such as “I’m fine”, “Not very well” etc., and ignores
answers like “Yes”, “No” and others) and “self-educates” to be able to generate situation-based
algorithms later. It is not necessary to manually enter information into the app to build our own
personal artificial intelligence assistant. API software was developed for that, and the application
programming interface aids the apps in the recognition of faces, speech, documents and other external
factors. There are number of APIs on the market, most popular of which are api.ai, Wit.ai, Melissa,
Clarifai, Tensorflow, Amazon AI, IBM Watson, etc. with less widespread options including Cogito,
DataSift, iSpeech, Microsoft Project Oxford, Mozscape and OpenCalais.
6
2. LITERATURE SURVEY
The field of voice-based assistants has observed major advancements and innovations. The
main reason behind such rapid growth in this field is its demand in devices like smartwatches or fitness
bands, speakers, Bluetooth earphones, mobile phones, laptop or desktop, television, etc. Most of the
smart devices that are being brought in the market today have built in voice assistants. The amount of
data that is generated nowadays is huge and in order to make our assistant good enough to tackle these
enormous amounts of data and give better results we should incorporate our assistants with machine
learning and train our devices according to their uses. Along with machine learning other technologies
which are equally important are IoT, NLP, Big data access management. The use of voice assistants
can ease out a lot of tasks for us. Just give voice command input to the system and all tasks will be
completed by the assistant starting from converting our speech command to text command then taking
out the keywords from the command and execute queries based on those keywords. In the paper
“Speech recognition using flat models” by Patrick Nguyen and all, a novel direct modelling approach
for speech recognition is being brought forward which eases out the measure of consistency in the
sentences spoken. They have termed this approach as Flat Direct Model (FDM). They did not follow
the conventional Markov model and their model is not sequential. Using their approach, a key problem
of defining features has been solved. Moreover, the template-based features improved the sentence
error rate by 3% absolute over the baseline.
Again, in the paper “On the track of Artificial Intelligence: Learning with Intelligent Personal
Assistant” by Nil Goksel and all, the potential use of intelligent personal assistants (IPAs) which use
advanced computing technologies and Natural Language Processing (NLP) for learning is being
examined. Basically, they have reviewed the working system of IPAs within the scope of AI.
The application of voice assistants has been taken to some higher level in the paper “Smart
Home Using Internet of Things” by Keerthana S and all where they have discussed how the application
of smart assistants can lead to developing a smart home system using Wireless Fidelity (Wi-Fi) and
Internet of Things. They have used CC3200MCU that has in-built Wi-Fi modules and temperature
sensors. The temperature that is sensed by the temperature sensor is sent to the microcontroller unit
(MCU) which is then posted to a server and using that data the status of electronic equipment like fan,
light etc. is monitored and controlled.
The application of voice assistants has been beautifully discussed in the paper “An Intelligent
Voice Assistant Using Android Platform'' by Sutar Shekhar and all where they have stressed on the fact
7
that mobile users can perform their daily task using voice commands instead of typing things or using
keys on mobiles. They have also used a prediction technology that will make recommendations based
on the user activity.
The incorporation of natural language processing (NLP) in voice assistants is really necessary
which will also lead to the creation of a trendsetting assistant. These factors have been the key focus
of the paper “An Intelligent Chatbot using Natural Language Processing” by Rishabh Shah and all.
They have discussed how NLP can help to make assistants smart enough to understand commands in
any native language and thus does not prevent any part of the society form enjoying its perks.
We also studied the systems developed by Google Text To Speech – Electric Hook Up (GTTS-
EHU) for Query-by? Example Spoken Term Detection (QbE-STD) and Spoken Term Detection (STD)
tasks of the Albayzin 2018 Search on Speech Evaluation. For representing audio documents and
spoken queries Stacked bottleneck features (sBNF) are used as frame level acoustic representation.
Spoken queries are synthesized, average of sBNF representations is taken and then the average query
is used for Qbe-STD.
We have seen the integration of technologies like gTTS, AIML (Artificial Intelligence Mark-
up Language) in the paper “JARVIS: An interpretation of AIML with integration of gTTS and Python”
by TanveeGawand and all where they have adopted the dynamic base Python pyttsx which is a text to
speech conversion library in python and unlike alternative libraries, it works offline.
The main focus of voice assistants should be to reduce the use of input devices and this fact
has been the key point of discussion in the paper
8
3. PROPOSED SYSTEM
In this proposed concept effective way of implementing a Personal voice assistant, Speech
Recognition library has many in-built functions, that will let the assistant understand the command
given by user and the response will be sent back to user in voice, with Text to Speech functions. When
assistant captures the voice command given by user, the under lying algorithms will convert the voice
into text.
3.1. Proposed Architecture:
The system design consists of
1. Taking the input as speech patterns through microphone.

2. Audio data recognition and conversion into text.
3. Comparing the input with predefined commands.
4. Giving the desired output.
The initial phase includes the data being taken in as speech patterns from the microphone. In
the second phase the collected data is worked over and transformed into textual data using Natural
Language Processing (NLP). In the next step, this resulting string the data is manipulated through
‘When conditions’ written in the app to finalize the required output process. In the last phase, the
produced output is presented either in the form of text or converted from text to speech using TTS
Features.
The System shall be developed to offer the following features:
1) It keeps listening continuously in inaction and wakes up into action when called with a particular
predetermined functionality.
2) Browsing through the web based on the individual’s spoken parameters and then issuing a desired
output through audio and at the same time it will print the output on the screen.
9
4. SYSTEM ARCHITECTURE
4.1 Basic Workflow:
The figure below shows the workflow of the main method of voice assistant. Speech
recognition is used to convert speech input to text. This text is then sent to the processor, which
determines the character of the command and calls the appropriate script for execution. But that's not
the only complexity. No matter how many hours of input, another factor plays a big role in whether a
package notices you. Ground noise simply removes the speech recognition device from the target. This
may be due to the inability to essentially distinguish between the bark of a dog or the sound near
hearing that a helicopter is flying overhead from your voice
Fig. 4.1 Block Diagram of Voice Assistant
4.2 Detailed Workflow:
Voice assistants such as Siri, Google Voice, and Bixby are already available on our phones.
According to a recent NPR study, around one in every six Americans already has a smart speaker in
their home, such as the Amazon Echo or Google Home, and sales are growing at the same rate as smart
phone sales a decade ago. At work, though, the voice revolution may still seem a long way off. The
move toward open workspaces is one deterrent: nobody wants to be that obnoxious idiot who can't
stop ranting at his virtual assistant.
10
Fig. 4.2 Detailed Workflow of Voice Assistant
11
5. FLOWCHART OF APP
Fig. 5.1 Program Outcome Flow Chart of AI Assistant app.
12
6. VOICE PROSESSING
6.1 Introduction to speech recognition:
Speech recognition technology is a type of artificial intelligence that involves understanding

what a person says. It usually does this by looking at the words being said and then comparing them
to a predefined list of acceptable phrases. Speech recognition software has an extensive list of words
and phrases programmed into it, including things like proper names, slang, numbers, letters from the
alphabet, and other common phrases. When a person speaks into a device that uses speech recognition
software, the software will analyze what is being said and then compare it to the list of acceptable
phrases.If it finds a match, it will respond accordingly. If there is no match, the software may still be
able to interpret what was said based on the context of the conversation.
6.1.1 How Does Speech Recognition Work
There are three primary components to speech recognition: the microphone, the software, and
the language database. The microphone is used to capture the sound of a person’s voice. The software
takes that sound and breaks it down into individual words. The language database stores all of the
information about the words and phrases that the software is looking for.
Once these three components are set up, they work together to decipher what a person has said
and convert it into text. If the microphone picks up enough of the sound and if all of the pre-
programmed rules have been met, then the words can be converted into text.
That processed text can then be used in a number of different ways, such as being displayed on
a screen or being used to control a device.
6.1.2 Various algorithms used in speech recognition
6.1.2.1 Natural Language Processing (NLP):
Natural language processing (NLP) is a field of computer science and linguistics that deals
with the interactions between computers and human languages.
13
It involves programming computers to understand human language and to produce results that
are understandable by humans.
This type of algorithm analyzes data and looks for the possible word choice. It then applies
linguistics concepts, such as grammar and sentence structure, to complete your request.
6.1.2.3 Hidden Markov Model (HMM):
Hidden Markov Model (HMM) is a statistical technique for analyzing sequences of data. This
type of model creates a chain of states, each with an associated probability so that the next state can
be predicted from the current state.
Each system has many states, and there are usually overlapping chains so that transitions are
not visible to outside observers.
The way this algorithm works is it converts speech to text by assigning probabilities to every
possible character that might next follow any sequence of characters to predict what should come next.
First, it breaks up the spoken text into phonemes-basic sounds that represent an individual letter
or symbol in written language and then assigns probabilities to each one.
One example is the word “receive,” which is often mispronounced and not written in text messages.
The term has several sounds that can be associated with the following characters: “C, C E I, E A U.”
Hidden Markov Model (HMM) calculates the probability of each sound represented by these
letters to determine the appropriate word choice. It then applies probabilities to each character after
“receive.”
6.1.2.4 Neural Networks
Neural networks are sophisticated software algorithms that can “learn” to recognize patterns in
data. They are modeled after the brain and consist of many interconnected processing nodes, or
neurons, that can “train” themselves to recognize specific patterns.
When you speak into a microphone, your voice is converted into digital form by a process
called sampling. This involves measuring the amplitude (volume) and frequency (pitch) of the sound
waves at fixed intervals-usually every 20 milliseconds-and recordings them as digital data.
The data is then sent to a neural network, which “reads” it and compares it to the templates
stored in its memory. If it finds a match, it will report that you said a specific word or phrase.
14
Some computing tasks require the computer to ask for repetition. This involves using voice
recognition software to select an alternative from among two possibilities, such as yes and no, and
requesting clarification when necessary.
For example: “Did you say ‘yes’?”
6.1.3 API’s used in speech to text
6.1.3.1 android.speech.SpeechRecognizer:
This class provides access to the speech recognition service. This service allows access to the
speech recognizer. Do not instantiate this class directly, instead,
call SpeechRecognizer#createSpeechRecognizer(Context),
or SpeechRecognizer#createOnDeviceSpeechRecognizer(Context). This class's methods must be
invoked only from the main application thread.
The implementation of this API is likely to stream audio to remote servers to perform speech
recognition. As such this API is not intended to be used for continuous recognition, which would
consume a significant amount of battery and bandwidth.
Please note that the application must have Manifest.permission.RECORD_AUDIO permission

to use this class.
Fig. 3.1 Speech to Text
6.1.3.2 android.speech.RecognizerIntent :
This activity converts the speech into text and sends back the result to calling activity.
15
6.1.3.3 android.speech.RecognitionListener :
Used for receiving notifications from the SpeechRecognizer when the recognition related
events occur. All the callbacks are executed on the Application main thread.
6.1.3.4 Sample Code of Speech to Text
package com.gtappdevelopers.kotlingfgproject
import android.content.Intent
import android.os.Bundle
import android.speech.RecognizerIntent
import android.widget.ImageView
import android.widget.TextView
import android.widget.Toast
import androidx.appcompat.app.AppCompatActivity
import java.util.*
class MainActivity : AppCompatActivity() {
// on below line we are creating variables

// for text view and image view
lateinit var outputTV: TextView
lateinit var micIV: ImageView
// on below line we are creating a constant value

private val REQUEST_CODE_SPEECH_INPUT = 1
override fun onCreate(savedInstanceState: Bundle?) {

super.onCreate(savedInstanceState)
setContentView(R.layout.activity_main)
// initializing variables of list view with their ids.

outputTV = findViewById(R.id.idTVOutput)
micIV = findViewById(R.id.idIVMic)
// on below line we are adding on click

// listener for mic image view.
micIV.setOnClickListener {
// on below line we are calling speech recognizer intent.
val intent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH)
// on below line we are passing language model

// and model free form in our intent
intent.putExtra(
RecognizerIntent.EXTRA_LANGUAGE_MODEL,
RecognizerIntent.LANGUAGE_MODEL_FREE_FORM
)
// on below line we are passing our

// language as a default language.
intent.putExtra(
RecognizerIntent.EXTRA_LANGUAGE,
Locale.getDefault()
)
// on below line we are specifying a prompt

// message as speak to text on below line.
intent.putExtra(RecognizerIntent.EXTRA_PROMPT, "Speak to text")
// on below line we are specifying a try catch block.

// in this block we are calling a start activity
// for result method and passing our result code.
try {
startActivityForResult(intent, REQUEST_CODE_SPEECH_INPUT)
16
} catch (e: Exception) {
// on below line we are displaying error message in toast
Toast
.makeText(
this@MainActivity, " " + e.message,
Toast.LENGTH_SHORT
)
.show()
}
}
}
// on below line we are calling on activity result method.

override fun onActivityResult(requestCode: Int, resultCode: Int, data: Intent?) {
super.onActivityResult(requestCode, resultCode, data)
// in this method we are checking request

// code with our result code.
if (requestCode == REQUEST_CODE_SPEECH_INPUT) {
// on below line we are checking if result code is ok
if (resultCode == RESULT_OK && data != null) {
// in that case we are extracting the

// data from our array list
val res: ArrayList<String> =
data.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS) as ArrayList<String>
// on below line we are setting data

// to our output text view.
outputTV.setText(
Objects.requireNonNull(res)[0]
)
}
}
}
}
6.1.3.5 Flowchart of Speech to Text:
Fig. 3.2 Flowchart of Speech to Text
17
6.2 Text to Speech
Speech synthesis (TTS) consists of the artificial production of human voices. The main use
(and what induced its creation) is the ability to translate a text into spoken speech automatically.
Speech recognition systems use phonemes (the smallest units of sound) in the first place to cut
out sentences. On the contrary, TTS is based on what are known as graphemes: the letters and groups
of letters that transcribe a phoneme. This means that the basic resource is not the sound, but the text.
This is usually done in two steps:
 The first one will cut the text into sentences and words (our famous graphemes) and assign phonetic
transcriptions, the pronunciation, to all these groups;
 Once the different text/phonetic groups have been identified, the second step consists of converting
these linguistic representations into sound. In other words, to read these indications to produce a
voice that will read the information.
6.2.1 Sample code of Text To Speech
MainActivity.kt
package com.tutorialkart.texttospeechapp
import android.support.v7.app.AppCompatActivity
import android.os.Bundle
import android.speech.tts.TextToSpeech
import android.util.Log
import android.widget.Button
import android.widget.EditText
import kotlinx.android.synthetic.main.activity_main.*
import java.util.*
class MainActivity : AppCompatActivity(),TextToSpeech.OnInitListener {
private var tts: TextToSpeech? = null
private var buttonSpeak: Button? = null
private var editText: EditText? = null
18
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
setContentView(R.layout.activity_main)
buttonSpeak = this.button_speak
editText = this.edittext_input
buttonSpeak!!.isEnabled = false;
tts = TextToSpeech(this, this)
buttonSpeak!!.setOnClickListener { speakOut() }
override fun onInit(status: Int) {
if (status == TextToSpeech.SUCCESS) {
// set US English as language for tts
val result = tts!!.setLanguage(Locale.US)
if (result == TextToSpeech.LANG_MISSING_DATA || result == TextToSpeech.LANG_NOT_SUPPORTED) {
Log.e("TTS","The Language specified is not supported!")
} else {
buttonSpeak!!.isEnabled = true
} else {
Log.e("TTS", "Initilization Failed!")
private fun speakOut() {
19
val text = editText!!.text.toString()
tts!!.speak(text, TextToSpeech.QUEUE_FLUSH, null,"")
public override fun onDestroy() {
// Shutdown TTS
if (tts != null) {
tts!!.stop()
tts!!.shutdown()
super.onDestroy()
6.2.2 Flowchart of Text to Speech
Fig. 6.2.2 Flowchart of Text to Speech
20
7. IMAGE RECOGNIZATION
7.1 Introduction to Image Recognition:

The ability of software to distinguish objects, places, people, writing and actions in
pictures. PCs can utilize machine vision advancements, together with a camera and artificial
intelligence software, to achieve image recognition. It’s accustomed perform a large number of
machine-based visual tasks, like labelling the content of images with meta-tags.
Now, what Firebase ML Kit offers to us is already possible to implement yourself using various
machine-learning technologies.
The thing with Firebase ML is that and also offering these abilities underneath a type of
wrapper, it like these innovations and offers their abilities within a solitary SDK.
Fig. 7.1 Image Recognition: In the Context Of ML
In light of these, it tends to be hard to know where to begin. This is one of the principal
objectives of Firebase ML Kit to make Machine Learning to our Android and iOS applications more
accessible to developers and available in more applications. Right now, ML Kit offers the capacity to:
• Image Labelling to classify common elements in pictures.
• Text Recognition to process and recognize text from pictures.
• Face Detection to help you know if a face is smiling, tilted, or frowning in pictures.
21
• Barcode Scanning to read data encoded in standard barcode formats like QR Codes.
• Landmark Recognition to identify popular places in images.
7.2 Image recognition Algorithm
Artificial Intelligence has transformed the image recognition features of applications. Some
applications available on the market are intelligent and accurate to the extent that they can elucidate
the entire scene of the picture. Researchers are hopeful that with the use of AI they will be able to
design image recognition software that may have a better perception of images and videos than
humans.
Image recognition comes under the banner of computer vision which involves visual search,
semantic segmentation, and identification of objects from images. The bottom line of image
recognition is to come up with an algorithm that takes an image as an input and interprets it while
designating labels and classes to that image. Most of the image classification algorithms such as bag-
of-words, support vector machines (SVM), face landmark estimation, and K-nearest neighbours
(KNN), and logistic regression are used for image recognition also. Another algorithm Recurrent
Neural Network (RNN) performs complicated image recognition tasks, for instance, writing
descriptions of the image.
7.3 How Does Image Recognition Work
Image recognition algorithms make image recognition possible. In this section, we will see
how to build an AI image recognition algorithm. The process commences with accumulating and
organizing the raw data. Computers interpret every image either as a raster or as a vector image;
therefore, they are unable to spot the difference between different sets of images. Raster images are
bitmaps in which individual pixels that collectively form an image are arranged in the form of a grid.
On the other hand, vector images are a set of polygons that have explanations for different colours.
Organizing data means to categorize each image and extract its physical features. In this step,
a geometric encoding of the images is converted into the labels that physically describe the images.
The software then analyses these labels. Hence, properly gathering and organizing the data is critical
22
for training the model because if the data quality is compromised at this stage, it will be incapable of
recognizing patterns at the later stage.
The next step is to create a predictive model. The final step is utilizing the model to decipher
the images. The algorithms for image recognition should be written with great care as a slight anomaly
can make the whole model futile. Therefore, these algorithms are often written by people who have
expertise in applied mathematics. The image recognition algorithms use deep learning datasets to
identify patterns in the images. These datasets are composed of hundreds of thousands of labelled
images. The algorithm goes through these datasets and learns how an image of a specific object looks
like.
7.4 How AI is used for Image Recognition
7.4 1. Facial Recognition
We as humans easily discern people based on their distinctive facial features. However, without
being trained to do so, computers interpret every image in the same way. A facial recognition system
utilizes AI to map the facial features of a person. It then compares the picture with the thousands and
millions of images in the deep learning database to find the match. This technology is widely used
today by the smartphone industry. Users of some smartphones have an option to unlock the device
using an inbuilt facial recognition sensor. Some social networking sites also use this technology to
recognize people in the group picture and automatically tag them. Besides this, AI image recognition
technology is used in digital marketing because it facilitates the marketers to spot the influencers who
can promote their brands better.
Though the technology offers many promising benefits, however, the users have expressed their
reservations about the privacy of such systems as it collects the data without the user’s permission.
Since the technology is still evolving, therefore one cannot guarantee that the facial recognition feature
in the mobile devices or social media platforms works with 100% percent accuracy.
7.4.2. Object Recognition
We can employ two deep learning techniques to perform object recognition. One is to train a
model from scratch and the other is to use an already trained deep learning model. Based on these
23
models, we can build many useful object recognition applications. Building object recognition
applications is an onerous challenge and requires a deep understanding of mathematical and machine
learning frameworks. Some of the modern applications of object recognition include counting people
from the picture of an event or products from the manufacturing department. It can also be used to spot
dangerous items from photographs such as knives, guns, or related items.
7.4.3. Text Detection
AI trains the image recognition system to identify text from the images. Today, in this highly
digitized era, we mostly use digital text because it can be shared and edited seamlessly. But it does not
mean that we do not have information recorded on the papers. We have historic papers and books in
physical form that need to be digitized. There is an entire field of research in Artificial Intelligence
and Computer Vision known as Optical Character Recognition that deals with the creation of
algorithms to extract the text from the images and convert them into machine-readable characters.
24
8. SONG RECOGNITION
8.1. Introduction to song recognition
The task of recognizing a song as a cover version of another is relatively easy for the human
being, when the song is known. However, making a machine to do this job is complex because of the
number of variables involved in the development of a cover; these include variations in tempo,
instrumentation, gender, and duration with respect to the original version. A methodology that aims to
identify covers from the application and analysis of machine learning techniques, sparse codification,
signal processing and second order statistics, in order to obtain the best configuration, is proposed.
Acoustic features such as pitches and timbres, as well as beat information of the cover songs were
obtained from the Million Song, a metadata database oriented to music information retrieval. Along
the experimentation it was able to try different analysis configurations on the metadata and to
appreciate the effects on the comparisons between original and cover versions. According to the results,
a system that integrates a frequency processing on the pitches with beat alignment, a sparse codification
and a clustering technique was obtained with correct cover identification similar to the state of the art
results. It was also possible to get information about learning techniques combinations with different
metrics that allows future experiments to improve the results.
Fig.8.1 How Song Recognition work
8.2. Song Recognition
The intersection among music, machine learning and signal processing has let to address a wide
range of task such as automatic definition of melodies, chords and instruments, identification and
characterization of long-term times and structures or recognition of musical genres and covers.
25
There are organizations like the International Information Society for Recovery of Music
Information (ISMIR), or the Music Information Retrieval Evaluation Exchange (MIREX), that have
promoted the use of these fields to the access, organization and understanding of musical information,
focusing on the research and development of computational systems that aim to solve these series of
tasks. A musical version, or cover, is defined as a new interpretation, live or in studio, of a song
previously recorded by another artist.
This implies that a musical cover may have shifts in the rhythms, tempo, instrumentation
ranges, gender or duration with respect to the original version. As an example the song Summertime,
originally performed by Abbie Mitchel in 1935, has up to 1200 musical covers to date according to the
project Second Hand Songs1. Some of these are in general similar to the original song and some others
are quite different; the Million Song Dataset2 reports versions in the musical genres of jazz, rock-pop,
rhythm & blues and even country. A Cover Identification System (CIS), is an automatic system that
ideally determines if a song is a cover version of some musical piece located in a database. This
problem has been addressed by applying several methods based in two stages: the first stage consists
on the extraction and the analysis of the most important characteristics of the song such as its melodic
representation, harmonic progression or pitch; the second stage aims to the measurement of the
similarity degree between the features extracted from each piece of music. Some previous work on the
subject has aimed to solve these steps by proposing different methods. Lee showed an extraction
method based on Hidden Markov Models applied to sequence of chords for each song and followed
by a similarity degree measurement between chord sequences, using dynamic time warping. The
problem with this technique is that it needs a huge amount of time and computational resources.
In Jensen et.al calculated a Chroma-gram, a matrix from the Chroma vector sequences, that
was not sensitive neither to instrumentation nor time changes, to obtain the minimum distance of the
matrices using the Frobenius norm. Ravuri and Ellis proposed to obtain the Chroma-gram and calculate
three characteristics per song to classify by means of vector support machines (SVM), or multilayer
perceptron (MLP), while Chuan proposed to calculate a Chroma-gram that saves the partial harmonics
of the melody and maintain the volume invariance, to make a framework that measures similarity by
means of a binary classifier.
In a method inspired on the creation of digital fingerprints used to minimize the execution times
in the search for covers is proposed by Bertin-Mahieux and Ellis. This research makes use for the first
time of the database MSD (Million Song Dataset) which consists of characteristics and metadata for a
million songs under the Creative Commons (CC) license;
In an introduction to MSD, as well as its creation process and possible uses are presented by
the same authors plus Lamere. The search process in a database may be accelerated by dividing a song
26
into small fragments that may be used as hashes;
In Grosche and Maller applied this technique, but with larger segments for each song in order
to minimize el number of searches.
In Bertin Mahieux and Ellis used the 2D Fourier Transform to procure a representation of the
Chroma patches to obtain an efficient nearest neighbor algorithm; this scheme makes the nearest
neighbors to have more probabilities of being related to the same song. A couple of modifications of
this method are presented to improve the classification; the authors used data dispersion to enhance
their separability followed by a dimensionality reduction. Two methods that help making queries in a
faster way on a large data base are proposed for cover identification: Basic Alignment Search Tool
(BLAST), which is a bio-sequence indexation technique that Martin et. al used to increase the
efficiency and a data base prune method that Osmalskyj et. al used to reduce the set where the search
is made.
The Locally-Sensitive Hashing (LSH), method was used by Khadkevich and Omologo in to
obtain similar chord songs and then to apply a progression method to refine the search ranking. On the
other hand, Salamon et al focus on extracting tonal representations (melody, bass line and harmonic
progression), by using state of the art algorithms, and a dynamic programming algorithm to measure
the degree of similarity in.
From these works it is concluded that harmonic representations are more reliable for cover
identification although the tonal representations better improves the recognition accuracy. Van Balen
et al use three descriptors to content-based music recovery: pitch bihistogram, Chroma correlation
coefficients and harmonization features Serrà explains the steps involved in a CIS: feature extraction,
key invariance, tempo invariance, structure invariance and similarity calculation, while the authors
group these steps into a two phases system: calculation of harmonic features and pitch for each song
and comparison of the similarity. This report presents the results obtained from a CIS based on machine
learning techniques. This is based on the state-of-the-art feature processing plus the introduction of a
sparse codification in order to obtain a reduced size feature vector. The components to separate the
main melody from the accompaniment were selected, the machine learning architecture for cover
recognition was determined and the first test adjustments were carried out. Followed, a final test and
the performance assessment were developed
27
8.3. Methods
The CIS presented here was based on the work of Bertin-Mahieux and Ellis. Although some
other research is aimed to classify musical covers, these authors used the MSD database by applying
a very precise methodology that allows obtaining a comparative framework with our results. In
general, the tasks done in, are to get the Chroma features of MSD, to align the Chroma features with
the beat, to apply the power law on the resulting matrix, to generate patches and to calculate the 2D
Fourier transform, to calculate the median and finally to apply PCA. Figure 1, shows the methodology
used here on each one of the songs. In general, the characteristics of the songs were obtained from an
adaptation of the Chroma and timbre data taken directly from the database. A spectral analysis was
applied to the resulting matrices in order to get a 2-D Fourier Transform (2D-FFT), a Wavelet
Transform (WT) and patterns generated from the algorithm named Sparse
28
9. USER INTERFACE (UI)
9.1 Introduction to Interface:
The user interface (UI) is the point of human-computer interaction and communication in a
device. This can include display screens, keyboards, a mouse and the appearance of a desktop. It is
also the way through which a user interacts with an application or a website.
The growing dependence of many businesses on web applications and mobile applications has
led many companies to place increased priority on UI in an effort to improve the user's overall
experience.
9.2 Types of user interfaces
The various types of user interfaces include:
• Graphical User Interface (GUI)
• Command Line Interface (CLI)
• Menu-Driven User Interface
• Touch User Interface
• Voice User Interface (VUI)
• Form-Based User Interface
• Natural Language User Interface
Examples of user interfaces
Some examples of user interfaces include:
• Computer Mouse
• Remote Control
29
• Virtual Reality
• ATMs
• Speedometer
• The Old iPod Click Wheel
Websites such as Airbnb, Dropbox and Virgin America display strong user interface design.
Sites like these have created pleasant, easily operable, user-cantered designs (UCD) that focus on the
user and their needs.
9.3 UI and UX
The UI is often talked about in conjunction with user experience (UX), which may include the
aesthetic appearance of the device, response time and the content that is presented to the user within
the context of the user interface. Both terms fall under the concept of human-computer interaction
(HCI), which is the field of study focusing on the creation of computer technology and the interaction
between humans and all forms of IT design. Specifically, HCI studies areas such as UCD, UI design
and UX design.
An increasing focus on creating an optimized user experience has led some to carve out careers
as UI and UX experts. Certain languages, such as HTML and CSS, have been geared toward making
it easier to create a strong user interface and experience.
9.4 History of UI
In early computers, there was very little user interface except for a few buttons at an operator's
console. Many of these early computers used punched cards, prepared using keypunch machines, as
the primary method of input for computer programs and data. While punched cards have been
essentially obsolete in computing since 2012, some voting machines still use a punched card system.
The user interface evolved with the introduction of the command line interface, which first
appeared as a nearly blank display screen with a line for user input. Users relied on a keyboard and a
set of commands to navigate exchanges of information with the computer. This command line interface
30
led to one in which menus (lists of choices written in text) predominated. Finally, the GUI arrived,
originating mainly in Xerox's Palo Alto Research Center (PARC), adopted and enhanced by Apple and
effectively standardized by Microsoft in its Windows operating systems.
9.5 Graphical UIs
Elements of a GUI include such things as windows, pull-down menus, buttons, scroll bars
and icons. With the increasing use of multimedia as part of the GUI, sound, voice, motion video and
virtual reality are increasingly becoming the GUI for many applications.
9.6 Mobile UIs
The emerging popularity of mobile applications has also affected UI, leading to something
called mobile UI. Mobile UI is specifically concerned with creating usable, interactive interfaces on
the smaller screens of smartphones and tablets and improving special features, like touch controls.
31
10. REFERENCES
1. https://www.ijert.org/voice-assistant-using-artificial-intelligence
2. https://www.techreviewadvisor.com/what-is-speech-recognition/
3. https://developer.android.com
4. https://vivoka.com/how-to-speech-synrhesis-tts/
5. https://www.techtarget.com/searchapparchitecture/definition/user-interface-UI
32

Mega Project Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mega Project Report

Uploaded by

Copyright:

Available Formats

GOVERNMENT POLYTECHNIC, SAKOLI

ACADEMIC SESSION 2022-2023

A CAPSTONE PROJECT PLANNING REPORT ON

MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION, MUMBAI

Gitesh N. Ujgaonkar Pratik N. Vairagade

Under the guidance of

DEPARTMENT OF COMPUTER TECHNOLOGY

1. Gitesh Narendra Ujgaonkar

Subject Teacher Guide Name

Head of the Department Principal

DEPARTMENT OF COMPUTER TECHNOLOGY

Roll No Enrolment No Name Signature

DEPARTMENT OF COMPUTER TECHNOLOGY

DEPARTMENT OF COMPUTER TECHNOLOGY

1.1.1 STUDYING SIRI

• Audio and video calls

Below is a scheme on intents processing

fig 1.1.1 How Siri processes intents

1.1.2 STUDYING GOOGLE NOW AND VOICE ACTIONS

fig 1.1.2 What is Google's intelligent mechanism

1.1.3 STUDYING CORTANA

 Who should it be sent to? - To Ann.

1.2. CAPABILITIES OF AI ASSISTANTS

1.3. HOW DOES AN AI VIRTUAL ASSISTANT WORK

1.4. THE OPERATING PRINCIPLE OF AI ASSISTANT

3.1. Proposed Architecture:

The system design consists of

1. Taking the input as speech patterns through microphone.

The System shall be developed to offer the following features:

4.1 Basic Workflow:

Fig. 4.1 Block Diagram of Voice Assistant

4.2 Detailed Workflow:

Fig. 5.1 Program Outcome Flow Chart of AI Assistant app.

6.1 Introduction to speech recognition:

Speech recognition technology is a type of artificial intelligence that involves understanding

6.1.1 How Does Speech Recognition Work

6.1.2 Various algorithms used in speech recognition

6.1.2.1 Natural Language Processing (NLP):

6.1.2.3 Hidden Markov Model (HMM):

6.1.2.4 Neural Networks

For example: “Did you say ‘yes’?”

6.1.3 API’s used in speech to text

Please note that the application must have Manifest.permission.RECORD_AUDIO permission

Fig. 3.1 Speech to Text

6.1.3.4 Sample Code of Speech to Text

class MainActivity : AppCompatActivity() {

// on below line we are creating variables

// on below line we are creating a constant value

override fun onCreate(savedInstanceState: Bundle?) {

// initializing variables of list view with their ids.

// on below line we are adding on click

// on below line we are passing language model

// on below line we are passing our

// on below line we are specifying a prompt

// on below line we are specifying a try catch block.

// on below line we are calling on activity result method.

// in this method we are checking request

// in that case we are extracting the

// on below line we are setting data

6.1.3.5 Flowchart of Speech to Text:

Fig. 3.2 Flowchart of Speech to Text

6.2.1 Sample code of Text To Speech

class MainActivity : AppCompatActivity(),TextToSpeech.OnInitListener {