You are on page 1of 18

Building a ChatGPT-4 Voice Assistant With Vivid Unit

By Gavin85 in CircuitsComputers
Published Feb 26th, 2024

Introduction: Building a ChatGPT-4 Voice Assistant With Vivid Unit


Vivid Unit comes with screen, speaker and microphone, that makes it an ideal piece of hardware to implement a vo
ice assistant. Lucy is a voice assistant powered by Google Speech Recognition, Google Text-to-Speech and Chat
GPT-4, runs on a Vivid Unit.
Supplies

1 x Vivid Unit
1 x DC 5V USB type-C adapter (or 1 x Ethernet cable if you have PoE port availalble)

Vivid Unit has everything needed on hardware aspect, and we just need to have it powered and connected to the I
nternet.

Vivid Unit is not that power hungry, a power adapter with 5V/2A should be good enough. If you still worry, give it 2.5
A.

You also need Internet connection, either wired or wireless. You need Internet access to install software, and Lucy
also need Internet connection during the usage.
Step 1: The Idea

ChatGPT is an advanced artificial intelligence developed by OpenAI. It is designed to engage in natural language c
onversations. ChatGPT can assist users with tasks, answer questions, brainstorm ideas, and even generate text in
different styles, making it a versatile tool for communication, learning, and problem-solving.

Text is the bridge between ChatGPT and human. If we use speech recognition to convert what we say to text, we c
an talk to ChatGPT. If we use Text-to-Speech technique to read the text generated by ChatGPT, we can hear Chat
GPT too.

The idea is straightforward and is really nothing unique, but I really like it and I do it out of fun. I will use free API/se
rvice only, so the investment is just time, and I learned a lot from it and enjoyed the process.

Step 2: Preparation
The program will be written with Python, and I also need to find packages to provide those functionalities.

Speech to Text (STT)

In order to convert our speech to text, we need a speech recognition package. I stumbled across the SpeechRecog
nition project and I am very impressed: it offers APIs to access different Speech-to-text Transcription (STT) tools, a
nd some of them can even work offline. I decide to use it because it will be very easy to switch from different STT t
ools, which maximizes the fun.

ChatGPT (or alike)

There are many projects that provide APIs to access ChatGPT. My favorite is the gpt4free project. It provide APIs t
o access different AI engines from various providers. Again I choose it because that maximizes the fun.

Text to Speech (TTS)

After ChatGPT response with text, we need to convert it to speech (usually in MP3 format). I was hoping to find a p
roject that allows me to easily switch between different TTS engines, but I could not find one. I tried pyttsx3 but felt
its voice quality (using espeak) is terrible. I eventuly choose gTTS, which offers much better voice quality. The dow
n-side however, is that it needs network connection during usage. Considering we need network connection for Ch
atGPT service anyway, this is also acceptable.

Playback

We still need to playback the MP3 generated by TTS engine. The simplest way is to save the MP3 as a file, and us
e os.system() function to call any player that can play MP3 file. However I feel it is less elegent to generate an MP3
file. I finally use the mixer in pygame, which can playback MP3 without actually generate the file.

Packages Installation

Vivid Unit comes with Python3 installed, but it doesn't have PIP (the Python package manager) yet. It will be conve
nient to have PIP to help install some packages, so we install PIP first:

sudo apt install pip

We install the "SpeechRecognition" package, and we will use it to convert our speech to text:

pip install SpeechRecognition

Install the "gpt4free" package, which provides access to ChatGPT4:

pip install -U g4f

Install the "pygame" package, which is used for playback without actually generating the MP3 file.

pip install pygame

Install the "gTTS" package, so it can actually read the text loudly.

pip install gTTS

We also need to install some packages related to the sound playback:

sudo apt install flac

pip install sounddevice

sudo apt-get install portaudio19-dev python3-pyaudio

Now we have installed all required packages.

Step 3: A Very Simple Prototype


Here is a very simple and straightforward Python program (simple.py):

from io import BytesIO


import re
import speech_recognition as sr
import g4f
import pygame
from gtts import gTTS
import sounddevice

pygame.init()
r = sr.Recognizer()

def speak(txt):
mp3_file_object = BytesIO()
speech = gTTS(text=txt, slow=False, lang='en', tld='us')
speech.write_to_fp(mp3_file_object)
mp3_file_object.seek(0)
pygame.init()
pygame.mixer.init()
pygame.mixer.music.load(mp3_file_object, 'mp3')
pygame.mixer.music.play()

if __name__ == '__main__':
while True:
try:
with sr.Microphone() as mic:

print("Say something please...")

voice = r.listen(mic)
txt = r.recognize_google(voice)
print('\n\nQ: ' + txt + '\n')

resp = g4f.ChatCompletion.create(
model=g4f.models.gpt_4,
provider=g4f.Provider.Bing,
messages=[{"role": "user", "content": txt}],
stream=False,
)
resp = re.sub('[^A-Za-z0-9 ,.:_\'\"]+', '', resp)

print('A: ' + resp + '\n')

speak(resp)

except sr.UnknownValueError:
print("Something goes wrong.\n")

The speak() function accepts text parameter and calls gTTS to generate MP3 accordingly, and it uses pygame.mix
er to play the MP3 without saving it into file.

This program demostrates how to convert human's speech (from microphone) to text, and then forward that text to
ChatGPT4. As a very simple example, it works and you can chat with it already. Try to ask it some simple question
s like "what day is today?" or "which country has tallest people?", you will find it actually answers your question.

However, there are still some issues to address:

Lack of context
This is the biggest problem. The AI always starts a new conversation whenever you ask a new question, you will fe
el frustrated because the AI doesn't remember whatever you have previously said. This prevents you to chat deepl
y about a topic with the AI.

Long waiting time when ChatGPT output is big

This is also very obvious. ChatGPT likes to talk, a lot. It sometimes generate thousands of words to answer your sh
ort and simple question. You may have to wait for long time when those output are processed.

Lack of GUI

It will be much nicer if the voice assitant has its own GUI, instead of printing the output on the console.

In following steps I will address these issues one by one.


Step 4: Keep the Context
In order to let ChatGPT remember what we previously discussed, we need to save the chat history and send it to C
hatGPT everytime we ask a new question. The chat history plays an "assistant" role in such case.

What to do?

If you look at the function call that gets response from ChatGPT:

resp = g4f.ChatCompletion.create(
model=g4f.models.gpt_4,
provider=g4f.Provider.Bing,
messages=[{"role": "user", "content": txt}],
stream=False,
)

You can see the "messages" parameter is actually an array, and each element is an object. In the simple example
above, we always provide a new array that contains only one element (the new question) as the messages parame
ter.

If we always use the same array as the messages paramter, and append the answer from ChatGPT to the same ar
ray, then AI will know what we have dicussed before. Of course, the newly asked question should also be appende
d to the same array.

You can imagine this will bring some pressure to the device, to the network, and also to ChatGPT, because you are
sending more and more data during the chat. Although those data are just pure text and they are not that big, we st
ill should control how much context should we keep -- and it is pratical too: it is not likely the AI needs the informati
on you mentioned 58 questions ago. So we can define a constant, say MAX_CONTEXT, and we put its value to 3
2. Everytime after we append something to that array, we check the array size, if it is bigger than MAX_CONTEXT,
we delete its first two elements (the question and the answer).

Below is the code snippet:

chat_data.append({'role': 'user', 'content': txt})

resp = g4f.ChatCompletion.create(
model=g4f.models.gpt_4,
provider=g4f.Provider.Bing,
messages=chat_data,
stream=False,
)
resp = re.sub('[^A-Za-z0-9 ,.:_\'\"]+', '', resp)

print('A: ' + resp + '\n')

speak(resp)

chat_data.append({'role': 'assistant', 'content': resp})

if len(chat_data) > MAX_CONTEXT:


del chat_data[0]
del chat_data[0]
Step 5: Speak During Processing
Don't you think it is a good idea to play partial of the MP3 data before the whole response from ChatGPT gets fully
processed?

The good news is that ChatGPT supports output as stream: instead of outputing everything in the buffer, it outputs
text piece by piece. You will eventually get the same output text, but this way you have chance to access early text
while other text are still being output.

The bad news however, is that I don't find a TTS engine that can generate streamed audio according to streamed t
ext.

The solution

I start two theads. One as text generator and the other one as text consumer.

The text generator thread runs a loop that keeps getting output text from ChatGPT, and put the text into a queue.

The text consumer thread runs a loop that keeps taking text from the queue and assemble them as a sentence. On
ce a sentence is complete, it calls speak() function to read it out.

This way the sentence will be read out once it is ready, and no need to wait for other sentences to be processed.

The speak() function

The speak() function becomes a member function of QueueProcessingThread class (the text consumer thread). Be
cause it calls the mixer.music.play() function, which is not blocking during the playback, I have to add a while loop t
o make it blocking, or it will try to play next sentence before the current playback is done.

def speak(self, txt):


if(txt and txt.strip()):
mp3_file_object = BytesIO()
speech = gTTS(text=txt, slow=False, lang='en', tld='us')
speech.write_to_fp(mp3_file_object)
mp3_file_object.seek(0)
pygame.mixer.music.load(mp3_file_object, 'mp3')
pygame.mixer.music.play()
while(pygame.mixer.music.get_busy()): pass # wait until playing done
Step 6: Make the GUI

I create a fullscreen window as the GUI for this voice assistant. The conversation will be displayed on the screen w
hen the chat goes on. I also define three states for the program: inactive, active and listening.

The three states

When the program just launched, it is in "inactive" state: the screen is black and it will not react to what you say.

If you say something that contains "Lucy", that will trigger it and its state will become "active" and immedetely go to
"listening": the screen is green and it listens to your question.

After you ask the question, its state will go back to "active" while ChatGPT is outputing the answer: the screen is p
urple and your speech will be ignored. After all output are read out, the state will go to "listening" again.

If you haven't say anything after a while, the state will go to "inactive".

The GUI uses CSS to change the color of widgets.


Step 7: GPT-4 Providers
By using the gpt4free API, we can easily choose different providers for ChatGPT. We would have much more choic
es if we accept using ChatGPT-3.5, but I still prefer ChatGPT-4 because it is indeed available and better.

There are severial providers for ChatGPT-4 service, and the gpt4free project gives a very detailed list of them. In th
e list there are two of them (openai and raycast) need authentication, which make them harder to use and (most pr
obably) not free. Also the GeekGpt is no longer available, so there are currently only three remaining:

Bing (bing.com)
Liaobots (liaobots.site)
You (you.com)

When I do testing, I can not make the Liaobots work. I am not sure if it was a temorary issue.

Bing and You are both working quite well. I personally like You better, because I like the way it speaks: it tends to s
peak less and simple. Bing on the other hand, likes to talk more, sometimes a little bit too much.

Switching provider

Switching provider is every easy: you just replace the "provicer" parameter when calling the g4f.ChatCompletion.cr
eate() function. If you want to use Bing, you set provider to "g4f.Provider.Bing"; To use You, set provider to "g4f.Pro
vider.You".

resp = g4f.ChatCompletion.create(
model=g4f.models.gpt_4,
provider=g4f.Provider.Bing, # set provider here
messages=chat_data,
stream=True,
)

Step 8: The Complete Program


Below you can find the complete Python program.

import gi
gi.require_version('Gtk', '3.0')
from gi.repository import Gtk, Gdk, Pango, GLib
import threading
import time
import re
import queue
from io import BytesIO
import speech_recognition as sr
import g4f
import pygame
from gtts import gTTS
import sounddevice

NAME = 'Lucy'
RESP = 'Yes?'
BYE = 'Talk to you later.'
MAX_CONTEXT = 32
MAX_INACTIVE = 60

pygame.init()
q = queue.Queue()
r = sr.Recognizer()

get_sentence = False
sentence = ''
output_done = False
speech_done = True
chat_data = []
active_ts = 0;

class ChatView(Gtk.TextView):
def __init__(self):
Gtk.TextView.__init__(self)
self.set_wrap_mode(Gtk.WrapMode.WORD)
self.set_editable(False)
self.set_cursor_visible(False)
text_buffer = self.get_buffer()
text_iter_end = text_buffer.get_end_iter()
self.text_mark_end = text_buffer.create_mark("", text_iter_end, False)

def append_text(self, text):


text_buffer = self.get_buffer()
text_iter_end = text_buffer.get_end_iter()
text_buffer.insert_markup(text_buffer.get_end_iter(), text, -1)
self.scroll_to_mark(self.text_mark_end, 0, False, 0, 0)

def clear_text(self):
text_buffer = self.get_buffer()
text_iter_start = text_buffer.get_start_iter()
text_iter_end = text_buffer.get_end_iter()
text_buffer.delete(text_iter_start, text_iter_end);

class LucyWindow(Gtk.Window):
active = False;
listening = False;
chat_view = ChatView()

def __init__(self):
Gtk.Window.__init__(self)
self.set_title('Lucy')
self.fullscreen()

self.set_default_size(640, 360)
self.grid = Gtk.Grid()
self.scrolled_win = Gtk.ScrolledWindow()

self.scrolled_win.set_hexpand(True)
self.scrolled_win.set_vexpand(True)
self.scrolled_win.add(self.chat_view)
self.scrolled_win.set_policy(Gtk.PolicyType.NEVER, Gtk.PolicyType.AUTOMATIC)
text_box = Gtk.Box(orientation=Gtk.Orientation.VERTICAL, spacing=20)
text_box.set_margin_top(20)
text_box.set_margin_bottom(20)
text_box.set_margin_start(20)
text_box.set_margin_end(20)
text_box.add(self.scrolled_win)

self.grid.add(text_box)
self.add(self.grid)
self.connect('destroy', Gtk.main_quit)
self.show_all()

def set_state(self, active, listening):


self.active = active
self.listening = listening

window_context = self.get_style_context()
window_context.remove_class('inactive')
window_context.remove_class('active')
window_context.remove_class('listening')

view_context = self.chat_view.get_style_context()
view_context.remove_class('inactive')
view_context.remove_class('active')
view_context.remove_class('listening')

if active:
if listening:
window_context.add_class('listening')
view_context.add_class('listening')
else:
window_context.add_class('active')
view_context.add_class('active')
else:
window_context.add_class('inactive')
view_context.add_class('inactive')

class QueueProcessingThread(threading.Thread):
window = None

def __init__(self, win):


threading.Thread.__init__(self)
self.window = win
self.daemon = True

def speak(self, txt):


global active_ts
if(txt and txt.strip()):
active_ts = time.time()
mp3_file_object = BytesIO()
speech = gTTS(text=txt, slow=False, lang='en', tld='us')
speech.write_to_fp(mp3_file_object)
mp3_file_object.seek(0)
pygame.mixer.music.load(mp3_file_object, 'mp3')
pygame.mixer.music.play()
while(pygame.mixer.music.get_busy()): pass # wait until playing done
active_ts = time.time()

def run(self):
global get_sentence
global sentence
global output_done
global speech_done
while True:
if (get_sentence):
item = q.get()
sentence += item
q.task_done()
if item.endswith(".") or item.endswith("!") or item.endswith("?") or (output_done and q.empty()):
self.speak(sentence)
sentence = ''
get_sentence = False
else:
if q.empty():
if output_done:
if not speech_done:
speech_done = True
if self.window.active:
self.window.set_state(True, True)
else:
if output_done:
get_sentence = True

class VoiceRecognizingThread(threading.Thread):
window = None

def __init__(self, win):


threading.Thread.__init__(self)
self.window = win
self.daemon = True

def run(self):
global get_sentence
global output_done
global speech_done
global chat_data
global active_ts

while True:
Gtk.main_iteration_do(False)
try:
with sr.Microphone(sample_rate=44100) as mic:

if not self.window.active and not self.window.listening:


self.window.set_state(False, False);

if not speech_done:
continue

ts = time.time()
if self.window.active and active_ts and (ts - active_ts) > MAX_INACTIVE :
active_ts = 0
self.window.set_state(False, False)
speech_done = False
q.put(BYE)
get_sentence = True
output_done = True
GLib.idle_add(self.window.chat_view.clear_text)

voice = r.listen(mic)
txt = r.recognize_google(voice)
active_ts = ts

if not self.window.active:
if NAME in txt:
self.window.set_state(True, False)
speech_done = False
q.put(RESP)
get_sentence = True
output_done = True
else:
active_ts = ts;
output_done = False
speech_done = False
GLib.idle_add(self.window.chat_view.append_text, '\n\nQ: ' + txt + '\nA: ')
self.window.set_state(True, False)

chat_data.append({'role': 'user', 'content': txt})

resp = g4f.ChatCompletion.create(
model=g4f.models.gpt_4,
provider=g4f.Provider.You,
messages=chat_data,
stream=True,
)

answer = ''
for message in resp:
msg = re.sub('[^A-Za-z0-9 ,.:_\'\"\+\-\*\/=]+', '', message.replace('**', ''))
GLib.idle_add(self.window.chat_view.append_text, msg)
answer += msg
q.put(msg)
if msg.endswith("."):
get_sentence = True

output_done = True
chat_data.append({'role': 'assistant', 'content': answer})

if len(chat_data) > MAX_CONTEXT:


del chat_data[0]
del chat_data[0]

active_ts = time.time()

except sr.UnknownValueError as ue:


output_done = True

if __name__ == '__main__':

# load CSS
screen = Gdk.Screen.get_default()
provider = Gtk.CssProvider()
style_context = Gtk.StyleContext()
style_context.add_provider_for_screen(
screen, provider, Gtk.STYLE_PROVIDER_PRIORITY_APPLICATION
)
css = b"""
textview {
font: 25px Arial;
background: transparent;
}
textview text {
color: white;
background: transparent;
}
textview.inactive text {
color: black;
}
window.inactive {
background: black;
}
window.active {
background: #7700df;
}
window.listening {
background: #008c8c;
}
"""
provider.load_from_data(css)

# Lucy window
win = LucyWindow()

# voice recognizing thread


thread1 = VoiceRecognizingThread(win)
thread1.start()

# queue processing thread


thread2 = QueueProcessingThread(win)
thread2.start()

Gtk.main()

In attachment you can also find the lucy.py source file.

Attachments
Step 9: The Result
Below is a video that shows how Lucy works. As you can see, it does remember the context during the conversatio
n.

Lucy is a voice assistant t…

Sometimes the sentences are incorrectly connected without period or comma, and the gTTS engine just reads it ou
t that way. I think this can be improved by tuning the text consumer thread (QueueProcessingThread).

Lucy's performance can be significantly affected by the network situation. Lucy uses several API that require Intern
et connection. If the network is slow, or the service server responses late, Lucy may answer you way later than you
expected.

Offline version?

I can't help thinking, can this voice assistant work offline?

The SpeechRecognition library does provide some APIs that can work offline (e.g. the Vosk API). I tried them and c
onfirm they indeed can work locally on Vivid Unit. However the recongnition accurancy is not as good as Google S
peech Recognition.

Also the text-to-speech engine can switch to an offline version: pyttsx3. But the voice quality is really bad and you
will not like it.

As for the ChatGPT-4 service, it definitely needs Internet connection. It may be possible to run a simplified LLM loc
ally on Vivid Unit, but that will be very slow and that will not be practical.

With that said, if we really make Lucy offline, it will be unfortuantely quite un-usable.

Can Lucy do more?

Definitely! Vivid Unit comes with GPIOs and ADC channels, so it is possible to let Lucy to control some external cir
cuits, read some data from sensor etc. It can actually become the center unit of home automation.

You might also like