You are on page 1of 19

Subscribe to DeepL Pro to translate larger documents.

Visit www.DeepL.com/pro for more information.

Audiogram
Speech recognition and
synthesis platform
Speech recognition and synthesis platform

AUDIOGRAM IS A SPEECH RECOGNITION AND


SYNTHESIS PLATFORM BASED ON NEURAL
NETWORKS AND MACHINE LEARNING
METHODS.

Audiogram is designed for developers of services and services, as well as for end customers
who need to recognize large volumes of audio files and generate audio content: telecom
operators, banks, media and other companies.

Audiogram allows you to automatically convert speech to text in real time and offline,
and vice versa, to dub the text with the selected voice, with specific intonation and
accents.

Based on the platform developed by MTS AI, various speech services can be created.
These include:

Speech recognition and voice assistants; creating audiobooks;


automatic generation of subtitles for video; transcribing audio to
text;
recording audio messages with synthesized voice.

Audiogram can be delivered as software or as a cloud-based service. The platform can be easily
integrated with other MTS AI solutions, including
with speech analytics.

page 1 Of 16
Speech recognition and synthesis platform

1. MARKET OVERVIEW OF SPEECH


RECOGNITION AND SYNTHESIS
TECHNOLOGIES

Market Trends in Speech Recognition and Synthesis Technologies

There is a growing need worldwide for software that can understand and reproduce the
human voice, as well as communicate with users.

The following factors have driven the development of such technologies:

demand for remote customer service in retail, medicine, telecom and other
industries;
business aspiration to improve the efficiency of communications with
customers and increase the speed of processing audience requests;
introduction of speech recognition technology into consumer products: smartphones,
laptops, tablets and smart home devices,
and the proliferation of voice-activated devices;
The growing need for voice authentication in
applications and devices.

The global speech and voice recognition market size will grow from $9.4 billion in 2022 to $28.1
billion by 2027. Average annual growth will average 24.4%.

$9.4 billion $28, bln

Global Speech Recognition Market in 2022Global Speech Recognition Market in 2027

In Russia, the market for conversational AI, including recognition technologies


and speech synthesis is also going up. And this trend will continue in the coming years.

Conversational AI market size in Russia: 2021-2025

561
Market size in 2025

$56
322 million
196 $44
142
80 billion
2021 2022 2023 2024 2025 Market growth from 2021 to
2025

600% page 2 Of 16
Speech recognition and synthesis platform

Representatives of different companies are interested in such solutions:

Large corporations with more than $1 billion in revenue are launching


pilots to create virtual assistants and bots.
Mid-sized businesses need customizable solutions for a specific need.
Small businesses are interested in boxed products that require minimal customization and in
service support from partners.

A large volume of orders for conversational AI solutions comes from the government - there are
several players in the market that almost entirely specialize in government orders.

Business effects of implementing products created with Audiogram analogs:

2 times 16-25% 96%


increasing the speed Improving the accuracy of
of informing clients recognizing user requests
sales growth

Additional benefits:

50% 20% 20%


load reduction of labor repeat sales
reduction costs for
call centers employees

Data source: case studies of Nanosemantica, Fonemica, Just AI, MarketsandMarkets studies

page 3 Of 16
Speech recognition and synthesis platform

2. AUDIOGRAM BENEFITS

Audiogram is a speech recognition and synthesis platform based on neural networks and machine
learning methods. It is designed for developers of services and services, as well as for end customers
who need to recognize large volumes of audio files and generate audio content: telecom companies,
banks, media and others.

With Audiogram, business customers can create speech transformation services


to text and vice versa, text to speech in real time and offline, create voice bots, generate subtitles and
dub texts with a selected voice with a certain intonation and other parameters.

Multi-sectoral model Domain model pre-training

Ability to adapt the speech recognition model to the


The unique speech recognition model can business customer's specific vocabulary (marketing
be used in any field (retail, telecom, banks, names, specific terms) in 14 days
etc.) based on 300 hours of audio recordings
and does not require additional training

Advanced speech synthesis functions High quality work

Create the voice of your brand, quickly Audiogram recognizes speech in different noise
and efficiently voice over fiction books conditions: the platform supports dialog with
and commercials customers when they are talking very quietly or
with automatic accent and intonation in places,
placement where there are extraneous noises

Easy integration Flexible licensing


with customer systems
Payment is available on a pay as you go basis
The platform supports interaction with external for a minute of audio recognition and for
applications using gRPC API and UniMRCP synthesizing every million characters. It also
and SIP protocols provides-
for integration with telephony package tariffs and system enhancements for the
client's domain are available for a fee.

page 4 Of 16
Speech recognition and synthesis platform

3. HOW CAN BUSINESSES USE


AUDIOGRAM?

Developers of voice bots and Media and other content


smart assistants: creators:
voicing the bot's or assistant's automatic generation of subtitles
response with a synthesized voice for videos;
that is indistinguishable from a
voiceover of articles
from human, making it pleasant
and videos;
and comfortable for the user
to communicate with the bot; voiceover of site
navigation for people
Using an off-the-shelf, multi-
with impaired vision;
branch speech recognition model
with which a bot or assistant can text-to-speech
maintain a dialog and video footage - interviews and
with a user on any topic. conferences with one
or multiple participants.

EdTech companies: Publishers and


digital libraries:
audio transcription
and videotaped lectures;
creating subtitles for
training videos; fast voice-over of fiction, popular
science and other literature
voice-overs for course videos;
to create audiobooks.
voice-overs for articles.

page 5 Of 16
Speech recognition and synthesis platform

HOW CAN BUSINESSES USE


AUDIOGRAM?

Social media and Call Centers:


messengers:
user speech recognition and
transcribing voice messages; generation of bot responses by
synthesized voice;
translate text messages into
audio messages; operational change of the intelligent
voice menu (IVR)
automatic generation of subtitles
without the use of an announcer; call
for videos.
transcribing.

Software and Transportation


video game and retail facilities
developers: (airports, train stations, subways, shopping
introduction of speech recognition and centers, stores):
synthesis into applications
and end-user programs; voiceover
and informational messages to
audio navigation for users; subtitle
attract the attention of passengers
generation for videos; character and customers;
voiceovers generation of audio prompts
computer games. for comfortable navigation of
visitors.

page 6 Of 16
Speech recognition and synthesis platform

4. KEYS.

4.1. Audiogram implementation


to the MTS product "Defender"
The customer turned to MTS AI for a recording solution PROTECTOR
and transcribing calls, as well as to maintain a dialog with
spammers. It had to be integrated into the service
to protect customers from phone spam and unwanted calls. MTS AI

proposed to use Audiogram for this purpose.

What was done?

Audiogram has been integrated with MTS's internal systems.

The speech synthesis and recognition platform was deployed on CPUs (central
processing units) in the customer's circuit;

Audiogram was connected to MTS' contact center software using the gRPC
protocol, which facilitates messaging between customers and internal
services.

The features of the platform have been customized.

Added a punctuation module to make it easier for end users to understand


messages;

The platform has been taught to translate numerals into numeric values (e.g.,
three hundred and twenty-five spoken words into 325);

The "Antimat" service for converting foul language into special symbols has
been implemented.

Result

With the Audiogram integration, MTS has the options it needs for
its Defender product:
recording and accurate decoding of spam calls;
listening to the message to the end.

Customers receive SMS messages with the text of the conversation and indication of the
spam category, and MTS subscribers can be sure that important information will not be lost,
as they will receive information about the call and the content of the conversation.

page 7 Of 16
Speech recognition and synthesis platform

The business impact of implementing Audiogram in MTS Defender:

The number of users of the service more than tripled in 5 months; ROI

was 73%;

Increase user loyalty by reducing unwanted conversations.

Audiogram integration took about 5 months and cost 2.1 million rubles
including the development of customized services.
The product has been launched commercially and continues to evolve.

page 8 Of 16
Speech recognition and synthesis platform

4.2. Using Audiogram


to soundtrack MTS Libraries books.
LIBRARY
MTS had the task of increasing the number of audiobooks in
the library and reducing the time to prepare them,
so that users do not have to wait three to six months for new
products and do not stop using the service.

MTS turned to MTS AI to find a way to create audio content without hiring an
announcer, renting a studio, sound processing costs, etc.

MTS AI has offered to utilize the capabilities of Audiogram.


The platform allows you to voice artistic texts with the necessary intonation and accents. At
the customer's request, one of four voices can be used: female and three male voices.

What was done?

Automated the process of creating audiobooks from electronic versions of publications


in the common EPUB format;

The model of speech synthesis has been improved: intonation characteristic of literary
texts, including questioning, accent placement*.

Result

Audiogram allowed MTS Library to optimize the audiobook preparation process:

The time to produce an audio version of an electronic publication has been


reduced from months to an hour to 30 minutes;

The quality of the voice acting remains at a high level.


Between synthesized voice and narration, users chose the first option**.

Based on the results of the experiment, launched MVP to voice 300-500 fiction books in MTS
Library.

* According to a study conducted by MTS AI, Audiogram places intonations and accents better than the
industry leaders: Yandex.Reader and spichki.org.
** The survey was conducted by MTS AI among MTS Library users.

page 9 Of 16
Speech recognition and synthesis platform

4.3. Creating with Audiogram


AI operator for MTS contact center
CONTACT CENTER
MTS approached MTS AI with a proposal to create a service for
customer service in contact centers in parallel with the current IVR.
The purpose of this was to reduce the load on operators and improve
the quality of service to subscribers.

MTS AI developers using Audiogram have created a voice assistant answering customer calls
with speech synthesis and recognition.

What was done?

We connected a chatbot created on the JAICP platform to the PBX.

Audiogram speech recognition and synthesis modules were connected to the software and to
the MTS contact center chatbot using UniMRCP.

We configured the voice activity detector (VAD), an algorithm designed to distinguish between
intervals of active speech and pauses.

Result

An experiment to serve customers of the MTS ecosystem with the help of


an AI-operated voice assistant was launched in two regions of Russia. The
voice assistant receives incoming calls, transcribes them, sends them to a
bot and synthesizes a verbal response.

The voice assistant has processed more than 200 thousand appeals since the
beginning of the year.

Improved customer service resulted in a 17-20% increase in customer


loyalty.

In the future, it is planned to include new regions and other areas in the experiment:
bank, digital products, servicing of fixed telephone network subscribers, and transfer
of the project to commercial operation.

The implementation of the voice assistant took about 7 months including the pilot.

page 10 Of 16
Speech recognition and synthesis platform

5. BASIC FUNCTIONALITY

ASR (Automatic Speech Recognition) - Automatic Speech Recognition

Streaming speech-to-text conversion, which allows you to transcribe audio in real time and get
the results in text format.

File-based speech conversion - asynchronous speech-to-text transcription for


large volumes of audio files or audio archives.

There are two types of models available in Audiogram:

domain, which enables effective speech recognition in the fields of medicine, telecom
and finance;

a general model with increased resource consumption, suitable for any application in a
wide range of noisy environments.

TTS (Text-to-Speech) - text-to-speech conversion

Text voicing with female or one of three male synthesized voices. Automatic ML markup,

for literary voicing of books and articles.

The platform supports the SSML speech synthesis markup language, which allows you
to achieve a more natural sound by controlling intonation, speed, accents and other
parameters.

Auxiliary Services:

Collecting statistics on platform usage;

billing - bill generation based on statistics of platform services usage and tariffs;

Connector services to support interaction with external applications.

page 11 Of 16
Speech recognition and synthesis platform

6. SOFTWARE COMPONENTS

6.1. Description of program components

The Audiogram platform is built on the principle of microservice architecture.

Audiogram is provided to the user as an API through which he can interact with the platform
directly, as well as a set of connectors for conversion to other protocols: SIP connector,
UniMRCP connector, REST gateway.

SIP connector ASR


Telephony
Speech Recognition
uniMRCP Module
connector
Mobile app
TTS
Audiogram
Speech
Smart gRPC API synthesis
device module

Personal cabinet
Administrator

On premise and cloud-versions of Audiogram

MTS AI can provide the Audiogram platform in two formats:

ON PREMISE. CLOUD

The customer receives a software The solution will be deployed in the cloud,
distribution and license for installation on MTS AI facilities, and the customer will
on their servers. receive only a link to the API for work and a
link to the personal account.

It does not need to install any additional


equipment.

The developers recommend that companies for which it is important to ensure the
confidentiality of customer data (e.g. banks and telecom companies) should choose the
on premise option. In this way, all information will not go outside the company.

page 12 Of 16
Speech recognition and synthesis platform

MAIN CHARACTERISTICS OF THE VOICE


PLATFORM MODULES

6.2. ASR module

General Characteristics:

integration with customer systems using the standard gRPC data


transfer protocol;

response delay from 500 ms*;

integration with Asterisk IP telephony platform via SIP


connector software module;

Integration with Genesis contact center software using UniMRCP connector


software;

supported audio formats: WAV PCM 16bit, WAV MULAW, WAV ALAW; available

language - Russian.

Modes of operation:

File Streaming Recognizing


long audio
Suitable for recognizing Allows for a single
single-channel audio of small connection
size, the response will be send audio clips and receive Gives the ability to
sent to the results, recognize long
at the end of audio including intermediate multichannel audio
transmission, the response recognition results, response recordings, the speed of
delay is not less than the latency - response depends on
length of the 500 ms the
of the audio itself from the length of the
audio

page 13 Of 16
Speech recognition and synthesis platform

MTS AI has language models available for topics such as telecom, medical, and a general
conversational model

A general model of speech

recognition: no pre-training is

required;

the ability to get up and running "out of the box";

higher recognition accuracy than the domain model.

A domain model of speech recognition:

is focused on a user from a specific sphere: telecom, retail,


medicine, education and so on;

pre-training is required for new subject areas;

More than 200 hours of audio material is needed for pre-training.

6.3 TTS module

General Characteristics:

MARVIN GLEB
A man's voice A man's voice
4
BASIC VOICES
ISLAM MARIA
A man's voice A woman's voice

SSML markup for point control of synthesis, allowing to correct intonation, speed, accents and
other parameters, to put accents in bot phrases;

response delay from 500 ms*;

available language - Russian;

automatic markup for artistic voiceover.

page 14 Of 16
Speech recognition and synthesis platform

*Total delay depends on the number of characters.

page 15 Of 16
Speech recognition and synthesis platform

7. RECOMMENDED
DEPLOYMENT MODELS

7.1. ASR

ProcessorOperating memoryHard disk capacity

2x GPU Nvidia V100


160 Gb RAM 1.5 Tb HDD
16Gb, 64 vCPU 2.3GHz

In the ASR domain model, each server with the above configuration can handle single-channel
audio at different bandwidths:

in file mode throughput is 700 RTF (Real-Time Factor) in 1000 concurrent threads with
512 Gb RAM consumption;
in streaming mode - 500 simultaneous streams with 128 Gb RAM consumption. For
five-second audio the delay is 0.271 sec; for ten-second audio - 0.343 sec; for fifteen-
second audio - 0.364 sec.

In a generic ASR model, each server with the above configuration can
handle single channel audio:

in file mode with 80 RTF (Real-Time Factor) throughput, while consuming 512 Gb of
RAM;
stream utilization data will be provided at the end of productivization.

7.2 TTS:

ProcessorOperating memoryHard disk capacity

1x GPU Nvidia V100 16Gb,


64 Gb RAM 1.7 Tb HDD
32 vCPU 2.3GHz

On the current TTS model, each server with the above configuration synthesizes audio
at 290 characters per second.

page 16 Of 16
Speech recognition and synthesis platform

7.3 Auxiliary modules:

RAM Hard disk


Module Processor
capacity

Personal cabinet 4 vCPU 2.3GHz 8 Gb RAM 256 Gb HDD

Statistics service 8 vCPU 2.3GHz 16 Gb RAM 2 Tb HDD

In the proposed scheme, the servers for ASR/TTS can be horizontally scaled based on the
expected load:

The gRPC API gateway scales by increasing vCPU and RAM in


proportion to the increase in load;
Modules "Personal Cabinet", "Statistics Service"
are also scaled by increasing vCPU and RAM.

8. COMPETITIVE COMPARISON

The developers of MTS AI are constantly improving


the
Audiogram platform, adding new features.
To familiarize yourself with the actual
competitive analysis, scan the QR code.

9. LICENSING

The Audiogram platform presents two types of tariffs:

pay as you go; package.

Contacts for requesting prices and demo

To request rate information, access a demo


and additional information, contact sales@mts.ai,
or scan the QR code and go to the site of MTS AI.

page 17 Of 16

You might also like