Audiogram MTS AI en

Subscribe to DeepL Pro to translate larger documents.
Visit www.DeepL.com/pro for more information.
Audiogram
Speech recognition and
synthesis platform
Speech recognition and synthesis platform
AUDIOGRAM IS A SPEECH RECOGNITION AND

SYNTHESIS PLATFORM BASED ON NEURAL
NETWORKS AND MACHINE LEARNING
METHODS.
Audiogram is designed for developers of services and services, as well as for end customers
who need to recognize large volumes of audio files and generate audio content: telecom
operators, banks, media and other companies.
Audiogram allows you to automatically convert speech to text in real time and offline,
and vice versa, to dub the text with the selected voice, with specific intonation and
accents.
Based on the platform developed by MTS AI, various speech services can be created.
These include:
Speech recognition and voice assistants; creating audiobooks;

automatic generation of subtitles for video; transcribing audio to
text;
recording audio messages with synthesized voice.
Audiogram can be delivered as software or as a cloud-based service. The platform can be easily
integrated with other MTS AI solutions, including
with speech analytics.
page 1 Of 16
1. MARKET OVERVIEW OF SPEECH

RECOGNITION AND SYNTHESIS
TECHNOLOGIES
Market Trends in Speech Recognition and Synthesis Technologies
There is a growing need worldwide for software that can understand and reproduce the
human voice, as well as communicate with users.
The following factors have driven the development of such technologies:
demand for remote customer service in retail, medicine, telecom and other
industries;
business aspiration to improve the efficiency of communications with
customers and increase the speed of processing audience requests;
introduction of speech recognition technology into consumer products: smartphones,
laptops, tablets and smart home devices,
and the proliferation of voice-activated devices;
The growing need for voice authentication in
applications and devices.
The global speech and voice recognition market size will grow from $9.4 billion in 2022 to $28.1
billion by 2027. Average annual growth will average 24.4%.
$9.4 billion $28, bln
Global Speech Recognition Market in 2022Global Speech Recognition Market in 2027
In Russia, the market for conversational AI, including recognition technologies

and speech synthesis is also going up. And this trend will continue in the coming years.
Conversational AI market size in Russia: 2021-2025
561
Market size in 2025
$56
322 million
196 $44
142
80 billion
2021 2022 2023 2024 2025 Market growth from 2021 to
2025
600% page 2 Of 16
Representatives of different companies are interested in such solutions:
Large corporations with more than $1 billion in revenue are launching

pilots to create virtual assistants and bots.
Mid-sized businesses need customizable solutions for a specific need.
Small businesses are interested in boxed products that require minimal customization and in
service support from partners.
A large volume of orders for conversational AI solutions comes from the government - there are
several players in the market that almost entirely specialize in government orders.
Business effects of implementing products created with Audiogram analogs:
2 times 16-25% 96%

increasing the speed Improving the accuracy of
of informing clients recognizing user requests
sales growth
Additional benefits:
50% 20% 20%

load reduction of labor repeat sales
reduction costs for
call centers employees
Data source: case studies of Nanosemantica, Fonemica, Just AI, MarketsandMarkets studies
page 3 Of 16
2. AUDIOGRAM BENEFITS
Audiogram is a speech recognition and synthesis platform based on neural networks and machine
learning methods. It is designed for developers of services and services, as well as for end customers
who need to recognize large volumes of audio files and generate audio content: telecom companies,
banks, media and others.
With Audiogram, business customers can create speech transformation services

to text and vice versa, text to speech in real time and offline, create voice bots, generate subtitles and
dub texts with a selected voice with a certain intonation and other parameters.
Multi-sectoral model Domain model pre-training
Ability to adapt the speech recognition model to the

The unique speech recognition model can business customer's specific vocabulary (marketing
be used in any field (retail, telecom, banks, names, specific terms) in 14 days
etc.) based on 300 hours of audio recordings
and does not require additional training
Advanced speech synthesis functions High quality work
Create the voice of your brand, quickly Audiogram recognizes speech in different noise
and efficiently voice over fiction books conditions: the platform supports dialog with
and commercials customers when they are talking very quietly or
with automatic accent and intonation in places,
placement where there are extraneous noises
Easy integration Flexible licensing

with customer systems
Payment is available on a pay as you go basis
The platform supports interaction with external for a minute of audio recognition and for
applications using gRPC API and UniMRCP synthesizing every million characters. It also
and SIP protocols provides-
for integration with telephony package tariffs and system enhancements for the
client's domain are available for a fee.
page 4 Of 16
3. HOW CAN BUSINESSES USE

AUDIOGRAM?
Developers of voice bots and Media and other content

smart assistants: creators:
voicing the bot's or assistant's automatic generation of subtitles
response with a synthesized voice for videos;
that is indistinguishable from a
voiceover of articles
from human, making it pleasant
and videos;
and comfortable for the user
to communicate with the bot; voiceover of site
navigation for people
Using an off-the-shelf, multi-
with impaired vision;
branch speech recognition model
with which a bot or assistant can text-to-speech
maintain a dialog and video footage - interviews and
with a user on any topic. conferences with one
or multiple participants.
EdTech companies: Publishers and

digital libraries:
audio transcription
and videotaped lectures;
creating subtitles for
training videos; fast voice-over of fiction, popular
science and other literature
voice-overs for course videos;
to create audiobooks.
voice-overs for articles.
page 5 Of 16
HOW CAN BUSINESSES USE

AUDIOGRAM?
Social media and Call Centers:

messengers:
user speech recognition and
transcribing voice messages; generation of bot responses by
synthesized voice;
translate text messages into
audio messages; operational change of the intelligent
voice menu (IVR)
automatic generation of subtitles
without the use of an announcer; call
for videos.
transcribing.
Software and Transportation

video game and retail facilities
developers: (airports, train stations, subways, shopping
introduction of speech recognition and centers, stores):
synthesis into applications
and end-user programs; voiceover
and informational messages to
audio navigation for users; subtitle
attract the attention of passengers
generation for videos; character and customers;
voiceovers generation of audio prompts
computer games. for comfortable navigation of
visitors.
page 6 Of 16
4. KEYS.
4.1. Audiogram implementation

to the MTS product "Defender"
The customer turned to MTS AI for a recording solution PROTECTOR
and transcribing calls, as well as to maintain a dialog with
spammers. It had to be integrated into the service
to protect customers from phone spam and unwanted calls. MTS AI
proposed to use Audiogram for this purpose.
What was done?
Audiogram has been integrated with MTS's internal systems.
The speech synthesis and recognition platform was deployed on CPUs (central
processing units) in the customer's circuit;
Audiogram was connected to MTS' contact center software using the gRPC
protocol, which facilitates messaging between customers and internal
services.
The features of the platform have been customized.
Added a punctuation module to make it easier for end users to understand

messages;
The platform has been taught to translate numerals into numeric values (e.g.,
three hundred and twenty-five spoken words into 325);
The "Antimat" service for converting foul language into special symbols has
been implemented.
Result
With the Audiogram integration, MTS has the options it needs for
its Defender product:
recording and accurate decoding of spam calls;
listening to the message to the end.
Customers receive SMS messages with the text of the conversation and indication of the
spam category, and MTS subscribers can be sure that important information will not be lost,
as they will receive information about the call and the content of the conversation.
page 7 Of 16
The business impact of implementing Audiogram in MTS Defender:
The number of users of the service more than tripled in 5 months; ROI
was 73%;
Increase user loyalty by reducing unwanted conversations.
Audiogram integration took about 5 months and cost 2.1 million rubles
including the development of customized services.
The product has been launched commercially and continues to evolve.
page 8 Of 16
4.2. Using Audiogram

to soundtrack MTS Libraries books.
LIBRARY
MTS had the task of increasing the number of audiobooks in
the library and reducing the time to prepare them,
so that users do not have to wait three to six months for new
products and do not stop using the service.
MTS turned to MTS AI to find a way to create audio content without hiring an
announcer, renting a studio, sound processing costs, etc.
MTS AI has offered to utilize the capabilities of Audiogram.

The platform allows you to voice artistic texts with the necessary intonation and accents. At
the customer's request, one of four voices can be used: female and three male voices.
What was done?
Automated the process of creating audiobooks from electronic versions of publications

in the common EPUB format;
The model of speech synthesis has been improved: intonation characteristic of literary
texts, including questioning, accent placement*.
Result
Audiogram allowed MTS Library to optimize the audiobook preparation process:
The time to produce an audio version of an electronic publication has been

reduced from months to an hour to 30 minutes;
The quality of the voice acting remains at a high level.

Between synthesized voice and narration, users chose the first option**.
Based on the results of the experiment, launched MVP to voice 300-500 fiction books in MTS
Library.
* According to a study conducted by MTS AI, Audiogram places intonations and accents better than the
industry leaders: Yandex.Reader and spichki.org.
** The survey was conducted by MTS AI among MTS Library users.
page 9 Of 16
4.3. Creating with Audiogram

AI operator for MTS contact center
CONTACT CENTER
MTS approached MTS AI with a proposal to create a service for
customer service in contact centers in parallel with the current IVR.
The purpose of this was to reduce the load on operators and improve
the quality of service to subscribers.
MTS AI developers using Audiogram have created a voice assistant answering customer calls
with speech synthesis and recognition.
What was done?
We connected a chatbot created on the JAICP platform to the PBX.
Audiogram speech recognition and synthesis modules were connected to the software and to
the MTS contact center chatbot using UniMRCP.
We configured the voice activity detector (VAD), an algorithm designed to distinguish between
intervals of active speech and pauses.
Result
An experiment to serve customers of the MTS ecosystem with the help of

an AI-operated voice assistant was launched in two regions of Russia. The
voice assistant receives incoming calls, transcribes them, sends them to a
bot and synthesizes a verbal response.
The voice assistant has processed more than 200 thousand appeals since the
beginning of the year.
Improved customer service resulted in a 17-20% increase in customer

loyalty.
In the future, it is planned to include new regions and other areas in the experiment:
bank, digital products, servicing of fixed telephone network subscribers, and transfer
of the project to commercial operation.
The implementation of the voice assistant took about 7 months including the pilot.
page 10 Of 16
5. BASIC FUNCTIONALITY
ASR (Automatic Speech Recognition) - Automatic Speech Recognition
Streaming speech-to-text conversion, which allows you to transcribe audio in real time and get
the results in text format.
File-based speech conversion - asynchronous speech-to-text transcription for

large volumes of audio files or audio archives.
There are two types of models available in Audiogram:
domain, which enables effective speech recognition in the fields of medicine, telecom
and finance;
a general model with increased resource consumption, suitable for any application in a
wide range of noisy environments.
TTS (Text-to-Speech) - text-to-speech conversion
Text voicing with female or one of three male synthesized voices. Automatic ML markup,
for literary voicing of books and articles.
The platform supports the SSML speech synthesis markup language, which allows you
to achieve a more natural sound by controlling intonation, speed, accents and other
parameters.
Auxiliary Services:
Collecting statistics on platform usage;
billing - bill generation based on statistics of platform services usage and tariffs;
Connector services to support interaction with external applications.
page 11 Of 16
6. SOFTWARE COMPONENTS
6.1. Description of program components
The Audiogram platform is built on the principle of microservice architecture.
Audiogram is provided to the user as an API through which he can interact with the platform
directly, as well as a set of connectors for conversion to other protocols: SIP connector,
UniMRCP connector, REST gateway.
SIP connector ASR

Telephony
Speech Recognition
uniMRCP Module
connector
Mobile app
TTS
Audiogram
Speech
Smart gRPC API synthesis
device module
Personal cabinet
Administrator
On premise and cloud-versions of Audiogram
MTS AI can provide the Audiogram platform in two formats:
ON PREMISE. CLOUD
The customer receives a software The solution will be deployed in the cloud,
distribution and license for installation on MTS AI facilities, and the customer will
on their servers. receive only a link to the API for work and a
link to the personal account.
It does not need to install any additional

equipment.
The developers recommend that companies for which it is important to ensure the
confidentiality of customer data (e.g. banks and telecom companies) should choose the
on premise option. In this way, all information will not go outside the company.
page 12 Of 16
MAIN CHARACTERISTICS OF THE VOICE

PLATFORM MODULES
6.2. ASR module
General Characteristics:
integration with customer systems using the standard gRPC data

transfer protocol;
response delay from 500 ms*;
integration with Asterisk IP telephony platform via SIP

connector software module;
Integration with Genesis contact center software using UniMRCP connector

software;
supported audio formats: WAV PCM 16bit, WAV MULAW, WAV ALAW; available
language - Russian.
Modes of operation:
File Streaming Recognizing

long audio
Suitable for recognizing Allows for a single
single-channel audio of small connection
size, the response will be send audio clips and receive Gives the ability to
sent to the results, recognize long
at the end of audio including intermediate multichannel audio
transmission, the response recognition results, response recordings, the speed of
delay is not less than the latency - response depends on
length of the 500 ms the
of the audio itself from the length of the
audio
page 13 Of 16
MTS AI has language models available for topics such as telecom, medical, and a general
conversational model
A general model of speech
recognition: no pre-training is
required;
the ability to get up and running "out of the box";
higher recognition accuracy than the domain model.
A domain model of speech recognition:
is focused on a user from a specific sphere: telecom, retail,

medicine, education and so on;
pre-training is required for new subject areas;
More than 200 hours of audio material is needed for pre-training.
6.3 TTS module
General Characteristics:
MARVIN GLEB
A man's voice A man's voice
4
BASIC VOICES
ISLAM MARIA
A man's voice A woman's voice
SSML markup for point control of synthesis, allowing to correct intonation, speed, accents and
other parameters, to put accents in bot phrases;
response delay from 500 ms*;
available language - Russian;
automatic markup for artistic voiceover.
page 14 Of 16
*Total delay depends on the number of characters.
page 15 Of 16
7. RECOMMENDED
DEPLOYMENT MODELS
7.1. ASR
ProcessorOperating memoryHard disk capacity
2x GPU Nvidia V100

160 Gb RAM 1.5 Tb HDD
16Gb, 64 vCPU 2.3GHz
In the ASR domain model, each server with the above configuration can handle single-channel
audio at different bandwidths:
in file mode throughput is 700 RTF (Real-Time Factor) in 1000 concurrent threads with
512 Gb RAM consumption;
in streaming mode - 500 simultaneous streams with 128 Gb RAM consumption. For
five-second audio the delay is 0.271 sec; for ten-second audio - 0.343 sec; for fifteen-
second audio - 0.364 sec.
In a generic ASR model, each server with the above configuration can
handle single channel audio:
in file mode with 80 RTF (Real-Time Factor) throughput, while consuming 512 Gb of
RAM;
stream utilization data will be provided at the end of productivization.
7.2 TTS:
ProcessorOperating memoryHard disk capacity
1x GPU Nvidia V100 16Gb,

64 Gb RAM 1.7 Tb HDD
32 vCPU 2.3GHz
On the current TTS model, each server with the above configuration synthesizes audio
at 290 characters per second.
page 16 Of 16
7.3 Auxiliary modules:
RAM Hard disk

Module Processor
capacity
Personal cabinet 4 vCPU 2.3GHz 8 Gb RAM 256 Gb HDD
Statistics service 8 vCPU 2.3GHz 16 Gb RAM 2 Tb HDD
In the proposed scheme, the servers for ASR/TTS can be horizontally scaled based on the
expected load:
The gRPC API gateway scales by increasing vCPU and RAM in

proportion to the increase in load;
Modules "Personal Cabinet", "Statistics Service"
are also scaled by increasing vCPU and RAM.
8. COMPETITIVE COMPARISON
The developers of MTS AI are constantly improving

the
Audiogram platform, adding new features.
To familiarize yourself with the actual
competitive analysis, scan the QR code.
9. LICENSING
The Audiogram platform presents two types of tariffs:
pay as you go; package.
Contacts for requesting prices and demo
To request rate information, access a demo

and additional information, contact sales@mts.ai,
or scan the QR code and go to the site of MTS AI.
page 17 Of 16

Audiogram MTS AI en

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Audiogram MTS AI en

Uploaded by

Copyright:

Available Formats

Subscribe to DeepL Pro to translate larger documents.

Visit www.DeepL.com/pro for more information.

AUDIOGRAM IS A SPEECH RECOGNITION AND

Speech recognition and voice assistants; creating audiobooks;

1. MARKET OVERVIEW OF SPEECH

Market Trends in Speech Recognition and Synthesis Technologies

The following factors have driven the development of such technologies:

$9.4 billion $28, bln

Global Speech Recognition Market in 2022Global Speech Recognition Market in 2027

In Russia, the market for conversational AI, including recognition technologies

Conversational AI market size in Russia: 2021-2025

Representatives of different companies are interested in such solutions:

Large corporations with more than $1 billion in revenue are launching

Business effects of implementing products created with Audiogram analogs:

2 times 16-25% 96%

50% 20% 20%

With Audiogram, business customers can create speech transformation services

Multi-sectoral model Domain model pre-training

Ability to adapt the speech recognition model to the

Advanced speech synthesis functions High quality work

Easy integration Flexible licensing

3. HOW CAN BUSINESSES USE

Developers of voice bots and Media and other content

EdTech companies: Publishers and

HOW CAN BUSINESSES USE

Social media and Call Centers:

Software and Transportation

4.1. Audiogram implementation

proposed to use Audiogram for this purpose.

What was done?

Audiogram has been integrated with MTS's internal systems.

The features of the platform have been customized.

Added a punctuation module to make it easier for end users to understand

The business impact of implementing Audiogram in MTS Defender:

Increase user loyalty by reducing unwanted conversations.

4.2. Using Audiogram

MTS AI has offered to utilize the capabilities of Audiogram.

What was done?

Automated the process of creating audiobooks from electronic versions of publications

Audiogram allowed MTS Library to optimize the audiobook preparation process:

The time to produce an audio version of an electronic publication has been

The quality of the voice acting remains at a high level.

4.3. Creating with Audiogram

What was done?

We connected a chatbot created on the JAICP platform to the PBX.

An experiment to serve customers of the MTS ecosystem with the help of

Improved customer service resulted in a 17-20% increase in customer

ASR (Automatic Speech Recognition) - Automatic Speech Recognition

File-based speech conversion - asynchronous speech-to-text transcription for

There are two types of models available in Audiogram:

TTS (Text-to-Speech) - text-to-speech conversion

for literary voicing of books and articles.

Collecting statistics on platform usage;

Connector services to support interaction with external applications.

6.1. Description of program components

The Audiogram platform is built on the principle of microservice architecture.

SIP connector ASR

On premise and cloud-versions of Audiogram

MTS AI can provide the Audiogram platform in two formats:

It does not need to install any additional

MAIN CHARACTERISTICS OF THE VOICE

6.2. ASR module

integration with customer systems using the standard gRPC data

response delay from 500 ms*;