Professional Documents
Culture Documents
Audiogram
Speech recognition and
synthesis platform
Speech recognition and synthesis platform
Audiogram is designed for developers of services and services, as well as for end customers
who need to recognize large volumes of audio files and generate audio content: telecom
operators, banks, media and other companies.
Audiogram allows you to automatically convert speech to text in real time and offline,
and vice versa, to dub the text with the selected voice, with specific intonation and
accents.
Based on the platform developed by MTS AI, various speech services can be created.
These include:
Audiogram can be delivered as software or as a cloud-based service. The platform can be easily
integrated with other MTS AI solutions, including
with speech analytics.
page 1 Of 16
Speech recognition and synthesis platform
There is a growing need worldwide for software that can understand and reproduce the
human voice, as well as communicate with users.
demand for remote customer service in retail, medicine, telecom and other
industries;
business aspiration to improve the efficiency of communications with
customers and increase the speed of processing audience requests;
introduction of speech recognition technology into consumer products: smartphones,
laptops, tablets and smart home devices,
and the proliferation of voice-activated devices;
The growing need for voice authentication in
applications and devices.
The global speech and voice recognition market size will grow from $9.4 billion in 2022 to $28.1
billion by 2027. Average annual growth will average 24.4%.
561
Market size in 2025
$56
322 million
196 $44
142
80 billion
2021 2022 2023 2024 2025 Market growth from 2021 to
2025
600% page 2 Of 16
Speech recognition and synthesis platform
A large volume of orders for conversational AI solutions comes from the government - there are
several players in the market that almost entirely specialize in government orders.
Additional benefits:
Data source: case studies of Nanosemantica, Fonemica, Just AI, MarketsandMarkets studies
page 3 Of 16
Speech recognition and synthesis platform
2. AUDIOGRAM BENEFITS
Audiogram is a speech recognition and synthesis platform based on neural networks and machine
learning methods. It is designed for developers of services and services, as well as for end customers
who need to recognize large volumes of audio files and generate audio content: telecom companies,
banks, media and others.
Create the voice of your brand, quickly Audiogram recognizes speech in different noise
and efficiently voice over fiction books conditions: the platform supports dialog with
and commercials customers when they are talking very quietly or
with automatic accent and intonation in places,
placement where there are extraneous noises
page 4 Of 16
Speech recognition and synthesis platform
page 5 Of 16
Speech recognition and synthesis platform
page 6 Of 16
Speech recognition and synthesis platform
4. KEYS.
The speech synthesis and recognition platform was deployed on CPUs (central
processing units) in the customer's circuit;
Audiogram was connected to MTS' contact center software using the gRPC
protocol, which facilitates messaging between customers and internal
services.
The platform has been taught to translate numerals into numeric values (e.g.,
three hundred and twenty-five spoken words into 325);
The "Antimat" service for converting foul language into special symbols has
been implemented.
Result
With the Audiogram integration, MTS has the options it needs for
its Defender product:
recording and accurate decoding of spam calls;
listening to the message to the end.
Customers receive SMS messages with the text of the conversation and indication of the
spam category, and MTS subscribers can be sure that important information will not be lost,
as they will receive information about the call and the content of the conversation.
page 7 Of 16
Speech recognition and synthesis platform
The number of users of the service more than tripled in 5 months; ROI
was 73%;
Audiogram integration took about 5 months and cost 2.1 million rubles
including the development of customized services.
The product has been launched commercially and continues to evolve.
page 8 Of 16
Speech recognition and synthesis platform
MTS turned to MTS AI to find a way to create audio content without hiring an
announcer, renting a studio, sound processing costs, etc.
The model of speech synthesis has been improved: intonation characteristic of literary
texts, including questioning, accent placement*.
Result
Based on the results of the experiment, launched MVP to voice 300-500 fiction books in MTS
Library.
* According to a study conducted by MTS AI, Audiogram places intonations and accents better than the
industry leaders: Yandex.Reader and spichki.org.
** The survey was conducted by MTS AI among MTS Library users.
page 9 Of 16
Speech recognition and synthesis platform
MTS AI developers using Audiogram have created a voice assistant answering customer calls
with speech synthesis and recognition.
Audiogram speech recognition and synthesis modules were connected to the software and to
the MTS contact center chatbot using UniMRCP.
We configured the voice activity detector (VAD), an algorithm designed to distinguish between
intervals of active speech and pauses.
Result
The voice assistant has processed more than 200 thousand appeals since the
beginning of the year.
In the future, it is planned to include new regions and other areas in the experiment:
bank, digital products, servicing of fixed telephone network subscribers, and transfer
of the project to commercial operation.
The implementation of the voice assistant took about 7 months including the pilot.
page 10 Of 16
Speech recognition and synthesis platform
5. BASIC FUNCTIONALITY
Streaming speech-to-text conversion, which allows you to transcribe audio in real time and get
the results in text format.
domain, which enables effective speech recognition in the fields of medicine, telecom
and finance;
a general model with increased resource consumption, suitable for any application in a
wide range of noisy environments.
Text voicing with female or one of three male synthesized voices. Automatic ML markup,
The platform supports the SSML speech synthesis markup language, which allows you
to achieve a more natural sound by controlling intonation, speed, accents and other
parameters.
Auxiliary Services:
billing - bill generation based on statistics of platform services usage and tariffs;
page 11 Of 16
Speech recognition and synthesis platform
6. SOFTWARE COMPONENTS
Audiogram is provided to the user as an API through which he can interact with the platform
directly, as well as a set of connectors for conversion to other protocols: SIP connector,
UniMRCP connector, REST gateway.
Personal cabinet
Administrator
ON PREMISE. CLOUD
The customer receives a software The solution will be deployed in the cloud,
distribution and license for installation on MTS AI facilities, and the customer will
on their servers. receive only a link to the API for work and a
link to the personal account.
The developers recommend that companies for which it is important to ensure the
confidentiality of customer data (e.g. banks and telecom companies) should choose the
on premise option. In this way, all information will not go outside the company.
page 12 Of 16
Speech recognition and synthesis platform
General Characteristics:
supported audio formats: WAV PCM 16bit, WAV MULAW, WAV ALAW; available
language - Russian.
Modes of operation:
page 13 Of 16
Speech recognition and synthesis platform
MTS AI has language models available for topics such as telecom, medical, and a general
conversational model
recognition: no pre-training is
required;
General Characteristics:
MARVIN GLEB
A man's voice A man's voice
4
BASIC VOICES
ISLAM MARIA
A man's voice A woman's voice
SSML markup for point control of synthesis, allowing to correct intonation, speed, accents and
other parameters, to put accents in bot phrases;
page 14 Of 16
Speech recognition and synthesis platform
page 15 Of 16
Speech recognition and synthesis platform
7. RECOMMENDED
DEPLOYMENT MODELS
7.1. ASR
In the ASR domain model, each server with the above configuration can handle single-channel
audio at different bandwidths:
in file mode throughput is 700 RTF (Real-Time Factor) in 1000 concurrent threads with
512 Gb RAM consumption;
in streaming mode - 500 simultaneous streams with 128 Gb RAM consumption. For
five-second audio the delay is 0.271 sec; for ten-second audio - 0.343 sec; for fifteen-
second audio - 0.364 sec.
In a generic ASR model, each server with the above configuration can
handle single channel audio:
in file mode with 80 RTF (Real-Time Factor) throughput, while consuming 512 Gb of
RAM;
stream utilization data will be provided at the end of productivization.
7.2 TTS:
On the current TTS model, each server with the above configuration synthesizes audio
at 290 characters per second.
page 16 Of 16
Speech recognition and synthesis platform
In the proposed scheme, the servers for ASR/TTS can be horizontally scaled based on the
expected load:
8. COMPETITIVE COMPARISON
9. LICENSING
page 17 Of 16