United States Patent US 10,891,948 B2 Jan. 12, 2021 Patent No.: Date of Patent:

IDENTIFICATION OF TASTE ATTRIBUTES FROM AN AUDIO SIGNAL

Applicant: SPOTIFY AB, Stockholm (SE)
Inventor: Stéphane Hulaud, Stockholm (SE)
Assignee: SPOTIFY AB, Stockholm (SE)

Appl. No.: 16/901,870
Filed: Feb. 21, 2018

Related U.S. Application Data
Continuation of application No. 15/365,018, filed on Nov. 30, 2016, now Pat. No. 9,934,785. Int. Cl.
G10L 15/22 (2006.01)
G10L 25/63 (2013.01)
G10L 15/18 (2013.01)
G06F 3/16 (2006.01)
G10L 25/84 (2013.01)
G06F 16/683 (2019.01)

U.S. Cl.
G10L 15/22 (2013.01); G06F 3/167 (2013.01); G06F 16/683 (2019.01); G10L 15/1822 (2013.01); G10L 25/63 (2013.01); G10L 15/1815 (2013.01); G10L 25/84

Field of Classification Search
CPC: G10L 15/00; G10L 15/26; G10L 15/265; G10L 2015/00; G10L 25/48; G10L 25/51; G10L 25/54; G10L 25/63 Primary Examiner — Paras D. Shah
Attorney, Agent, or Firm — Merchant & Gould P.C.

ABSTRACT

A system, method and computer product are provided for processing audio signals. An audio signal of a voice and background noise is input, and speech recognition is performed to retrieve speech content of the voice. There is retrieval of content metadata corresponding to the speech content, and environmental metadata corresponding to the background noise. There is a determination of preferences for media content corresponding to the content metadata and the environmental metadata, and an output is provided corresponding to the preferences. WEWUOUIAU |BIO0g © eby « (ped ‘88109 ‘Jooupg "s0OpINO ‘UIELL ‘ouj2W ‘sng '3')) WauIUONAUS edIsAYe « [EIEPEIOIN [eWOUUOIAUS OAOLIEY BVPEIEIN JUSTO ersinEYy sepueg + |y. aleig |euonowy « onjuB0oey yoseds « quauog ensuiey / 01 zor~] reubig Bugyeuuo4 ue Buoy bor ynduj opny i US 10,891,948 B2 Sheet 2 of 3 Jan. 12, 2021 U.S. Patent ‘Sy081, KON Loz ‘Ae|q 0} UBUD, pepuauiwiosay (jeuodo) é old soz 802 —] (“"wsjuog panes ‘s6u0s Jo BuNey) a1s21 ss8s7) ° Buluayst) jeauo}sI} S087) © ssyndul jeiuewuo.1AU3 oIpny sindu eepeyew yseds « fe— ‘8109 pue sinduy snoinaig © indy} (peyewuo4 pue pesied) }xo1 0} yoseds « ‘up Uo sjustoIys09 BuIsf Aq “8! jue} joeIeg, ‘eyse | Spueiy « i ayeig feuonOW/e\ePeIeW NYSOd JO} YOO] (BAINSOg) 2uQ qUaLNg oy JeNWIS 81 (s)nduj 3827 4! 4007 (BANEBEN) © ‘91 Indu] snoIeld 109g sinduy Aresg7y pu0n981}09 oisnw voz Aisi Suneu 9 Sues! pues pue Jas) 807 ‘sanding BIEPeISY\ |B]USWUOUAU 22 JUBIUOD ‘GxeL) jUETVOD Loz oz sisenboy snoinasd ssf) XN 002 U.S. Patent Jan, 12, 2021 Sheet 3 of 3 US 10,891,948 B2 300. oN af 0 315 i Processor Digital Main Device Signal Memory Processor 305 Y eae Peripheral eal Swrage Graphics a Device(s) | | Devices) Medium | | Subsystem Device 330 340 380 360 361 320 Output 370] Display 4 Input Engine }4— 388 ‘Speech Recognition Engine _|-7~ 390 Metadata Engine Ly 392 Content Preference Engine _[/~ 994 Output Engine }y- 396 FIG. 3 US 10,891,948 B2 1 IDENTIFICATION OF TASTE ATTRIBUTES FROM AN AUDIO SIGNAL CROSS-REFERENCE TO RELATED "APPLICATION ‘This application is» continuation of US. Non-Provisional patent application Ser. No. 15/365,018 filed Nov. 30, 2016. ‘To the extent appropriate, the above-disclosed application is, hereby incorporated by reference in its entirety BACKGROUND OF THE INVENTION Field of the Invention Example aspects described herein relate generally 10 ‘acoustic analysis, and more particular to systems, methods ‘and computer produets for identifying taste attributes of a user fom an audio signal Description of Related Art In the fie of on-demand media steaming services, its ‘common for a media streaming application to inlude fea tures that provide personalized media recommendations to 8 user, These features typically query the wser to identify preferred content among a vast catalog of media that is predicted to match the consumer taste, ie, Hstening or viewing preferences of the see. For example, one approach to identifying consumer taste js to query the user for basic information such ws gender oF age, © narrow down the number of possible ecommenda- tions. The user is then further asked to provide additional {information 1o narrow down the numbor even further. In one ‘example, the user pushed toa decision tee including, eg. nists oF shows that the use like, and fills in or selects ‘options to further fine-tune the system's identification of their tastes ‘One challenge involving the foregoing spproach i tht it requires significant time and effort on the pat of the user. Ia particular, the user is required to tediously input answers to utile queries inorder forthe system to identify the wser's ‘What is acedod is an entirely different approsch to cole lecting taste attributes of a user, particularly one tbat is rooted in technology so that the above-described human ‘setivity (eg, requiring a user to provide input) is atleast Parially eliminated and performed more elicently There is also a need (0 improve the operation of a ‘computer or special purpose device that provide content based on user tastes by minimizing the processing time roeded to compile taste profile information. BRIEF DESCRIPTION ‘The example embodiments described herein provide methods, systems and computer products for processing audio signals. An audio signal of a voice and background noise is input, and speoch recognition is periommed retrieve speech content of the voice. Content metadata ‘corresponding to the speech content, and environmental metadata corresponding to the background noise is retrieved. In tum, a determination of preferences for media ‘content comesponding t0 the content metadata and the ‘environmental metadata, and an output is provided corr sponding to the preferences 0 o 2 In one example aspect, the content metadata indicates an ‘emotional state ofa specker providing the voice In another example aspect, the content metadata indicates 4 gender of speaker providing the voice Tn yet another example aspect, the content metadata indicates an age of a speaker providing the voiee. In tll another example aspect, the content metadata indicates an aecent of a speaker providing the voice ‘In another aspect, the environmental metadata indicates aspects of a physical environment in which the audi signal is input, In yet another aspect, the envitonmental metadata jndi- cates & number of people ia the enviroament ia which the audio signal is input. Tn another aspect, the input audio signal is fered and {formatted before the speech content is retrieved. none aspect, the speech content is normalized 1 remove duplicated and ‘filler words, and to parse and format the speech content, In another example aspect, the audio signal is input fom ‘a user in esponse to querying the user o provide a audio signal ‘In stl another example aspect, the output is audio output ‘of mas comesponding tothe preferences, In another example aspect, the output is a display of recommended next matsc tacks comresponding to the pref In another example aspect, the preferences also corre- spout! to historical listening practiees ofa user who provides the voice, ‘In another example aspect, the preferences also ate 2880- ciated with preferences of fiends ofa user who provides the [BRIEF DESCRIPTION OF THE DRAWINGS The features and advantages ofthe example embodiment ofthe invention presented herein will hecome more appar trom the detailed description et forth below when taken in csonjusction with the Following drain FIG. 1 is flow diagram illustrating a process for pro cessing audio signals aeording to an example embodiment. TFIG. 2 is block diagram ofa system for processing aco signals cording o an example embodiment. FIG. ¥ isa block disgram illistating an example prefer- ence determination system constructed to determine prefer fences for media content rom an audi signal according oan ‘example embodiment bi "RIPTION ‘The example embodiments of the invention presented boerein are directed (© methods, systems and computer pro- gram products for processing audio signals to determine {asteatributes. Generally, example aspects provide a frame- ‘work tha provides media content preferences (such as next musical track to play) from an audio signal suchas a user's PIG. 1 is flow digram illustrating a process for deter- mining taste atributes for media content according to an ‘example embodiment, ‘Briefly, according 1 FIG. 1, audio signals ofa voice (also referred fo as voice signals) and background noise are recived by a microphone. In tur, an audio signal processor to converts the voice signals to digital or other represen tions for storing or further processing. In ane embodiment an ‘dio signal processor performs a voice recognition process US 10,891,948 B2 3 ‘on the voice signals to generate digitized speech content of the voice signals that can be stored and further processed. “The digitized speech content andl background noise are, ia turn, further processed to retrieved voiee content metadata ‘corresponding 10 the speech content and environmental mictadata corresponding tothe background noise. A process- ing unit (eg, a computer processor) processes the voice ‘content metadata and environmental metadata to generate taste attributes that ean be used to determine preferences for media content. The taste attributes are outpat, for example, through a network interes “This, instep 101, audio signals are received by a micro- phone communicatively coupled to an audi signal proces- for. Tn one example, the audio signals inchude voice Signals nnd nose received via a mobile device (et, via mobile device or through an application that ‘causes the mobile device to receive and process audio, signals) Inone embodiment, the mobile device transmits the suo signals to another remote system for processing, of processing might be performed in the mobile device itself ‘The audio signals may be recorded in real-time, or may correspond to previously-recorded audio signals. Tn step 102, the input audio signals are filtered and formatted. Thus, for example, the audio signal might be processed to remave silences from the beginning and/or the ‘end of the aio inp In step 103, speech recognition is performed to retrieve ‘content. Prior to extracting the content or performing speech recognition, addtional processing can be applied 10 the ‘input audio signal, such as using frequency-