You are on page 1of 11
«2 United States Patent Hulaud US 10,891,948 B2 *Jan, 12, 2021 (10) Patent No.: (4s) Date of Patent: oa oy ™ 03) wo an @) 6s) @) Gl (2) IDENTIFICATION OF TASTE ATTRIBUTES FROM AN AUDIO SIGNAL Applicant: SPOTIFY AB, Stocklolm (SE) Inventor: Stéphane Hulaud, Stockholm (SE) Assignee: SPOTIFY AB, Stockholm (SP) Notice: Subject to aay disclaimer, the team ofthis patent is extended or adjusted under 35 USC. 154(b) by 61 days. ‘This patent is subject to a terminal dis- claim, Appl. No.: 18901,870 Filed: Feb, 21, 2018 Prior Publication Data US 201810182304 Al Jun. 28, 2018, Related U.S. Application Data Continuation of application No. 15/365 018, fled on Nov: 30, 2016, now Pat, No, 9,934,785. Int. Cl, G01 1522 (2006.01) G01. 25/63 (201301), G01 158 (201301), GOOF 3716 (2006.01) G01. 2585 (201301), GOO 16643 (201901) Us. cL cc GI0L, 15/22 (2013.01); GOOF 3/167 (QO13.01): GO6F 161683 (2019 01): GOL 15/822 (2013.01), GIOL 35/63 (2013.01); GIOL, 15/1815 (2013.01); GIOL 25/84 (58) Field of Classification Seareh CPC os GIOL 15100; GIOL 15/26, G1OL 15/265; {G10L. 2015/00; G101. 25/4%; GOL. 25/51 GAOL. 25/54; G1OL 25/63: GOIC 21/3608: (GOSP 17/30; GOSE 17/3011 ‘See application file for complete seurcl history. 66) References Cited US. PATENT DOCUMENTS 591822) A 61999 Blum eal Soivoa7 & 51999. Sone 6539395 BL 32003. Gjendingen a ‘65083803 B2 82003, Inoue eta {6801906 BL 102001, Bates eta ‘8865.73 BL '32005. Caja tal (Continued) (OTHER PUBLICATIONS T.Polzin et al, "Detecting Emotions in Spech, bps www smuedu/pubfiles;publpotzin thomas 1998 L/polzin thomas 1998-1 pai (1998), (Continved) Primary Examiner — Paras D Sha (74) Attornes, Agent, or Firm — Merchant & Gould PC. on ABSTRACT A system, method and computer product are provided for processing audio signals. An audio signal of a voice and background noise is input, and speech recognition is per formed to retrieve speech content of the voice. There is retrieval of content metadata corresponding to the speech content, and environmental metadata corresponding to the background aoise. There is a determination of preferences ‘ormedia content corresponding to the eontent metadata and the enviroamental metadata, and an output is provided corresponding to the preferences. aad —— ao =m i Taenzearmnana ] [Rares nese eae 7 Somer Reseyann ba | J+ siemens te ate Sea So Be | ee st | 66) 7013301 Thrs000 Zsi s79 Tipsas Taam 7277766 Tsi8oss Tons Tas 9e Tesx.009 sss105 20020002809 | aonv019974 an20172379| sopsooxetat an0so191764 dooson0sina 20030220908 aooson29ss7| sonwoosaat 2oov01t1262 20090122660 amngotoa4or aopuoni sist doosorsiaos sos00s0017| dnosorearo? n0s0n89075 ao0s03¥9600 2aonedoest02 3050085105 aonsonTses aonwo14oss? 20080183286 2nns02k9622 soon o0t3 768 doonoo47n9| 007 0088215 al a a al US 10,891,948 B2 Page 2 References Cited USS. PATENT DOCUMENTS. 32006 a6 72006 $007 Sam 02007 "42008 112009 32010 010 Sania 12002 1/2002 13003 1bauos 102003 003 1/2003 12003 "004 2008 ‘2004 voauos Todo 2004 "ys Fa00s 12005 100s 2006 ¥i006 4006 06 7006 122006 2007 32007 007 Home al Gang e ‘eae ea Cremer ol thie Khun ea Docketson ct Seppanen ea. Camere ote a Vanian rot isis 700270 ‘Gntinge oa ‘Aol et cin Ta et Walls ca: Dunning «al Eaton et Maison GioL 18s 704251 Crockatt ‘ait Orbach thet Lone eta tang Lone eta Kawa ta xu het ot a Cremer Bopdsnoy ‘Ameren Khor ta Kang etal. Dhan etal Toms @ a 2007 0106405 AL Cook e aonvisosts Al Brave ut a gnv7nnseons Al ile 2o0n0261587 Al Soowo01ss0n AL Kime a aniwonoten Al Seheluon et a 20080101782 AL Kellock ea. Souw0124082 AL $2008 Divakaran ol Songoisos40 AL 82008 Hoos ea 20080193017 AL 2008 Wilson ea. goowneses7t AL 102008 Eronen 3o090240379 AL 102008 Malsos ot al googorrssto AL 11/2008 Andersen dorowosita? At '22010 Nagatome doru0pesss9 AL 32010 _Atsmon ot al SoUMOLyIS AL 62010 Goldstein tal ao1woies0o1 AL 72010 Zureket ao120201362 AL 82012 Crossan ta goia0e24706 AL 92012 Hivang eta 20190167029 AL 6 201% Friesen et SO1MOR3H AL 122013 Brown cal aovaunn39%4 AL 102014 Ashiawa ea ao1sbisiong Al 62015. Weinsan ea. 20150332667 AL* 112015 Mason 101. 1902 704249 20160101486 AL 42016 Penile. 2O1GDI2S7 AL $2016. Schroeter et a soils AL 62016 Uae a, 2O160R3600S AL 112016 Cao eal aoua0s7s747 AL 122016 Orr eal. aorn0si7ss AL 22017 Line 2ITOLSTSTG AL* $2017 Des Farias... GOOF 20170213247 AL 72017 Balasbramanian eta. (OTHER PUBLICATIONS [Ma Alita, “Gender Recognition System Using Speech Signal’ np aireese ong.journl jst papers 11 2jesei0| pl (2012), Jeancue Rouas ot al, “Language ani varity verification on todas news fr Portuese, hp wwwines- px plinlcadores Ficheinoe 4971 pl (208). * cited by examiner US 10,891,948 B2 Sheet 1 of 3 Jan. 12, 2021 U.S. Patent 204- L ‘Sls eyepeieWy leyvewuownug -B 1USILOD TUEIUOD sindino indy yeuuo pue esieg © SpION J9ji'4 eAOWOY © spiom pereaiidng exowey * qUsIUOD ez/eWON, 901 = 7 (hued 'dnos9, squscoy PWS ‘aUO|y ‘8'!) WEWUOUIAU |BIO0g © eby « (ped ‘88109 ‘Jooupg "s0OpINO ‘UIELL ‘ouj2W ‘sng '3')) WauIUONAUS edIsAYe « [EIEPEIOIN [eWOUUOIAUS OAOLIEY BVPEIEIN JUSTO ersinEYy sepueg + |y. aleig |euonowy « onjuB0oey yoseds « quauog ensuiey / 01 zor~] reubig Bugyeuuo4 ue Buoy bor ynduj opny i US 10,891,948 B2 Sheet 2 of 3 Jan. 12, 2021 U.S. Patent ‘Sy081, KON Loz ‘Ae|q 0} UBUD, pepuauiwiosay (jeuodo) é old soz 802 —] (“"wsjuog panes ‘s6u0s Jo BuNey) a1s21 ss8s7) ° Buluayst) jeauo}sI} S087) © ssyndul jeiuewuo.1AU3 oIpny sindu eepeyew yseds « fe— ‘8109 pue sinduy snoinaig © indy} (peyewuo4 pue pesied) }xo1 0} yoseds « ‘up Uo sjustoIys09 BuIsf Aq “8! jue} joeIeg, ‘eyse | Spueiy « i ayeig feuonOW/e\ePeIeW NYSOd JO} YOO] (BAINSOg) 2uQ qUaLNg oy JeNWIS 81 (s)nduj 3827 4! 4007 (BANEBEN) © ‘91 Indu] snoIeld 109g sinduy Aresg7y pu0n981}09 oisnw voz Aisi Suneu 9 Sues! pues pue Jas) 807 ‘sanding BIEPeISY\ |B]USWUOUAU 22 JUBIUOD ‘GxeL) jUETVOD Loz oz sisenboy snoinasd ssf) XN 002 U.S. Patent Jan, 12, 2021 Sheet 3 of 3 US 10,891,948 B2 300. oN af 0 315 i Processor Digital Main Device Signal Memory Processor 305 Y eae Peripheral eal Swrage Graphics a Device(s) | | Devices) Medium | | Subsystem Device 330 340 380 360 361 320 Output 370] Display 4 Input Engine }4— 388 ‘Speech Recognition Engine _|-7~ 390 Metadata Engine Ly 392 Content Preference Engine _[/~ 994 Output Engine }y- 396 FIG. 3 US 10,891,948 B2 1 IDENTIFICATION OF TASTE ATTRIBUTES FROM AN AUDIO SIGNAL CROSS-REFERENCE TO RELATED "APPLICATION ‘This application is» continuation of US. Non-Provisional patent application Ser. No. 15/365,018 filed Nov. 30, 2016. ‘To the extent appropriate, the above-disclosed application is, hereby incorporated by reference in its entirety BACKGROUND OF THE INVENTION Field of the Invention Example aspects described herein relate generally 10 ‘acoustic analysis, and more particular to systems, methods ‘and computer produets for identifying taste attributes of a user fom an audio signal Description of Related Art In the fie of on-demand media steaming services, its ‘common for a media streaming application to inlude fea tures that provide personalized media recommendations to 8 user, These features typically query the wser to identify preferred content among a vast catalog of media that is predicted to match the consumer taste, ie, Hstening or viewing preferences of the see. For example, one approach to identifying consumer taste js to query the user for basic information such ws gender oF age, © narrow down the number of possible ecommenda- tions. The user is then further asked to provide additional {information 1o narrow down the numbor even further. In one ‘example, the user pushed toa decision tee including, eg. nists oF shows that the use like, and fills in or selects ‘options to further fine-tune the system's identification of their tastes ‘One challenge involving the foregoing spproach i tht it requires significant time and effort on the pat of the user. Ia particular, the user is required to tediously input answers to utile queries inorder forthe system to identify the wser's ‘What is acedod is an entirely different approsch to cole lecting taste attributes of a user, particularly one tbat is rooted in technology so that the above-described human ‘setivity (eg, requiring a user to provide input) is atleast Parially eliminated and performed more elicently There is also a need (0 improve the operation of a ‘computer or special purpose device that provide content based on user tastes by minimizing the processing time roeded to compile taste profile information. BRIEF DESCRIPTION ‘The example embodiments described herein provide methods, systems and computer products for processing audio signals. An audio signal of a voice and background noise is input, and speoch recognition is periommed retrieve speech content of the voice. Content metadata ‘corresponding to the speech content, and environmental metadata corresponding to the background noise is retrieved. In tum, a determination of preferences for media ‘content comesponding t0 the content metadata and the ‘environmental metadata, and an output is provided corr sponding to the preferences 0 o 2 In one example aspect, the content metadata indicates an ‘emotional state ofa specker providing the voice In another example aspect, the content metadata indicates 4 gender of speaker providing the voice Tn yet another example aspect, the content metadata indicates an age of a speaker providing the voiee. In tll another example aspect, the content metadata indicates an aecent of a speaker providing the voice ‘In another aspect, the environmental metadata indicates aspects of a physical environment in which the audi signal is input, In yet another aspect, the envitonmental metadata jndi- cates & number of people ia the enviroament ia which the audio signal is input. Tn another aspect, the input audio signal is fered and {formatted before the speech content is retrieved. none aspect, the speech content is normalized 1 remove duplicated and ‘filler words, and to parse and format the speech content, In another example aspect, the audio signal is input fom ‘a user in esponse to querying the user o provide a audio signal ‘In stl another example aspect, the output is audio output ‘of mas comesponding tothe preferences, In another example aspect, the output is a display of recommended next matsc tacks comresponding to the pref In another example aspect, the preferences also corre- spout! to historical listening practiees ofa user who provides the voice, ‘In another example aspect, the preferences also ate 2880- ciated with preferences of fiends ofa user who provides the [BRIEF DESCRIPTION OF THE DRAWINGS The features and advantages ofthe example embodiment ofthe invention presented herein will hecome more appar trom the detailed description et forth below when taken in csonjusction with the Following drain FIG. 1 is flow diagram illustrating a process for pro cessing audio signals aeording to an example embodiment. TFIG. 2 is block diagram ofa system for processing aco signals cording o an example embodiment. FIG. ¥ isa block disgram illistating an example prefer- ence determination system constructed to determine prefer fences for media content rom an audi signal according oan ‘example embodiment bi "RIPTION ‘The example embodiments of the invention presented boerein are directed (© methods, systems and computer pro- gram products for processing audio signals to determine {asteatributes. Generally, example aspects provide a frame- ‘work tha provides media content preferences (such as next musical track to play) from an audio signal suchas a user's PIG. 1 is flow digram illustrating a process for deter- mining taste atributes for media content according to an ‘example embodiment, ‘Briefly, according 1 FIG. 1, audio signals ofa voice (also referred fo as voice signals) and background noise are recived by a microphone. In tur, an audio signal processor to converts the voice signals to digital or other represen tions for storing or further processing. In ane embodiment an ‘dio signal processor performs a voice recognition process US 10,891,948 B2 3 ‘on the voice signals to generate digitized speech content of the voice signals that can be stored and further processed. “The digitized speech content andl background noise are, ia turn, further processed to retrieved voiee content metadata ‘corresponding 10 the speech content and environmental mictadata corresponding tothe background noise. A process- ing unit (eg, a computer processor) processes the voice ‘content metadata and environmental metadata to generate taste attributes that ean be used to determine preferences for media content. The taste attributes are outpat, for example, through a network interes “This, instep 101, audio signals are received by a micro- phone communicatively coupled to an audi signal proces- for. Tn one example, the audio signals inchude voice Signals nnd nose received via a mobile device (et, via mobile device or through an application that ‘causes the mobile device to receive and process audio, signals) Inone embodiment, the mobile device transmits the suo signals to another remote system for processing, of processing might be performed in the mobile device itself ‘The audio signals may be recorded in real-time, or may correspond to previously-recorded audio signals. Tn step 102, the input audio signals are filtered and formatted. Thus, for example, the audio signal might be processed to remave silences from the beginning and/or the ‘end of the aio inp In step 103, speech recognition is performed to retrieve ‘content. Prior to extracting the content or performing speech recognition, addtional processing can be applied 10 the ‘input audio signal, such as using frequency-