Find your next favorite book

Become a member today and read free for 30 days
Talk to Me: How Voice Computing Will Transform the Way We Live, Work, and Think

Talk to Me: How Voice Computing Will Transform the Way We Live, Work, and Think

Read preview

Talk to Me: How Voice Computing Will Transform the Way We Live, Work, and Think

ratings:
3/5 (2 ratings)
Length:
434 pages
7 hours
Released:
Mar 26, 2019
ISBN:
9781328715555
Format:
Book

Description

**To chat with the author, ask your Alexa device to "open the voice computing book."**

The next great technological disruption is coming


The titans of Silicon Valley are racing to build the last, best computer that the world will ever need. They know that whoever successfully creates it will revolutionize our relationship with technology—and make billions of dollars in the process. They call it conversational AI.
 
Computers that can speak and think like humans may seem like the stuff of science fiction, but they are rapidly moving toward reality. In Talk to Me, veteran tech journalist James Vlahos meets the researchers at Amazon, Google, and Apple who are leading the way. He explores how voice tech will transform every sector of society: handing untold new powers to businesses, overturning traditional notions of privacy, upending how we access information, and fundamentally altering the way we understand human consciousness. And he even tries to understand the significance of the voice-computing revolution first-hand — by building a chatbot version of his terminally ill father.
 
Vlahos’s research leads him to one fundamental question: What happens when our computers become as articulate, compassionate, and creative as we are?
Released:
Mar 26, 2019
ISBN:
9781328715555
Format:
Book

About the author

JAMES VLAHOS covers the frontiers of technological change for publications like Wired, the New York Times Magazine, Popular Science, and Scientific American; he also writes for The Atlantic and GQ. In 2017 he engineered a chatbot that helps preserve the personality and tell the life story of his late father, a project that resulted in a Wired cover story and attracted attention from around the world. He lives in El Cerrito, California.


Related to Talk to Me

Related Books

Related Articles

Book Preview

Talk to Me - James Vlahos

Contents


Title Page

Contents

Copyright

Dedication

Introduction: Visionaries

Competition

Game Changers

Assistants

Titans

Innovation

Voices

Rule Breakers

Personalities

Conversationalists

Revolution

Friends

Oracles

Overseers

Immortals

Afterword: The Last Computer

Notes

Index

About the Author

Connect with HMH

Copyright © 2019 by James Vlahos

All rights reserved

For information about permission to reproduce selections from this book, write to trade.permissions@hmhco.com or to Permissions, Houghton Mifflin Harcourt Publishing Company, 3 Park Avenue, 19th Floor, New York, New York 10016.

hmhbooks.com

Library of Congress Cataloging-in-Publication Data

Names: Vlahos, James, author.

Title: Talk to me : how voice computing will transform the way we live, work, and think / James Vlahos.

Description: Boston : Houghton Mifflin Harcourt, 2019. | Includes bibliographical references and index.

Identifiers: LCCN 2018043595 (print) | LCCN 2018052370 (ebook) | ISBN 9781328715555 (ebook) | ISBN 9781328799302 (hardback) | ISBN 9781328606716 (international edition)

Subjects: LCSH: Voice computing. | Ubiquitous computing. | Computer networks—Social aspects. | BISAC: COMPUTERS / Intelligence (AI) & Semantics. | TECHNOLOGY & ENGINEERING / Social Aspects. | TECHNOLOGY & ENGINEERING / Inventions. | BUSINESS & ECONOMICS / Corporate & Business History.

Classification: LCC QA76.9.V65 (ebook) | LCC QA76.9.V65 V54 2019 (print) | DDC 006.2/48392—DC23

LC record available at https://lccn.loc.gov/2018043595

Cover design by Mark R. Robinson

Cover images © Shutterstock

Author photograph © Barbara Butkus Photography

v2.0319

To my father, John, because he wasn’t able

to make it to the end of this journey, and to my wife, Anne, because she was

Introduction:

Visionaries

The reason we are asking people to be stealthlike, said the man in the green shirt, is this is a big fucking idea.

Eight people sat around him on sofas and chairs that had been set up in the corner of an airy loft at Twenty-Fifth and Broadway in New York. They nodded in earnest agreement, little Yeah, man thought bubbles over their heads. And what’s interesting about this big fucking idea, the man continued, "is that it’s like a lot of big fucking ideas: It’s simple. It’s a simple idea that everybody should have thought of. But we thought about it first."

The man making the pronouncements was Peter Levitan, the CEO of a tech start-up called ActiveBuddy. It was March of 2000, and ActiveBuddy had $4 million of venture capital in the bank, a dartboard on the wall, and pricey artwork in the reception area. History was in the offing, the people at the meeting believed, and a documentary film crew hovered around the office to chronicle it.

The big idea had been the inspiration of the company’s president, Robert Hoffer, and the chief technical officer, Tim Kay. The genesis story went like this: Hoffer and Kay were internet veterans, having created an online version of the yellow pages in the mid-1990s. Brainstorming for new ideas as the decade drew to a close, Hoffer and Kay were texting over AOL’s instant messaging platform, AIM, when Hoffer asked Kay to check the stock price for Apple.

Kay was about to look up the information and write Hoffer back. But then he had an idea. An ingenious programmer, he spent a few minutes cranking out some code that would enable a computerized agent, or bot, to automatically write Hoffer back instead. It worked, and Hoffer got his stock quote.

This tiny exchange suggested something much bigger to Hoffer and Kay. At the time, the world was obsessed by the web. Netscape fought Internet Explorer in the browser wars. The search engines AltaVista, Yahoo, and a newcomer called Google battled for the public’s allegiance. Going on quests to find information online was such a cultural phenomenon that it had acquired a sporty nickname—surfing the web.

Hoffer and Kay, though, weren’t sold on surfing. The stock quote bot had given them a very different idea, one they believed would make interacting with computers more natural, powerful, and fun. Screw search. Instead, what if people could access the riches of the digital world simply by having conversations, in everyday language, with someone who seemed like a friend?

That someone, of course, would not be an actual human but rather an entity that imitated one: a conversation-making bot, or chatbot. It would communicate via text over AIM and other instant-messaging platforms. People would just need to add the chatbot as a contact like they would their human friends. They could then ask the bot for stock quotes, news updates, sports scores, movie times, dictionary definitions, and horoscopes. They could play games, get trivia, and fetch listings from the yellow pages. They could even initiate a web search.

After developing the technology, ActiveBuddy unveiled its first product in March 2001—a chatbot called SmarterChild. The company spent no money to market it, but inexplicably, the bot took off. Users, delighted by being able to have rudimentary conversations with a computer, shared transcripts of their chats online and encouraged friends to give SmarterChild a go. Then, in May, the company got a promotional opportunity that Levitan viewed as a gift from God. It seemed that the band Radiohead wanted ActiveBuddy to create a chatbot called GooglyMinotaur to promote its upcoming Amnesiac album.

Before long, SmarterChild and its creators were being written up in publications across the country and interviewed on television by the likes of Ted Koppel. Madonna and other musicians wanted bots; Yahoo and Microsoft were sniffing out the company for a possible acquisition. Within a year, SmarterChild had amassed 9 million users. The company estimated that a staggering 5 percent of all instant-messaging traffic in the United States was between people and SmarterChild.

Pure success on the outside, however, was more nuanced behind the scenes. Conversation logs with users revealed that the vision of a helpful, information-providing chatbot was not coming to life quite as the founders had envisioned. Executives asking for stock quotes and couples seeking movie times did not comprise a sizable portion of those millions of users. Instead, a vast number of them seemed to be bored teenagers who delighted in hurling expletives, racial slurs, and sexual propositions at SmarterChild.

This was a profound disappointment. But the logs also showed a pattern that validated the founders’ grandest dreams for what conversational computers could ultimately become. Amid the seas of puerility, there were islands of genuine, thoughtful discussion. Or, at least, attempts at them. People wanted to talk about their hobbies and favorite bands. They were confessing things. They were lonely and just wanted to chat with SmarterChild—and sometimes did so for hours.

Hoffer was intrigued. Science fiction teemed with portrayals of AIs gone bad—Frankenstein, HAL, the Terminator—but rosier scenarios resonated more with him. He especially loved the 1999 movie Bicentennial Man, in which Robin Williams plays a sensitive, intelligent android who wants to become human. If people really wanted to talk to SmarterChild, then Hoffer figured that he should make it his mission to enable that. From the very beginning, he later recalled, we always had this notion of having your best friend on the internet.

The trick was figuring out how to make that happen. Retrieving factual tidbits—phone numbers, sports scores—from digital databases and regurgitating them was not enough to make SmarterChild a personable friend. The bot needed to be able to make small talk. So ActiveBuddy hired a small stable of creative writers who scripted tens of thousands of responses in advance; SmarterChild could automatically say them when the right conversational moments arrived.

One of those writers was Pat Guiney, who had given up a rock musician’s life to pursue a career in new media. He curated a consistent personality for SmarterChild, transforming dry, pro forma responses into juicier ones. He gave the bot a sense of humor, his own sardonic one, which came through so clearly that colleagues would joke that when you chatted with SmarterChild you were essentially chatting with Guiney. He and the other writers also built out the bot’s stores of knowledge so it could have at least a few intelligent things to say about any topic that was popular with users, whether that was baseball or reality TV. SmarterChild even gained the ability to remember pieces of information—that user A liked the White Stripes, while user B preferred Jay-Z.

To Hoffer, this was only the beginning. He believed that with further development there was virtually no limit to how conversationally capable, emotionally perceptive, and personalized chatbots could become. The relationships between people and their bots might last for decades: friends for life, and perhaps even after, with the bots remembering key things about the deceased.

Hoffer’s dreams, unfortunately, were destroyed by the dot-com crash in 2001. The investors who had put up $4 million weren’t thinking about life years into the future. They wanted to know how the company was going to make money right away. Hoffer and Levitan believed that once the user base got big enough it surely could be monetized. But they were vague on precisely how. The counterargument, from Kay and the investors, was that millions of cursing thirteen-year-olds were never going to pay the bills. After many shouting matches, Hoffer’s camp lost the fight, and in early 2002, both he and Levitan left the company.

Stephen Klein took over as CEO; ActiveBuddy was eventually rechristened as Colloquis, a name redolent with Office Space–style corporate tedium. Its business was to build chatbots that answered customer-service questions for corporations, including big ones such as Time Warner Cable, Vonage, and Comcast. Three years later, Colloquis was acquired by Microsoft. It was a successful exit for the original investors. But Microsoft, strangely, soon lost interest in its new prize. A scandal that broke at the end of 2007 certainly didn’t help matters. A Santa chatbot for kids, powered by the Colloquis technology, told one user that it’s fun to talk about oral sex.

Guiney and the last remaining bot builders were laid off in 2008. Hoffer had long since moved on, but he hadn’t forgotten the original vision—a vision that was now entombed in the bowels of Microsoft. Conversational computing was a big, expletive-worthy idea.

Las Vegas, 2018. The city is hosting the annual Consumer Electronics Show, and all around, the attendees—a record 180,000 of them—are gabbing with computers. Palm-size slabs, flower vase–like cylinders, and what look like cigarette lighters emblazoned with brand logos. Devices with screens and ones without. Cars, ceiling fans, electrical outlets, and clothes dryers. Cameras, door locks, garden sprinklers, and coffee makers. If Hoffer had pulled a Rip Van Winkle, lying down for a nap in 2008 only to wake up here a decade later, he might think that he had been asleep for more like thirty years.

People aren’t just typing messages, either, as in the days of SmarterChild. The trade show’s 2.7 million square feet of exhibitions reverberate with the sounds of people actually talking to technologies that obediently fulfill commands and often talk back. Amid the cacophony, people might be instructing blinds to close, air conditioners to start, and speakers to play the song, Hot in Herre. Asking a countertop screen for a carnitas recipe, telling a refrigerator to add pork shoulder to the shopping list, and firing up a slow cooker. Controlling security cameras, robot vacuum cleaners, printers, ovens, and scent dispensers. Asking the mailbox if it has a letter in it, the car if the oil needs to be changed, and the lawn if it is thirsty.

All told, chatty helpers are accessible from thousands of the devices being shown off at CES, and it seems that they can do just about anything. Imagine what they can do for you: They can be commanded to start your car, check the tank, and find the nearest gas station. To pass time during the drive, they pull up audio from NPR, CNN, or the Wall Street Journal. They cue up slow jams and speed metal—virtually any song by any artist. They produce the sound of breaking waves, a ticking grandfather clock, or rain on a tin roof.

Talking digital genies can suggest baby names, order diapers, and read bedtime stories. They monitor how long the baby slept and how many times he pooped. They tell kids to clear their dishes, clean up their rooms, and look both ways before they cross the street. They remind seniors to take their medications and play memory games to keep their brains sharp.

In the bathroom—CES brims with ideas for this room of the house—chatty mirrors can share makeup recommendations, provide traffic information for the morning commute, and give affirmations: Damn, girl, you have got all of the right moves. Responding to spoken commands, showers spring to life. Toilets open, heat their seats, and even make small talk.

In the bedroom, when you wake up, talking computers can report on how well you slept, ask how you are feeling, and make mood-brightening suggestions: Perhaps some exercise would be uplifting? These assistants help pick a hiking trail and monitor how many steps you take. Or, if something calmer is in order, they lead yoga sessions at home.

If the activity has stimulated your appetite, talking computers can tell Starbucks to have a latte and pumpkin bread waiting at the counter. Or get Denny’s to whip up a Grand Slam. They order pizza and get six packs of beer delivered. They track the leftovers in the fridge and nag you to wash the dishes.

If the members of your family are away, voice assistants can share their current whereabouts. To pass the time until they get back, the assistants serve as virtual friends. They suggest Mother’s Day gift ideas or date-night ideas. They direct the fish tank to feed the fish, the cat bowl to feed the cat, and the birdfeeder to feed the birds. If you are away, they automatically tell the dog that you love him via a speaker on his collar.

For productivity, voice-controlled assistants can instruct your bank to make payments, ask insurance companies for updates on claims, and search for flights. They help to find plumbers, real estate agents, and roofers. They place orders for virtually any product ever made.

Endlessly helpful, the talking contrivances at CES also are boundlessly knowledgeable. Many of them can answer questions pertaining to daily life: When is my next meeting? What’s the traffic like on I-80? Or What time does Gordo Taqueria close? But they also successfully tackle questions that require broader knowledge: When was Alexander Hamilton born? How tall is the Burj Khalifa? Who is the quarterback for the 49ers? Or How many calories are in an avocado?

The companies unveiling all of these voluble devices comprise a roll call of familiar names: Ford, Toyota, and BMW. Sony and LG. Honeywell, Kohler, and Westinghouse. HP, Lenovo, and Acer. But these companies typically make only the vessels through which computers speak—their bodies. Their artificially intelligent brains, in an overwhelming majority of cases, are made either by Amazon or Google. The name of Amazon’s AI is Alexa; her rival is the Google Assistant.

The two tech giants are going about their business in very different ways at CES. Google is pulling out all of the promotional stops to declare that this is its trade show, its moment. All around Las Vegas, Google has made sure that a certain two words are ubiquitous. They are the ones that tell the Assistant to listen to users through any connected device: Hey, Google.

The words are spelled out in giant letters on the monorail train that glides past the Strip: Hey, Google. On billboard-size video screens, murals, and walls: Hey, Google. Above a two-story-high slide, a tabletop townscape, and a fifteen-foot-tall gumball machine. In a lavish promotional video projected onto the inside of a dome. On the hats of the company’s white-jumpsuit-clad emissaries. Repeated like a mantra, the phrase simultaneously feels like an introduction to a technology and a declaration of its dominance.

Amazon doesn’t bludgeon conferencegoers with as much branding, perhaps because the company feels that it has less to prove. The company enters the show having captured around 75 percent of the American market for smart home speakers (those featuring a voice assistant). At the time of the show, some 1,200 different companies are integrating Alexa into 4,000 smart home products while Google claims partnerships with 225 brands and 1,500 products. (Counting Android phones, though, the Assistant is available through 400 million devices worldwide, so Google isn’t exactly hurting.)

But if Amazon isn’t flaunting itself with any giant gumball machines, the company isn’t lying low, either. Amazon’s name is on the lips of virtually every product rep and journalist. The company hosts a daylong set of talks with titles such as Amazon’s Quest for Alexa to Be Everywhere.

As costars of the show, Amazon and Google are the ones setting the overall tone. But the two companies have not come to hawk any particular product. Instead, they are articulating a holistic view: that the world is now ruled by voice. In a packed talk, David Isbitski, Amazon’s chief evangelist for Alexa, summarizes the theme. We’re living in that future now where we can talk to technology like human beings, he says.

// Part One //

Competition

1

Game Changers

Every decade or so, there is a tectonic shift in how people interact with technology. Multibillion-dollar fortunes await the companies that define the paradigm of the new era while the also-rans go bankrupt or, worse, become passé. IBM ruled the days of mainframe computers; Microsoft dominated the desktop era; Google exploded with search in the internet age; and Apple and Facebook skyrocketed when computing went mobile.

The latest paradigm shift is underway.

The latest platform war is being fought.

The latest technological disruption is happening, and it promises to be one of the most sizable and momentous that the world has ever seen.

We are entering the era of voice computing.

Voice is becoming the universal remote to reality, a means to control any and every piece of technology. Voice allows us to command an army of digital helpers—administrative assistants, concierges, housekeepers, butlers, advisors, babysitters, librarians, and entertainers. Voice disrupts the business models of the world’s most valuable companies and creates opportunities for new applications. Voice puts artificial intelligence directly in the control of consumers. And voice introduces the world to relationships long prophesied by science fiction—ones in which personified AIs become our helpers, watchdogs, oracles, and friends.

The advent of voice computing is a watershed moment in human history because using words is the defining trait of our species—the ability that sets us apart from everyone and everything else. Our internal awareness centers not on the air in our lungs or the blood in our veins but on the words in our brains. Words mediate our relationships. They shape thoughts, express feelings, and communicate needs. They launch revolutions, save lives, and inspire hatred or love. They embody and record all that we know.

Computers, by comparison, have always been linguistically feeble. To be sure, they have an unsurpassed ability to warehouse words and shuttle them around; the internet has more than 4.5 billion pages of content. But until very recently, computers have scarcely begun to understand humanity’s torrents of words. To make sense of the texts, emails, documents, and speech that we volley back and forth. To hear and talk back.

Thanks to a recent range of breakthroughs, however, the fantasy of teaching computers to communicate in natural language—a field known as conversational artificial intelligence—has gained purchase in reality. The list of advances starts with the exponentially improved computing power as predicted by Moore’s Law. The rise of mobile—the fact that we all carry around potent pocket-size computers—has also been a significant enabler of voice.

Machine learning—in which computers gain capabilities by analyzing data rather than being explicitly taught—has also been critical, allowing developers to blast through problems that have lingered for decades. And cloud computing is a final and often overlooked factor. Conversational AI requires immense power. Attempting to embed all of it on a phone is difficult; putting it into something like a dog collar would be nearly impossible and absurdly expensive. But thanks to the cloud, any device can become a voice-enabled one with the simple addition of a microphone and a Wi-Fi chip. Everything from showerheads to children’s dolls can leverage the might of thousands of globally distributed computers.

Backed by all of these advances, voice is ushering in what’s known as ambient computing, which will ultimately make the rectangular slabs of today’s smartphones look as clunky as old VCRs. To date, computers have been, well, computers—discrete devices that we put atop our desks or hold in our hands. But when the bulk of the technological machinery can be far away rather than physically present, and when voice rather than clunky external peripherals can be used for control, the primacy of objects is diminished. As Google CEO Sundar Pichai put it in a letter to stockholders, The next big step will be for the very concept of the ‘device’ to fade away. With voice, computers are to be ubiquitous rather than discrete, invisible rather than embodied. Digital intelligence will be everywhere, like the air we breathe.

Voice also reverses an onerous status quo that has existed for thousands of years going back to the very dawn of human toolmaking. Our inventions have always demanded that we adapt to them. Whether with planes or guitars, lawn mowers or video games, we have to learn unnatural commands and movements to get devices to do our bidding. We have to determine which buttons to press, levers to slide, wheels to turn, and pedals to press.

On computers, we squeeze our fingers over jumbles of letters, numbers, and symbols—state-of-the-art technology when the QWERTY-layout typewriter was patented in 1867, but not so much now. The user slides around a mouse—a cute name for a hand-cramping contraption that was invented five decades ago—to reveal drop-down menus. We point and click. On smartphones, we tap, swipe, and pinch. Above all, we sit or stand motionless, our spines akimbo, the captives of eye-straining screens.

With voice, however, computers are finally doing it our way. They are learning our preferred way of communication: through language. Voice, optimally realized, has the potential to be so easy to use that it hardly feels like an interface at all. We know how to speak because we have been doing it for all of our lives.

Screens and smartphones won’t disappear in the conversational era, just as the jet airplane didn’t kill off the car. And voice will be integrated with current and emerging technologies, such as augmented reality. But for many applications, people will ditch keyboards and screens, and opt instead for the more natural, liberating interface of voice. Computers will follow us around rather than needing us to come to them.

It’s about time.

Voice, ultimately, is ushering humanity into the age of artificial intelligence. AI already lurks in the background of a wide range of applications, from internet search to automotive braking systems. But voice brings AI to the foreground—we speak to it, and it speaks back in a humanlike tone. Computing power that was previously only accessible to those in the innermost sanctums of academia, the military, and the world’s leading technology companies is now available to everyone.

What’s more, voice brings us artificial intelligence not as an academic might define it (the term is notoriously squishy) but as it has long been depicted in science fiction. So-called virtual assistants like Alexa are presented as intelligent, lifelike entities who do the biddings of their flesh-and-blood masters. They can be engineered to convey humor, friendliness, support, and empathy. In response, people will reflexively and mostly unconsciously start to reciprocate warm feelings. Our relationships with voice assistants will inevitably attain a depth and emotional complexity that a mobile phone or desktop computer would never inspire.

To be sure, these are still fledgling days for voice, as anyone who has cursed at their phone for failing to understand a simple utterance can attest. For some people, the technology suffers from the who would use that? stigma that has greeted new inventions from the automobile to Snapchat. Talking to your virtual assistant in public can feel awkward. But people used to think that having a cell phone conversation as you walked down the street was lame, too. The situation with voice computing is comparable to when the public was first hearing about a strange new technology called the World Wide Web in 1993. Or that of January 8, 2007, the day before Steve Jobs first announced the iPhone. The voice revolution has started, and it will change how we live.

Let’s run the numbers.

There are around 2 billion desktop and laptop computers in the world and 5 billion mobile phones. The number of deployed smart speakers, including Google Home or Amazon Echo, is much smaller but climbing fast, with an estimated 100 million of them worldwide. Now add in the types of gadgets showcased at CES—light bulbs, televisions, toilets, and the rest. All of the above can be portals for conversational-computing technology. This means that the total potential market for voice is exponentially larger than even mobile, climbing toward one hundred billion different devices globally.

Across the landscape of business, companies ranging from Facebook to 1-800-Flowers are asking: How will the voice revolution affect us? Is this an opportunity or a threat? Voice creates new ways to sell things, advertise, and monetize peoples’ attention. To interact with consumers for marketing or customer service. To collect data and profit from it. To make bookings and provide services from matchmaking to therapy.

The stakes are huge, so this book will devote the first of its three parts, Competition, to telling the story of voice from a business perspective. The primary focus will be on the campaigns by Apple, Amazon, Google, and Microsoft to develop voice platforms and dominate the emerging paradigm, which has the potential to imperil their empires or propel them to even greater heights.

ActiveBuddy’s vision had two prescient components. The first was people communicating with computers in natural language. The second, equally important, was users no longer having to expend so much effort online. Someone, or rather something else, would do a lot of the digital searching and doing instead.

Both components of that vision would famously come together in the world’s first mass-market, voice-enabled virtual assistant—Apple’s Siri. As we will see, her roots run deep. Prior to debuting on the iPhone in 2011, Siri and her component technologies were more than twenty-five years in the making, the passion project of a magic-loving technologist whose work was supported in part by the U.S. military.

The vast majority of people in the world had never spoken to an AI before Siri, and she blew minds. But as people spent more time with her, they quickly realized that Siri wasn’t some superintelligent AI with human-caliber skills. The bulk of her original functionality consisted of basic utilities: setting timers, getting weather forecasts, and sending off messages. And because she was essentially ahead of her time—conversational AI is much better today—her bugs in the early going led to a lot of disappointed users.

Siri’s shortcomings meant that most people failed to appreciate the magnitude of the revolution she had instigated. But Apple’s rivals were not oblivious. In fact, by the time Siri was unveiled, they were all working on their own voice-enabled assistants. Microsoft was first to market, in the spring of 2014, with the mellifluously named Cortana. Amazon shocked the tech world in November that year when it released the Echo smart home speaker, which was powered by an AI named Alexa. Google, which had offered internet search by voice since 2008, came out with a full-fledged voice AI, the Assistant, in 2016.

What is unfolding now is a textbook platform war, one that poses existential risks and tantalizing opportunities as the top combatants crack $1 trillion market valuations. Historically, Google and Facebook have made the vast majority of their fortunes from advertising. Amazon is the world’s largest digital store. Apple sells its own products, none more important than the iPhone. And Microsoft provides services and software for business applications. All of these business models are being disrupted by voice, and nobody is battling merely to create a new product or service. The companies are in a war to create the dominant new operating system for life itself.

ActiveBuddy wound up as a historical footnote for a variety of reasons—market downturns and management disputes. But possibly the most important factor was that the technology wasn’t good enough yet. Computers couldn’t listen well enough. They couldn’t talk naturally.

People, in fact, have been struggling to give voices to machines for centuries. This quest is the subject of the second of this book’s three parts, Innovation, which tells the story of voice from a technological perspective. Millennia ago, people shared myths about inanimate objects that sprang to life and spoke. In the Middle Ages, people recorded fantastical tales of so-called brazen heads, which sagely provided advice to holy men. Then, in the eighteenth century, inventors rolled out contraptions, functionally primitive but mechanically ingenious, that emulated human speech. As we will see, their creators were more likely to be seen as madmen or charlatans than legitimate inventors. But the original talking machines inspired subsequent generations of tinkerers all the way up to the digital age.

As soon as there were computers, beginning in the mid-twentieth century, people began laboring to teach them natural language. Early on, overconfident scientists promised that their creations would be able to help America win the Cold War, aid the mentally distraught, and maybe even explore space.

Reality intervened. What people perceived as a simple, unified experience—hearing something and replying—is, in fact, anything but. Dialogue involves subprocesses that branch out with fractal complexity. Sound waves must be transformed into words, a task known as automated speech recognition (ASR). Figuring out the meaning of those words is called natural-language understanding. Devising replies is natural-language generation. And, finally, speech synthesis is what allows computers to produce words aloud.

From the 1970s onward, most researchers limited themselves to one of these subspecialties. Other less constrained people began building simple text-based chatbots. They did so to engage players in video games or amuse themselves. They created chatbots that competed in competitions where the goal was to fool people into thinking that the computers were actually alive.

Both the niche-constrained researchers and the chatbot builders advanced the state of the art. But it took recent advances in machine learning to finally produce accelerating gains in voice. The theoretical appeal of the approach has long been obvious. Machines learn for themselves, trial-and-error style from data, rather than being explicitly programmed. The latter requires immense manual labor from programmers—too much, ultimately, to keep up with the enormous variety and complexity of human dialogue.

But if the promise was long there, the payoffs started coming only in the last five years or so. How this came about is a study of scientific perseverance. Researchers, including a trio known as the Canadian Mafia, spent decades working on machine-learning algorithms even when their colleagues mocked them for doing so.

Tech companies now scrap for the talents of machine-learning gurus and shower them with bank-busting salaries. The experts deserve abundant credit for cracking enduring problems such as speech recognition. Other challenges, like getting computers to formulate intelligible replies, are still works in progress. But the range of what’s possible is already astounding. Computers are learning to detect both meaning and emotion when we speak. They craft emails, advertisements, and poems. They talk with voices so realistic that they can convincingly emulate those of specific real people.

Creating voice interfaces, however, requires more than hard science. Early in the process, the inventors of Siri, Cortana, and other virtual assistants realized that their engineering work would be wasted if people couldn’t relate naturally to voice AIs and enjoy the experience. Enter the personality and user-interface designers, people with backgrounds in linguistics, anthropology, and philosophy; in comedy, acting, and playwriting.

"When you hear a voice, you automatically as a human being make

You've reached the end of this preview. Sign up to read more!
Page 1 of 1

Reviews

What people think about Talk to Me

3.0
2 ratings / 1 Reviews
What did you think?
Rating: 0 out of 5 stars

Reader reviews

  • (4/5)
    This book talks about how the major ai powered chatbots were conceived and built (cortana, siri, google assistant, alexa, etc).

    Vlahos contextualizes the development of such technologies in a timeline, explains some of tha negotiations that happened that led to the acquisition of Siri by Apple, for example, and also comments on the technology behind these artificial voices.

    He doesn't go too deep in the technicality so a layperson can understand the book easily. Also, he explores some social impacts of these technologies and how it was addressed by the major assistants (e.g. how the assistants react to sensitive themes such as harassment), which was very interesting.

    In short, I recommend this book for anyone interested in the topic.