Defining the Future of AI-Powered Voice Recognition with SoundHound

Season

Episode

27m 0s

Defining the Future of AI-Powered Voice Recognition with SoundHound

Darin Clark, Director of Business Development at SoundHound discusses how SoundHound has evolved from a pioneering music recognition app into an independent voice AI platform with proprietary conversational intelligence. SoundHound's technology allows brands in a wide range of industries to add conversational interfaces and wake words to any hardware, software, or mobile app.

Read transcript 

Episode Transcript

“When we see these things working, even those of us that are in the business, sometimes we take a step back and are like, wow, that’s really amazing. It works so well and provides such a level of convenience and usability for the customer. It’s a little bit mind-blowing at times. The technology has really come so far from—not even 20 years ago—but just 5 years ago. It’s made leaps and bounds of progress since then,” says Darin Clark, SoundHound’s Director of Business Development.

Recently, Darin sat down with Scott Stanford of The Infrastructors Podcast to discuss conversational AI technology. In the podcast, they covered the evolution of voice AI, natural language understanding, wake words, personalization, monetization, and how convenience and value are at the core of a good voice solution.

Scott Stanford:

How does your platform recognize speech now as opposed to in the past? Has it changed much?

Darin Clark:‍

So, our solution is very different than the traditional approach to speech recognition. Our founders decided that we needed to own all the technology ourselves. In order for it to work, we needed to build it all ourselves. In doing that, we developed a new approach to speech recognition.

The traditional approach is audio comes in, and the ASR engine, the automatic speech recognition, converts that audio to text. That text is then passed to some type of natural language understanding to try and derive the meaning. But then there are issues of latency. You have two processes that have to run serially. Then, you have issues with accuracy. Any errors that happen in the ASR component are then propagated to the user side.

What we did was we actually combined those two processes to work simultaneously. So they actually feed off of each other. So it’s much more akin to the way humans understand. As I’m talking, you’re not waiting until I get to the end of a sentence and then trying to understand everything that I just said in the last sentence. You’re constantly referring back to what I just said, thinking about what might be coming later, and understanding holistically everything I’ve said.

So even if you miss a word here or there, you get the overall meaning because it’s taking the entire context of what I’ve said into consideration. That’s really how we developed our solution, and it’s how we do things differently than anybody else on the market today.

Scott Stanford:

What is Deep Meaning Understanding?

Darin Clark:‍

It’s another differentiation between our approach and some of our competitors when it comes to the natural language side of things. The traditional solutions on the market are typically keyword based.

For example, if you say something like, “Find Asian restaurants in San Francisco,” almost all of the good products in the market will have no problem with that. It’s a local search for a restaurant with a qualifier of Asian and a location in San Francisco. Pretty easy to extract the right pieces and keywords to put them into your engine and provide a good result.

The issue is if you go even slightly more complex for any of the other solutions on the market, they’re gonna break. So if you said, “Find Asian restaurants, excluding Chinese in San Francisco.” For all of the other solutions, you’re going to get a list of Chinese restaurants. It thinks it’s another keyword.

Instead, what we do is we look holistically at the query and understand exclusions, double negatives, multiple contexts, and aspects of the query. So I could take that same query, and I could go even more complex and find Asian restaurants, excluding Chinese and Japanese, that are in San Francisco, open after 5:00 PM, have wifi and outdoor seating, and with four stars on Yelp.

Then I would get a list of responses within the exact parameters which I’ve requested. So it’s that holistic approach that we do differently than the other solutions on the market.

Scott Stanford:

Can you customize your wake word?

Darin Clark:‍

Yes, and there’s a lot of value in that for brands. There are a lot of companies that spend a lot of time building brand equity associated with their brand name, product suite, solutions, and services. Being able to customize the wake word as an entry point into a voice experience is critical for brands to maintain that ownership and create a customized experience.

The branded voice experience, from the first interaction the customer has with your brand all the way through the responses, is critical. When I’m talking to a specific device, for example, I don’t want to say Alexa to my car. I want to say Mercedes or Hyundai. Or if I’m using Pandora on my smartphone, I don’t want to say Siri. The ability to customize the wake word is essential, and we see a ton of value for a number of our customers.

Scott Stanford:

How far along are we on personalization? For example, a quick-service restaurant asking if you want the same burger and fries as last time.

Darin Clark:

‍It’s available today. We are doing voice ordering in drive-thrus. We have a voice-enabled drive-thru with MasterCard that is implemented in White Castle.

Then there are technologies on the horizon, such as license plate recognition, that could be associated with a loyalty program. If a customer signs up for a loyalty program and opts in, then the next time they pull up to the drive-thru, they can be asked if they would want their last order.

A credit card could also be a part of that, where it asks if you just want to charge the card on file. Speed in the interaction is really important. Not to mention the excitement of the loyalty program.

As far as voice technology, customers can make changes and customize items. It’s efficient and convenient for the customer. Specifically, with quick-service restaurants, there is a huge labor shortage right now. They can’t even hire people. Voice technology helps solve that problem.

Scott Stanford:

How do voice assistants work with regional dialects and accents, especially with voice recognition systems?

Darin Clark:‍

It’s critical that the solution works well from the get-go. Part of the advantage of our solution is that we get a lot of data that allows us to feel very confident that we can recognize regional accents.

The other part of your question is about speaker identification and voice recognition. Could a voice assistant know that it’s me versus my wife or son? That’s definitely something that’s possible today.

It can be done through the wake word. If I say the wake word and it recognizes my voice, then it could bring up my preferences, whether that be radio stations or work location, or parental controls if my son asks something. Giving that convenience to the user is incredibly valuable.

Scott Stanford:

Is monetization of voice assistants in the works?

Darin Clark:‍

Yeah, 100%. Pandora, for example, has voice ads that are an integral part of that solution. If you’re a free Pandora user, every five songs or so, they’ll be an interactive advertisement where it’ll ask if you want more information about the ad, and if the user says yes, then it will send information to the email on file.

That’s the first step in a much broader effort towards monetization across advertising and especially with food ordering. The critical thing from our perspective is to do it in a way that’s not intrusive to the user. You don’t want it to be a distraction or detriment to the user. You want monetizable moments that are consistent with what the user is doing.

For an in-car voice assistant, if the user says “Navigate home” or “I’m hungry,” then the voice assistant can say something like, “Here are some pizza places on your way home.” Or, for a voice-enabled TV, if I say I want to watch a movie, then it could show an ad for $5 off pizza if you use your voice remote to order. This ability to do things consistent with what the user is doing without being intrusive or distracting is at the core of our solution.

Scott Stanford:

Where do you see SoundHound going next?

Darin Clark:‍

We like to think that there’s not really a ceiling. There are 7.96 billion people in the world, and there will be 8.4 billion voice assistants by 2024. So that’s more voice assistants than the world’s population.

A lot of people are very familiar with Alexa or Google, and they’ve worked well, but we also want to give customers the ability to talk to all devices and have very specific interactions with them. For example, changing the temperature on a thermostat.

That’s the market that we’re going to be moving into. People want to have specific interactions with devices.

Scott Stanford:

Is there voice technology out right now that really amazes you that you think customers will really love?

Darin Clark:‍

That’s a great question. It kind of comes back to what resonates most with our customer base and prospects that we’re talking to. One of those is the voice-enabled drive-thru experience or food ordering in general.

Most people adapt right away to a voice system at a drive-thru, but some people freeze up, wondering what they’re supposed to say to this. There’s some user education involved. Once we can get users to recognize they can order just like talking to a person, they’re going to be blown away by how it responds. It gives them all the items, customization, corrections, and more.

When we see these things working, even those of us that are in the business, sometimes we take a step back and are like, wow, that’s really amazing. It works so well and provides such a level of convenience and usability for the customer. It’s a little bit mind-blowing at times. Voice technology has really come so far, from not even 20 years ago but 5 years ago. It’s made leaps and bounds of progress since then.

Scott Stanford:

Before you go, can you give us any information about what’s coming out in the future?

Darin Clark:

‍As a company, SoundHound just went public, so that limits my ability to talk about anything that might be coming. But we continue to have exciting conversations and develop products with new and existing customers, following out new features and functionality. I think the best is definitely yet to come. We’re just so excited about the future.

SoundHound has all the tools and expertise needed to create custom voice assistants and a consistent brand voice. Explore SoundHound’s independent voice AI platform at SoundHound.com and register for a free account here. Want to learn more? Talk to them about how they can help bring your voice strategy to life.

PODCAST EPISODES