April 1, 2024
Portfolio
Unusual

Deepgram's product-market fit journey

Sandhya Hegde
No items found.
Deepgram's product-market fit journeyDeepgram's product-market fit journey
All posts
Editor's note: 

SFG 43: Deepgram's CEO Scott Stephenson on Speech AI

Deepgram is a voice AI company that has built an incredible reputation in the market for the quality of its speech recognition. Last valued at over $250M, Deepgram has over 500 customers, including NASA, Spotify, and Twilio.

In this episode, Sandhya Hegde chats with Scott Stephenson, CEO and co-founder of Deepgram.

Be sure to check out more Startup Field Guide Podcast episodes on Spotify, Apple, and Youtube. Hosted by Unusual Ventures General Partner Sandhya Hegde (former EVP at Amplitude), the SFG podcast uncovers how the top unicorn founders of today really found product-market fit.

Episode Transcript

Sandhya Hegde

Welcome to the Startup Field Guide, where we learn from successful founders of high-growth startups how their companies truly found product market fit. I'm your host, Sandhya Hegde, and today we'll be diving into the story of Deepgram. So, Deepgram is a voice AI company that has built an incredible reputation in the market for the quality of its speech recognition.

Last valued at over 250 million, Deepgram has over 500 companies in its customer portfolio today, including NASA, Spotify, and Twilio. Joining us today is Scott Stephenson, CEO and co-founder of Deepgram. Welcome to the Field Guide, Scott.

So I I'm so excited to learn more about how you ended up starting Deepgram. So you were a particle physicist, right? You were like working on detecting dark matter less than 10 years ago. How did you end up becoming an enterprise software company CEO?

Scott Stephenson:

 Yeah, I was building deep underground dark matter detectors in a government-controlled region of China. And it was basically like a James Bond lair two miles underground. No kidding. And it was mostly no rules as well. And‌ I look back now and think it's a lot like building a startup. And I can, I could talk about that. In a minute, but yeah, I was a particle physicist looking for dark matter. And I was a graduate student when we were working on that. And the reason that you go underground is because you're trying to run away from radiation. You're trying to run away from background radiation, cosmic radiation.

And so you build this extremely sensitive detector that has very little background inside, and so we called it the quietest place in the universe, because basically, from a radioactive perspective, there's nothing going on inside that detector. And that's what you were hoping for, except for finding dark matter. So you wanted nothing to be going on except for dark matter hits, basically. But the type of thing that you had to do to build that detector and detectors like it, what was waveform signal processing? You had to use a lot of machine learning to understand real-time data feeds, and all of these things had never been built before.

You had to tell signal from background and build all your systems around that. And then manage the experiment and run it over a long period of time because it takes data for six months, a year and a half, and all the systems have to be like perfect throughout that time. And then you're taking data to understand what's going on.

What that turns into is, we're an audio company, Deegram is, and it's the equivalent of doing 60, 000 conversations at once is what we were doing underground in that detector. And it was a team of 15 people that were working on it. And so we had this high-leverage dark matter experiment that was going on.

And an equivalent experiment might have a hundred people or a thousand people on other particle physics experiments. But we leveraged deep learning and other machine learning techniques to keep the team small and discover a lot of things. And it's interesting, the type of signal that we were analyzing was a waveform. It was an analog waveform that was digitized by an FPGA. Very similar to like how a microphone works. It's an analog waveform, then it gets digitized by an analog to digital conversion on your computer. And we just noticed a lot of similarities between these, but that's not‌ what got us into audio as a company.

What got us into audio as a company was we were sitting around building this experiment, letting it take data for six months, a year, etc. And while you're doing that, all the work is done, at that point. You did all the hard work. You're taking data, right? And what we were doing with that time was we built these little devices that were… it's‌ funny because we were doing this like eight or nine years ago. But what you see now, so the devices that people are building now that record your life and make a backup copy of it. Today, this is these life-logging devices and there's a few of them that are out there now that are starting, but that's what we were building back then, basically like a Raspberry Pi that recorded your life all day, every day, 24-7. And we got thousands of hours this way, and we wanted to understand what was inside it because we were building the detector, taking data, having all sorts of interesting conversations, all sorts of not interesting conversations as well. And we wanted to sort the signal from the noise, basically.

And so we were thinking, Hey, we're doing this machine learning, these end-to-end deep learning machine learning techniques on our particle physics data that looks a lot like audio. Surely there's some company out there in the world that is doing that too. And so we went looking for that, so that we could understand these thousand-plus-hour data sets that we had. And the only real company out there was Nuance, and Nuance, as folks know now, they're the largest AI acquisition even though they're like old world acquisition, or sorry, old world AI. They got bought for 17 or 20 billion dollars or whatever by Microsoft, but Nuance is a speech-to-text technology company, but when we went looking through their product portfolio, we were just like, whoa, this technology is not the new technology. This is not end-to-end deep learning. It's not going to be able to do the things that we want, et cetera. We were just developers, looking for a developer tool to go do this thing. And so we're looking around saying why is nobody building this?

 But as scientists, scientists are generally pretty pessimistic. We would think yeah, surely somebody's building it, even if it's not out yet, Google or Microsoft or somebody like that. They're probably building this. So we just emailed them from our academic email addresses and emailed their speech teams and met with them independently, and we were just hoping that they would give us access to something that we could use, maybe early access or something, a naive request, but nevertheless. They took the meeting and talked with us, and we were pitching to them, end-to-end deep learning is going to take over speech recognition and conversational AI, and so surely you guys are working on that, right? And they basically laughed us out of the room. They were like, language is too complicated. End-to-end deep learning will never work. This data-driven technique will never work, etc. And we couldn't believe it. We were just like all the reasons that you're giving for why it won't work are all the reasons it will work. That's why it has to work. And we had a meeting with Microsoft, we heard that. We had a meeting with Google. We heard the same thing. And we were just like, okay, I guess we have to start a company because these other people who should be doing it don't get it. And we will go, we'll go do it now. And so then we went from being pessimistic to extremely optimistic into the company saying, okay, no, the people who should get it, don't get it. That's a really good sign for building a product. And so let's go build that thing.

Sandhya Hegde

And this was‌ 2015 or 16. What was the first year of being a founder like?

Scott Stephenson

We started with a demo. So we're just physicists. We used coding as a tool. We used machine learning as a tool. We use whatever we can as a tool to get to the end goal that we're looking for. And so we built these underlying models that were what people would think of today as embeddings models with a vector database search.

So we built embeddings models and we built a fuzzy database search on top of them to search inside the audio that we were looking for, and so the that was what we started with and we built a demo around that. At the time, Silicon Valley was a popular show. And so we took, like, all maybe 2 seasons up until that point, because it was still ongoing and we concatenated them all together and made 1 extremely large file.And then built a search across that so that if you wanted to search for a specific phrase like the middle-out technique or this or that, like many of these things that were mentioned in the show, then you could search across that entire thing and be taken to specific moments and then view those moments inside.

But it was a demonstration of the technology, right? And then we went and shopped around to a couple of our friends at university and one of them said, Hey, I know an investor. They're an early seed-stage investor. And that person also was a founder of a Bitcoin company that eventually got hacked and failed, but nevertheless, they're like, Hey, I know this person. Do you want to meet them? And so we were just graduate student physics kids that didn't connections with Silicon Valley whatsoever, but that was our like one single connection. And then once we met them, they're like, hey, you should come out to the Bay Area. And this is 1517, it's Danielle and Michael from 1517. And you should come out to the Bay Area, we'll give you a thousand bucks for free, we won't even invest in the company. Just come out and do like the things that you're doing and meet some people. And when we did that, our eyes just opened up like, wow, okay.

Okay. There's a lot of funding around. There's a lot of people working on a lot of different things. We need to move here and start working on it. And so it was just two founders sitting in a room all day, every day, taking that demo and turning it into an API. That we then built other demos on top of, and then we went and shopped around two different seed investors. And then we eventually applied to YC as well and got into YC. And then that was when everything started to go really fast because up until that point, at least for us, we were gathering our allies, met a few people here, et cetera, we're figuring out our strategy, but then we get into YC and it's okay, game on, basically, right? And so as soon as we did that, we were still trying to figure out, should we go B2B or should we go consumer? I don't really know. I knew that we could be successful in B2B, but is there a consumer product in 2015 that should be built around voice AI? And so we went looking for it, and we built these products that would‌ seem pretty quaint now.

So one was where you would write down a certain phrase. And then it would assemble video clips based on the audio and that it could make anybody say anything. So our target there was Donald Trump, just make Donald Trump say anything, and this is 2016, or 2015, 2016. Right before that first presidential election. And then we released that on Product Hunt.. And then we did another one that was like search all of the video as if it were text. It's funny. Google is Google. We called it Hoogly. You could go there and then you could search across a broad amount of YouTube videos and search directly into them. Again, all these technologies have now in the last eight years, they've come around and if you do a Google search, you'll find that but so we were trying to figure out, do these things make sense? And‌, it was like, no, not at the time. Basically what you're going to do is you'll build something that's popular. It'll probably turn into a 10 or 20 or 50 million dollar company. And then you'll be forced to be acquired because there's no way to distribute yourself. You have to piggyback off like YouTube or something like that. And then, Google will just crush you or whatever it is. And so we're like, okay, are we really good at building infrastructure and backends? Yeah. Are we really good at building models from scratch and foundational models? Yeah. Okay. That's what we should do. We should go B2B and we should supply the voice AI backend that powers the next generation of apps like we were just talking about. But so we'll build the underlying infrastructure to do it. But we knew we were signing up for a long march then. There's no like quick path to success when you do that, it’s okay, we're going to be grinding for two, three, four years until we release a product probably. But that product is going to be like a step function far better than competitors and then they won't be able to catch up and then, so we're betting on this voice AI market coming sometime in the next four or five years and and it worked out.

It worked out. We all know how it goes now, but at the time it was like everybody thought we were crazy, and but now, go back to pre-pandemic, basically, and right around the time of the pandemic and meeting tools and everything started to become very popular, and then voice AI was on everybody's mind. And then we released our product maybe a year and a half before that. And then we just saw massive uptake. Andit hasn't stopped since, essentially, because everybody's realizing the uses for voice AI.

Sandhya Hegde 

And at the same time, there were other companies, I would say, also piggybacking off just the status quo speech-to-text technology, whether that's Gong or like these other companies as well. So the application and the business value started becoming more apparent. I'm curious, obviously to build deep tech, like it takes a while and one of the risks of hunkering down for two or three years is you don't have a quick feedback loop from the market on okay, yeah, actually, like this is the application of our tech that will have the most demand and in the right category, et cetera. How did you approach that? Did you have like design partners that were your feedback loop? Like for the two to three years that you were really building, what did the customer feedback and customer focus look like at Deepgram?

Scott Stephenson

I don't recommend this to any AI company right now. I'll tell you how it went for us, but I don't recommend this for any AI company right now because the world is moving too fast, and so you need to build on other people's infrastructure to be relevant. But what we did was we just we picked a fundamental thing that we thought definitely mattered. So like latency. And it was no one thing. It was like latency. It was accuracy and COGS basically. So if you can get really good latency, really good accuracy, and you can have low COGS, then you can provide that product that we were looking for, like we were personally looking for.

You can go provide that to the world. And so we weren't focused in the first three years on trying to monetize that at all. So we just opened it up for free. We didn't open-source it, but we opened it up for free. Anybody could sign up and use it. And once we got maybe around a dozen or so true believers, people that believed in the product and were‌ using it, then we took that down. We just said, okay, we've got all the product development partners that we need. And literally it was down for maybe a year and a half or something. We didn't allow anybody to sign up for Deepgram or any of that stuff. We were just working with these 12 people. And again I'm saying I don't recommend this right now because the world is moving too quickly but back then, everybody was really slow to figure out how AI works.

And so you needed to help coach them along on how they could utilize it and everything. Anyway, we worked with those 12 people and figured out what they want, what languages they care about, what latencies they care about, how they wanted all the work, et cetera. And then we, of course, we're thinking about okay, what about the next 100 customers? Because we're building a product that can actually scale. And so what happened was with those first like dozen or so customers, I just went to them after they had been using it for maybe a year and a half or something. And said we want you guys to pay now, and much to my surprise, they're like, I can't believe you let us use it this long for free. Your product is amazing. Sure. We want to pay, and it's oh, okay, that was easy. And that got us maybe about halfway to the ARR goal that we had for raising an A. But at that point, it was like, okay, we just cashed in on the beta users, basically. But we haven't been going out into the world and getting new users.

So at that point in time, we were pretty much 100 percent a technical team. And I had not hired anybody from a go-to-market perspective at that point. There was nobody at the company. I was the go-to-market person, the particle physics PhD, Scott, I was the go-to-market person, and so that still works, but nevertheless, you want somebody who's I think of them, it's like an athlete, you want somebody who gets up in the morning that is they're tying up their practice shoes and they're ready to go for the run and whatnot. And they're ready to go put in the work on going out and educating the world about Deepgram and then pushing it all the way through and closing the deal at the end, too.

And then making sure they're happy and continue, like you need a person who's going to do that. And so I tapped into my network of people, which is not all that big at that point, but YC helps. So I tapped into the YC network and said, does anybody know how to hire a salesperson? And I met these great folks Cory and Hilmon from ClozeLoop is the name of their company. But they came on as sales advisors for Deepgram and we gave them a few basis points basically, and paid them a little bit of cash. And I highly recommend this to technical founders to go find folks like this who have scar tissue in that area. It's not just for their ideas, but it's also for their network. Cause I go to them and I say, Hey, I don't want you to just to figure out how to help us put our go-to-market together. I want you to help me hire the person who's going to run it too. And so we did that and that person is our VP of sales now at Deepgram, Chris Dyer, and he was a great hire then.

He's the athlete guy, he gets up every morning, and that's what he's doing, and he's been doing that for six years and it pays off for Deepgram but nevertheless, so around that time, we needed to start‌ getting a real go-to-market going, not just the free beta user to paid-user method, and so we got that going and that's when we said, okay, we know this is repeatable, we know we can go out, we know we can hire go-to-market people to go sell it. So now is the time to go raise an A. Yeah, hopefully that gives you an idea.

Sandhya Hegde

I'm curious, who were your early customers? What industries or use cases did they represent? And what product feedback did you get that's like memorable, maybe surprising

Scott Stephenson

 Typical market, especially then, it's all call centers. They were the only ones who were buying speech recognition at scale. That's not true today anymore, but back then, they were the only ones who had recognized that. And what they were doing was analytics for, they would have a call center that would have ten thousand or hundred thousand calls a day. And they would have folks in the call center who are maybe trained a week ago or whatever, right? And they're on calls talking to customers and they want to figure out, like, hey, are they saying things they shouldn't be saying? Are they following the script you want them to follow? Is the customer happy in certain areas? And so we should go reward them, like whatever it is, right? They just wanted to figure that out. And the way that it was previously done was a manager would listen to random calls. They don't have much time, but they would listen to maybe for each person that they managed, they would listen to maybe one call, two calls per week, and that's how the feedback would happen.Those people are making calls all day, every day, right? And so anyway they're automating that process. And so one of the early customers was so there you find love triangles. This is what we saw. It was Uber, Deepgram and a company called Randall Riley and Randall Riley does hiring for companies like Uber or trucking companies or that kind of thing. And so folks who are interested in being a driver and being a contract driver or whatever it is for those companies, they call in and then have a conversation about what their qualifications are and everything. And then they wanted massive-scale lead qualification there basically.

And so anyway, it's that type of thing that we started out with, but I think it's really important to have a beachhead like that, a market that is already there, especially when you're trying to build deep tech that is new. If you have to create the market and the tech and label your own data and do all this stuff, like everything becomes, your chance of success keeps going down, basically. You should have some kind of stable market that you know you can sell into if you're gonna take on all these other risky things.

Sandhya Hegde

And, speech recognition is notoriously difficult to get the last few points of improvement, right? Whether that's accuracy, cost, latency. I'm curious what has been your approach to continue pushing those boundaries out? And what is it that customers care about? Is everybody focused on some of these like performance metrics or are they completely separate criteria? Like maybe how you deploy other APIs, are there completely other things that also ended up being important that maybe you hadn't considered at first?

Scott Stephenson

Yeah, there are many things that are important. And big three, though, are accuracy of the model. And that's the top one that everybody cares about to begin with. And then there's speed and then speed is linked with cost. Because if you can have a really fast model, then you can charge them a lot and give them that model. If you have a really fast model, then you can not charge them a lot and still give them a really fast model, basically. And then if that model is super accurate, then you've hit the holy trifecta here, basically, right? Where those companies are, like, I can't think of a reason not to go with you. You don't have to convince them about anything. And the thing that you do have to convince them about is trusting you over Google or something like that. But not, it's not about the product specifically. And this is the perspective of an infra, like a foundational AI model-building company that provides strategic infrastructure. You have to think that way. If it's not better in all three, then what are you doing? Now those are like the table stakes, essentially, if you want to win the market, those three, but if you want to ensure higher success, take the money that you save in COGS and put it toward really good support, really good onboarding, really good reference architectures, and open source code to you, so you reinvest that back into the company to make it easier to onboard because when you're building a developer tool or a strategic piece of infrastructure like this, there is a learning curve for people to come into it. They have to know what's possible. You have to try to make it easy for them to sign up so they, so that they can just switch from Google or Amazon or whatever, to, they can just switch to you easily. And then they get all the feedback that they need in doing that. And, and all the good feelings that go along with it. So this comes down to cost as well, too. If you have poor margins on your on your service that you're providing for your model, then that cuts into your ability to give really good support.

And so this is one of the problems that the big tech providers have right now, is that‌ from for speech-to-text, especially, and for text-to-speech, they have a lot of cost on that side to run it themselves. And companies like Deepgram are an order of magnitude, or better from a COGS perspective. And so they can't drop their prices too much, but to compete, that's what they try to do, but then they can't provide any support. And if you want to win as a strategic infrastructure provider in AI, you have to do it all, basically. And this is why I was talking about earlier, you're signing up for a long march, because you know you're not going to get every single piece. But you find the innovators that believe in your vision and then you collect them along the way. And then eventually it turns into a hockey stick because you have all the coverage. And nobody has an attack vector.

Sandhya Hegde

 And now that you have a lot of startups building different types of enterprise conversational AI solutions today, branched out a lot from like the early days of like just the analytics use case. I feel like you have a good bird's eye view of, how the contact center slash just customer support industry might evolve. What are your thoughts?

Scott Stephenson

Real-time bots. That's where it's going to go for sure, no doubt about it. And some of those real-time bots will be text. Many of them are going to be voice. Anything that is text will eventually add voice to it because it's just easier for humans like we, if we can choose, if you have your best friend in life, sitting next to you, are you going to type on your computer and write them stuff? Or are you just going to turn and talk to them? You know what I mean? You're going to just turn around and talk to them. And so you want to get that level of trust or something close to it with with AI, with voice AI systems where you're just like, I know you're going to understand what I'm saying. I know you're going to have my best interest in mind.

And I know that you're going to come up with good solutions to my problems. Okay, so I'm just going to turn and talk to you about that thing. And so the way that this is going to turn out is everybody dreads calling into their like Comcast or something like that, now like literally you think about it and you just dread the prospect of doing, that won't be the future; you'll be‌ like happy to do it. Okay, I'm going to call in and I know I'm not going to wait for 45 minutes, I'm not going to get hung up on four times, and my problem is going to get solved in two minutes. And so why wouldn't I call them and try to get that solved? There's no business hours, there's no waiting in line. There's none of it. You just do it whenever it's available. And yeah, for call centers and other customer success things, that's gonna be a real game changer. But there's a whole other world, which is it's not just about the customer service. It's any consumer type of interaction. I think take your best salesperson, your best, whatever, and say, if you could clone them, would you? And if you could clone the happiest, best version of them, would you? And so in the next five years, all of this stuff is just going to be rolled out. And we'll be glad to call those services whether it's on an app, whether it's literally a telephone call, whether it's in the browser, in your typing whatever it is, all these things are happening right now. So it's real-time bots, real-time agents that's the future.

Sandhya Hegde

Shifting costs from man hours to GPU hours. One GPU at a time.

Scott Stephenson

 Yeah, and using the best part of human creativity to train those models and make them better, right? So whenever there's like some dull grunt work that you dread, that's probably what you should be automating, right? And then there's other parts of that where you're like, Oh, that's an interesting question. I haven't really thought about that. That's where a human is really valuable. You can hand it off to them. They can help solve the problem, and then they make that system even better. So yeah, that's the future.

Sandhya Hegde

Yeah. I think I see a lot of folks, especially on the text side, have been trying to automate a lot of things like customer support, service style use cases over the past year. I think the trickiest thing so far‌ has been, leave aside all of the hallucination accuracy kind of questions that everyone has talked them to death, so the question has been the handoff. When should we rely on the bot versus when do we really need to escalate? If we can get really good at that, it becomes easier and easier to adopt the bot. But yeah, I hear you. I think it's such a win in terms of like better customer experience. Work that is not really a career path for anybody and it's expensive to do that. Obviously the impact on labor in general is a big question mark and a point of concern, but you can see why this is going to happen. Like all of the forces are aligned to make it happen. Maybe speaking of cost a little bit more, I'm curious, given your deep expertise extending to like the hardware layer of this, how are you thinking about like general-purpose hardware versus specialized hardware in this ecosystem for better performance‌ to continue, what changes would you expect and the GPU data center segments of the AI industry?

Scott Stephenson

Yeah, how specialized? Because it definitely specialized is going to be a thing, but it's actually a question of how specialized. I think I think everybody has gotten the memo that they need GPUs. They're not going anywhere. They are 10 times a word more better than CPUs for deep learning. And there will now be serious competition from other GPU manufacturers like AMD.

They're starting to bring that into the market. Nvidia still is by far the leader, and it makes sense. They invested in this early. They built the ecosystem around CUDA and all of that. So they're enjoying that benefit right now, but it's going to get very competitive. But there's a whole other side as well, which is around the application-specific chips that are out there. So like Groq and others and I think they, they have a real fighting chance as companies because there's like this big capability boom in in AI right now based on different architectures that people have found. CNNs, RNNs, transformers, attention, that kind of thing. But I look at this a lot like, electricity, or the internet, or something like that. You don't have to have it all figured out to get a ton of value. What you need to have are the basic components to get that value. So for the internet, it was connectivity across the entire globe, at a certain bandwidth, basically. And as long as you hit that and the reliability is there, then the thing is going to emerge on top of it. Similar kind of thing with electricity, whether it's DC or AC or whatever it is, you gotta lay some copper wires down. You gotta do this, you gotta do that. But once you do that, then the entire world expands on that idea. And then maybe, decades later, at least in the case of Internet or the case of cars and the case of electricity is when you go back to rework and reoptimize, once you've you've extracted everything you can out of out of the first elements that you found. And I think that is for that reason chip manufacturers right now can‌, I think they can rest on transformers, recurrent neural networks, convolutional neural networks, and linear layers, and just say hey, we're going to build for that, and we're going to build for inference on that side, not training.

And we're gonna go serve that to the world and they'll probably have a massive uptake because of that. And they probably won't be caught two years from now, four years from now with this new architecture that kills everything. Probably not, actually. We're to the point and there are some science reasons around why this would be true, but Basically, with with convolutional neural networks, you figured out the spatial portion. With recurrent neural networks, you figured out the temporal portion. And with attention, you figured out how to focus in certain areas. And it's like, what else do you need to build something that is good? That's what you need. And so anyway, they'll just build chips around that. And I expect companies like them and others like Intel and whatnot to get into the game. They've purchased companies and whatnot. But I think there's real hope there because we're at a more stationary point in AI from an underlying model network perspective.

We're totally not in a stationary point from an application adoption perspective. And so there's going to be so much demand just for the underlying elements that make it up. So yeah, GPUs are going to do well. The specialized ones are going to do well. I think everybody's going to do well on the hardware side, selling picks and shovels.

Sandhya Hegde

 Have you thought about what ‌additional hardware specialization would do for Deepgram? Is there ‌a fantasy chip that specifically would be better for you as a voice AI company than it's going to be for even multimodal or an LLM company? And now, how much of a difference would it make? Is this, okay, this will give me 20 percent better performance, or is it 10x better performance? What are the thresholds here for voice?

Scott Stephenson

I like that you brought it up that way because there really are only two types of gains, like the 20 percent or the 10x, basically, it's got to be multiples or in these tens of percents range, otherwise you won't think about it. And the tens of percent range are just the iterative hey, can we get a little bit better without a lot of effort? And then the 2x or 10x or whatever is when you're betting on new architectures or new ways of solving the problem and whatnot. Here's the reality‌. It's really interesting, at least for Deepgram, like inner secrets here. Our COGS are so much better than our competitors we don't‌ need a new chip to to be effective. So when our competitors figure that out and do something about it, which is a very hard thing to do, and we have eight or nine years of research behind all of it. Then there will be a like, okay, we should be thinking about a new chip. And we do think about that already, but like in the current market and for the next year or two, there probably is no pressure there.

But there will be pressure in two or three years or whatever it is. And so we do think about that, what silicon partners should we be working with? And NVIDIA is an investor in Deepgram, and so we work closely with them. But there are others as well, that you have to keep your options open on the underlying substrate that you run your models on just because of the competitive dynamics of the market.

 And so we'll see how it all turns out. But at least for, specifically for Deepgram there's no one thing that makes it so much better. Specifically for others, they probably would hope for one. I don't see one coming for them to save them though. It's really just, you just gotta do the hard work to optimize and rethink all your problems and whatnot.

Sandhya Hegde

How do you see Deepgram's product vision evolving? And if you had to look‌ 10 years into the future, what do you hope Deepgram will be?

Scott Stephenson

 Speech to Speech. Right now, we're talking a lot about Speech-to-Text, which is Deepgram's first product. We're the world-leading API for Speech-to-Text. More data is pumped through Deepgram than any other provider and we're known for that . But that's not the end, that's not the only thing. We are‌ a voice AI platform, and we released recently our text-to-speech to supplement our speech-to-text. And so it's actually funny, it'll be for a little bit here, for a year or two, it'll be STT, TTT, TTS. So, speech to text, text to text, and text to speech. And text to speech, we released a human-like, very expressive TTS, but that fits in with what I talked about before, low latency. Low cost as well. So that can power the speech-to-speech experience. So a full-voice bot that can listen to your conversation and respond in a human-like way where you're, where you're pumped to be talking with them. There is a piece there in the middle, the TTT and what most people would call an LLM right now. where we partner with others to do that right now. So we have some LLMs that are owned internally at Deepgram where we do sentiment analysis and intent detection and topic detection and that kind of thing. But if you want an open-ended LLM to drive your voice AI agent, then you would use OpenAI or Anthropic or, a Mistral model under the hood or something. And that's something that we are completing the voice platform on right now. TTS, it goes GA in a couple days. And it's already been in early access with a lot of successful companies using it now. A lot of YC companies are using it now. A huge fraction of the batch of new YC companies are AI agents. And so we see a lot of success there and other companies as well. And we'll be releasing that in a week, but so yeah it's speech to speech, in three words.

But there will‌ be versions to that. So the first version will be that broken-out version of speech-to-text and then in LLM and then TTS. But the next version of that will be fully contextualized speech-to-speech. And that's what we work on in our research team. And think like next year around this time, people will be complaining about the standard speech-to-speech voice AI bots and they'll say things like this. When it responded to that message, it didn't really sound like it cared, or that it was like empathetic, or it didn't really sound like a human, like we crossed over, and a lot of things felt right, but then when I made the statement that was like a very negative statement, it responded in a very positive way, and it's that doesn't make sense. What is going on here? And so when you build a system in the speech-to-speech, sorry, speech to text to text, and text-to-speech way, when you're only passing words between all of them, you're not passing any context through them. You're not saying the goal of this conversation is this, and you should respond in a very caring, empathetic way.

And I detected that they were, upset in the speech-to-text, and so I'm going to pass that along to the text-to-speech. And so that's‌ what the next version will be. That's what the, the speech-to-speech V2 will be. And that's what we'll be talking about in a year. But the one thing I've learned is that you don't jump too far ahead. Because if the market is not educated on something simple to begin with, they won't understand the more complicated thing. And I mentioned the first product we ever built was an embeddings model with a fuzzy vector database search. Nobody knew what to do with that eight years ago, and so we're not going to build the full speech-to-speech model literally right now and release it right now, even though we have research that shows we could do it. Because the world is not ready yet. The world needs to see the speech-to-text, they need to debug it, they need to read the words, they need to send that to their LLM, they need to choose which LLM, they need to prompt for their LLM, they need to do all of that, and then they get over that tuition hump, basically, and then, a year from now, two years from now, three years from now, they say‌ I would love a model that is even more expressive and better. And so that's a trajectory of the market and what we'll be doing.

Sandhya Hege 

And you want to be like a year or two ahead of the market adoption cycle in terms of what you're building, not four years ahead, and definitely not two years too late. Any advice to, new founders planning to start an AI company in 2024?

Scott Stephenson

So what I said about our journey early on, take 2, 3, 4 years to really build the foundation. I still think that's possible, but it's an even harder road now, and you have to be thinking next generation again. For most founders, though, like it's possible, you can do it. You may be successful, the odds are lower, but there are ways to do it. But for most other AI companies, you should be thinking about building on top of other people's frameworks and infrastructure. Very similar to how tech companies build on top of CPUs and GPUs and memory and hard drives and networks and that kind of thing. You should be thinking that way in AI now rather than trying to reinvent the wheel in every single spot because if you try to do that, you'll probably be lagging behind unless you have some like amazing insight that other people are not seeing. So find really good strategic infrastructure partners. Most of them are probably really interested in partnering with you because you can be a really good product development partner with them as well. And so it's like the market is still early. Everybody's still talking to each other. It isn't like knife in the back stage or anything like that. It's rising tides lifts all boats. And take advantage of that right now and get in with the right frameworks and strategic infrastructure providers. And once you think that way, the next stage is just the standard product advice now follows, which is, find something that people love, obsess over those 10 or 100 or whatever people.

Make sure that market could‌ expand into a much larger market. Communicate that vision to your investors, find really great investors that can back you up on that, and then introduce you to the people who you want to interface with on the next step. So somebody was asking me recently, should I not care about my seed investors right now? Just get them because it's just money, right? And it's they're really helpful. Some of them probably, yes, you can do that. But you want at least some of them angels or seed funds or corporate even to have a good network so that they can introduce you to who's going to lead your A. And they can vouch for you, so I think it's really important for folks to think about that. You brought up this point. It's really easy to raise money in AI right now. Don't make the mistake of taking the easiest money when you could get a little bit harder to get money, , but what comes along with that is a really good network. So those are my pieces of advice that I would give.

Sandhya Hegde

 Yeah, I second all of that for our listeners. Yeah, I think it's not just about, taking the easy money. It's also like confusing investor interest with idea validation. That makes me very nervous. You raise money easily. You assume it's because your idea is perfect, as opposed to they're just hoping you will do the hard work of iterating on it and figuring it out.

Scott Stephenson 

Yeah. Investors are betting on a market. They're betting on you and a market. And that market they're betting on is eight years to come. And especially at the seed stage, they're betting on 50 companies for that. And yeah, it's a really good point that investor validation is not customer validation.

We even when we first raised our money based on the search that I talked about embeddings and that kind of thing, investors were super pumped about that. We go to the market. We just get punched in the face over and over because everybody is asking us for speech-to t-xt. And we're like, okay, we should just build the simple thing that they that they need right now.

And then we'll turn that into something bigger later. And it could take way longer than you think. But the important thing is that you‌ get the market. You get the pull from the market.

All posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

All posts
April 1, 2024
Portfolio
Unusual

Deepgram's product-market fit journey

Sandhya Hegde
No items found.
Deepgram's product-market fit journeyDeepgram's product-market fit journey
Editor's note: 

SFG 43: Deepgram's CEO Scott Stephenson on Speech AI

Deepgram is a voice AI company that has built an incredible reputation in the market for the quality of its speech recognition. Last valued at over $250M, Deepgram has over 500 customers, including NASA, Spotify, and Twilio.

In this episode, Sandhya Hegde chats with Scott Stephenson, CEO and co-founder of Deepgram.

Be sure to check out more Startup Field Guide Podcast episodes on Spotify, Apple, and Youtube. Hosted by Unusual Ventures General Partner Sandhya Hegde (former EVP at Amplitude), the SFG podcast uncovers how the top unicorn founders of today really found product-market fit.

Episode Transcript

Sandhya Hegde

Welcome to the Startup Field Guide, where we learn from successful founders of high-growth startups how their companies truly found product market fit. I'm your host, Sandhya Hegde, and today we'll be diving into the story of Deepgram. So, Deepgram is a voice AI company that has built an incredible reputation in the market for the quality of its speech recognition.

Last valued at over 250 million, Deepgram has over 500 companies in its customer portfolio today, including NASA, Spotify, and Twilio. Joining us today is Scott Stephenson, CEO and co-founder of Deepgram. Welcome to the Field Guide, Scott.

So I I'm so excited to learn more about how you ended up starting Deepgram. So you were a particle physicist, right? You were like working on detecting dark matter less than 10 years ago. How did you end up becoming an enterprise software company CEO?

Scott Stephenson:

 Yeah, I was building deep underground dark matter detectors in a government-controlled region of China. And it was basically like a James Bond lair two miles underground. No kidding. And it was mostly no rules as well. And‌ I look back now and think it's a lot like building a startup. And I can, I could talk about that. In a minute, but yeah, I was a particle physicist looking for dark matter. And I was a graduate student when we were working on that. And the reason that you go underground is because you're trying to run away from radiation. You're trying to run away from background radiation, cosmic radiation.

And so you build this extremely sensitive detector that has very little background inside, and so we called it the quietest place in the universe, because basically, from a radioactive perspective, there's nothing going on inside that detector. And that's what you were hoping for, except for finding dark matter. So you wanted nothing to be going on except for dark matter hits, basically. But the type of thing that you had to do to build that detector and detectors like it, what was waveform signal processing? You had to use a lot of machine learning to understand real-time data feeds, and all of these things had never been built before.

You had to tell signal from background and build all your systems around that. And then manage the experiment and run it over a long period of time because it takes data for six months, a year and a half, and all the systems have to be like perfect throughout that time. And then you're taking data to understand what's going on.

What that turns into is, we're an audio company, Deegram is, and it's the equivalent of doing 60, 000 conversations at once is what we were doing underground in that detector. And it was a team of 15 people that were working on it. And so we had this high-leverage dark matter experiment that was going on.

And an equivalent experiment might have a hundred people or a thousand people on other particle physics experiments. But we leveraged deep learning and other machine learning techniques to keep the team small and discover a lot of things. And it's interesting, the type of signal that we were analyzing was a waveform. It was an analog waveform that was digitized by an FPGA. Very similar to like how a microphone works. It's an analog waveform, then it gets digitized by an analog to digital conversion on your computer. And we just noticed a lot of similarities between these, but that's not‌ what got us into audio as a company.

What got us into audio as a company was we were sitting around building this experiment, letting it take data for six months, a year, etc. And while you're doing that, all the work is done, at that point. You did all the hard work. You're taking data, right? And what we were doing with that time was we built these little devices that were… it's‌ funny because we were doing this like eight or nine years ago. But what you see now, so the devices that people are building now that record your life and make a backup copy of it. Today, this is these life-logging devices and there's a few of them that are out there now that are starting, but that's what we were building back then, basically like a Raspberry Pi that recorded your life all day, every day, 24-7. And we got thousands of hours this way, and we wanted to understand what was inside it because we were building the detector, taking data, having all sorts of interesting conversations, all sorts of not interesting conversations as well. And we wanted to sort the signal from the noise, basically.

And so we were thinking, Hey, we're doing this machine learning, these end-to-end deep learning machine learning techniques on our particle physics data that looks a lot like audio. Surely there's some company out there in the world that is doing that too. And so we went looking for that, so that we could understand these thousand-plus-hour data sets that we had. And the only real company out there was Nuance, and Nuance, as folks know now, they're the largest AI acquisition even though they're like old world acquisition, or sorry, old world AI. They got bought for 17 or 20 billion dollars or whatever by Microsoft, but Nuance is a speech-to-text technology company, but when we went looking through their product portfolio, we were just like, whoa, this technology is not the new technology. This is not end-to-end deep learning. It's not going to be able to do the things that we want, et cetera. We were just developers, looking for a developer tool to go do this thing. And so we're looking around saying why is nobody building this?

 But as scientists, scientists are generally pretty pessimistic. We would think yeah, surely somebody's building it, even if it's not out yet, Google or Microsoft or somebody like that. They're probably building this. So we just emailed them from our academic email addresses and emailed their speech teams and met with them independently, and we were just hoping that they would give us access to something that we could use, maybe early access or something, a naive request, but nevertheless. They took the meeting and talked with us, and we were pitching to them, end-to-end deep learning is going to take over speech recognition and conversational AI, and so surely you guys are working on that, right? And they basically laughed us out of the room. They were like, language is too complicated. End-to-end deep learning will never work. This data-driven technique will never work, etc. And we couldn't believe it. We were just like all the reasons that you're giving for why it won't work are all the reasons it will work. That's why it has to work. And we had a meeting with Microsoft, we heard that. We had a meeting with Google. We heard the same thing. And we were just like, okay, I guess we have to start a company because these other people who should be doing it don't get it. And we will go, we'll go do it now. And so then we went from being pessimistic to extremely optimistic into the company saying, okay, no, the people who should get it, don't get it. That's a really good sign for building a product. And so let's go build that thing.

Sandhya Hegde

And this was‌ 2015 or 16. What was the first year of being a founder like?

Scott Stephenson

We started with a demo. So we're just physicists. We used coding as a tool. We used machine learning as a tool. We use whatever we can as a tool to get to the end goal that we're looking for. And so we built these underlying models that were what people would think of today as embeddings models with a vector database search.

So we built embeddings models and we built a fuzzy database search on top of them to search inside the audio that we were looking for, and so the that was what we started with and we built a demo around that. At the time, Silicon Valley was a popular show. And so we took, like, all maybe 2 seasons up until that point, because it was still ongoing and we concatenated them all together and made 1 extremely large file.And then built a search across that so that if you wanted to search for a specific phrase like the middle-out technique or this or that, like many of these things that were mentioned in the show, then you could search across that entire thing and be taken to specific moments and then view those moments inside.

But it was a demonstration of the technology, right? And then we went and shopped around to a couple of our friends at university and one of them said, Hey, I know an investor. They're an early seed-stage investor. And that person also was a founder of a Bitcoin company that eventually got hacked and failed, but nevertheless, they're like, Hey, I know this person. Do you want to meet them? And so we were just graduate student physics kids that didn't connections with Silicon Valley whatsoever, but that was our like one single connection. And then once we met them, they're like, hey, you should come out to the Bay Area. And this is 1517, it's Danielle and Michael from 1517. And you should come out to the Bay Area, we'll give you a thousand bucks for free, we won't even invest in the company. Just come out and do like the things that you're doing and meet some people. And when we did that, our eyes just opened up like, wow, okay.

Okay. There's a lot of funding around. There's a lot of people working on a lot of different things. We need to move here and start working on it. And so it was just two founders sitting in a room all day, every day, taking that demo and turning it into an API. That we then built other demos on top of, and then we went and shopped around two different seed investors. And then we eventually applied to YC as well and got into YC. And then that was when everything started to go really fast because up until that point, at least for us, we were gathering our allies, met a few people here, et cetera, we're figuring out our strategy, but then we get into YC and it's okay, game on, basically, right? And so as soon as we did that, we were still trying to figure out, should we go B2B or should we go consumer? I don't really know. I knew that we could be successful in B2B, but is there a consumer product in 2015 that should be built around voice AI? And so we went looking for it, and we built these products that would‌ seem pretty quaint now.

So one was where you would write down a certain phrase. And then it would assemble video clips based on the audio and that it could make anybody say anything. So our target there was Donald Trump, just make Donald Trump say anything, and this is 2016, or 2015, 2016. Right before that first presidential election. And then we released that on Product Hunt.. And then we did another one that was like search all of the video as if it were text. It's funny. Google is Google. We called it Hoogly. You could go there and then you could search across a broad amount of YouTube videos and search directly into them. Again, all these technologies have now in the last eight years, they've come around and if you do a Google search, you'll find that but so we were trying to figure out, do these things make sense? And‌, it was like, no, not at the time. Basically what you're going to do is you'll build something that's popular. It'll probably turn into a 10 or 20 or 50 million dollar company. And then you'll be forced to be acquired because there's no way to distribute yourself. You have to piggyback off like YouTube or something like that. And then, Google will just crush you or whatever it is. And so we're like, okay, are we really good at building infrastructure and backends? Yeah. Are we really good at building models from scratch and foundational models? Yeah. Okay. That's what we should do. We should go B2B and we should supply the voice AI backend that powers the next generation of apps like we were just talking about. But so we'll build the underlying infrastructure to do it. But we knew we were signing up for a long march then. There's no like quick path to success when you do that, it’s okay, we're going to be grinding for two, three, four years until we release a product probably. But that product is going to be like a step function far better than competitors and then they won't be able to catch up and then, so we're betting on this voice AI market coming sometime in the next four or five years and and it worked out.

It worked out. We all know how it goes now, but at the time it was like everybody thought we were crazy, and but now, go back to pre-pandemic, basically, and right around the time of the pandemic and meeting tools and everything started to become very popular, and then voice AI was on everybody's mind. And then we released our product maybe a year and a half before that. And then we just saw massive uptake. Andit hasn't stopped since, essentially, because everybody's realizing the uses for voice AI.

Sandhya Hegde 

And at the same time, there were other companies, I would say, also piggybacking off just the status quo speech-to-text technology, whether that's Gong or like these other companies as well. So the application and the business value started becoming more apparent. I'm curious, obviously to build deep tech, like it takes a while and one of the risks of hunkering down for two or three years is you don't have a quick feedback loop from the market on okay, yeah, actually, like this is the application of our tech that will have the most demand and in the right category, et cetera. How did you approach that? Did you have like design partners that were your feedback loop? Like for the two to three years that you were really building, what did the customer feedback and customer focus look like at Deepgram?

Scott Stephenson

I don't recommend this to any AI company right now. I'll tell you how it went for us, but I don't recommend this for any AI company right now because the world is moving too fast, and so you need to build on other people's infrastructure to be relevant. But what we did was we just we picked a fundamental thing that we thought definitely mattered. So like latency. And it was no one thing. It was like latency. It was accuracy and COGS basically. So if you can get really good latency, really good accuracy, and you can have low COGS, then you can provide that product that we were looking for, like we were personally looking for.

You can go provide that to the world. And so we weren't focused in the first three years on trying to monetize that at all. So we just opened it up for free. We didn't open-source it, but we opened it up for free. Anybody could sign up and use it. And once we got maybe around a dozen or so true believers, people that believed in the product and were‌ using it, then we took that down. We just said, okay, we've got all the product development partners that we need. And literally it was down for maybe a year and a half or something. We didn't allow anybody to sign up for Deepgram or any of that stuff. We were just working with these 12 people. And again I'm saying I don't recommend this right now because the world is moving too quickly but back then, everybody was really slow to figure out how AI works.

And so you needed to help coach them along on how they could utilize it and everything. Anyway, we worked with those 12 people and figured out what they want, what languages they care about, what latencies they care about, how they wanted all the work, et cetera. And then we, of course, we're thinking about okay, what about the next 100 customers? Because we're building a product that can actually scale. And so what happened was with those first like dozen or so customers, I just went to them after they had been using it for maybe a year and a half or something. And said we want you guys to pay now, and much to my surprise, they're like, I can't believe you let us use it this long for free. Your product is amazing. Sure. We want to pay, and it's oh, okay, that was easy. And that got us maybe about halfway to the ARR goal that we had for raising an A. But at that point, it was like, okay, we just cashed in on the beta users, basically. But we haven't been going out into the world and getting new users.

So at that point in time, we were pretty much 100 percent a technical team. And I had not hired anybody from a go-to-market perspective at that point. There was nobody at the company. I was the go-to-market person, the particle physics PhD, Scott, I was the go-to-market person, and so that still works, but nevertheless, you want somebody who's I think of them, it's like an athlete, you want somebody who gets up in the morning that is they're tying up their practice shoes and they're ready to go for the run and whatnot. And they're ready to go put in the work on going out and educating the world about Deepgram and then pushing it all the way through and closing the deal at the end, too.

And then making sure they're happy and continue, like you need a person who's going to do that. And so I tapped into my network of people, which is not all that big at that point, but YC helps. So I tapped into the YC network and said, does anybody know how to hire a salesperson? And I met these great folks Cory and Hilmon from ClozeLoop is the name of their company. But they came on as sales advisors for Deepgram and we gave them a few basis points basically, and paid them a little bit of cash. And I highly recommend this to technical founders to go find folks like this who have scar tissue in that area. It's not just for their ideas, but it's also for their network. Cause I go to them and I say, Hey, I don't want you to just to figure out how to help us put our go-to-market together. I want you to help me hire the person who's going to run it too. And so we did that and that person is our VP of sales now at Deepgram, Chris Dyer, and he was a great hire then.

He's the athlete guy, he gets up every morning, and that's what he's doing, and he's been doing that for six years and it pays off for Deepgram but nevertheless, so around that time, we needed to start‌ getting a real go-to-market going, not just the free beta user to paid-user method, and so we got that going and that's when we said, okay, we know this is repeatable, we know we can go out, we know we can hire go-to-market people to go sell it. So now is the time to go raise an A. Yeah, hopefully that gives you an idea.

Sandhya Hegde

I'm curious, who were your early customers? What industries or use cases did they represent? And what product feedback did you get that's like memorable, maybe surprising

Scott Stephenson

 Typical market, especially then, it's all call centers. They were the only ones who were buying speech recognition at scale. That's not true today anymore, but back then, they were the only ones who had recognized that. And what they were doing was analytics for, they would have a call center that would have ten thousand or hundred thousand calls a day. And they would have folks in the call center who are maybe trained a week ago or whatever, right? And they're on calls talking to customers and they want to figure out, like, hey, are they saying things they shouldn't be saying? Are they following the script you want them to follow? Is the customer happy in certain areas? And so we should go reward them, like whatever it is, right? They just wanted to figure that out. And the way that it was previously done was a manager would listen to random calls. They don't have much time, but they would listen to maybe for each person that they managed, they would listen to maybe one call, two calls per week, and that's how the feedback would happen.Those people are making calls all day, every day, right? And so anyway they're automating that process. And so one of the early customers was so there you find love triangles. This is what we saw. It was Uber, Deepgram and a company called Randall Riley and Randall Riley does hiring for companies like Uber or trucking companies or that kind of thing. And so folks who are interested in being a driver and being a contract driver or whatever it is for those companies, they call in and then have a conversation about what their qualifications are and everything. And then they wanted massive-scale lead qualification there basically.

And so anyway, it's that type of thing that we started out with, but I think it's really important to have a beachhead like that, a market that is already there, especially when you're trying to build deep tech that is new. If you have to create the market and the tech and label your own data and do all this stuff, like everything becomes, your chance of success keeps going down, basically. You should have some kind of stable market that you know you can sell into if you're gonna take on all these other risky things.

Sandhya Hegde

And, speech recognition is notoriously difficult to get the last few points of improvement, right? Whether that's accuracy, cost, latency. I'm curious what has been your approach to continue pushing those boundaries out? And what is it that customers care about? Is everybody focused on some of these like performance metrics or are they completely separate criteria? Like maybe how you deploy other APIs, are there completely other things that also ended up being important that maybe you hadn't considered at first?

Scott Stephenson

Yeah, there are many things that are important. And big three, though, are accuracy of the model. And that's the top one that everybody cares about to begin with. And then there's speed and then speed is linked with cost. Because if you can have a really fast model, then you can charge them a lot and give them that model. If you have a really fast model, then you can not charge them a lot and still give them a really fast model, basically. And then if that model is super accurate, then you've hit the holy trifecta here, basically, right? Where those companies are, like, I can't think of a reason not to go with you. You don't have to convince them about anything. And the thing that you do have to convince them about is trusting you over Google or something like that. But not, it's not about the product specifically. And this is the perspective of an infra, like a foundational AI model-building company that provides strategic infrastructure. You have to think that way. If it's not better in all three, then what are you doing? Now those are like the table stakes, essentially, if you want to win the market, those three, but if you want to ensure higher success, take the money that you save in COGS and put it toward really good support, really good onboarding, really good reference architectures, and open source code to you, so you reinvest that back into the company to make it easier to onboard because when you're building a developer tool or a strategic piece of infrastructure like this, there is a learning curve for people to come into it. They have to know what's possible. You have to try to make it easy for them to sign up so they, so that they can just switch from Google or Amazon or whatever, to, they can just switch to you easily. And then they get all the feedback that they need in doing that. And, and all the good feelings that go along with it. So this comes down to cost as well, too. If you have poor margins on your on your service that you're providing for your model, then that cuts into your ability to give really good support.

And so this is one of the problems that the big tech providers have right now, is that‌ from for speech-to-text, especially, and for text-to-speech, they have a lot of cost on that side to run it themselves. And companies like Deepgram are an order of magnitude, or better from a COGS perspective. And so they can't drop their prices too much, but to compete, that's what they try to do, but then they can't provide any support. And if you want to win as a strategic infrastructure provider in AI, you have to do it all, basically. And this is why I was talking about earlier, you're signing up for a long march, because you know you're not going to get every single piece. But you find the innovators that believe in your vision and then you collect them along the way. And then eventually it turns into a hockey stick because you have all the coverage. And nobody has an attack vector.

Sandhya Hegde

 And now that you have a lot of startups building different types of enterprise conversational AI solutions today, branched out a lot from like the early days of like just the analytics use case. I feel like you have a good bird's eye view of, how the contact center slash just customer support industry might evolve. What are your thoughts?

Scott Stephenson

Real-time bots. That's where it's going to go for sure, no doubt about it. And some of those real-time bots will be text. Many of them are going to be voice. Anything that is text will eventually add voice to it because it's just easier for humans like we, if we can choose, if you have your best friend in life, sitting next to you, are you going to type on your computer and write them stuff? Or are you just going to turn and talk to them? You know what I mean? You're going to just turn around and talk to them. And so you want to get that level of trust or something close to it with with AI, with voice AI systems where you're just like, I know you're going to understand what I'm saying. I know you're going to have my best interest in mind.

And I know that you're going to come up with good solutions to my problems. Okay, so I'm just going to turn and talk to you about that thing. And so the way that this is going to turn out is everybody dreads calling into their like Comcast or something like that, now like literally you think about it and you just dread the prospect of doing, that won't be the future; you'll be‌ like happy to do it. Okay, I'm going to call in and I know I'm not going to wait for 45 minutes, I'm not going to get hung up on four times, and my problem is going to get solved in two minutes. And so why wouldn't I call them and try to get that solved? There's no business hours, there's no waiting in line. There's none of it. You just do it whenever it's available. And yeah, for call centers and other customer success things, that's gonna be a real game changer. But there's a whole other world, which is it's not just about the customer service. It's any consumer type of interaction. I think take your best salesperson, your best, whatever, and say, if you could clone them, would you? And if you could clone the happiest, best version of them, would you? And so in the next five years, all of this stuff is just going to be rolled out. And we'll be glad to call those services whether it's on an app, whether it's literally a telephone call, whether it's in the browser, in your typing whatever it is, all these things are happening right now. So it's real-time bots, real-time agents that's the future.

Sandhya Hegde

Shifting costs from man hours to GPU hours. One GPU at a time.

Scott Stephenson

 Yeah, and using the best part of human creativity to train those models and make them better, right? So whenever there's like some dull grunt work that you dread, that's probably what you should be automating, right? And then there's other parts of that where you're like, Oh, that's an interesting question. I haven't really thought about that. That's where a human is really valuable. You can hand it off to them. They can help solve the problem, and then they make that system even better. So yeah, that's the future.

Sandhya Hegde

Yeah. I think I see a lot of folks, especially on the text side, have been trying to automate a lot of things like customer support, service style use cases over the past year. I think the trickiest thing so far‌ has been, leave aside all of the hallucination accuracy kind of questions that everyone has talked them to death, so the question has been the handoff. When should we rely on the bot versus when do we really need to escalate? If we can get really good at that, it becomes easier and easier to adopt the bot. But yeah, I hear you. I think it's such a win in terms of like better customer experience. Work that is not really a career path for anybody and it's expensive to do that. Obviously the impact on labor in general is a big question mark and a point of concern, but you can see why this is going to happen. Like all of the forces are aligned to make it happen. Maybe speaking of cost a little bit more, I'm curious, given your deep expertise extending to like the hardware layer of this, how are you thinking about like general-purpose hardware versus specialized hardware in this ecosystem for better performance‌ to continue, what changes would you expect and the GPU data center segments of the AI industry?

Scott Stephenson

Yeah, how specialized? Because it definitely specialized is going to be a thing, but it's actually a question of how specialized. I think I think everybody has gotten the memo that they need GPUs. They're not going anywhere. They are 10 times a word more better than CPUs for deep learning. And there will now be serious competition from other GPU manufacturers like AMD.

They're starting to bring that into the market. Nvidia still is by far the leader, and it makes sense. They invested in this early. They built the ecosystem around CUDA and all of that. So they're enjoying that benefit right now, but it's going to get very competitive. But there's a whole other side as well, which is around the application-specific chips that are out there. So like Groq and others and I think they, they have a real fighting chance as companies because there's like this big capability boom in in AI right now based on different architectures that people have found. CNNs, RNNs, transformers, attention, that kind of thing. But I look at this a lot like, electricity, or the internet, or something like that. You don't have to have it all figured out to get a ton of value. What you need to have are the basic components to get that value. So for the internet, it was connectivity across the entire globe, at a certain bandwidth, basically. And as long as you hit that and the reliability is there, then the thing is going to emerge on top of it. Similar kind of thing with electricity, whether it's DC or AC or whatever it is, you gotta lay some copper wires down. You gotta do this, you gotta do that. But once you do that, then the entire world expands on that idea. And then maybe, decades later, at least in the case of Internet or the case of cars and the case of electricity is when you go back to rework and reoptimize, once you've you've extracted everything you can out of out of the first elements that you found. And I think that is for that reason chip manufacturers right now can‌, I think they can rest on transformers, recurrent neural networks, convolutional neural networks, and linear layers, and just say hey, we're going to build for that, and we're going to build for inference on that side, not training.

And we're gonna go serve that to the world and they'll probably have a massive uptake because of that. And they probably won't be caught two years from now, four years from now with this new architecture that kills everything. Probably not, actually. We're to the point and there are some science reasons around why this would be true, but Basically, with with convolutional neural networks, you figured out the spatial portion. With recurrent neural networks, you figured out the temporal portion. And with attention, you figured out how to focus in certain areas. And it's like, what else do you need to build something that is good? That's what you need. And so anyway, they'll just build chips around that. And I expect companies like them and others like Intel and whatnot to get into the game. They've purchased companies and whatnot. But I think there's real hope there because we're at a more stationary point in AI from an underlying model network perspective.

We're totally not in a stationary point from an application adoption perspective. And so there's going to be so much demand just for the underlying elements that make it up. So yeah, GPUs are going to do well. The specialized ones are going to do well. I think everybody's going to do well on the hardware side, selling picks and shovels.

Sandhya Hegde

 Have you thought about what ‌additional hardware specialization would do for Deepgram? Is there ‌a fantasy chip that specifically would be better for you as a voice AI company than it's going to be for even multimodal or an LLM company? And now, how much of a difference would it make? Is this, okay, this will give me 20 percent better performance, or is it 10x better performance? What are the thresholds here for voice?

Scott Stephenson

I like that you brought it up that way because there really are only two types of gains, like the 20 percent or the 10x, basically, it's got to be multiples or in these tens of percents range, otherwise you won't think about it. And the tens of percent range are just the iterative hey, can we get a little bit better without a lot of effort? And then the 2x or 10x or whatever is when you're betting on new architectures or new ways of solving the problem and whatnot. Here's the reality‌. It's really interesting, at least for Deepgram, like inner secrets here. Our COGS are so much better than our competitors we don't‌ need a new chip to to be effective. So when our competitors figure that out and do something about it, which is a very hard thing to do, and we have eight or nine years of research behind all of it. Then there will be a like, okay, we should be thinking about a new chip. And we do think about that already, but like in the current market and for the next year or two, there probably is no pressure there.

But there will be pressure in two or three years or whatever it is. And so we do think about that, what silicon partners should we be working with? And NVIDIA is an investor in Deepgram, and so we work closely with them. But there are others as well, that you have to keep your options open on the underlying substrate that you run your models on just because of the competitive dynamics of the market.

 And so we'll see how it all turns out. But at least for, specifically for Deepgram there's no one thing that makes it so much better. Specifically for others, they probably would hope for one. I don't see one coming for them to save them though. It's really just, you just gotta do the hard work to optimize and rethink all your problems and whatnot.

Sandhya Hegde

How do you see Deepgram's product vision evolving? And if you had to look‌ 10 years into the future, what do you hope Deepgram will be?

Scott Stephenson

 Speech to Speech. Right now, we're talking a lot about Speech-to-Text, which is Deepgram's first product. We're the world-leading API for Speech-to-Text. More data is pumped through Deepgram than any other provider and we're known for that . But that's not the end, that's not the only thing. We are‌ a voice AI platform, and we released recently our text-to-speech to supplement our speech-to-text. And so it's actually funny, it'll be for a little bit here, for a year or two, it'll be STT, TTT, TTS. So, speech to text, text to text, and text to speech. And text to speech, we released a human-like, very expressive TTS, but that fits in with what I talked about before, low latency. Low cost as well. So that can power the speech-to-speech experience. So a full-voice bot that can listen to your conversation and respond in a human-like way where you're, where you're pumped to be talking with them. There is a piece there in the middle, the TTT and what most people would call an LLM right now. where we partner with others to do that right now. So we have some LLMs that are owned internally at Deepgram where we do sentiment analysis and intent detection and topic detection and that kind of thing. But if you want an open-ended LLM to drive your voice AI agent, then you would use OpenAI or Anthropic or, a Mistral model under the hood or something. And that's something that we are completing the voice platform on right now. TTS, it goes GA in a couple days. And it's already been in early access with a lot of successful companies using it now. A lot of YC companies are using it now. A huge fraction of the batch of new YC companies are AI agents. And so we see a lot of success there and other companies as well. And we'll be releasing that in a week, but so yeah it's speech to speech, in three words.

But there will‌ be versions to that. So the first version will be that broken-out version of speech-to-text and then in LLM and then TTS. But the next version of that will be fully contextualized speech-to-speech. And that's what we work on in our research team. And think like next year around this time, people will be complaining about the standard speech-to-speech voice AI bots and they'll say things like this. When it responded to that message, it didn't really sound like it cared, or that it was like empathetic, or it didn't really sound like a human, like we crossed over, and a lot of things felt right, but then when I made the statement that was like a very negative statement, it responded in a very positive way, and it's that doesn't make sense. What is going on here? And so when you build a system in the speech-to-speech, sorry, speech to text to text, and text-to-speech way, when you're only passing words between all of them, you're not passing any context through them. You're not saying the goal of this conversation is this, and you should respond in a very caring, empathetic way.

And I detected that they were, upset in the speech-to-text, and so I'm going to pass that along to the text-to-speech. And so that's‌ what the next version will be. That's what the, the speech-to-speech V2 will be. And that's what we'll be talking about in a year. But the one thing I've learned is that you don't jump too far ahead. Because if the market is not educated on something simple to begin with, they won't understand the more complicated thing. And I mentioned the first product we ever built was an embeddings model with a fuzzy vector database search. Nobody knew what to do with that eight years ago, and so we're not going to build the full speech-to-speech model literally right now and release it right now, even though we have research that shows we could do it. Because the world is not ready yet. The world needs to see the speech-to-text, they need to debug it, they need to read the words, they need to send that to their LLM, they need to choose which LLM, they need to prompt for their LLM, they need to do all of that, and then they get over that tuition hump, basically, and then, a year from now, two years from now, three years from now, they say‌ I would love a model that is even more expressive and better. And so that's a trajectory of the market and what we'll be doing.

Sandhya Hege 

And you want to be like a year or two ahead of the market adoption cycle in terms of what you're building, not four years ahead, and definitely not two years too late. Any advice to, new founders planning to start an AI company in 2024?

Scott Stephenson

So what I said about our journey early on, take 2, 3, 4 years to really build the foundation. I still think that's possible, but it's an even harder road now, and you have to be thinking next generation again. For most founders, though, like it's possible, you can do it. You may be successful, the odds are lower, but there are ways to do it. But for most other AI companies, you should be thinking about building on top of other people's frameworks and infrastructure. Very similar to how tech companies build on top of CPUs and GPUs and memory and hard drives and networks and that kind of thing. You should be thinking that way in AI now rather than trying to reinvent the wheel in every single spot because if you try to do that, you'll probably be lagging behind unless you have some like amazing insight that other people are not seeing. So find really good strategic infrastructure partners. Most of them are probably really interested in partnering with you because you can be a really good product development partner with them as well. And so it's like the market is still early. Everybody's still talking to each other. It isn't like knife in the back stage or anything like that. It's rising tides lifts all boats. And take advantage of that right now and get in with the right frameworks and strategic infrastructure providers. And once you think that way, the next stage is just the standard product advice now follows, which is, find something that people love, obsess over those 10 or 100 or whatever people.

Make sure that market could‌ expand into a much larger market. Communicate that vision to your investors, find really great investors that can back you up on that, and then introduce you to the people who you want to interface with on the next step. So somebody was asking me recently, should I not care about my seed investors right now? Just get them because it's just money, right? And it's they're really helpful. Some of them probably, yes, you can do that. But you want at least some of them angels or seed funds or corporate even to have a good network so that they can introduce you to who's going to lead your A. And they can vouch for you, so I think it's really important for folks to think about that. You brought up this point. It's really easy to raise money in AI right now. Don't make the mistake of taking the easiest money when you could get a little bit harder to get money, , but what comes along with that is a really good network. So those are my pieces of advice that I would give.

Sandhya Hegde

 Yeah, I second all of that for our listeners. Yeah, I think it's not just about, taking the easy money. It's also like confusing investor interest with idea validation. That makes me very nervous. You raise money easily. You assume it's because your idea is perfect, as opposed to they're just hoping you will do the hard work of iterating on it and figuring it out.

Scott Stephenson 

Yeah. Investors are betting on a market. They're betting on you and a market. And that market they're betting on is eight years to come. And especially at the seed stage, they're betting on 50 companies for that. And yeah, it's a really good point that investor validation is not customer validation.

We even when we first raised our money based on the search that I talked about embeddings and that kind of thing, investors were super pumped about that. We go to the market. We just get punched in the face over and over because everybody is asking us for speech-to t-xt. And we're like, okay, we should just build the simple thing that they that they need right now.

And then we'll turn that into something bigger later. And it could take way longer than you think. But the important thing is that you‌ get the market. You get the pull from the market.

All posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.