The Analytics Edge

Key Trends in Databases with Nikita Shamgunov, Founder and CEO at Neon

Episode Summary

This episode features an interview with Nikita Shamgunov, legendary founder of MemSQL (now SingleStore). His latest endeavor, Neon, offers serverless Postgres as a fully managed multi-cloud database that separates storage and compute, with auto scaling, branching, and bottomless storage. Nikita recounts the founding stories behind both MemSQL and Neon, and elaborates on the key trends driving database technologies today, from serverless and generative AI, to open data and the convergence of transactional and analytical workloads.

Episode Notes

This episode features an interview with Nikita Shamgunov, legendary founder of MemSQL (now SingleStore). His latest endeavor, Neon, offers serverless Postgres as a fully managed multi-cloud database that separates storage and compute, with auto scaling, branching, and bottomless storage.

Nikita is also a Partner at Khosla Ventures, where he is incubating Neon and raised $104M to date. He is passionate about deep tech, data infrastructure, and system software. Prior to Neon, Nikita co-founded MemSQL (now SingleStore), a unicorn data and analytics company valued at over $1.3 billion. He served as a founding CTO, and then CEO, successfully scaling the company to over $40 million in ARR. Prior to SingleStore, he worked as a senior engineer at Facebook and at Microsoft on SQL Server. Nikita earned a Ph.D. in computer science from the National Research University in St. Petersburg, Russia.

In this episode, Nikita recounts the founding stories behind both MemSQL and Neon, and elaborates on the key trends driving database technologies today, from serverless and generative AI, to open data and the convergence of transactional and analytical workloads.

-----------

Key Quotes

Amplitude and Mixpanel, they basically are a time series database underneath with the UI. Time series data tends to be, you know, ‘write once’, most of it. And so, you need to take advantage of those techniques that data warehouses are basically born with, right? They are in the business of storing data relatively cheaply. And every enterprise, unless it's not like an archaic enterprise, should have a data warehouse. So it makes only too much sense to put this into a data warehouse rather than either a custom database, you know, like a platform like Datadog, Mixpanel, Amplitude. Plus you have additional benefits from it because you can cross reference that data with the rest of the business data." - Nikita Shamgunov

-----------

Episode Timestamps

(01:41) Founding stories behind MemSQL and Neon

(03:39) Addressing new challenges for databases

(09:20) Criteria for evaluating databases

(12:36) HTAP and zero ETL between transactional and analytical applications

(19:07) Evolving standards around table formats

(24:07) Thoughts on Generative AI and LLM-native in the data warehouse

(26:38) Warehouse centric approaches to data storage

(29:45) Open source for data warehouses

(33:54) Potential for new applications to be built around real time applications

(38:10) Managing large volumes of data

(40:59) Serverless Postgres is as easy as Stripe

(45:40) Takeaways

-----------

Links

Nikita Shamgunov's LinkedIn

Neon Website

Thomas Dong’s LinkedIn

Vijay Ganesan’s LinkedIn

NetSpring Website

Episode Transcription

Nikita Shamgunov Final Transcript

[00:00:00] Narrator: Hello and welcome to The Analytics Edge, sponsored by NetSpring. 

[00:00:08] Thomas Dong: The Analytics Edge is a podcast about real world stories of innovation. We're here to explore how data driven insights can help you make better business decisions. I'm your host, Thomas Dong, VP of Marketing at NetSpring. And for today's episode, my co-host is Vijay Ganesan, Co-founder and CEO at NetSpring. Thank you for joining me today, Vijay.

[00:00:25] Vijay Ganesan: Great to be here, Tom. Really looking forward to this podcast. 

[00:00:30] Thomas Dong: Today's topic is modern database architectures, and we're joined by Nikita Shamgunov, who in his career has founded not one, but two database companies, MemSQL, now SingleStore, and most recently, Neon, which offers true serverless Postgres. Neon is a fully managed multi cloud database that separates storage and compute to offer auto scaling, branching, and bottomless storage. Nikita, we're delighted you're able to join us today. Welcome.

[00:00:53] Nikita Shamgunov: Happy to be here. 

[00:00:57] Vijay Ganesan: Welcome, Nikita. Great to have you on our podcast. You're a thought leader in this space, amazing work at MemSQL, Single Store, and very interesting work you're doing at Neon. So really looking forward to the conversation today. 

[00:01:09] Nikita Shamgunov: Likewise, likewise. Thank you. Thank you guys for having me. 

[00:01:14] Thomas Dong: Well, according to Forrester's Noel Yohanna, the current average for large enterprises with over a billion in revenue is 40 databases per DBA. Now, this database sprawl is obviously because different workloads require different types of databases. You've got relational databases, key value stores, and databases for graph, time series, document, Y column, in memory, or search. So as a data leader, you're deciding between SQL, NoSQL, OLTP, OLAP, HTAP, on premises versus cloud. So Nikita, with all these requirements, how did you come up with the ideas behind MemSQL, an analytical system, and Neon, a transactional system?

[00:01:52] Nikita Shamgunov: Yeah, I think those are very different. And I'm also a very different individual that started memsql versus started neon. Um, I think memsql was at the start, very technology driven, uh, thinking, you know, what would be, uh, you know, it would be really cool to build an in memory system. And that's what memsql started with.

Now, Single Store evolved and supports now, you know, full tiered storage, memory, um, uh, disk, uh, and object storage. So say Shingle Store separated storage and compute. And NEON is very market driven. So there is a realization that there's an opportunity in the market. to create the bar none, best in class, cloud native Postgres, uh, with a special sauce behind it, which is the separation of storage and compute.

So, um, so Neon is an incredibly deliberate attempt to, uh, to go and address the OLTP market, uh, versus MemSQL started with cool technology, you know, scalable in memory database. Eventually, we found that market, and that market ended up being close to first real time analytics, then analytics, and then just kind of supporting mixed workloads, um, in the enterprise.

Single Store is a very enterprise, uh, forward company, uh, you know, finding very important and high value use cases in the enterprise, usually around scale, performance, reliability, and addressing those. Versus Neon is very much developer first, bottoms up, postgres company, where there's so much postgres out there.

So now that we're onboarding one database a minute. We, we're just realizing that the gigantic scale of the overall postgres impact and usage in the world. So those are two different things. 

[00:03:52] Thomas Dong: So many different types of databases have emerged to address, you know, evolving business and technical requirements.

What do you think are the next set of challenges that need to be addressed? 

[00:04:05] Nikita Shamgunov: Well, you're absolutely right. It is a mature space. And if you look at the past 10 years, I think that the driver was mostly the cloud transition and their technology enabler for the cloud transition is separation of storage and compute.

That's probably like kind of the largest piece that was missing for traditional databases. So that allows them to be kind of more elastic, um, as well as the cost equation gets right when you separate storage and compute. The new set of challenges are such, I think for the analytical systems, the journey for, for the most part is complete.

And the separation of storage and compute now exists in all major, with all major providers, Redshift separated storage and compute, BigQuery, Snowflake, obviously, single store separated storage and compute. And then for the operational systems, it's not quite complete. Right? So the only prominent OLTP service that separates storage and compute is AWS Aurora.

I think SQL Server followed suit with SQL Server Hyperscale. I guess that's, uh, I would call it number two. And then the rest of the world is catching up, you know, AlloyDB recently introduced by Google. So Neon feels that it may be slightly behind, but not that much behind. And so, uh, Neon kind of barraged into that space.

And, and separated storage and compute for OLTP. And by the way, separating storage and compute for Snowflake and separating storage and compute for an OLTP system like Neon are architecturally different because, you know, for an OLTP system, it has to be transactional. It has to be low latency versus for an analytical system, you know, it's more about, you know, large volumes of data, scalability, separation of workloads.

So, so I think we're not that, that far behind though separation of storage and compute is done by somebody else in the past. I think the other battle that is happening is serverless. And when I think about serverless, I kind of think about why. So the why's behind serverless is you remove the number of parameters you have to think about when you go and provision a database.

And that's very important. That's very important because you don't have one database anymore. You have a fleet. You know, at the minimum, you have your production database, staging database, and dev database. Then you probably have as many dev databases as you have developers, or at least as you have dev developers.

And then you multiply it by the number of apps. And then if you want to get really modern, you multiply it by the number of your DICD pipelines or PRs, right? So now, ideally, each PR has its own dedicated database, PRs as pull requests. So, suddenly you have a fleet, and if you want to manage that fleet, and you need to choose the size, for each one, and you need to wake up and think about, wake those things up and shut those down.

And now all your dev tests, production workflows now need to really think about those things. Um, that becomes really complicated. So the moment you step into the world of fleets, you realize how important serverless is, because it just removes that manageability overhead. Also, we've seen the industry is moving towards serverless consumption.

So today I can build an app without logging into AWS. You know, I can put my code into GitHub, I can put my front end code, front end part of the application into the front end cloud for sale, I can put my back end onto something like Fly. io or Render, and I can put my data into, uh, into Neon. And everything's serverless, and everything is consumption based.

And everything is built for developers and integrates with each other. So, I think we're moving more into this world we call developer cloud. And serverless is kind of a prerequisite. Because developers are not your SREs or DevOps people. They don't live in the AWS world. They live in the code world. So, it needs to be compatible with the modern developer workflow.

Which is CICD pipelines. I don't want to think about scalability, it should be given to me. And those scale up and down all the way to zero. Now that we have fleets, I don't want to kind of like, sweat over the fact that I'm blowing through my budget. Finally, I think there is a massive, massive wave coming at us with generative AI.

And the way to think about this one is what are the job titles that might not exist in the, in the new world at all? What are the job titles will emerge in the new world of generative AI and who can become 10 times more productive? And so I think the combination of AI and developer cloud movement, which is the combination of like Purcell, Repl.

it. Um, Neon will decrease the number of SREs and DevOps, will decrease the number, uh, for in the database world, it will decrease the amount of DBAs. And it will increase the number of developers, both physical developers like humans and potentially bots that will be building new applications, basically just like sharing prompts with each other.

So I think that's where we're heading. So, so serverless is one, and then plugging it into the AI wave is another one. And I think the implications are going to be different for OLTP and OLAP as usual. And so I have some thoughts on those subjects. Well, you 

[00:10:02] Thomas Dong: talked about kind of these evolving, you know, developer workflow, um, where we are today and where things may be going.

What criteria do you think data leaders should be evaluating a database on today? You know, we talked about cost, multi cloud performance, security, always being concerned about deciding on database. What are your thoughts on how that's 

[00:10:20] Nikita Shamgunov: evolving? So previously you made a point that, you know, larger enterprises have 40 plus databases.

And that's kind of the premise for single store as well, where it's like, we single store, just like put everything in one. Well, I think the reality is just more sophisticated than this. Uh, we, I think for a while we'll live in the world of OLTP and OLAP still, right? And I think the dominant technology for OLAP Right now, the leader is Snowflake, and the leading OLTP technology, not company, is Postgres.

So, if you take it to the extreme, I think modern applications Can be built just on those two, right? You know, you can use Postgres for OLTP and Snowflake for OLAP. Obviously we'll integrate Neon with Snowflake. That's in the works. Um, is it going to be one like single store? Um, maybe like, but this will be another five, 10 years, right?

Because there's a technology part to it and there's a market part to it. And then there's developer ecosystem part to it. And then developer preferences part to it. So, if tomorrow, for example, Snowflake introduces an OLTP system, um, you know, or evolve their Unistore into being fully low latency OLTP transactional, well, they still need to win all the developers that start their applications in Postgres.

And if tomorrow Postgres builds out perfect column stores, you know, on the same level as like, I don't know, DuckDB or something like this, well, you know, all the ecosystem for analytical workloads Postgres is conquered by Snowflake, but for Postgres, it will be a while before everything is going to be built out.

Now, I think the trend lines are pointing that direction. We certainly have Mongo, like it's not going anywhere. So object databases have a share of operational workloads, and we have all this other databases. We have Oracle's, we have SQL servers, we have MySQL, right? The trend lines are pointing to Postgres, but database wars and battles underwater in slow motion.

So, I think a certain amount of consolidation is happening. And if it is happening, then Postgres is a very natural place where OLTP workloads will consolidate. The reason to that is, you know, very favorable license. The reasons to that is it's a well built system that developers like, um, and that love I wouldn't underestimate.

And then there's like the plugin ecosystem. So Postgres is a platform, not a, uh, just a database anymore. So now I think that that's where the puck is going. And what it means to Neon is we want to skate to where the puck is going, which is the puck is going into the cloud, the puck is going into serverless, the OLTP puck is going into, into Postgres.

And we're seeing the AI disruption, which will put pressure on the like modern database features. And modern database workloads such as like vector DB workloads. So we're going and investing into all of those things to basically arrive where the puck is going. 

[00:13:35] Vijay Ganesan: You talked about this hybrid transactional analytical processing, HTAP concept.

You know, a concept that's been around for a long time with SAP HANA and Oracle and single store right from the beginning was HTAP type system. And, you know, as you mentioned, Snowflakes, Unistore now is essentially similar concepts around converging these two and this idea that you build transactional applications and you do analytical processing on the same system and, uh, Just want to double click on some things you said around that.

There's the sort of the technology aspect, but there's also the ecosystem aspect, right? You talked about developer love. You have to win over developers who are very used to, very committed to Postgres, for example, are very used to Postgres. So, so maybe double click a little bit on that. And you know, the thing you were saying about, you know, five to 10 years before these things come together, that's very interesting to hear your thoughts further on that.

[00:14:28] Nikita Shamgunov: For sure. First of all, I have a lot of learnings. Both SCARs and wins around HTAP, so very early on, MemSQL found its product market fit in the HTAP market. And then we realized that HTAP can be either big T and small A or big A or small T. And so where MemSQL arrived and became single store is big A and small T.

But that T is real, right? You know, single store is a transactional system. Um, I think where the reality of where the dollars are spent are distributed into, you know, the spectrum of databases. And a lot of dollars are spent in pure LTP. And a lot of dollars are spent in pure analytics. And then certain number of dollars.

I spent an HTAP. Those workloads are more expensive because nothing else works. Um, but there are fewer of them. There are definitely fewer of them. And like the how real time is real time matters. So, if you're on the analytical side of things where, you know, you mostly analytical workloads, you consumed all the data in the enterprise.

And certain workloads, they just need to be more real time than others. You need to know certain things sooner than, than some other things, which like daily reports or something. And then you start moving the real time to like, to perfect real time. So single store, by the way, is perfect real time. The moment you insert a record, it's instantly available as transactional.

It's in a low latency storage. You never go through S3 first. And then bringing it into low latency storage, you insert it and it's instantly available. So those workloads are really good from the dollar standpoint, like price per workload is very high. Just not that many of them. Um, and then if you keep being in the analytical space and slowly relaxing the real time requirements, well, Can I be down to a second, 10 seconds, one minute, one hour?

You can call all of those things real time. It's just, there's a, an asterisk to this, like how real time is real time. And the reality is that there's a dollar distribution for, for the degrees of real time. Of how much money is, is allocated to those workloads in the market. Now on the, on the transactional side, it's, it's the opposite.

It's like the primary workload is OLTP, but then you need to do some reporting. Think about, I don't know, like, um, Salesforce and Salesforce reports. So today those things are separate, two different systems, power Salesforce reports and power, you know, like you log in into Salesforce and you see, you know, your pipeline and whatever.

Uh, and wouldn't it be nice if it was just one system because fundamentally it's one data. And so those workloads people have been solving for a while as well. So like your, your core system is transactional, but certain things is you are such that you want to run reports on the same data. So SQL Server, I think 10 to 15 percent of all SQL Server deployments in the enterprise use ColumnStores.

Which is an indication that HTAP matters. Um, we have, you, we have companies, very well funded companies like PINCAP, um, which is TIDB, uh, that, you know, also positioned themselves as HTAP system, which is big T, like we're a transactional system. We're going to put your transactions in scale. Um, uh, but then, you know, you have a dashboard in sections, reporting section in your app.

And so you don't want to move data out of them. So there's a market for that as well. Um, I know, for example, like ServiceNow builds super custom solution. They acquired a company, uh, to satisfy this use case. And of course, another example is SAP HANA. Um, I find those custom solutions work well when, when, when there's an app that the company that is investing in those solutions control.

Like if you're SAP and you have SAP, then SAP HANA kind of makes sense. So otherwise you're spending way too much time on your people maintaining it, and you're paying way too much money to, you know, Oracle or SQL Server that the system sits on top. So, so from that standpoint, you just custom kind of custom build the database for the application, not the other way.

Um, and otherwise, again, the dollar distribution for this HTAP is, is not very favorable for compared to the core of LTP. 

[00:19:06] Vijay Ganesan: Great. That's great perspective. Let's talk a little bit about vendor lock in with data for enterprises. So one of the biggest concerns for enterprises is my data gets locked into you.

Thank you. A vendor's store, right? So whether it's Snowflake or BigQuery, you know, the fact is that it's in some proprietary formats and you can only use those tools to query them. And there is an interesting trend with standards around table formats. You know, it's been around for a while, but seems to be gaining more momentum with Iceberg and Delta Lake and Hudi and Things like that were the promises for enterprises to land data in a format that is some standardized and then any tool can compute on that, right?

I could use single store, I could use snowflake, I could use BigQuery, whatever, but I control the data. I don't, I don't get locked in. It's in a standard format that any tool can consume. How do you see this evolving? Is this an interesting trend that's going to change the way we think about data warehousing?

[00:20:05] Nikita Shamgunov: Well, first of all, I'm a huge fan of open formats. With regards to Iceberg, I think it's a great idea, a great format, and we should do Iceberg. I think the magnitude of the war is smaller than Parquet versus Ork. In the past where Parquet one, um, over time and at the, uh, and at the time where Parquet was introduced, the need was so much stronger, right?

Like the difference with CSV over Parquet was humongous, right? It's a much better compression, much better query processor on top of column store, compressed column store data. And the incremental value of Iceberg versus Parquet, relatively small, small in, in comparison, right? So while it's a very good idea and the transition should happen over time, in my opinion, it's, um, uh, and it's overall goodness.

It's not going to grab headlines, right? Maybe transition will happen over the years. I think the, a lot more interesting thing that's going on in this world of. Proprietary. Oh, yeah. On proprietary data, uh, proprietary data is bad. Open data is good. So that, I think for a user, that's just obvious, right? I want, I don't want to be locked in.

I think I can have similar performance with a system with open data compared to the system with proprietary data. I think Databricks is proving that, uh, to the world and their investments in Photon and all this, you know, query processing engines. I think they're still behind of Snowflake BigQuery at places.

But I don't see a real reason why this wouldn't converge over time, performance wise. So I think we'll arrive to that future, where the data is open. Vendors don't want the data to be open, and it's easier to be a vendor with closed data versus with open data because of all these like statistics and maintenance and all these things, like technology gets simpler.

And you can provide very, very nice iPhone like experience to your users if you control everything end to end. But of course, like, a good number of users want open data. But there's another thing that I think is very, very powerful that's coming, is I'm not sure that hoarding data is going to be as valuable for the future because some of the reasons to hoard data, to just store, like, very, very large volumes of data, has traditionally been to, to, to use this for training of machine learning models for various predictions.

And you define very large transformations on those models to clean data and then eventually feed that data into your machine learning models. And that's why you want to have data in and not in the data warehousing closed format like Snowflake. But in the open format like Parquet, because you feed it to your machine learning models and then drive some sort of predictions of it.

But now we live in the world of large language models. They come pre trained and they're smart already. And then a lot of those pipelines of, and training pipelines become inferior to a large language models that is mildly fine tuned. Or what people call like zero shot or one shot, where you just provide a handful of examples and the large language models performs just as good as a custom trained model.

And then you can RM RF a gigantic amount of data that sits into your data warehouse. So I think that's kind of huge. And the other thing that I'm observing that for the data that you do store in the data warehouse, the access method is no longer crunching your report. So you can extract a clean portion of that to feed in your machine learning models.

It's more of a quick access to various portions of that. So you can pull relevant data and put it in your context window of a large language models. So I think we're going to have retrieval systems. Uh, benefiting tremendously. And I'm looking at Snowflake acquiring Neva, which is a retrieval system, and thinking, well, that's probably why.

[00:24:25] Vijay Ganesan: Interesting what you said about RM F, bunch of data in your data warehouse. That's a very fascinating perspective in the context of the hottest topic now in tech, the generative AI and LLM. Let's double click on that a little bit. So if I understand what you're saying, you know, the need for bringing data into the data warehouse, And then doing some processing on top of it may be diminished because if you have good retrieval systems and all you need is some context that you need to pass to some prompt, you know, you don't really need to hold that data in a central place, right?

And that's just changes the way we even start thinking about data 

[00:25:03] Nikita Shamgunov: warehouses. Correct. Yeah. So reporting is not going anywhere. So, so first, like reporting is not going anywhere. As the CEO of Neon, I'm looking at reports and dashboards and KPIs and, uh, business metrics every day that, so reporting is here to stay, but the amount of data that participates in that reporting is relatively small for us.

You know, this is obviously a startup and a relatively small company. It hasn't accumulated a very large data footprint, but yeah, the company is going to grow and the reporting needs are going to grow and we'll probably like 10X our spend on Snowflake. We're on a million X hours spent on Snowflake, but what I think we might not be doing is we might not be storing all our telemetry forever, storing our logs forever.

And that need is probably going to go down because again, we are not going to be training our models ourselves. What we are going to do though, is what I described earlier, we'll augment our reporting Making business decisions, usually using large language models. And I do want to have a retrieval system on all our data that we have.

Not necessarily historical data, but all our present data. So those large language models are aware of the context of the present data that is relevant to Nia and the business. But I'm not in the business of like training that model from scratch on only historical basis. The model comes pre trained and smart.

[00:26:41] Vijay Ganesan: Interesting. So it's not that the reporting is going to go away or, you know, data warehouses are going to go away. There's a class of data that you can retrieve through these retrieval systems and feed prompts where you can do that very easily without having to bring it into some intermediate store and so on.

Sort of related question around the Warehouse centric approaches to building products, you know, the, we're doing a lot of work with product telemetry and product instrumentation, click streams, things like that. Data historically never came to data warehouses, you know, in the Teradata era, you didn't bring that data into Teradata, right?

But with cloud data warehouses, it's feasible to bring, you know, petabyte scale data in and store it cheaply, securely. And you pay only if and when you use it, and it's very elastic and cost efficient and so on. So, so, um, You know, there's this notion of warehouse first, warehouse centric approaches where the warehouse becomes like the central hub where data resides.

What's your point of view 

[00:27:38] Nikita Shamgunov: on that? Well, I'm a huge fan. I think that makes a ton of sense. So traditionally, like in, when you think about like kind of your company and in the product analytics space, um, you know, Amplitude and Mixpanel, they basically a time series database underneath, right? And, uh, with the UI.

So. And as a database technologist, I know for a fact that you benefit tremendously if you compress that data and if you store that data in a, in the right medium, as in like memory disk object store, depending on how frequently you need to access that data. And time series data tends to be, um, you know, right once, the most, most of it.

And so you need to put it in the, you need to take advantage of those techniques, uh, that data warehouses are basically born with, right? They are in the business of storing data relatively cheaply, otherwise the whole thing doesn't work. Um, and so, and every, every enterprise is. unless it's not like archaic enterprise, should have a data warehouse.

Uh, so, so it makes only too much sense to put this into a data warehouse rather than either a custom database, um, you know, like a platform like Datadog, um, Mixpanel, Amplitude. Um, so it kind of belongs there, plus you have additional benefits from it because you can cross reference that data with other, with the rest of the business data.

And so the cost is the right one. Um, the, the value capture for all the time series databases are typically is in the upper platform anyway. So, um, so putting this in the data warehouse does not threaten you, uh, as a product analytics company. Uh, so you're competing with other product analytics, uh, companies.

Time series databases never became like a huge category, unfortunately. Unfortunately, unfortunately, it's just the reality of the market. So, as a, as a, as a business strategy for NetSpring, I'm, I'm fully on board. Because now, if you think about Neon, You know, where will we put, want to put all our product analytics data?

Just don't question. It has to be a data warehouse. If it's not a data warehouse, I'm not doing this. Um, so, um, the, the other bit is, well, is it performant enough? Is it real time enough? All this HTAP conversations that we have. Um, and I think the answer to that is, I mean, the user wants it in the data warehouse full stop.

So the rest, uh, either the data warehouse technology catches up or the product analytics Puts a shim in between and makes it, you know, real time enough for the use case. 

[00:30:37] Vijay Ganesan: Let's talk a little bit about open source. You know, obviously you're working on Postgres, open source OLTP, and the OLTP world, there are very good open source solutions, but isn't it odd that we don't have a, you know, commercial grade robust open source data 

[00:30:54] Nikita Shamgunov: warehouse?

It is odd, right? There's Clickhouse, there's DuckDB. I think those are the two horses. Clickhouse. I would, I would bet on. Yeah, it's hard. 

[00:31:04] Vijay Ganesan: What do you think that is? Or will there be one soon? Or doesn't the world deserve a good open source 

[00:31:10] Nikita Shamgunov: data warehouse? Well, we, first of all, it's a lot of work. We both know that, right?

So, uh, it's a combination of, for the cloud, specifically of separating storage and compute, as we all discussed, and then database fundamentals. Uh, query processor, query optimizer, robustness, workload management. That's a lot of work. Um, there are people who are willing to put the work, right? You know, Doug to be people are willing to put the work in like house.

People are willing to put the work in and then there's one more complexity while we're staying in technology on the technology lane versus the business lane. There's one more complexity when you build the open source. You presume two types of deployments, one is self serve and the other one is cloud.

And it's kind of hard for a company to sit in both chairs. So you either build a cloud company when you have SREs, observability, the right cloud architecture. It's just you're running a service, you're a service company. Versus you are a packaged software company. So you, you, you constantly battling that tension of, okay, well, I built the feature, but then I need to package this feature and make sure it's, you know, somebody out there in the world can roll it out themselves.

Again, that's, that's tricky. Um, from the business standpoint, I think data warehouses. Data people don't care as much about open source because they're not developers. Developers do care about open source for a variety of reasons. One is flexibility. Two is just proliferation of this thing. I want to run it on my laptop.

I want to run it in the lab. There's that, I don't want to think about this stuff, licensees or anything like that. Just want to consume. And so open source kind of creates that ubiquity of consumption. From that standpoint, I think it's a lot easier to be DuckDB than Clickhouse. Because single node, so, um, versus cluster, it's just easier to deploy.

Yeah, I think in OLTP, the requirements to be open source is just stronger than in analytics. And because it's a distributed system, there's a big manageability piece that comes with. And running a distributed system doesn't matter how well it's built. It's hard and it's going to be quirks. So that's why we either have fully proprietary systems like Snowflake or half proprietary systems like Databricks where, well, Spark is open source, but like Databricks is not Spark as a service, like a lot more.

That's the tension. Do we, are we, we're going to have an open source data warehouse, again, ClickHouse and, and, and DougDB are on track, but are we going to have a dominant technology in that space? I don't know, because the market does not scream and demand an open source one. Mark, it seems to be okay with a proprietary one.

[00:34:18] Thomas Dong: Shall we switch gears a little bit here for our final segments? Um, we love stories. Uh, and you've been a veteran in the database industry for many, many, uh, years now. You talked about, obviously, reportings, uh, not going anywhere. Uh, one area that really interests us is, you know, kind of how the pace of business continues to accelerate in real time analytics.

We know that was a passion of yours. Um, you know, moving from near real time to pure real time, um, what are your thoughts on maybe the potential in the new applications that can be built around, around real time applications? 

[00:34:55] Nikita Shamgunov: Well, I have to talk about AI in this one, right? Because that's where the head is, uh, the talk is going.

Um, we see an emergence of vector databases right now, you know, Pinecone was just funded at a crazy valuation and there's, looks like there's another vector database, uh, showing up every, uh, every five seconds. I think it doesn't matter what we do, a few things that there need to be solved. And this AI world at least holds a promise how they are going to be solved.

So my own observation of standing up a data team is Snowflake is a great product and it solves querying Snowflake very well. So if I need to go and write a SQL statement, I kind of trust this thing will run the SQL statement well. I think they solved it very well, but the old, the rest of it, before I get into a place where my data is clean in the data warehouse and I can run SQL statements.

All that path that starts with, I have 20 different systems already at Nian. They all put data into Snowflake. So I need a data team. There's no way to bypass having a data team in even a relatively small company. So I have one and the person who runs that team. guards the knowledge of how everything is organized and what the data model is, what the semantic model is and how to query this and what is fresh and what is not fresh.

I think that problem is best addressed with AI. So we're going to have data warehouse as a calculator, but then the humans that are driving the calculator should be Empowered by AI, helping you with that semantic model, helping you with the ETL, with the stuff that nobody actually wants to deal with. And then we have a job title called data engineering, um, that live in the world of ETL and ELT, right?

They live in the world of like 5Trend to move data, bring data in, or Airbyte. And then they use DBT for data transformation. Um, that's just not glamorous. That's a lot of work. Um, and I think AI is in a great position to, to help with that, uh, with that part. The other thing is what I talked about retrieval.

So the vector databases are there for a reason. They allow you to retrieve relevant information that you feed into large language models. So large language models, they know everything and then you can take, bring their attention into something by prompting, but sometimes they need data that they are not aware of.

So it can, they can reason over that data. So the fast retrieval with good recall over that data is provided by a vector database. So that's a new workload that will. You know, every, every enterprise is thinking about right now, how do I, how do I onboard all those LLMs and how do I make them useful and how do I make them sure that they know about my data?

Um, so how to connect those things, uh, we either do a data warehouse centric, but we still move everything into the data warehouse and maybe we move it in the data warehouse using those LLMs too. And then how do I make it useful and instantly available for my LLMs? Do I tune them? Do I provide context? Do I have some sort of smart retrieval system?

Do I have a retrieval system that connects and lives in the same embedding space as my LLMs? I think that's where, where a ton of interesting stuff is happening right now in the data world. We'll definitely participate in that at Neon. Uh, we're, we're already supporting the thing that's called PG Vector, uh, which is a vector extension for Postgres.

Um, we have some ideas of how to evolve this further. And, uh, we'll be building lots of partnerships between like Neon and Pinecone, potentially, or Neon and other, other vector databases. 

[00:39:02] Vijay Ganesan: Nikita, great point about vector databases and their increasing relevance in the AI world we live in. One question for you on, um...

On context, you know, so the vector databases exist primarily to be able to pull context that is relevant for that particular question that you're asking off this language model, right, to get a more accurate answer. But oftentimes in data, that context tends to be very large, right? How do you manage, because, you know, it's, it's not like I can take all the data in my data warehouse and give it to.

this LLM to answer a question about something I have, analytical business question. How do you manage context? Vector databases are an approach, but how does that scale when you're talking about large volumes of data? Well, 

[00:39:45] Nikita Shamgunov: the approach that I know that works Is the following. First, we, we, we've seen continuously increasing the size of that context window and hopefully one day, and there's some research about it.

Um, and certain algorithms turning from Binion square to N log N, it will, will be supporting larger context, then, but they're still not unlimited. And even populating that context 10 terabytes.

Right? You're certainly not going to be populating that context window on a single request. So you will be populated in once then, and then continue. That's one possibility. And that's where retrieval comes in. So you get a prompt, you turn it into embedding, they're, they're embedding the algorithms now. In fact, what I'm learning lately is.

These larger companies iterate over the types of embedding algorithms for better recall. And recall is basically what you go and inquire in your search system and saying, Hey, give me, you know, give me a bunch of stuff, um, and that use approximate nearest number algorithms for this. And, and recall is how well they, what is the percentage of relevant stuff in the, in the whole stuff or, uh, in all the stuff that you need to retrieve.

And so, because it's retrieval, and because retrieval can be built very fast, as in, like, Google searches very fast. So, based on the prompt and based on what you need to find out about your data, using that vector embedding approach, you can get that data, a subset of that data very quickly, as in like milliseconds, and put it in the context window, then interact with your LLM.

So, are we going to continue doing that, or that maybe LLM and search index will merge? I know that, you know, there are rumors. that OpenAI might be working on, on the, on their own search index, but then OpenAI is not just one model. There are so many models out there. So I think that, that approach of, of quickly populating the context window with relevant information seems to be quite durable.

So I see a lot of companies doing that. 

[00:42:01] Vijay Ganesan: Question on Neon and serverless Postgres, you said in an interview recently, a serverless Postgres, it's, it's as easy as Stripe, right? You know, which is kind of very thought provoking, you know, that we don't think of that with databases, right? You know, it's just like, you know, some payment in your app, you know, just put a line of JavaScript and off you go, you're talking about.

OLTP database along those same lines, which is very, very fascinating. I'd love to hear a little bit more about that. 

[00:42:30] Nikita Shamgunov: Definitely. If you go on Amazon today and you say, I want a database, you'll spend, after you've like gotten to the dashboard and are making an API call, just the provisioning is going to take on the order of minutes to tens of minutes.

And if you go to Neon right now, you push a button, three seconds later, you have a database. Now, this is just the start, and that database is just a URL. So, it scales with you, it scales up, scales down, scales to zero. You don't need to do anything with it. But you're also right, like, that's just provisioning.

That's not tuning your database, that's not creating indexing, that's not twisting certain knobs that, that, that increase or decrease the performance of your workload compared to all the other workloads. But I think that's where AI and autotuning comes in. There are separate companies like Andy Pavlo, which is a famous database professor, started a company, Autotune, which I think is a good idea.

Um, we'll bring some of those ideas into, into Neon and the autotuning world. But I don't think that there's something that fundamentally stops us from, for every database to be just a URL in the cloud and it's technology, it's execution, but that's what the world wants. And if you imagine that we had such a lightweight approach to, to having databases, then you can start asking yourself a question.

What if every GitHub repo had a database that just came, that the GitHub repo came with? What if we had a posters database, uh, that is just like, it doesn't cost you anything to create one of those URLs. What if we started to share those URLs between my teammates? What if we started to share those URLs against with people in the world to make it trivial to collaborate with people you are building applications with people you might not even know.

Uh, what it was trivial to share data, uh, what it was trivial to share real time data. So that's what I mean as simple as Stripe. There are no knobs. You push a button, you get a database URL, you can put data in. This data never, never disappears. It's yours forever. You can share that data and it doesn't cost you much.

That's 

[00:44:54] Vijay Ganesan: fascinating. So, so the business value is developer productivity and opens up opportunities for a new class of use cases and cost. 

[00:45:03] Nikita Shamgunov: It's cost and it's speed at which your organization can move. You know, uh, Vinod Khosla said that the enterprise spend looks like Pac Man, right? It's like, you know, it's like this old game with like a sphere and then the mouth kind of looks like this.

And what's in the mouth and the pipe that corresponds to a mouth of Pac Man, that's what you spend on tools and the rest is what you spend on people. So if you take the cost of something to zero, and here we're taking the cost of provisioning and managing to zero, it frees up and maybe you're spending slightly more on the tool, not because the tool is more expensive, because you're using it a lot more, because the people that are in their other part are more productive.

Thank you. So that's, I think, is the biggest opportunity. And within that people spend, there are people who are responsible for provisioning in the enterprise. They're responsible for tuning in the enterprise. And now the tool comes with basically self serve capabilities and auto tuning capabilities. I think it's a pretty massive opportunity here.

[00:46:13] Thomas Dong: This new world that you're building and that you're dreaming of, that's a great analogy to leave our audience with today. Nikita, Vijay, any final thoughts here? 

[00:46:23] Vijay Ganesan: No, this has been fantastic. We're out of time, but we'd love to continue talking for hours. This is fascinating. Nikita, thank you so much. 

[00:46:31] Nikita Shamgunov: Really enjoyed it. Anytime, Vijay. Anytime, Thomas. Thank you so much for having me. For both of us, good luck building our companies. 

[00:46:40] Thomas Dong: Great, thank you. Yeah, thank you very much. 

[00:46:44] Vijay Ganesan: Alright, yes, a couple of takeaways for me, Thomas. One, what he said about vector databases and the increasing relevance of vector databases in the AI world and associated with retrieval systems.

I think that's an interesting area, you know, which is fairly New for a lot of people, and so that I thought was very, very interesting. The other thing that he said, you know, this RM F, a whole bunch of data in the data warehouse, and really, you know, thinking about data warehouses and, you know, the Central places of data in the context of LLM, I thought was very interesting.

It's a radical different way of, uh, of thinking, you know, it breaks all the conventional wisdom about data warehouses. So we may be thinking about data warehouses very, very differently, you know, a few years down the line. And on that generative AI LLM topic, what he said about the data warehouse is like a calculator, right?

There's a lot of stuff that happens before you input something to the calculator. And this is the whole data engineering area of data cleansing, ETL, ELT, DBT, where he sees a lot of potential for AI making those jobs much easier. So I do agree that that's going to be a big area where In data engineering world where there's going to be massive impact of generative AI and LLMs.

[00:48:10] Thomas Dong: Yeah, absolutely. He's definitely building this next generation of databases. It's fantastic to hear his thoughts, especially around AI and auto tuning, kind of automating out that. The hard parts of data modeling and that semantic layer I think is really going to move the needle when it comes to self service for databases.

Every application needs a database and, and besides those Great insights, um, that you called from the conversation. Um, for me, it's about all the actual insights that he, he was able to share here and the one that caught my mind, I know that many database vendors have been struggling or trying to tackle this.

OLTP, OLAP, HTAP, WORLD, and the thing that really stuck with me, uh, was he said it's like five to ten years out, right? We're, we're still going to be choosing OLTP for our primarily transactional workloads. We're going to be choosing OLAP for our primarily analytical workloads. Still an area rife with, with research, but I think for data leaders, that's… A good important takeaway, don't stress about it too much, um, it'll come, that's 5 to 10 years out.

[00:49:24] Vijay Ganesan:  Yeah, and adding to what you said about self service and building applications and databases, and this idea that using OLTP database is as easy as Stripe, that's just... Fascinating, right? And it's going to change the world of application development, I think.

[00:49:40] Thomas Dong: Absolutely. Well, that concludes today's show. Thank you for joining us and feel free to reach out to Vijay or I on LinkedIn or Twitter with any questions or suggested topics for the future. Until next time, goodbye.