This episode of The Analytics Edge (sponsored by NetSpring), features Sanjay Agrawal, Co-Founder and CEO of Revefi. Revefi's data operations cloud offers a zero-touch data quality, spend, usage, and performance co-pilot for monitoring and optimizing cloud data warehouses. With Revefi, one customer reduced warehouse spend by 30% and their data team saw zero escalations from the business for data quality related issues, despite data adoption increasing 35%. Throughout the conversation Sanjay explores the continuing challenges in managing data quality, the emergence of zero-touch observability enabled by AI, and the need to control data warehouse costs despite the anticipated cost reductions with the cloud.
This episode of The Analytics Edge (sponsored by NetSpring), features Sanjay Agrawal, Co-Founder and CEO of Revefi. Revefi's data operations cloud offers a zero-touch data quality, spend, usage, and performance co-pilot for monitoring and optimizing cloud data warehouses. With Revefi, one customer reduced warehouse spend by 30% and their data team saw zero escalations from the business for data quality related issues, despite data adoption increasing 35%. Throughout the conversation Sanjay explores the continuing challenges in managing data quality, the emergence of zero-touch observability enabled by AI, and the need to control data warehouse costs despite the anticipated cost reductions with the cloud.
Throughout the episode, Sanjay discusses the rapidly evolving field of data observability. He delves into the challenges and costs of data quality, emphasizing the importance of the right data at the right time and cost. Sanjay explores the concept of zero-touch data observability, likening it to level 4 automation in autonomous vehicles. He touches on the role of AI and ML in this context. The conversation also veers towards the new emerging dilemma where even though the cloud was supposed to reduce cost, businesses now find themselves seeking innovative ways to control costs within their cloud data warehouses..
Bio:
Sanjay Agrawal is a two-time co-founder of Revefi and ThoughtSpot. Sanjay has spent over 2 decades building foundational databases, technologies, SQL optimizers, and automating performances for entire warehouses. His latest endeavor, Revefi, offers a zero-touch, 360-degree data observability and monitoring solution for cloud data warehouses. At ThoughtSpot, he was instrumental in building a self-managing, distributed in-memory ACID compliant data warehouse capable of operating at 100 nanoseconds per input table.
“Cloud data warehouses like Snowflake, RedShift, BigQuery, Databricks, and Azure have become the de facto place where businesses pull data out and use it for a business purpose. So the more compute you push on the cloud data warehouse, the closer it stays to the ecosystem and the easier it is for anyone to even consume such a system.”- Sanjay Agrawal
(Segment 1) Challenges
(1:25) Motivations as a two-time founder
(2:37) Defining data observability
(5:32) Quantifying impact of poor data quality
(8:47) Understanding the problem of bad data
(13:08) Organizational responsibilities for data quality
(15:30) Data quality and/or analytics
(Segment 2) Solutions
(18:17) Challenges to zero-touch data observability
(21:15) Data observability in centralized warehouses
(23:52) Managing cloud data warehouse costs
(29:07) Leveraging AI/ML for data quality
(32:06) Building a non-invasive observability platform
(Segment 3) Business Opportunities
(34:39) Product vision for data observability
(Segment 4) (37:56) Takeaways
Announcer: [00:00:00] Hello, and welcome to the Analytics Edge, sponsored by NetSpring.
Tom: The Analytics Edge is a podcast about real world stories of innovation. We're here to explore how data driven insights can help you make better business decisions. I'm your host, Thomas Dong, VP of Marketing at NetSpring. And for today's episode, my co host is Vijay Ganesan, co founder and CEO at NetSpring.
Thanks for joining me on the show today, Vijay.
Vijay: Thank you, Tom. Great to be here. I'm really looking forward to this conversation with Sanjay.
Tom: That's right. Today's topic is data observability, and our guest is Sanjay Agrawal, Co-Founder and CEO at Revifi. Revefi offers a zero-touch, 360-degree data observability and monitoring solution for cloud data warehouses.
And with built-in data load monitoring, their customers are able to reduce their warehouse spend by at least 20%. Sanjay, we're delighted to have you with us today.
Sanjay: Welcome. Thank you, Thomas. Thank you, Vijay. Great to be part of this conversation.
Vijay: I've seen you in action building, ThoughtSpot's database and other cool things that we did and always enjoyed the perspectives you bring to data and analytics and I'm sure our audiences today will enjoy hearing your thoughts.
Tom: Yes. And speaking of ThoughtSpot, Rimify is your second company you've now started. ThoughtSpot, of course, is now a 4. 2 billion company. Unicorn in the BI space, our listeners would love to hear your career journey and what motivated you here as a two time founder.
Sanjay: I think the best way to answer this is, you know, what really makes an entrepreneur, right? The DNA itself. So, well, I'm Sanjay. I'm the co founder and CEO for Revefi. And, we started this company in the data observability plus plus, and we'll talk about that later. in the space, like it goes way beyond that. I think we're extremely fortunate, like when we started ThoughtSpot and as part of the co founding team, everyone who came in had an amazing [00:02:00] entrepreneur DNA.
And that's actually not a surprise at all, that all of them, there are no exceptions, have gone and started new companies. So to me, that's really, you know, an ideal, you know, person, a persona, a person who comes in with a strong entrepreneurial DNA will always be builders at heart. And so when we, in ThoughtSpot, after nine years, you know, we asked a simple question, you know, data stack, there's a lot more things to do.
What else can we do? It's time to build again, and that's where we started our journey again. So this is why I'm here now at Revefi.
Tom: And so you mentioned data observability, plus, plus, plus. So let's actually start out with a traditional definition or a classic definition of data observability, what it is, and why should data leaders care?
Sanjay: Yeah, so I would, in fact, for this one, I would walk through, you know, what we saw. Like, I have spent my entire career in data space. Like, even for my, even my graduation was in databases when it was not considered as, like, really. at that time, people were saying networking and compilers, [00:03:00] and that's the good, the hard things.
But I really had a great time there. And then, after that, when I was at Microsoft Research, again, I was very fortunate to be part of an amazing database group there. Among the people who have actually given this whole area the shape which we see today. Now, one thing we saw During our journey, whether we are building different technologies, like whether we're doing deep SQL query optimizations, whether we're building analytics at ThoughtSpot, what we saw was that while everyone really strives to be a data driven organization, there are a lot of obstacles there, friction points along the way.
Like as a very common example, let's say I'm in a CXO meeting. I see a number and say, I'm going to use this to make some business strategic decisions. If the number matches what's there in my mind, I mean, it's, it's there, no questions asked, nobody cares. And if it doesn't match, then immediately the, the, the direction is to say, no, data must be wrong.
[00:04:00] Why don't we go and look? That never gelled with us. If you are going to build a data driven organization, How can it be, you know, something which is like a data for convenience kind of a model? It has to be something better. Like, for example, we don't jump and say, you know, if you don't see something expected, that, you know, software is wrong.
We don't go and say, you know, the CPU instructions are out of the whack. It's giving me something completely off. So why, why is it, what is it about data? Why cannot it be having, why cannot it have It's the same, the first class, fidelity and, and, I would say trust. And when the trust is not there, you really, it's, it's not the right, in our view, it's not the right way to build the organization.
So what we saw was that you can give people the best data warehouses. You can build, give them the best data, you know, ETL, ELT solutions, right? Analytic solutions. But, the bridge of trust, unless you build it, it's, it's really going to hold you back as an organization. So for us. When we looked at the space [00:05:00] of data, we said, here is a big need.
And we are like huge. We are, we have been data practitioners forever and we love solving hard problems there. And so how do we help build the bridge, help, help build this, bridge of trust? And that's why we said, okay, let's, let's, this is really classic is a data observability or slash data quality for us.
And, and once you have that, that is how you get the trust. So let's talk about data
Vijay: quality, right? You, you know, you go to any organization, you'll hear. Talk of data quality, right? It's obviously, a problem that, that everybody faces. But how big of a problem is it? You know, is there a way to quantify it?
Sanjay: When we were exploring this space, we in fact, you know, we wanted to also get a point of view, right? How important in this organization, how pervasive it is. And one thing we found is it doesn't matter whether you're a Fortune 5 company, even like you can say Fortune 1 or 2, or whether you are. Like Fortune 500, or 2000, or you are a small [00:06:00] startup.
Everyone comes and says, you know, data quality is super important to us. Like, there is clearly a big need. So we wanted to see, you know, how deep is it now? Now that we realize that everybody says it's a problem, and in fact, everybody has their own hat, so I'll say what it means to them. The question is how important it is for them.
And what we found was there are two dimensions which we broke it into. One is that how much does it really cost an organization? Let's say there's some bad data which has come in to undo its changes and its impact. And the studies there actually shows that, and this was done, in fact, a few years ago, that on an average for an organization, it costs more than, I think, 10 million dollars, upwards of that, which means that, again, like different companies, different distributions have a much, much larger impact.
And the idea is, yes, it is very real.
Vijay: Wow, 10 million dollars a year
Sanjay: for that's easily, I mean, and that has, it has to, by the time now, I'm [00:07:00] pretty sure it is way, way above it because in the last 5 to 10 years, everybody has been using data and they want to use it even more and more to make decisions so it can, it can only increase.
And the, and yeah, and the second dimension, Vijay, which you sort of, you said, right? This number is not staggering enough. The other one is around the time dimension. So we generally have so much data coming in. Everybody wants to use data and the small data teams, like absolutely strapped for resources, trying to make it all work.
And what we found is that, and this is again, like, through studies, but also when we talk to closer to what, 100 leaders in this space until now, we see that. You know, at times it costs the team about half of their time, like real wall clock time trying to chase down issues related to data. And that really is another big thing because they should really be, what they should be focusing on is ensuring that the right data is coming to the right team.
Not really just trying to chase down [00:08:00] issues. It cannot be just fixing things. You've got to get new stuff there. And how are they building the right business processes and all so that the problem doesn't happen. So the point here is it's too much of a resource drain as well, the time drain. So that's why we said, you know, this is super critical.
We got to do something here.
Vijay: So 10 million dollars a year down the drain and 50 percent of resources and time spent on just quality issues and that's staggering. but Let's, let's discuss this topic of why is it so bad, right? You know, obviously data teams are, they're smart, you know, they have resources.
So there's quite a bit of investment enterprises put into data machinery. and you know, this data quality issue has been around. For years, right? It's not something new and the numbers are staggering, but why is that? Why is it still such a big challenge in this day and age when there is so many tools and technologies and evolution of maturity of data stacks and so on?
Sanjay: So I would, Again, like, I wish I [00:09:00] could tell you that, you know, why, how it can be solved, like, trivially. It's, it's a very, very hard problem, right? And let's walk a little bit, like, so when the things were, I would argue that if you walk back about 15 to 20 years, the problem has always been there.
And I mean, to be very fair, it's, we'll continue to see like in some shape and form because the need and the hunger for the data just continues to grow more and more so. But let's see, you know, the world about 15 years ago or 20 years ago. There was, a very few number of warehouses, like classically iterated data of the world, oracles, and controlled push into the warehouse, very few pipelines, informaticals, etc.
of the world, and very curated, even like reporting was curated and all. Now, if we look now, just we're talking to this company, the go to market team, and we asked, well, how many data sources do you use for just understanding, you know, the marketing,[00:10:00] like things, the, the features slash signals there, they're saying we are using more than 40 plus resources, which is a completely cloud API driven data coming from all over the place.
The world now is all about data coming in from all over, literally, I mean, there's no other way to describe it. There's a, when it comes to the variety of data, just the sources continue to grow. I walk in a store that the person there, you know, they want to know where, how much, what has been my spending history.
Everything about me. What's my likelihood of buying this? Right? And, and the more fancier stories, the chances are the more they want to know. And then it also dictates what they want. You know, how they, how it, how the next step happens, like when a person walks in the store and all of this, this things are all driven by data.
And imagine this, right? So much data coming in so much at first, like velocity. Streaming data, there is like everything in the universe. A multitude of data sources, the [00:11:00] volume has just increased so much enormously. And the need for the organizations to consume it is so high, they want to integrate it all.
So I would say that it's no surprise at all that this problem of data observability and quality I've caught on big time in the last few years, like more so in the last three to four years, because it just mirrors what we are seeing in the industry as a need for the data. In some ways, we could rationalize that when the SAS systems came up, there was a certain need for a certain class of observability systems, data docs and new relics, etc.
of the world came out. This is about a decade ago and now we are seeing the same trend with the data observability. Yeah,
Vijay: interesting you talk about number of SAS tools companies use and you know, we're a small startup, but I think Tom's team probably uses like 25 tools
Sanjay: already. I think it will grow. I would love to know Thomas.
Vijay: If I get to 100, you know.
Tom: Yeah, I certainly have a data observability problem here, in my [00:12:00] marketing
Sanjay: team. I think one other thing I'll just call out is that what we also notice now is that because we are like consuming this data truly ferociously, like for the lack of a better word, and so much shapes and forms, what we also see is that data has a very shortened lifespan as well.
Which means that we also need to understand what data makes sense today, and also when it doesn't make sense, we got to kick it out as well. So that rigor and discipline is more and more needed, more so. Specifically because the number of stakeholders now dealing with data has grown like really by orders of magnitude.
It's no longer a few data teams owning this. It's like everybody wants data. People can create their own pipelines as well. Everything has been democratized. So
Tom: in terms of that, actually, with so many different stakeholders that you mentioned, organizational, who owns data quality? Is it just the data teams?
What about the business teams who are, trying to leverage that data? What are your thoughts on that?
Sanjay: [00:13:00] Yeah, our belief is simple. Like, we just look at, if I have to look at it from a practitioner point of view, right? Data, to me, equality is a shared sport. Like, in some ways, I always draw the analogy of, you know, how Software life cycle evolves, right?
We have design, we have code reviews, then we have, naturally, like, unit tests, then we have integration end to end tests, then we put it on Canary in production, then we observe what's running in production, and then we have all of these things going on. In some ways, if I look at the data stakeholders, it's again, like, very similar, right?
There'll be data teams, some are responsible for getting data from your APIs or Kafka or equivalent streams. into S3 or pushing it into a data, cloud data warehouse or somewhere else. And some teams will pick it up from there and have it in their businesses. So, I have a strong view is that data quality is a very, very strong and observability is a shared sport.
Everybody It really plays a part, and [00:14:00] the intent is they should all stay on top of this. They should all have a point of view of what quality means to them, and it has to percolate upstream to serve that purpose. Like, yeah, no, I think we can definitely dig deeper into that. So when you build Revefi, that was one of the things we had in mind, but naturally, I would say it's, it's, it's not really a single place.
And that's why it also makes it a little bit much more, I'd say, much harder because now you have to make sure they have a single, unified point of view when they have a different ways of consuming the same information.
Tom: Yeah, that makes a lot of sense in this new data driven, organization, that most of us are, are in now.
And obviously leveraging that data for analytics. And that actually brings me to my next question here. there's a saying, that. Perfect is oftentimes the enemy of good. and this obviously applies to data quality when it comes to analytics in particular. oftentimes we hear our customers telling us, hey, you know, come back six months later after we solve our data quality problem.
but [00:15:00] the reality is, you probably don't need to wait for that. Like, it can be an iterative process here where oftentimes your business people are the best people to identify issues in your data. I would love to hear your opinion on this, on this topic. Do you believe enterprises should wait for perfect data before leveraging it for analytics?
Sanjay: So, the short answer is absolutely no, unequivocally no. It doesn't make sense to me at all, and the reason is straightforward, right? In some ways, when we see an outcome, we are able to immediately draw the conclusion, if this is a problem and where it started. Unless we see something, it's very hard to draw that conclusion, right?
So, I am, I'm a firm believer in the simple dictum, right? By the way, if you're in a startup, right, you know that, Perfect is not what you strive for. Perfection, right? You definitely strive for it, but you can't reach there by saying this is my, this is [00:16:00] where, you know, I have to be at all points. It's a journey.
So if the one example which comes to mind is that when we are working with a, with a, with a customer, and this was for six, seven years ago, and they were based in Europe, um, the ThoughtSpot itself is an extremely visual tool. And it had concepts like SpotIK, you know, which brought out, you know, is there something anomalous with your data right to your footsteps.
And the, what was happening was that, they used to have this notion of weights. On pallets. And there were a few weights which were through the roof. There were actually three to four orders of magnitude larger than expected. And, of course, that number was used in all their calculations, like average and all.
But when they saw the, when they brought the ThoughtSpot on board and it came out to them, then they realized, oh, this shouldn't even be there. And that is how the point is that once you see an outcome, it is much easier for you to go back and reason out what's the [00:17:00] right thing to do. So waiting for putting everything behind that, uh, let's say six months, you know, that's for data quality to be perfect.
The question is perfect for what? Right? It's really, so if it's going to serve a purpose, a business purpose, you need to get them involved as soon as possible. And the way to do that is to create this holistic view. So I would say I'll never wait for six months. In fact, the idea is build an experience where they can see it quickly and they'll work together to ensure that it goes back to the shared support model.
Like they work together, they see something wrong, tell it to data, data teams have some questions, they go and ask the businesses. Get on this virtuous cycle. of learn and iterate as soon as possible. Great. Sanjay,
Vijay: let's talk a little bit about solutions to the problem. so clearly building a zero touch data observability product is super hard.
You know, what you're trying to do at Revefi, you know, you talk about zero touch data observability. what makes zero touch data observability?[00:18:00]
Sanjay: The analogy I draw here is, you know, where do you want to be as a company, right? If you want to take on easy problems, the easiest way is, you know, ask people to identify, hey, here is my warehouse. Here are my tables. Pick what you care about. For each table, you know, find the thresholds. Set it up, if you get an error, you know, if it goes there, just let me know.
Like, that's to us, this is not observability, right? This is really like absolutely level zero kind of an automation or level one at best, right? If I think of an analogy in the autonomous vehicles. But now, Where we want to be is not just take on hard problems, right, because of course, like building a zero touch is a very hard problem because it pushes the onus of all the choices, everything to us.
Like, and we have to know what's relevant, what's important and bring it to the user and when. Definitely a hard problem, but absolutely worth it. Because imagine the following, right, you have this warehouse, like data is changing, manual thresholds and all, doing it all the time across [00:19:00] everything, not even an option.
We know what happens, right? People will set it up, then they will. Get a lot of alerts, then they say, well, I'm going to turn it off, right? I'll make the thresholds like 10x more and see what happens next. That is not the way to build an awesome and usable observable radio system. Our approach was always that from days, like when we started, our mantra was, how do you make it delightful and truly easy to consume?
We are here to save you time and money. That's what we tell the data teams. How can I save you time if I'm going to ask you for like weeks and months to set up a system? It doesn't make sense. So we said, we are going to take on this ourself and see how much we can test drive the system, how much you can get without spending any effort from your side.
So we have stayed extremely, extremely close to our mantra of zero touch. In fact, we draw the same. our thing is that we should be, we are in a level four automation, and this is how we draw the, the analogy. And ideally it should be [00:20:00] think level three, level four at this state, because that is what helps the teams.
Because it really tells them, it puts them in a place, saying that, Hey, not only I know you're watching what I have today, you're watching it in a way where it makes sense, but I also know that when I bring more things in future, you're going to watch that for me. So, for us, this was not really an option.
And yes, we enjoyed taking on these problems because they're super relevant and hard. And, that's what Revefi is all about.
Vijay: A great analogy, to level 4 automation in self driving vehicles, which is an incredibly hard problem. So, that's a great analogy. let's talk about data warehouses. you know, one of the trends that's happening, it has been happening for several years and continues to happen, is this movement towards centralizing everything on cloud data warehouses, like.
Snowflake, BigQuery, and so on. what does that unlock for data observability? One
Sanjay: thing we, again, like, data is all, like, [00:21:00] there's lots of data in the system. Like, people will be collecting data, they're doing many things with it. But we started with this very simple conjecture, like, where does it matter the most?
Yes, it's a long journey for us. If a project, you know, a decade later, we want to cover it all. But let's talk about today, let's talk about a year. So the question is, what matters most to the organization? And one thing we have seen as a trend is that, if I look at cloud data warehouses, they have now become the de facto place, like snowflakes of the world, as you said, right?
Redshifts, like queries, Databricks, like, and also Azure, right, down the road, anything like this. They have become the de facto place where the businesses pull their data out and use it for a business purpose. So, for us, like at Revefi, it was a no brainer. You say, I need to have data. I, in fact, want to reason out that, does this data have a purpose?
And the question is, we started by saying, does it have a business purpose? And our claim is that, you know, Cloud Data Warehouse is the natural place to gravitate to [00:22:00] for, for finding out such data. And that is what led us to start by simply saying that we are going to start with observate or monitoring, so it's observability of, and on the dimensions of quality and many other things along that for the data which resides there.
And our goal is to reason out how did the data, how did it even get here? So once you know there's a problem, you want to know where it started and you want to reason out where is it going to, because that tells us how important the problem is. If it's not going anywhere, your problem is not data quality.
Your problem is data is absolutely inconsequential. And the first thing you should not be doing is trying to fix it, is to actually remove the data out from your pipelines. So, so yes, I think cloud data warehouses are the way to go for, and this, so in fact, as you said, Vij, I mean. We are again, second that, in fact, say that the more compute you push on the cloud data warehouses from that purposes, the more closer it stays to the ecosystem, the easier it is for anyone [00:23:00] to even consume such a system.
Vijay: When you talk about pushing compute into data warehouses, which is, which is the model we subscribe to, you know, you want to keep data in one place, you want to compute it, you want to push down compute as much as you, you can. but, one of the, emerging challenges, with cloud data warehouses is cost, right?
Particularly runaway costs, you know, somebody runs something before you know it, um, you know, the cost is spiraling. So, when you look at data observability, A solution, you must be looking at cost too. It's not just about quality, but it's also cost. Is that part of your observability solution?
Sanjay: Absolutely, because as we just gave that sort of a thought process, right?
Your goal is if you have a quality issue, you don't want to just say, I'm going to fix it. You want to ask, should I even be fixing it? Likewise, if you have a data here, you should be perhaps asking, you know, is this data serving a purpose? And yes, if it is going to cost me 10, 000, let's say to [00:24:00] get data in.
There better be a multiplier of that on the other side, which is going to use it for that purposes. Otherwise, it's not really the right thing to do. So for us, cost and understanding the cost rationale is super critical, because it helps us again establish the viewpoint that should this data even be here at all?
And if it is, then how should it come in? And just to give some examples what we saw, in fact, these are like staggering examples and not to like nothing to call out a specific warehouse, but there was one company we talked to, and they mentioned that over the weekend, they had some four runaway queries, and it cost them 70, 000 pounds.
Now, it doesn't even matter, pounds or dollars at a conversion, but that number was just so staggering, and this is really, [00:25:00] like, really something you have to watch out for all the time. And in a Gmail kind of a model. Cloud data warehouses, in my opinion, it should not be treated like that. They continue to accumulate and then, you know, figure out what's coming and then work with it.
Doesn't work, because anything which sits there will cost someday, somewhere, to somebody. And the idea is to continue to eliminate as much as possible and keep it running. Now, there's another example which we saw, which, even from a much smaller company. like startup. And what had happened was that they had, it's just not the operation was simply push data into a table.
It's the way they did it. And what they found was that, that it was about 800 a day decision. So, which means that if they had found this issue every day, they waited, it was another close to thousand dollars a day. And if they had found it at the end of billing cycle, which was about 10, 000 later, it was a lot.
So in a cloud data warehouse, we are extremely say, you know, be [00:26:00] super watchful. Of understanding, you know, what their model is, what the, how, the, what kind of compute you're doing, what's really, how is this data important and relevant and stay on top of it. So reify, we said there's no way you can have, that's why we are not data observability.
We as, as a sandbox, we said it's really much, much, bigger, than that because it really brings into. Picture all of this. It brings into picture the quality, it brings into the cost aspects, and also performance and usage characteristics all in one place. I can totally
Vijay: relate to the cost examples you're talking about, Ray.
You know, you find out at the end of the billing cycle that There was this thing that's been running for 20 days that shouldn't have been running and it's, it's, it's blown up your bill and even a small startup like ours, we struggle with keeping track of that. And then it goes back to what you said about the need for zero touch and nobody has the time to be looking at every single cluster, every configuration, every data warehouse.
No one has the time, right? [00:27:00] even in a large company with well staffed teams, there just isn't time, to keep watching everything. And you want. People to be able to spin up things, spin down things. This is the beauty of elasticity and cloud and so on. But then, it costs can get out of control very quickly.
So that zero cost where I just tell a system, Hey, watch, watch on my behalf and let me know when there's a problem. it's absolutely important.
Sanjay: I think Vijay, you brought it out very well. I think, again, like the whole thing, your goal is to build an amazing product analytics company. You shouldn't be thinking, you know, how much warehouse I'm spinning and this shouldn't be really at the top of the mind.
So, so for us it's actually extremely natural. This is why we said is it, we want to really, if you look at the data team's persona and their mission, right data, right time, right cost, we want to take care of it all and empower them to do that. And you have to do all of it to do, to do it justice. But, and there's like, it's, it's, cost is a huge thing.
I mean, to be very fair, Simple things can make a huge difference in cost. [00:28:00]
Tom: Yeah, what Vijay and you were just reiterating here was that level 4 automation for any company regardless of size, particularly the small ones like us, who don't have the time to identify this cost leakage. in the system. So I'm actually fascinated by this platform that you're building and, you know, to be able to unlock this capability just to in the zero touch model, identify these data quality issues.
Are there specific AI, ML technologies, maybe even perhaps generative AI, which is all the buzz these days, that you've built into the platform to actually help your users identify? the cost issues in their cloud data warehouse.
Sanjay: Yeah, so, the way I can, Thomas, I'll draw this analogy here, right? So if, let me actually walk through the customer issues, right, and come back to the cost piece.
So suppose this customer had to actually pick, so this was a Revefi customer who, like, over this 800 a day query, and they [00:29:00] had to do nothing. All they needed to do was just set up Revefi, took them two minutes, and the system was doing it all on their behalf. Now, if they had to go and say, you know, here, watch out for cost here, look for these tables, look for these patterns, any of these things, it wouldn't have been possible.
Because even if they take the effort to do it on day 1, it's not going to happen on day 2, it's not going to happen on day 30. So, automation is the super, like, it's critical to getting it out, right? And what matters is really, you know, making sure it comes at the right time, as soon as possible to, without them spending any effort.
So, for us Everything, if you think the level for automation, it actually screams AI and ML, because the whole point is that you can't build a zero death system, right, and take it off the ground unless you have a lot of AI and ML running behind the covers. So for us, the Revefi system is Completely, I would say very strongly, driven with this mantra that, you know, [00:30:00] how much can we automate?
How much can we learn? What do we know of the domain? That's where our entire, our own experiences comes into play. Like I didn't call it out earlier, but my co founder, Sashank, who was also part of the founding team for ThoughtSpot, he took a detour and he in fact spent four years at Meta where he proposed this whole area of automated data quality.
And when you are working at that scale, at that level, where they have petabytes, exabytes of data, when there are millions of tables and tens and thousands of stakeholders of data, versus the data teams, there is a lot of learning which we get there. So, the Revuify platform we built was, that is why we were able to hit the, we hit the ground running, with, you know, we knew what to build, we knew exactly what, and then we knew what to learn and how to push it out.
So if you look at the general like ML and the AI techniques, that is extremely pervasive and everything you do and don't do in our platform, we learn [00:31:00] and we figure out how to, how to make it. Again, our intent is to give you the minimum false alerts, like maximize our signals, uh, there and, and that's really, super critical for.
You talk
Vijay: about this learning system where you have to constantly be looking at the data, you know, understanding patterns and so on. but that also, poses a challenge in the sense that, you want the system to be non invasive, right? You don't want to come in the way of, workloads that the warehouse is serving, for example.
So, if this observability system is, you know, it's putting more load on the warehouse or, or slowing down other applications that are, In production on the warehouse, that's a challenge. So how do you, how do you make this non invasive?
Sanjay: I think you're going a little bit into some of our secret sauces, throwing some numbers because it'll bring into perspective what is super critical for us, right?
And then it comes back into architecture decisions. So there are a few things again, like when you said you're going to save your time [00:32:00] and money, we talked about time zero touch. We also talk about money. I can't save you money if I build a system which can, which can, which has unpredictable costs to itself, like TCO.
So, when we started EFI, we said, hey, how much can we do for, let's say a price of less than a, nowadays easily, less than a coffee a day. So, how much can we, value can we get to you? And the system is built to be a progressive model, which means that it operates at super low TCO today, which means extremely, In, I would say non invasive, not just in terms of the compute it puts on your warehouse, but also how much additional cost it incurs.
And it, this is very close to our heart, and it's completely baked into our architecture. So generally, we have a philosophy that don't ask for the same thing twice. Make sure you don't ask for anything unless it serves a purpose. And those really go into defining our system, like us, which is really [00:33:00] plays into this.
The simple thing is being extremely, I would say, stingy when it comes to how much compute you are incurring on behalf of anybody else. So, yes, we take very good care of that. And the best part is as part of observability, whatever EFI incurs, it is there out in the product for people to see by themselves.
They don't need to guess. It's there. How many credits we use, for example, in Snowflake, how much time we spend. Everything is out there. That's the fun part here, Vijay.
Tom: Yeah, so it sounds like, just through this this discussion here, you've mentioned many things that are pushing the boundaries of classic data observability, right?
We talked about AI machine learning, cost really being a, a first class driver and benefit within your platform. Maybe you can, take a look into the future of, you know, what Revlify is building, how, how the, the landscape for data observability is changing in terms of additional pieces, that should become [00:34:00] foundational, as a capability of a platform for data observability.
Sanjay: So a few things which comes to mind and I know Thomas you had mentioned the Gen AI piece, right? So I'm going to sort of bring it up. So naturally we are a startup. We are, we are having a great time working with the Gen AI technologies and there is a lot there to be very clear which we can Derive a lot of value from and also in the process, we'll also bring that value to our customers.
Some things we'll keep a little bit under wraps for now. Let's just imagine that. But there's a many, like, it's a very direct, like, the way we look at it is a directly impacts, you know, the space we are in and Very, very strong connect with the value prop. But there are a few things which I would say are even foundational, irrespective of, you know, what happens in that space down the road, is that what has happened with this, the evolution of AI, it's really democratizing the models now, right?
It's now out there. Now, it's actually, the [00:35:00] model is more considered as, it's, it's, it's, it's like everybody has it, and your data is really the differentiator. And thanks to amazing like Chad GPT and others also, I mean, in the space, which means that the onus now is more on ensuring that the right data goes in, into feeding this model.
With you, definitely, we don't want a garbage in, garbage out kind of a model in this case. Like when you're building this, Open AI calls sector, you need to know how to build the prompts. And we need to understand, so what we have always was very close, even in the classic ML and the AI world, understand the data drift.
All those problems are, I would say, there and even more so, because it's all getting democratized very quickly. So that's an area where we know, again, like, if I will, it continues to, and even today it plays a big role. Understanding what data goes in, into such cases, that's why we talked about the consumption of the data.
Who is using it? Once you have a [00:36:00] handle, you know, how do you make sure it's the right things coming in. But overall, there's a lot of fun stuff, but I think Thomas, you already talked about, you know, what would we want from such a platform, right? Which is not just about data quality. It's really to have things come to you when it matters the most.
And that's why we just summarize it in our philosophy. It's all about right data, right time, right cost. Go to different personas, expand it as you
Tom: move forward in the journey. Well, thanks so much for joining us today, Sanjay. it was wonderful, hearing from a tech leader, data leader like yourself, who's innovating in the data observability space.
Our listeners are going to be fascinated by many of the insights that you shared with us today.
Sanjay: Yeah, absolutely. Thanks, Thomas. And thanks, Vijay. It was a pleasure to be here. Thank
Vijay: you, Sanjay. Always a pleasure talking with you. You're a deep thinker, you know, a great thought leader in this space. It's been a [00:37:00] fantastic episode.
Thank you so much. Wish you all the very best with Revefi.
Sanjay: Thank you, and same to you as
Tom: well. That must have been a great conversation. It's like old buddies getting back together here live on the Analytics Edge. why don't you take a second, let's collect our thoughts here. A lot was discussed here today.
What were your key takeaways from today's discussion?
Vijay: Yeah, Tom, you know, it's always fun talking to Sanjay. He's, you know, he's a deep thinker and, and great, great, great perspectives. A couple of things, I think one thing, the sort of the sheer magnitude of the data quality problem, you know, he's talking about an average enterprise spending 10 million a year, 50 percent of their resources is gone and just data quality challenges and that's That's staggering.
You know, we know it's a large problem, but to really, give it a number like that, you know, puts it in perspective. the, the, the question we asked him about, you know, what is the, why is it such a big [00:38:00] challenge? You know, there's, there's obviously, you know, the data. Number of data sources exploded, volume, the velocity, and the variety of the data.
but there's one interesting thing, that I think is important to remember is the shelf life of data is very short these days. so that's very key. It's not just about, you know, volume of data, and how do you collect and store and manage and, make sure that it's accurate, but it's also How do you figure out what data is useless that deserves to be thrown out, right?
It's not relevant after a certain period of time. So how do you, how do you get rid of data? That's, that's equally a big problem.
Tom: Yeah, yeah, for me as well. When he was talking about The purpose of data, right, is where you focus your data quality efforts on, and that obviously ties directly to the business and the trust in the system, and I really like the answer he gave to my question around, you know, perfect being the enemy of good, right?
you know, these are [00:39:00] ongoing simultaneous projects in data quality and data observability that need to happen, but at the end of the day, you know, business Still goes on. we still need to, if we're going to be a data driven organization, we need to be running the analytics and getting actual insights out of that data.
and where there are data quality issues, they're known and directional guidance is oftentimes, the, the best answer, right? There really is no optimal solution. Data underneath you and your business is changing so fast. But you need to ensure that you are moving in the right direction at all times.
Vijay: Yeah, he talked about zero touch data observability and drawing an analogy to level four automation in autonomous vehicles, self driving cars. you know, that analogy was very good, but it got me thinking, it makes sense for data observability. You know, nobody has time to be watching every query, every configuration, every cluster you want, some system that's It's humming along, you know, watching the system.
But, you know, the same type of thing is [00:40:00] probably relevant for analytics, too, if you sort of take that to one level higher to some of the things we're doing. you know, it's, it's, it's a challenge for somebody to go in and dig through the data, do slice and dice it, you know, build reports and finally get to an insight.
And, you know, this idea that with emerging technologies and generative AI and LLMs, and this idea that you could have a zero touch analytics, if you will, where you know, the system's automatically looking at things and understanding. Kinds of things that you're interested in automatically pushing insights to you.
so that is probably, you know, interesting, you know, higher up the stack too for, for the kinds of things we're doing in analytics.
Tom: Yeah, absolutely. The applications of AI machine learning while All the buzz, as we've discussed on previous episodes, many of these technologies have been around for decades already, but it's amazing to see the commercialization of it, the widespread adoption, mainstream awareness of it.
It's really going [00:41:00] to innovate every level of the data and analytics stack.
Vijay: And the final perspective for me is, is it on cost? You know, it is, it is emerging, as, as a big problem in enterprises as more and more, you know, cloud resources are being spun up around data, Managing costs effectively is a huge challenge for companies.
So his idea of broadening the umbrella of observability is not just about quality, but it's also about cost. You know, the right data at the right cost. I think that's very important. So I think they're touching upon something that's going to become increasingly more and more important.
Tom: Yeah, that's also a very interesting dilemma that has reared its head here where cloud was supposed to help us save cost, new opportunities have arisen as companies now want to actually control and better manage, their cloud data warehouse costs. That concludes today's show. Thank you for [00:42:00] joining us and feel free to reach out to Vijay or I on LinkedIn or Twitter with any questions.
Until next time, goodbye.