Data Radicals logo

Start with Story, End with Data

Ashish Thusoo, Founder of Qubole and Creator of Apache Hive

Ashish Thusoo

Ashish Thusoo is the GM of AI/ML at Amazon Web Services, where he owned 5 product lines including SageMaker. He also co-founded and was the CEO of Qubole, a data lake platform for ML, streaming, and ad-hoc analytics, and previously created the Facebook Data Infrastructure team that built one of the world’s largest data processing and analytics platforms.

Ashish Thusoo

Ashish Thusoo

Founder of Qubole and Creator of Apache Hive

Satyen Sangani

As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”

Satyen Sangani

Satyen Sangani

CEO & Co-Founder

Alation

Producer 1: (00:00) Hello and welcome to Data Radicals. In today's episode, Satyen interviews Ashish Thusoo. When the cloud was still a novel idea in 2011, Ashish co-founded Qubole, a cloud data lake platform, and later served as CEO. Built on AWS, Microsoft, and Google Cloud, Qubole delivers a self-service platform for big data analytics. Later in his time at Facebook, Ashish co-created Apache Hive to democratize data access and analytics. Today, Ashish is the general manager of AI and ML at AWS and owns several machine learning products. In this episode, Satyen and Ashish discuss the accelerated push to cloud, building a data culture, and how the economic climate is impacting customers.

Producer 2: (00:49) This podcast is brought to you by Alation. Meet us at Snowflake Summit this June. We'll uncover how Alation cuts through the complexity to help you find valuable insights in the Data Cloud. Learn how leading enterprises in every industry are using cloud migration to drive innovation and efficiencies. Snowflake Summit runs from June 26th to the 29th. Attend virtually or in person in Las Vegas. We can't wait to connect. Learn more at snowflake.com/summit.

Satyen Sangani: (01:20) Today, on Data Radicals, we have Ashish Thusoo. Ashish is the general manager of AI and ML at AWS, where he owns five product lines including SageMaker, low-code/no-code Edge ML, Health AI, marketing intelligence, and fraud detection. Prior to that, Ashish was the co-founder and CEO of Qubole, a self-service platform for big data. Before Qubole, Ashish led Facebook's data infrastructure team. In that role, he built one of the largest data processing and analytical platforms in the world at the time, spearheading the big data revolution. Ashish, welcome to Data Radicals.

Ashish Thusoo: (01:54) Thanks Satyen, and glad to be here. And looking forward to the conversation.

Satyen Sangani: (01:57) It's funny how in some ways your journey has come full circle, starting in a massive scale technology company building the biggest technologies in the world at the forefront of big data, and now you're sort of at the forefront of another trend in artificial intelligence and machine learning. But before we get to sort of the end, I kind of wanna start at the beginning and talk a little bit about your time at Facebook.

So, there you were, engineering manager of data infrastructure and built Apache Hive. Can you tell us a little bit about that story and the process of getting to building Hive, because that was at the time, quite a transformational technology?


The SQL buzz that drove Hive

Ashish Thusoo: (02:33) Yeah, happy to, Satyen. So this was back in 2007 and at the time, big data was just getting started and the primary catalyst for that big data revolution was a tremendous amount of data that was getting collected, primarily because of web applications. With the advent of the web and with the advent of mobile, essentially a lot of data was getting collected and there were no good systems to actually process that data.

So at the time, Google came up with this very innovative system called MapReduce, which was able to process data at very, very large scale using commodity hardware. And it was a transformational system at that time. Before that, you had to buy a very specific hardware for processing data, like data warehouses and so on and so forth. And Google came out with a system which could put just commodity machines together and create large-scale processing units.

Ashish Thusoo: (03:25) Now this system was — then a team from Yahoo! produced something called Hadoop at the time, which was the open source implementation of this idea that got very popular and that got picked up by a lot of companies, including Facebook. But the primary application for that was predominantly building indexes, search indexes.

However, we at Facebook looked at that system and thought that there were great possibilities of taking that system to introduce it to the analytics users, people who were trying to get sense out of that data, people who were trying to query the data. So we married this system, which had large processing power, with the lingua franca of data, with SQL, and we said, “Let's bring that together to see if we can bring the power of this compute infrastructure to the SQL users.” And that's how the genesis of Hive came about.

Ashish Thusoo: (04:14) And Hive was basically the first implementation of SQL on MapReduce and Hadoop, and it just opened up a lot of possibilities immediately. And we built the first prototype within weeks, actually, it got picked up by a lot of internal users at Facebook, a lot of analysts, a lot of data scientists at Facebook who started interacting with their data with SQL. Before that, it was all locked up. People had to write MapReduce, Java programs, which is insane. It was in the realm of a developer and every analyst had to go back to a developer to say, "Hey, I want to do this analysis on data. Can you write this Java program for me or MapReduce for me?" And that was just not scalable.

But with Hive, what we said was, “Let's bring this immense processing power with, marry it with a SQL interface on top,” and voilà, it became self-service data for most of our analysts. And that really uncorked a lot of insights that the analysts could get out of that data. That started that revolution of SQL and Hadoop, and then it led to multiple of those ideas coming out and open source and so on and so forth.

Satyen Sangani: (05:16) You mentioned that it started in 2007. How long did it take from sort of the ideation where you said, “Hey look, I think Hadoop needs a SQL interface,” to actual delivery of the first version of this platform?

Ashish Thusoo: (05:28) It was done very iteratively. So we first actually wrote a simple prototype when, I would say, a couple of months of very simple SQL, which was just filters and simple aggregations. Very simple. We just tried to test out whether there would be demand for this, and that got picked up within weeks. People started asking for more and that's when we said, you know what? Let's graduate out from this prototype to build a full-blown system with all the bells and whistles that SQL provides. All kinds of various JOINs, ability to do sub queries, and so on and so forth.

So we then just worked on it. This was initially a team of about 7, 8 people who then started working on it. And I think it took about, I would say 6 to 8 months to open source, the first version. And then we started to build our community around it. And then a lot of development started to happen within the community as well as within Facebook around Hive. And that went on and on and on for a couple of years and more. But the initial version, the initial test of whether this idea would actually fly was done within a few weeks. So within a couple of months we essentially put out the prototype, we saw the demand, the pull was directly, then that gave us a conviction that this is a great idea, we should go forward with it. And that's how we ended up building up the entire system, over the years, after that.

Satyen Sangani: (06:49) So a couple of weeks to MVP, 8 months to the first open source release. And did the open source community just take off? It ended up ultimately being one of the most popular open source projects that I can remember. And certainly at the time it was one of the most popular.

Ashish Thusoo: (07:03) That's right. Yeah. The open source community did take off. We had people contributing from companies like Yahoo and then it got picked up by a bunch of other systems as well. It inspired a lot of other systems as well. And through the dignitary in 2020, got the paper that we wrote on this got the test of time award in, I believe it was IEEE. And it all started with that initial foray. A lot of open source work, a lot of community work, a lot of recognition in the research, and then of course, it sparked off all the SQL and MapReduce and a lot of companies came out of it and a lot of open source projects also came out of it.


Simplifying data delivers its power

Satyen Sangani: (07:39) And I would argue at least some of the roots of the two biggest companies in data today that are standalone, Databricks and Snowflake, certainly trace back to a lot of the work that you pioneered. So, an incredible accomplishment.

One of the interesting things that I remember was that when we were starting Alation, we were aware of Hive and obviously knew about it and knew about it as a tool within Facebook. What was interesting to us is that there was this shared query environment, and I forget what it was called, there was a specific name for it, but I remember we had gone into speaking to some users at Facebook and they were super excited to show us this shared query environment. And it became one of the sort of levers or the platforms off of which we would then build what became known as Alation Compose, which was basically a SQL editor that had auto suggestion and previews. And I guess tell us a little bit about that tool, because that UI seems like it was almost as transformational as the ability to provide SQL on the backend.

Ashish Thusoo: (08:34) It was. It was. And that tool was called HiPal. So it was like...

Satyen Sangani: (08:37) Yes.

Ashish Thusoo: (08:38) Hive and a friend of Hive. So just put… [laughter]

Ashish Thusoo: (08:41) Brought that together. So the great thing about Facebook was that at the time a lot of innovation was bottoms-up. Nobody asked us to build Hive, we built Hive. Similarly, nobody asked people to build HiPal, but there are a few analysts and engineers in our teams who said, “SQL is great, but we still need interfaces where people need to look at metadata, people need to get help in authoring the SQL, people want to share their analysis with other folks within the company.” And, one option was to go get the BI tool of the day and try to install it and do all their kind of stuff.

But many of those things didn't really work with Hive at the time. And many of those things didn't really work with Hadoop at the time. They were all on the previous generation tech stack, which is primarily data warehousing. So this part, I remember there was a hackathon that we did and that tool actually got built, the first version got built in a hackathon, believe it or not.

Ashish Thusoo: (09:35) It was an overnight project which people built out with the primary intention: “Hey, let's have a centralized place where we can have the metadata, we can have the descriptions of these tables, we can have a mechanism where people can alter queries easily and share those analyses, and so on and so forth.” And when that was done, that was even more... Hive was transformational, that actually even simplified how people could use that interface, and that became extremely powerful in Facebook.

All this started when Facebook was a 300-people company. When I left Facebook in 2011 to found Qubole, at that time it was a 5,000-people company and more than 30% of the company would use these interfaces to actually work with data. So both HiPal and Hive, they were so transformational and HiPal was a tool that you mentioned, Satyen. I remember talking to you and your co-founders. I think this was back when you were just starting Alation.

Satyen Sangani: (10:30) Yeah, it was 2012. I remember we first met in 2012.

Ashish Thusoo: (10:34) That's right. That's right. And that was the tool and that was very transformational. It was the right place, right time.

Satyen Sangani: (10:38) It was funny because in the history of Alation, I had always conceived it as a catalog, kind of this LinkedIn for data where you could just go look up a data set and statically understand it. But then in our travels, one of the things that we found was nobody actually felt like they needed a catalog, but then we found two companies. One example was Facebook and the other one was Disney Interactive, which I believe was basically the product of an acquisition. And in both of those cases I saw these two companies that had invented these query tools, which to me at the time seemed so counterintuitive because, gosh, there were four bajillion SQL query tools that were out there. And yet that became the transformational capability.

When you saw that HiPal capability, did you understand its power? Was it immediately obvious out of that hackathon that this was gonna be transformational?

Ashish Thusoo: (11:26) Yes. The pull was very, very strong. And by pull, I mean the number of users who started using that tool immediately. Of course in a hackathon you can just build a very, very simple thing. But the amount of requests that that team started getting in terms of feature requests told us that we had hit a nerve, that there was something there, some unmet need that was there that this tool could fulfill. So it was very clear that there was pull here and then we started to essentially go further down and invest more in that. But yeah, it was very clear very early on for both these tools, whether it was HiPal or for Hive that there would be a pull. They were both trying to address a certain unmet need.


SQL knowledge sparks the data engineering revolution

Satyen Sangani: (12:08) And arguably kicked off what we now know today is this kind of space of data engineering because that wasn't really a thing back then, but today, this idea that somebody that knows SQL can be quite powerful and independent was really a function of a lot of the work that you...

Ashish Thusoo: (12:22) Yeah.

Satyen Sangani: (12:23) You did.

Ashish Thusoo: (12:23) So yeah. You're absolutely right. A lot of these systems started data engineering as we know it today. And before these systems came in being, there was no discipline like data engineering as such. They were, in data warehouses, people used stripe data pipelines and those types of things. But the modern discipline of data engineering where it's a confluence of both understanding SQL, both understanding the power of some of these systems and all these new tools that came in around catalogs and all that stuff was nonexistent before. So there you're right, many of these systems started that discipline.

And data engineering before these systems came into being, it was not of course known as data engineering, but business intelligence was mostly in the IT space. Business intelligence was brought into the product space by many of these innovations. Before that, the R&D teams would not think about business intelligence but because a lot of web companies came about and a lot of people wanted to understand what their products are doing, that's where the first application of these systems came in being and that's where the data engineering discipline as we know of today started to emerge. And now obviously, it's gone further out from just R&D teams to back to its roots in BI. But that's generally the evolution. And these systems had a big play in starting that.


Founding a tech company that serves a trio of constituents

Satyen Sangani: (13:42) Yeah, and I think that mindset of iteration is really the essential thing. Obviously combined with the technology that allowed for this change in how people did their work. That analytics weren't a purely waterfall oriented, every six-month-updated thing, but that one actually had to constantly update the models, constantly update the experiments, constantly update what they wanted to learn about, which I thought was an incredibly powerful thing and only could have been done at a place like Facebook that at the time was one of the very rare institutions, maybe there were a few, but at the time it was one of the very rare institutions where that kind of experimentation could take place.

All of which led you, I think, to found Qubole, I believe in 2011. When you left Facebook to found Qubole, how did you think about the founding hypothesis and what was it then and how did it evolve over time?

Ashish Thusoo: (14:29) At the time it was a very few set of companies like Facebook who believed in and invested in these systems where you could create self-service tools for analysts and engineers and scientists to go about working with data within the company. So we saw the success of that in Facebook and we thought that this paradigm of enabling self-service and enabling a place where analysts and engineers and data scientists...

Data science was not known as a data science discipline at that time, but it started to emerge. It was still early. We saw that such a platform could do wonders to uncorking the insights that the data could provide for companies like Facebook. And we thought that there was no reason why this could not be brought to the mainstream and it would just be a matter of time and because the data was growing, generally speaking. There were a lot of sources of data and people wanted to combine data sets together in order to get insights.

Ashish Thusoo: (15:26) We thought that this phenomenon that we are looking at, we were seeing at Facebook, would then go mainstream, and that's why we started Qubole. So Qubole’s early genesis was, “Hey, let's build a platform that would enable all these three constituents, whether it's an analyst or a data engineer or a data scientist.” Primarily at that time it was analysts and data engineers. They could work on a single platform, it would take care of all the infrastructure needs. We did a lot of work in building out auto scaling and so on so forth, and would give them very friendly interfaces like SQL. We had built simple interfaces to that as well. It would give them those things to be able to play with data and to collaborate with each other. And that's how Qubole came into being. Now, the other thing around Qubole was, at the time, cloud was just starting.

Ashish Thusoo: (16:12) [Cloud] had started a few years back, but it was still very much a place where folks would experiment with and would not run production systems apart from digital-native businesses or startups.

We thought, “Hey, if you want to build a self-service platform, why don't we actually leverage the cloud to build it?” Because we could completely automate infrastructure and we could just focus the users on the interface where we took care of everything else. And that's how Qubole came into being. So it was a cloud-based platform. It was only available in the cloud. There was no shipping software or anything like that. It was a platform which enabled self-service for these three constituents. And it was a platform where data engineers could go and use, if they wanted to use MapReduce, they could use MapReduce, if they wanted to use SQL, they could use SQL at the time.

Ashish Thusoo: (17:00) And later when we brought in Spark, as well, they could use Spark to build pipelines to create these data sets and the analysts could then go in and query those data sets and work with those data sets. And later when data science came in being, those tools could work together with it, as well. So that was how it all started and full circle now, we see a lot of companies, both hyperscalers like Amazon, talking about that vision and building to that vision, we see companies like Databricks and Snowflake having built to that vision and built out those platforms as well. So that is how the genesis of Qubole and the general idea of Qubole came into being.


The impact of cloud — and COVID

Satyen Sangani: (17:35) Yeah, and I remember at the time there were very few standalone data companies that were trying to do it. Certainly, Google was starting to become a thing. Lots of limitations in the early days around the types of SQL that could handle, I guess the closest would've been maybe Redshift, which on the other side was a very different thing, but also that was like a re-implementation of a technology that they had OEMed. And so it feels like, and I can't remember too many others where you had the data storage fully delivered, storage and compute fully delivered through the cloud. That was also I think pretty innovative at the time. And I can't remember much else.

And a couple of years later, of course Snowflake would come to be, but that was pretty transformational. I guess this idea, this moving to the cloud, it seems like COVID really sort of accelerated that and you were obviously still at Qubole during the time this all happened. So much changed to me because one could’ve seen a line where Qubole became what is now today Snowflake or Databricks. What changed during the period of the last, I guess, call it decade? What were the subtle differences? Because to those that are not inside baseball, it'd be hard to understand why did some things do really well and other things scaled less? What were the big transformations from your perspective?

Ashish Thusoo: (18:46) There are a few things that changed. First the cloud. Yes, cloud did accelerate during COVID. And most industries accelerated and certain industries COVID did take a toll, especially around transportation related or travel and tourism related. But before we go into that, let's talk a little bit about the cloud. And we saw this journey in Qubole, till 2011 to 2014, cloud was still considered to be something which was a toy and people would just use it for experimentation. Nobody actually thought that it would become something that enterprises would adopt.

2014 is when, if I remember recently, I think it was that year, that was when Amazon started publishing the cloud numbers separated out from Amazon.com's own revenue. And that, I remember, was transformational where people started seeing and that this is a real business. People also started seeing enterprises — initially, I think there was one of the banks, Capital One was one of the early banks, early financial institutions — which said that we are gonna go into AWS. And they went up and reinvent and talked about it. Those things started opening up people's eyes that this is not just a toy, this is actually a safe system. People can run production things and the amount of agility that people would get and the amount of flexibility they would get in terms of infrastructure was 10X more, maybe even 100X more that they could do by deploying their own data centers.

Ashish Thusoo: (20:11) When COVID came in 2020, everything became remote. Most companies were forced to have people working remotely, and then that is when cloud became the central point. Okay, if you had to collaborate, then it's much easier to do that in the cloud. Also, because people wanted flexibility. There was so much uncertainty in the business, people wanted to go toward deployment infrastructure where they could turn it on or turn it off at will. And that really accelerated a lot of cloud adoption.

That is true for most companies. Qubole business was a little bit more heavily weighted in the companies around travel and transport. While we were doing pretty well at that time, we felt that we will get probably more impacted in a negative way because of COVID than most other companies. But at the same time, we felt that there was a lot of potential that the business had and therefore we wanted to find a partner where Qubole and that company could actually exist and do better, and that's how we ended up selling.

Ashish Thusoo: (21:08) And if you remember the first half of 2020, I think March is when things shut down and the first six months were a lot of uncertainty. It's only in late 2020 and 2021, it became clear that the cloud is actually going to gain a lot from COVID. But before that it was not very very clear. So that's how it turned around. But there were also subtle differences between Qubole's strategy and, say, Snowflake and Databricks' strategy. And the subtle difference was our vision was spot on, but I think we went too broad, too quickly, whereas Snowflake focused a lot on data warehousing and Databricks focused a lot on, at that time, data science and then later much more on data engineering. So we became a very top-heavy platform. Some of the biggest names in the industry who wanted to use data were Qubole users.

Ashish Thusoo: (21:55) They were very, very happy with the platform at that time. But the bottom of the market where there were single-use cases is where companies like Snowflake and Databricks started to dominate. And that also helped protect a lot of their business from any downturns, any impact that COVID would've had in certain specific industries.

So that's how it sort of transpired, but I think the vision was spot on. The place where all these three companies were running towards was also spot on. We also got to a pretty decent scale. We went to about 50-plus-million in revenue run rate, but of course, as we know, Databricks and Snowflake did way way much better. And that was, I think if I was to boil down to two things, it was initial focus on certain use cases, which led to the breadth of the market and then a little bit of, of course, fortune, which is around what use cases got accelerated by COVID and what didn't.


Satyen Sangani: (22:48) Yeah. It's a super instructive case study and in the sense that, I remember I was meeting with a venture capitalist, this guy at Sequoia, named Aaref Hilaly, and I remember he said to me something like — this is very early at Alation — He said, "When you're early, a little bit of this and a little bit of this, the companies may look the same, but a slight tilt may make such a difference in outcomes." And this didn't quite have a lot of meaning to me at the time, but I think at this point in time does feel like a very astute comment.

So all of this that's now happening, here you are sitting inside of Amazon and, of course, Amazon, the business still seems like it's this continued juggernaut, massive growth, even at this scale, massive gross margins true for almost all of the cloud mega scalers, but obviously particularly true for AWS and Microsoft. How is this environment, how is this particular economy changing things? What is most top-of-mind for your customers, your users? Is it just acceleration of most trends or slight deceleration, or how are your customers reacting?

Ashish Thusoo: (23:50) Generally speaking, we feel that there are two things which are happening and AI/ML and cloud are a little different. So I'll talk a little bit about both.

In the cloud, so right now there's a flight to ROI because I think two years back everybody wanted growth at all costs. Now people want to grow with efficiency and I think cloud is a great platform to do that. So I think we are essentially seeing a lot of our customers though of course some of the customers say, “Yeah, everybody's being a little cautious.” But generally speaking, we feel that cloud is a great platform, which allows you to turn on and off things, especially even when there's uncertainty, especially when you are trying to move your strategy from grow to ROI. It just gives you a lot of flexibility in terms of infrastructure and I think that's what we are seeing as far as cloud is concerned.

Ashish Thusoo: (24:39) Now in the AI/ML business, things are in a very different space. In AI/ML post-ChatGPT, I think, ChatGPT is the same as — it seems to me, it's the same watershed moment that happened with the cloud in 2014 with Amazon announcing their numbers. ChatGPT seems to be the same watershed moment that now AI has sort of moved into the mainstream consciousness, whereas previously it was a lot of research, a lot of cutting-edge companies like Amazon, a bunch of others or open AIs were the ones pushing the envelope. But now it's mainstream. So not a day goes by where we are not asked by our customers, "Hey, how do we use this stuff? Especially generative AI stuff. This seems to be so transformational. What do we do?"

There are ISVs and product people who talk to us and say, "Okay, how do we make our products better?" Assuming adopting this particular technology. There are people in companies who talk about, "Okay, how do we automate functions or automate or assist in functions like customer service, functions in marketing and sales?" I think that is what has happened here and we see a tremendous pull from the market around AI/ML and specifically around generative AI.

Ashish Thusoo: (25:53) Something that I've not seen before — I think the only place where I've seen this before was maybe when the iPhone came out; I think that was probably the most transformational and similar. So we are very excited about that trend. I'm at the crossroads of that trend in SageMaker. We have a bunch of things and Amazon announced their own models. Just last week they announced a new service, first-party service, which brings in a bunch of our first-party models as well as our partners into providing those services to our clients.

But I still feel that it's day one, and I think this next 10 years is going to be just defined by how this plays out. But this tremendous, tremendous amount of pull as far as AI/ML is concerned and generative AI is concerned from the market because this is so transformational. So in some ways the changing economic conditions have not affected this particular pool, GenAI.

Ashish Thusoo: (26:46) As far as the overall cloud is concerned, of course it is a flight to ROI. I think specifically for the cloud though, people are a little cautious in terms of where they want to invest and stuff, but specifically just the flexibility that cloud provides actually makes it better for folks to adopt the platform now than looking at trying to run their infrastructure, building our data centers and stuff like that. So that's how things are panning out. So there's still a lot of opportunity, a lot of growth in front.


Big data passes the transformation torch to AI/ML

Satyen Sangani: (27:18) And I think in some sense I feel like these two trends are related because I think what it feels like is happening now is that where AI and ML were more speculative events, even seven to eight months ago before ChatGPT was sort of even widely recognized and known. Now, I think to your point, people's imaginations have allowed them to realize that, “Wow, this thing can be useful to me right now, this second, whether it's developer productivity or customer support productivity or assisting analytics.” I mean, there's a million use cases where people could become far more productive, far faster with this capability.

And I guess, did you see this happening as fast as it has happened when you left Idera for Amazon? I mean, was this very obvious to you? Is this why you ended up where you ended up?

Ashish Thusoo: (28:04) Well, I joined Amazon because I wanted to be close to AI/ML. I felt it's gonna be transformational, and all of that has proven to be true. When I joined Amazon, I felt that, yeah, in two, three years time, a lot of great things will happen. Still, the inflection of ChatGPT has astounded me. I think in four months time the whole world has changed and it's phenomenal. It's a phenomenal amount of work that has gone into it from OpenAI and a bunch of other companies, including our partners and so on and so forth, that it's been quite surprising how the uptick is.

But even though it might seem that things have changed quite a lot and the buildup for that has been in years, that's the reason why I joined Amazon AI/ML, was primarily because I felt that this technology is going to be there, defining how we do work, defining how do we operate various things for the next 10 years, much in the same way that big data did it for the last 10 years as far as data is concerned, or the cloud did for the last 10 years. That's the reason I joined and it's proven out to be spot on, but the rate of innovation, I think is much faster than I've ever seen before.

Satyen Sangani: (29:08) Yeah, and what's incredible for me is, I mean this trend is so fast. I had a friend who stopped a company that he was working on in December — serial entrepreneur — then literally in 30 days puts together a pitch deck around GenAI, raises $50 million as a seed, by the way, no revenue, it's just pitch deck. And where the primary argument is that these compute models are basically the LLM and the development of the LLM in a domain-specific case is extraordinarily expensive.

So on some level, if the rich are getting richer — Amazon, in this case being the rich — I mean you now even have the generative AI trend pushing even by an order of magnitude, far more compute. Is that starting to show up in the numbers? Like are you seeing that, like is there a big uptick in those workloads as well?

Ashish Thusoo: (29:54) I can't talk about the numbers obviously, but there is certainly a lot of pull and these models, both for training as well as for inferencing, they use a lot of horsepower, a lot of compute. The fact that there is a lot of pull in putting these models into mainstream applications just tells us that it's already big, but it's going to be bigger than anything else as time progresses. Now, how that transpires out, will each company build their own models or will there be few companies that will build those models or it's gonna be a combination of both? I think the bet is, it's going to be a combination of both. That's what we believe, that there'll be companies that will build generic models, which people will use. There'll be companies who shall build specific models. There'll be a lot of innovation that is already happening in open source, like the models that Facebook released, much smaller models but trained on larger datasets like LLaMA and Alpaca.

Ashish Thusoo: (30:47) So, we just feel that this whole ecosystem is just starting and it's going to be very, very interesting and very innovative for the next many years. And in that interim we want to be the place where people do all those types of workloads. AWS wants to be the place where they'll do all those types of workloads, not just because it drives compute and all, but also because we feel that this is where the future is, this is where the applications are going, and we are very excited about those possibilities.

Looking forward to how all this transpires. But I think it's safe to say that there's room for generic models. There's room for specific models. I'm talking about foundation models. There's room for domain adaptation, there's room for innovations in all the tools that have come out like Langchain and LlamaIndex and so on and so forth. This ecosystem is just taking off and we want to be the place where it all happens.


Where is AI/ML investment headed?

Satyen Sangani: (31:41) And so what is your strategy, therefore, for the AI/ML business? I mean even Amazon, blinding scale, massive access. Every business leader has to focus and Amazon is famous for its sort of 6-pagers. For whatever you can tell us, where are you investing and where do you feel that you can have differentiated pull in this market?

Ashish Thusoo: (32:01) We have been actually building many of these models. Just Amazon.com itself, even before AWS started, even before AI/ML started, Amazon.com has been using science, machine learning, and AI in a lot of its areas. So we've been essentially doing all of that for a long, long time and have developed a lot of expertise. We have a lot of science teams that work on these areas.

Alexa was the first very early one, which did a lot of conversational stuff, and that started a lot of these things. We have a lot of science teams that are focused on this. We have services around this area. SageMaker, of course, is a great place for machine learning and AI. And at the same time, we have also invested in building our own foundation models, like the Titan models that were announced with the Bedrock service. So our strategy here is we want to be the place where a lot of this development happens. And we want to be the place where we are providing these building blocks for people to take these models, to put these models into production. And we feel that there's not just going to be one single model. There's got to be selection available, whether it's for application developers or data scientists, there's got to be a lot of selection available on a platform where they can experiment with these, see what model fits their use case, and apply those.

Ashish Thusoo: (33:21) Use cases might be different. Some models are good in certain kinds of conversational things, certain other models are better in summarization, certain models are better in different modalities, like speech models are a little different, image-to-text models are different, and so on and so forth. So that's basically the reason why we launched this new service called Bedrock, which brings forth our own models as well as some select 3rd parties — like Anthropic, AI21, and Stability — to bring those to the developer community, to build applications around it.

So that's basically our strategy. At the same time, we provide all the open source models which have been developed in open source on our platforms as well. So strategy is selection, give easy tools for people to actually play with these models, have our own models there as well, and then they can choose and bring in those models to build applications. And I think there's a lot of experimentation that is going to happen in this area, while people figure out which one is good for what. And that's basically the area that we are essentially going after.

Satyen Sangani: (34:22) Yeah, and I think can uniquely go after, which is obviously the essence of it.

Ashish Thusoo: (34:27) That's right.


Facebook’s data culture vs. Amazon’s

Satyen Sangani: (34:28) You've worked at these kind of incredible organizations, Meta when it was going through this massive scale, obviously back then known as Facebook, and then of course now Amazon today, and both massive organizations of scale. Obviously Facebook then and Amazon now are two very, very different beasts, but both arguably have this thing that we talk about in this podcast called "data culture." Lots of bottoms-up innovation, strong personalities, people who have used, can test them. What are the commonalities between those two cultures that you have found and what are the differences? And walk us through that because I think it'd be really instructive for people to know the bounds of what data culture might look like.

Ashish Thusoo: (35:06) So you're right. Both these organizations are extremely data-driven. I of course don't know how Facebook is operating today, but Facebook circa 2011 was very data-driven. Data was actually the center of their business. It still is in many ways. Amazon is known for being data-driven. Some of our practices, internal practices have become like case studies around what it means to be a data-driven culture. So that part is true. Both these organizations focused a lot on innovation. Amazon and the cloud has innovated and has really created the market there. Facebook obviously created the market as far as social networks are concerned, both from the product side in terms of what innovations were brought out on the product side, as well as creating a business out of it. Both of these organizations think big. There's no incrementality, but a lot of “think big” is encouraged as to where you try to even question the current assumptions and current beliefs and see if that's going to change, and that's how the innovation wheel sort of gets spun in these organizations.

Ashish Thusoo: (36:12) So a lot of similarities that way. Of course, leadership is extremely strong in all of these organizations. So a lot of similarities that way, and they've come to dominate their respective industries and to define their respective industries.

At the same time, there are differences. Every single company has their own unique DNA, so to speak. And Amazon's DNA is driven by: “Our business came out from retail, then we obviously created AWS.” Two things which are common to those businesses. First retail business is known for being run with high efficiency. So very, very small margins. So you can see that culture dominate; frugality is one of our core leadership principles. Apart from many other leadership principles, that is something that we really, really believe in. And that sort of comes into play as far as Amazon is concerned.

Ashish Thusoo: (37:03) AWS is a business which has got a very broad surface area as compared to Meta's businesses. And it caters to a lot of B2B clients, much more so than in the case of Facebook. So that shows up in terms of how the company is operated and how it is operationalized and so on and so forth.

And I think those things are a little different from Facebook. Facebook back in 2011 was much more of a product-driven company. Amazon I feel is a lot more of... It's a machine. It's just an amazing, amazing piece of... How you construct an organization to keep delivering innovations after innovations and markets after markets. It's just amazing how this thing has been constructed. So those are some differences there.

The DNAs of the teams are a little different. The operational ways of the teams are very different, but the commonality is that they're all innovation-driven. They're both data-driven. They've both been constructed in such a way where there's a lot of bottoms-up innovation that happens. So those are the commonalities, but the DNAs are also quite different based on where both these companies have come from.

Satyen Sangani: (38:12) Yeah, it definitely feels that Facebook at the time, and maybe through its history, has always had more of a largesse mentality, although maybe today's version, like the “Year of Efficiency” Mark Zuckerberg version of Facebook this very moment may not be experiencing that, but that doesn't feel like it's been the history of Facebook. Where Amazon, it just feels like there's this, to your point, this incredible innovation of the machine.

As you've come in from the outside, because I think this is just a fascinating thing, like what are the cogs of that machine? If you were to identify the 1, 2, 3, 4 things that are so unique that Amazon does that is different, what are they? I mean, one thing I can think of is the 6-pager, but I've only been outside the organization. What does it look like on the inside?

Ashish Thusoo: (38:52) First of all, in Amazon, the amount of document writing, document reading that we do is just very different from any other company. In most other companies, people put out PowerPoints.

We don't believe in PowerPoints, we actually write docs. All the doc writing has to fit in either a 1-pager or 2-pager or a 6-pager, depending upon what is being discussed. So there's constraints there, but the doc writing enables us to communicate at a much deeper level and a much detailed level than what a simple PowerPoint might do. That is the core of how the company operates.

Now around this thing are constructed a lot of mechanisms which enable us to create new products. First ideated around PR/FAQ, then fund them based upon that ideation and what are the projections around what this could do with the market, then fund them and then once they're funded, then operate them. We have products which are operated in a certain way before they get to go to product market fit, and then products which are operated at scale. And then there are a lot of operation practices that happen behind the scenes in trying to make sure that more and more innovation happens in these product areas.

Ashish Thusoo: (40:01) At the same time, a lot of focus on operational excellence of how these products are built because our services have got to be up all the time. People run their businesses on our services, so they've got to be up all the time. So there's a lot of focus on that, a lot of focus on security.

So when you bring all these mechanisms together around the seed of doc writing, a lot of things are built around that. And these mechanisms are essentially what enables us to keep the wheel of innovation spinning, and that keeps feeding a lot more trust that we get from customers that allows us to innovate more and so on and so forth. And then that flywheel takes its effect and that's how we've constructed it.


Amazon’s “federated startup” structure

Satyen Sangani: (40:41) But at this scale, I mean, most organizations go through a life cycle and there's the inevitable bureaucracy and slow decision making and fiefdoms that occur. I mean, I'm sure some of that exists there. I can only imagine, not some utopia, but what is it about Amazon that sort of allows for those things to not take effect?

Ashish Thusoo: (41:03) It's a very federated structure. So one way to look at Amazon is it's a group of a lot of startups internally and I have not seen this structure in a lot of places. We have GMs that own and start and build businesses. There's a lot of leeway given to these GMs to run those businesses, how they deem fit, and of course as a support structure, which goes around it, but this federated structure is something that I've not seen in many other companies where things roll up and then the decision-making is not as distributed. Now there are of course decisions that happen that are top-down in all companies and same as Amazon, but when it comes to building your product strategy or your business strategy and trying to grow your businesses, that's federated out. And that allows us to create so many innovative things at the same time. And I think that is very deliberate on how this organization has been structured to deliver that way.


People are the foundation of the data culture

Satyen Sangani: (42:01) It's an incredible set of transitions having been in the shoes of an entrepreneur and certainly now that you're talking particularly in this podcast to lots of executives: If you were to take away one or two things from your experience here at Amazon and maybe even since Qubole, what bits of advice would you give people about how to build a data culture or building culture? Because I mean, you've obviously done it, been a part of it, seen it. What would you recommend people focus on and where would you recommend that they spend their time and efforts?

Ashish Thusoo: (42:31) So it all starts with people. Especially if you're constructing this in the first place, it all starts with first people. So you gotta bring the like-minded people on board. You can't build a data culture if somebody doesn't believe in a data culture, if the leaders don't believe in the data culture. So the leaders have to first believe in culture.

So assuming that that is done, after that, it boils down to investing in systems and processes: data around data. For example, in Amazon there's not a single 6-pager which doesn't talk about data. It's just core in terms of how we converse. You've got to quantify things as opposed to just talking about anecdotal things. And once you bring that focus in, the processes and mechanisms are the ones that carry this thing further. It's very easy to fall into a place where...

Ashish Thusoo: (43:26) I've seen a lot of business presentations where people start talking anecdotes. “Hey, we heard this from this customer or we heard that from that customer.” But then they're not able to bring those anecdotes down to, “Here are certain metrics or here are certain measurable ways in which we can talk about the impact of those anecdotes.” And that is the bridge which data culture essentially does. And once you set that expectation in your processes and mechanisms, it starts to flow. But it all starts with people, though.

If the leaders don't believe in that and there have been plenty of leaders who say, “You know what? My gut is the right way to do things and I'll just go with my gut.” Well, then you can't build a data culture. But if a leader says, “You know what? There's my gut. Yes, of course, but let's actually quantify it and figure out whether it's right in the right direction or not,” then you start seeing underpinnings of data cultures start to emerge and then that spreads around in the organization, then everybody follows that paradigm and that's how it is built. So it all starts with people and then processes and mechanisms follow.

Satyen Sangani: (44:21) Yeah, and it's ironic because of the use of data and the prevalence of data. And so a lot of the people who are chief data officers often themselves have a struggle to quantify, but it still is possible. And I think they can be the models or at least ought to be the models for the rest of their organization and have that opportunity to do that.

Ashish Thusoo: (44:38) It is totally possible. Now you have to remember, human beings are trained from the get-go to talk about stories, not data. That's how we learn. So it takes special discipline to bring the conversation back to data saying that, “Okay, fine, you have this anecdote somewhere. Get me the data that proves or disproves it.” That specific mindset has got to be inserted into the organization and that's how it becomes data-driven. It's a very fine line, but if you cross that line, essentially you become a data driven organization. But if you stay on the side of anecdotes and stories, then you can't bridge that, and it takes some discipline to actually do that. And once that is done, that's when you get to a data driven culture.

Satyen Sangani: (45:19) We're gonna stop right there. I don't think there's much more than we... Well, there probably is a lot more you could say to beat that, but I'm not gonna try because I might screw it up. So we've known each other for years and yeah, this was a real fun conversation for me because I think there's a lot of questions that I would've loved to have asked you, but just never get the opportunity to do so. Thank you for generously sharing your time, and I think all of us are looking forward to seeing what Amazon enables and what we all enable in this new world that you are a huge part of. So thank you for taking the time.

Ashish Thusoo: (45:48) Same here. It was a pleasure talking to you, Satyen, and really fun conversation. Really liked it.

[music]

Satyen Sangani: (45:57) After speaking with Ashish, it's so clear to me that where you start matters.

Companies like Qubole, Alation, Snowflake, and Databricks all had similar visions, but our starting points were quite distinct. We tried to please different users with different technologies and all came from a slightly different context.

In the case of Qubole, Ashish took his experience from Facebook and started a large-scale cloud-based computing capability to simplify the hardest parts of Hadoop. In the case of Alation, we took inspiration from HiPal, another capability that Ashish built, to build the data intelligence platform.

Today, Ashish is starting with a much broader canvas at Amazon and it'll be really intriguing to see where he and the entire team go in a world where LLMs are dramatically lowering the cost and friction of discovering information and driving innovation.

But to his point, none of this would've been possible if it weren't for leaders believing in data. He's chosen to define and work in organizations where data is the priority for driving decisions and strategy.

Thank you for listening to this episode and thank you Ashish for joining. I'm your host Satyen Sangani, CEO of Alation — and data radicals, stay the course, keep learning and sharing. Until next time.

[music]

Producer 2: (47:10) This podcast is brought to you by Alation. Subscribe to our Radicals Rundown newsletter. You'll get monthly updates on hot jobs worth exploring, news we're following, and books we love. Connect with past guests and the wider data radicals community. Go to alation.com/podcast and enter your email to join the list. We can't wait to connect.