LLMs Decoded: A Starter's Guide to AI

Raza Habib, Co-founder & CEO, Humanloop

Raza Habib

Raza Habib is the co-founder and CEO of Humanloop and a Forbes 30-Under-30 honoree, dedicated to revolutionizing AI by making it as intuitive as guiding a colleague. With a PhD in Machine Learning from UCL and experience at Google and Monolith AI, Raza leads Humanloop in empowering companies to harness AI efficiently, democratizing access to cutting-edge technology for all.

Raza Habib

Co-founder & CEO

Humanloop

Satyen Sangani

As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”

Satyen Sangani

CEO & Co-Founder

Alation

0:00:03.7 Satyen Sangani: Welcome back to Data Radicals. Today we're diving into how large language models are reshaping business with Raza Habib, Co-founder and CEO of Humanloop, a company making AI smarter and more accessible for everyone. Raza shares why prompt engineering often beats fine-tuning, how teaching AI what it doesn't know improves outcomes, and why AI tools are shifting from specialist teams to every corner of an organization. This episode is packed with actionable insights and big ideas that will change how you think about AI in your business. Let's jump in.

0:00:36.6 Producer: This podcast is brought to you by Alation, a platform that delivers trusted data. AI creators know you can't have trusted AI without trusted data. Today our customers use Alation to build game-changing AI solutions that streamline productivity and improve the customer experience. Learn more about Alation at Alation.com.

0:00:58.9 Satyen Sangani: Today on Data Radicals, I'm joined by Raza Habib, Co-founder and CEO of Humanloop, an LLM evaluation platform for enterprises. Before joining and launching Humanloop, Raza was the founding engineer of Monolith AI, which applies AI to mechanical engineering, and built speech systems at Google. Raza has a PhD in machine learning from University College London. And Raza, welcome to Data Radicals.

0:01:23.5 Raza Habib: Thanks for having me, Satyen. It's a pleasure to be here.

Raza’s career journey: From academia to industry

0:01:26.1 Satyen Sangani: So you have a really impressive career journey and I guess I'd love to hear you tell the listeners about it today because I think it informs so much of how you've come to the problem of LLM evaluation and model creation and management. And will give our users a little bit of a sense for how and why you approach the problem as you do.

0:01:45.2 Raza Habib: Yeah, so I was always a pretty nerdy kid. Read a lot of physics books and maths books growing up. My mom always was remarking about the kinds of things she'd find on my bedside table. And when I went to university I ended up studying natural sciences at Cambridge, specializing mostly in physics. But it's a really interesting degree in the UK where you end up specializing in one hard science but you get exposed to a bunch of other things. And one of the things I got to learn a little bit about was a first sort of glimpse of machine learning at the time, but it was still super early.

And when I graduated I had looked into finance as an undergrad because in the UK anyone who's kind of in a top university who's doing a STEM degree gets funneled towards finance or consulting. It's a little sad. And I had very quickly become disillusioned with that. It felt zero-sum, it didn't feel like you're creating anything. And I wanted to be part of a company that made something that was useful to people. And so I went and joined a small venture backed startup.

0:02:35.6 Raza Habib: I did that for a little while and whilst I was working there, I was doing more traditional startup things, was very small teams. I was doing sales and business development, data science, and all this stuff. I started to miss some of the technical work I'd had as a physicist. And so I read this textbook on the tube to work by David MacKay, Professor David MacKay, who's a bit of a hero of mine, called Machine Learning Information. So it's called Information Theory, Inference, and Learning Algorithms is what the book is called. And I read that book and it just like this was a kind of career-arcing trajectory for me. Like after I read that book I got fascinated by machine learning and I could see that it was all the same maths that I knew as a physicist, but being applied to this absolutely fascinating problem of how can you get computers to learn and do stuff with that.

0:03:18.4 Raza Habib: And so I was at this intersection of being very interested in entrepreneurship, very interested in running startups, and I was working at one and got excited about machine learning. So I had this plan that I was going to go back to school, do a master's in machine learning, get a little bit deeper in the space, and then start a company.

0:03:32.5 Raza Habib: And the plan kind of almost worked out in that I did eventually do that, but I got sidetracked for about four or five years doing a PhD. So I did a PhD at UCL in probabilistic deep learning. But I never stopped being interested in entrepreneurship and startups. And I ended up helping a close friend of mine start his first company called Monolith AI. And so during my PhD, I was working a day, a week or two days a week at Monolith, helping hire the digital team to being the founding engineer there. And that was really exciting. Our first customer was McLaren Automotive. And so a couple of days a week I was going into their offices, which is also where they sell the cars from. And so there is an incredible place to get to spend time. I was working literally touching distance from a Formula One Supercar and helping them use machine learning to accelerate how quickly they can design and build these cars. And so those two experiences, both being in research and also seeing startups in industry, got me very excited. And something that I felt wasn't possible in machine learning and academia was to have these very large interdisciplinary teams.

0:04:32.6 Raza Habib: So in academia you have to be the engineer, the researcher, the mathematician. Like everything all in one person typically. And actually machine learning has become something that requires a team of interdisciplinary skills to do successfully. And so that's why I wanted to go and be in industry rather than academia. And I was very excited about starting something of my own. And so when I got back from a year long internship at Google, I started trying to find the smartest people that I knew. I got really lucky in finding Peter Hayes and Jordan. So these are my two cofounders and on paper we all look really similar. Peter was in my PhD group. Jordan was part of a startup [0:05:05.4] ____ Blooms Rei that was acquired by Meta and he works at Amazon Alexa. So we're all deeply technical, have backgrounds in machine learning research. But actually we ended up being surprisingly complementary and as a result, it turned out to be a great founding team. And from there we went on to found Humanloop, which I'm sure we can, we'll get into the details of, but that's a very long version of the backstory.

0:05:24.7 Satyen Sangani: There's two bits of detail that I'd love to dig into. The one, the first of which is you mentioned David MacKay's book. What was it about that reading, because the average person isn't, would let, would stop at reading the finishing the title, let alone making their way through the first chapter. So what was it about that book that inspired you to go to a PhD? And what were the principles that seeded this interest that sort of changed the direction of your life?

0:05:50.8 Raza Habib: So I recommend for anyone who is technical to check out the book. So I'll say the name again. Information Theory, Inference and Learning Algorithms by David MacKay. It's a very unusual book in that it's not a typical dry textbook. It both teaches you about the underlying ideas and concepts, but I also think it teaches you how to think in a particular way. He has a way of explaining the ideas and talking about stuff that I found very inspiring. For me, what was inspiring about it was realizing that I could take a lot of what I already knew as a physicist and a lot of the same maths and the same machinery and apply it to these very practical problems of how do we get machines that can do perception, that can understand language, that can understand images and text and take actions and do useful things. And that wasn't something I'd seen before in my academic career. And so being able to connect these dots between the maths and the techniques that I already knew with very practical real world applications and also very interesting questions about intelligence and how do you build thinking machines. That to me was somewhat irresistible.

0:06:47.6 Satyen Sangani: Do you need to be a physicist to have read the book or is it...

0:06:51.2 Raza Habib: No, I don't think you need to be a physicist to read the book. I think it helps to have at least some background in calculus and linear algebra, like it does assume some prerequisite. But you could read the first few chapters about probability theory and about how to think about probability and I think get a lot of value anyway.

Finding your fellow founders and startup vision: Raza’s story

0:07:09.0 Satyen Sangani: So you realize that your background could help you both develop these models, but also, usher in a new age of applications that I think were at the time probably still far less real than they are right now. Then maybe moving forward, you mentioned the idea that, and I think this is a really interesting one, that an academic, a person in the academy has to sort of have lots of skills to be able to even get to the point of focusing on their ultimate research aim. Because, pulling together a PhD thesis is going to require you to be an expert in machine learning, data collection, data gathering, statistics, as well as everything else that you otherwise need to be.

You mentioned that your current team was from people in the same program, but you're all complementary. Tell us a little bit about that. What are the skills that they have that you don't and that you have that they don't?

0:08:00.2 Raza Habib: Yeah. So when we started the company, the three of us naturally matured into the roles of CEO, CTO and Head of product. So I ended up being the CEO, Peter CTO, and the reason for that is that despite the fact that we look similar on paper, it turned out that I really enjoy speaking to customers. I really love learning about their problems, asking them about how they do things, going deep on that. And that is, I think, a really important skill for both the CEO and the product leaders to have. And then Jordan has this amazing taste, which is something that I think is very hard to learn in product design and UX and as a result is able to build things and design things that are, or steered the design of things that I think lead to delightful developer experiences and delightful UX. And Peter had an engineering background before doing his PhD, and so he naturally kind of gravitated towards designing the software architecture and leading the team. And he also has, I think, a surprisingly strong ability to just lead teams of people and is an excellent engineering leader as well as an engineer himself.

0:09:00.7 Raza Habib: And we couldn't have known these things ahead of time. The three of us had social relationships for a decade, we'd worked on projects, but until you're in the thick of trying to build a company from scratch, you don't really know how this will shake out. And I just feel very lucky that we ended up being a team with both complementary strengths, but also natural interests or inclinations in these different directions. That meant that it was easy for us to move forward together.

0:09:24.0 Satyen Sangani: And did you know when you were sort of all coming together exactly the problem that you were trying to solve, or did it evolve after coming together? So what was the story of the founding of the company?

0:09:34.0 Raza Habib: So we had this core thesis which was that the ability of computers to understand and process language had just gotten a lot better and was going to continue getting a lot better. And that was based on research that we were seeing as we were PhD students. So there had been the release of these models, ULMFiT and Bert and T5, which were the first large pre-trained language models. So this predates GPT-3 and some of the OpenAI work, but was of that ilk. And it was the first time where transfer learning, which is this thing where you take a machine learning model, train it on a large corpus of unlabeled data, and then can specialize it quickly to lots of different tasks, had started to work for language. So there'd been a moment that had happened about a few years earlier for images and vision models, but it hadn't happened for language yet. So we'd seen that and we felt that was going to lead to a very big inflection and unlock a lot of interesting use cases for companies, because most companies don't have computer vision problems, but they have a ton of problems that are reading documents, summarizing them, extracting information.

0:10:33.3 Raza Habib: And so when we launched the company, the thesis was very much that, hey, we see this very big gap between what companies seem to be capable of, in terms of building products and what's happening in research and what we've seen and we know is possible. And so we went out and tried to speak to a lot of people to understand what is the cause for that gap. And it quickly became clear to us that there was a lot of tooling missing and a lot of infrastructure missing at the time. And when we started the company, that was actually related to being able to get enough data to train these models, whereas today that's very different. So post LLMs, post ChatGPT, a lot more companies are now able to build AI models. It used to be something that was only possible for machine learning engineers to do, you required a lot of technical expertise, but the problems have changed. So as the problems have changed, the company has evolved as well. But that core thesis of we want to enable companies to be able to build products and services with language understanding, AI and these pre-trained language models has been a thread throughout.

0:11:30.8 Satyen Sangani: So maybe to make that more tangible for our listeners. Can you give us an example of the early-day problems that people were facing and what that gap looked like at the time of founding?

0:11:40.6 Raza Habib: Yeah, so when we founded the company, this was in 2020. The biggest barriers to being able to take an off-the-shelf model and adapt it to a specific use case were that one, you needed a lot of annotated data. You needed a data set of input-output pairs where the input might be the text you wanted and the output would be the summary, and you needed to collect a lot of these. And that often required expensive domain experts to do the annotation. So unlike computer vision, which was easy to outsource, this required actual domain expertise to do so. It was very expensive. On top of that, you needed machine learning expertise. So you needed someone who was able to actually take that data set and grab a model and fine-tune it. And machine learning expertise is just scarce relative to how many companies want to build AI services or could build AI services. And so the V-naught version of the product, before we pivoted, was a tool that allowed a domain expert to build this annotated data set. And as they annotated it, we would automatically train a model and use that model to make sure you only labeled the highest-value examples.

0:12:39.0 Raza Habib: And so we could get you to a trained model without a machine learning engineer and with something like 80% less annotated data than you might otherwise need.

What is active learning?

0:12:46.9 Satyen Sangani: How did you determine at the time what the highest value data sets were?

0:12:52.4 Raza Habib: Okay, yeah, so that's a really interesting question. And it's a problem in the academic literature that's referred to as active learning. So, if you were to teach a human how to do something, you wouldn't just show them problems at random. If you're teaching someone maths or something like that, you would start off with problems that were easy for them and then move towards problems that were progressively harder and you would focus new examples when you're teaching someone on things they don't already understand well. And it's a very similar idea when training a machine learning model, instead of randomly labeling data and then using it to show a model, you take some estimate of what the model currently does and doesn't understand. So the simplest version of this would be looking at the entropy of the predictive distribution of the model, although you need to do slightly more sophisticated things to get it to work well. And you basically say, which are the data points that I don't have a label for yet? Where the model is unsure and is unsure due to having a lack of knowledge. Those are the ones that if I label them, will help the model learn quickly, versus if I label something the model already understands really well.

0:13:50.8 Raza Habib: If I give an example for a concept it's grokked already. There's little value in that. It's not worth it. And so that's how we would do the data selection.

0:13:58.3 Satyen Sangani: So maybe if I can say it back to you to make sure that I understand, you have a model, the model has certain set of inputs and an output and things that are furthest away or data samples that are furthest away from sort of giving you the output that the model already expects are the things that are the most likely to be useful to train the model on. Is that a fair summary?

0:14:20.9 Raza Habib: That's close. It's not that they're furthest away from what the model expects, it's that when you sort of show them to the model and ask for the model's predictions. So the model usually returns a probability distribution. Rather than a single prediction. And it's where that probability distribution has high uncertainty. So if the model very confidently said, hey, the answer, it's multiple choice. The answer is A out of A, B, C, D and it put 90% probability on A. That's a label that's probably not worth acquiring unless the model is wrong on it. The subtlety is being able to distinguish where the model has uncertainty because the question is just genuinely uncertain. It's hard to answer, versus the places where the model has uncertainty because it lacks knowledge. And so that's the subtle part that's a bit difficult to explain. But the rough principle is find places where the model is uncertain and gather more data there.

0:15:09.3 Satyen Sangani: Makes sense, I think. So then you build this V1 of the product. This is where I presume the Humanloop name comes from.

0:15:18.0 Raza Habib: Absolutely.

0:15:19.0 Satyen Sangani: And you then mentioned that now the world changes. And I presume the world changing is really the advent of a lot of these models, the more recent GPT models. Is that a fair statement or was there another change that...

Pivoting Humanloop’s purpose at the right time

0:15:32.0 Raza Habib: No, that's roughly right. I mean, we were lucky because we were very early to this because we'd already been working on language models and trying to help people customize them. We were keeping a very close eye on what was happening with the large language model providers. So post GPT-3, that became something that we were monitoring closely. And when GPT-3 came out, we felt it wasn't quite ready yet.

Because with a modern large language model, there's three stages to training. There's pre-training, there's fine-tuning, and then there's reinforcement learning from human feedback. And those second two stages are really vital to making the model actually useful. And they hadn't been done yet. And so the models at the time had a lot of this raw potential, but they weren't yet that easy to use or customize to specific applications. And when the first instruction tuning paper came out from OpenAI, where they demonstrated that you could actually get these models to follow human instructions, that for us was a bit of a light bulb moment where we said, hey, if the model improvement continues at this rate, then actually there's not going to be a need to annotate data anymore.

0:16:30.0 Raza Habib: But there are going to be this whole host of new problems. And we're sort of contemplating the pivot and we decided to go and just speak to some of the earliest builders of the API and run a sales experiment. And so the challenge we set ourselves was, hey, if we can get 10 people to be significant paying customers in two weeks for our new product, then we'll shift focus to that. And we ran the experiment and we actually got 10 ping customers in two days. So it was much faster than anything we'd done before. And that was a really strong signal that there was a really unmet need for people who were building with large language models. And so we sunsetted the original product and we shifted. And about four months later, ChatGPT came out. And so we turned out to be just very fortunately placed in terms of timing.

0:17:10.1 Satyen Sangani: Got it. So you moved from this pre training annotation world into the later stages of fine-tuning and prompt management and context management. Is that broadly the…

Humanloop’s value: How LLMs diverge from traditional software processes

0:17:20.8 Raza Habib: That's right, yeah. So today, we describe Humanloop as an LLM evaluation platform. And the way that we think about it is that we think that when ChatGPT came out and most companies started building with LLMs, that LLMs break a lot of traditional software processes in three key ways. And we try to give people the tools to solve them.

And so the problems that come up is one, evaluation is a lot harder. So for a lot of Software Engineers, LLMs are the first time that they're dealing with stochastic software where when they run it, they get different answers every time. And as a result, the traditional unit testing, integration testing frameworks that they used to don't work anymore. And even if they come from a machine learning background, a lot of the metrics that people are used to calculating in machine learning, precision, accuracy, F1 recall, those kinds of things don't work when the output is so subjective. So if I'm summarizing a meeting, there isn't a ground-truth summary anymore. It depends on who the person is. And there's a degree of subjectivity involved in evaluation of LLM outputs that didn't exist before. So that makes it hard.

0:18:19.7 Raza Habib: The second big change is that before, as I mentioned, people used to customize these models by labeling annotated data, and now they write prompts. Prompt engineering has become a big part of AI development, and prompts blur this line between what is data and what is code. And if you try to iterate on a prompt in your IDE, then you have this problem that if you make a change to the prompt, you don't really know what impact it's having on your overall applications. You have to go into an interactive environment, where actually you can get some feedback on, as I change the prompt, I can run it over things and see whether or not, what the impact of my changes are. And then the third really big difference, and I think this is the most important and most overlooked, is the role of subject matter experts. So prompt engineering and evaluation for LLMs, the companies that we've seen get the best results with AI products.

0:19:06.0 Raza Habib: Deeply involve the subject matter experts and they look at the data a lot. And so a concrete example would be, one of our customers is a company called Filevine, and they're a contract lifecycle management thing for lawyers, and they have actual lawyers who are involved in writing the prompts, in looking at the outputs, in helping investigate that.

0:19:23.3 Raza Habib: And traditional software workflows don't have this very high degree of collaboration with a product manager or a domain expert during the development phase. They're there when they're speccing the project, they're defining the criteria, but they're not implementing it. They're not writing the code themselves. But with LLMs, they're actually involved often in prompt engineering, actually involved in creating the evaluations. And so that is something that's very hard to do with a traditional software workflow where you're writing your code in an IDE and you're versioning things in git and you're doing evaluation via unit tests and integration tests.

And so what Humanloop is, is it's an evaluation platform. There's one place where we're allowing you to score the models themselves, but also give you a place where the non-technical people can go in and tweak prompts, run evaluation reports, see the output, and collaborate with the engineers to improve the system. So it's a combination of the tools needed for evaluation and the tools needed to iterate on the prompts and other parts of an LLM application.

What is fine-tuning? How does it differ from prompt engineering?

0:20:17.1 Satyen Sangani: One thing that you didn't mention in the context of all of this is the idea of fine-tuning. Can you speak about that and whether or not that factors into the platform and whether or not that's something you are helping companies do?

0:20:29.1 Raza Habib: With large language models, there's essentially two axes or two vectors by which people try to customize them for specific use cases. And prompt engineering is one where you write a natural language instruction to the model. But the other way is what's called fine-tuning, which is what we used to do before LLMs got good, which is that you gather a labeled data set of input-output pairs and you actually change the weights of the model.

So prompt engineering is only changing the input to the model, whereas fine-tuning is changing the weights themselves to make the model better at those specific examples that you've given it. And in our experience, fine-tuning is very useful as an optimization step, but it's not where we recommend people to start. So when people are trying to customize these models, we encourage them as much as possible to push the limits of prompt engineering with the most powerful model they can, before they consider fine-tuning. And the reason that we suggest that is that it's much faster to change a prompt and see what the impact is, and it's often sufficient to customize the models and it's less destructive.

0:21:28.8 Raza Habib: So if you fine-tune a model and you want to update it later, you kind of have to start from scratch. You have to go back to the base model with your label dataset and refine tune from the beginning. Whereas if you're customizing the model via prompts and you want to make a change, you just go change the text and you can see the difference. So there's a much faster iteration cycle and you can get most of the benefit.

There are some places where fine-tuning gives you an advantage. So one is if you're trying to get the model to do something that is hard to describe. If it's a show, don't tell type of problem, you want the model to mimic a particular tone of voice, then fine-tuning can be better because it's hard to actually provide that as an instruction. The other place is if you're trying to take a big model and reduce its cost or latency. So maybe you have a very large language model that is working well, but doesn't have the latency that you need, you can often take the outputs of that model and use it as a data set to fine-tune a smaller model that is faster and lower latency.

0:22:23.1 Raza Habib: And that's something that I think OpenAI actually recently shipped as a part of their platform that people have been doing for a long time organically themselves. Insofar as Humanloop helps with fine-tuning, it's because we're helping you with evaluation. So the hardest part of fine-tuning really is actually figuring out what data should I fine-tune on, which are the examples that encapsulate the behavior that I want? And we help you curate that data and to work out which data points are the right examples. So our tooling helps with that, but we don't help with the literal fine-tuning itself. We integrate with other companies like OpenAI or Fireworks or others to do the fine-tuning process. Or people fine-tune models themselves.

0:23:00.2 Satyen Sangani: Yeah, and unlike the prior two steps, what you really need to understand in order to be able to manage appropriate prompts and understand and evaluate a model is just the observation of the inputs and observation of the outputs. And so you don't have to necessarily be a crackerjack ML engineer or an expert in AI. On the other hand, in the fine-tuning case, you really do have to understand how the model fundamentally operates 'cause you are in some sense playing with fire and if you feed it the wrong data inadvertently, you might have some fairly significant impacts.

0:23:29.9 Raza Habib: I think there's some truth to that. I think where I would agree with you is that certainly to be building an application with an LLM and if you're, especially if you're going prompt engineering first, then you don't need to be an expert machine learning person. In fact, it helps to be an expert product person and a domain expert more than it, I think helps to be a machine learning person. For fine-tuning there is a little bit more knowledge needed, but there are a lot of services now that provide good APIs to fine-tuning. So you don't necessarily even to fine-tune a large language model need to be a machine learning expert.

I think where machine learning expertise starts to be more useful is if you are training your own models or you're fine-tuning locally or something like that, then it helps somewhat. But the broader concepts I think are very useful. So people from a machine learning background, I don't think it's helpful to know the details of the transformer architecture, but I do think it's helpful to know about the workflow that machine learning engineers have typically gone about for years, which is having a test data set or a validation data set and a test set, and to be creating evaluation suites and running evaluations when you make changes and using that to drive improvement.

0:24:31.2 Raza Habib: That's a new workflow for most software engineers because they're not used to dealing with stochastic software. So those lessons and what kind of metrics to have and how to avoid bias and how do I avoid data leakage? So how do I avoid the, leakage from my test set or overfitting – those kinds of concepts are all still relevant, but the literal details of the machine learning architectures or things like that I think are less relevant than they used to be.

What is data leakage?

0:24:53.0 Satyen Sangani: You mentioned the term data leakage, just describe what that is.

0:24:56.2 Raza Habib: Oh, so leakage would just be where there are examples from the training set that kind of also show up in the test set in some way. The correct way to do this is you have some kind of training set which is used to train the model. In the case of LLMs, it's already been pre-trained for you, so you don't need that bit. You then have a validation set that's used to tune the hyperparameters. So in the LLM case, when you're choosing your prompts, this is the set that you're iterating against. Every time you make a change to the system, you're looking at the scores on this set.

0:25:23.9 Raza Habib: But you also want a held-out test set that isn't used to tweak the parameters that you can then run on and get a final prediction of how is this going to work when I release it into the wild? And you want that test set not to be looked at during the development phase because if it's informed you about how to make changes, then you're going to get a misleading confidence in your accuracy. And if any examples from the validation or the training set show up in the test set, you'll have an overestimate of how good the model is.

0:25:49.9 Raza Habib: And that's called data leakage. And so a standard example of how this might happen is if someone has time series data and they segment it at random. So there are some things from the past in both the training set and the test set, for example, whereas if you would want to split it, say so that you only have the past and the training set and the future and the test set. That would be an example of data leakage.

0:26:08.6 Satyen Sangani: Excellent. So one thing you haven't mentioned, and I think just to round out sort of all of the things that people do or are doing, you haven't mentioned sort of when it's appropriate, and I think Humanloop starts from the point of somebody having selected a model and taking a use case, but of course there's always these questions of like, well, I hear about people developing their own language models and their small language models. When do people make that decision? And tell us a little bit about how that is working and will change? Because the models, of course, the larger models themselves are getting quite sophisticated and certainly cheaper by the day.

0:26:45.2 Raza Habib: So the companies that we see get the most success typically start with the biggest, baddest, most powerful model. They can validate the use case, make sure that it's possible, and then treat going to smaller models or private models as an optimization step, either because they have privacy concerns and they need to have local data, or they have price or latency concerns.

But the reason why you want to do it in that order, starting with a more powerful model and then moving towards something smaller, is you don't want to prematurely conclude that the models aren't capable of doing the task that you want and you want to sort of get a ceiling on what the performance can be. So the larger models allow you to establish the baseline of very quickly, kind of how good can we get? And then you might say, okay, this is the capability sort of threshold that we reached, but we want to try and reduce the cost or the latency. Let's see if we can go to a smaller model. And so people do end up training their own models or fine-tuning models, but typically not as the starting point.

0:27:43.1 Raza Habib: And one of the things that we're helping companies do as well is actually just choose between models. So most of our customers are not using a single model for everything. They're using a fine-tuned custom small model here. They're using OpenAI for some applications, using Anthropic for others. The breakdown is actually largely, I would say the majority are using OpenAI models. A smaller but large minority are using Anthropic, and then a small minority are doing custom fine-tuning.

0:28:08.3 Satyen Sangani: And there's of course, then the last question of sort of like, okay, there's multimodel based architectures where the models themselves are highly specialized and you start getting workers and sort of master models. And that itself is also a design decision that people have to make? And is that something that you are sort of helping orchestrate or, or help people determine when that problem exists? Or is there an implicit assumption that people are going to now know how to and when to factor out their models?

0:28:33.2 Raza Habib: Yeah, so it's still a design choice for our customers as to whether they're using an agent or building a RAG system or doing something simpler. But we make it easy for them to be able to quantitatively understand which of these are actually performing better. And the advice that I would give to anyone is in general, start with the simplest system you can and add complexity only as you need it. And so usually we get people to set up a simple baseline, measure that, and then add in more complexity if the simple baseline isn't performing well. So usually you don't want to start with a complicated agent.

0:29:09.0 Satyen Sangani: So tell us about the actual practice now. So you have customers, can you share a story with us of how a customer has started with you, grown with you, evolved with you and sort of the problems that you're helping them solve?

Use case: How Gusto uses Humanloop to accelerate customer support and report building

0:29:20.1 Raza Habib: Yeah, absolutely. So our customers tend to be sort of mid-market enterprise technology companies who are already building things with AI. So examples would be companies like Gusto, Duolingo, Vanta.

Maybe we can talk about Gusto as an example. So they started off very early, just after ChatGPT. Their CTO was very keen to take advantage of the possibilities of GenAI and they've been on a bit of a journey since then that I think is interesting and others can learn from as well.

So early on they got Humanloop in to help them with kind of building and customizing prompts. And their initial strategy was to democratize access to this. So their belief, and their belief even now is eventually every engineer is going to be building AI features and products and that's a skill set that should be distributed through a company. But early on they found that actually that resulted in good new features being shipped, but incremental changes, it wasn't taking the full potential of AI that it could be. And so they then regrouped and they centralized around a central AI team that could identify where the places, both in internal operations and also in customer-facing products.

0:30:24.1 Raza Habib: Where if we make a change when we use GenAI, it'll either save our customers a huge amount of time or it'll improve operational efficiency. And that centralization was really important to them. So that's an interesting lesson that we saw alongside them.

But the way they use Humanloop specifically, so they have a couple of products, one of them is an internal customer support tool. So they have a very large suite of customer support people. They want to augment those people with assistance and help them answer questions more quickly. That's kind of one internal product and the other product that they have amongst many, but the most important ones is they have a report builder. So if you're Gusto is a payroll company, accountants will go into their software and want to be able to build reports. It used to be something that took someone maybe an hour or two to build a complicated report based on the data they have. And they've built an AI copilot that can guide people through doing that process. And in building those things, they had to do a lot of prompt engineering. They had to get feedback and review from the customer support people to understand how well things are working.

0:31:20.9 Raza Habib: And they needed to be able to iterate on all of these pieces. And so with that, what they do is they wire up the application end to end, they add Humanloop in as a observability step so we can trace the data through and they get the prompts into Humanloop and then they can start making changes, getting feedback with quantitative evaluation, they create different evaluator scores within the platform so they can define different ways of measuring the performance, either with other models, scoring the outputs with human feedback coming from their internal CX team or code based feedback. And then they use that score that they get back from the system to decide whether or not the changes they're making actually improve things. And so they've been able to systematically improve the system over time.

Another concrete example is we have quite a few customers in the legal tech space 'cause I think that's a place where AI has had a huge impact and is continuing to generate significant revenues. And so I think I mentioned Filevine as a customer earlier. I think they're a particularly interesting one because for them to build their product, they have to inject a lot of specialist domain experts.

0:32:22.7 Raza Habib: So they had these very long complicated prompts where they're explaining to the model exactly what they want it to do. And the people writing those prompts are actually the lawyers or they're kind of legal professionals collaborating with the engineers. And so they go into Humanloop, they open up inside the editor environment that we have that is an interactive place to actually edit prompts. And they're spending hours in there tweaking these prompts, running them over data sets, seeing the impact and iteratively improving the performance of those systems. And then the engineers are able to call those prompts and have them in production. And so it's enabling for them a collaboration between these non-technical domain experts, the lawyers, and the engineers to improve the system.

Centralized vs. distributed teams

0:32:58.7 Satyen Sangani: Yeah, and you can imagine that use case being pretty interesting because you could have a lawyer feed a term sheet or effectively a set of terms via an LLM and automatically produce a contract that is, 90% of the way there and not beating the need to write this hundred-page document that otherwise would have been written.

One of the things that you mentioned that I found pretty interesting as a parallel to the world of data and analytics where Alation comes from is this idea of sort of centralized teams versus distributed teams. And there's this sort of age-old debate about well, do we have a centralized data team? What's the advantage of having a centralized data team? And then you have the distributed teams, but then the distributed teams don't see across functions for use cases, nor do they necessarily know the latest and greatest tools of art. And so there's one problem, which centralization solves, which is sort of aperture, there's another of which is sort of centralized resources and understanding. But on the flip then you lose the agility. How do you see that trade evolving over time and in the world of AI and what factors are at play that might make it different from sort of the data circumstance that I just described?

0:34:03.3 Raza Habib: So I think it's a staging thing. So I think that in the fullness of time, in the not-too-distant future, I think it will be decentralized. I don't think that AI engineering as a skill set will be siloed to a small group of people with specialist expertise. I think that it'll be distributed around almost every product team will have some knowledge and ability to do these things.

But at the present moment, we're still very early and there's an education piece that needs to happen. I think software engineers who haven't come from a machine learning background are slowly coming to terms with the different workflows and techniques needed to be building with stochastic software and a more iterative software development process. And there's also what I said before about incremental versus high-value changes. So individual product teams maybe don't quite have the remit or the authority to be able to go and make the size of change that is sometimes needed to get the full benefits of GenAI. And so at least early on I think that it makes sense for there to be centralized teams that maybe are actually staffed or led by someone very senior within the company who has the authority to make changes and to push things through processes that might otherwise get stuck.

0:35:11.0 Raza Habib: But over time, I think that it'll become the case that this just becomes a normal part of software engineering, that most teams will have this skill set and that every team will be building AI features and products.

We're just not quite there yet. And I think if you do it too early, if you distribute access to these tools too early, it's not that it doesn't work, it's just that I think you get more incremental improvements rather than the size of change that is possible. Klarna is a good one who's talked about this publicly. Where they've been able to, I think it was something like 70% reduction in how long it takes them to process customer support queries or the staff number required to process them. Those are not small changes to the bottom lines of business. They're very, very large. But they have to come from a centralized team. It's not something that's within any individual PM's remit.

How can software engineers adapt in the age of AI?

0:35:56.8 Satyen Sangani: You mentioned this idea of sort of a software engineer having to understand the difference between sort of deterministic applications where one has test inputs and test outputs. And this answer is the same every single time. And this world of stochastic applications. If I were a software engineer and I said, hey Raza, give me some advice, what do I need to do, what do I need to learn in order to be able to move from this old model that I'm familiar with to this new model and when it's appropriate, what would you tell me to read and what would you tell me to do?

0:36:26.4 Raza Habib: So I think the good news is that I don't think it's rocket science or that complicated and I think people will be able to pick up the skills relatively quickly. The bad news is that I don't think that the playbook has been written fully yet or kind of the knowledge is all out there. I think it's early enough that a lot of companies and teams are still learning as they're going and figuring things out and the workflow is to a certain extent still being invented. So I would say that the way to learn it right now is, I mean, one, get your hands dirty. So it's not hard to go and grab some models and build prototypes and build things yourselves. There's very active Discord forums. There's a lot of people in online open-source communities who are building things and sharing knowledge. There's podcasts like this one. I run a podcast as well called High Agency that has exactly that purpose. Like the reason it exists is to try and help people learn these skills. But I think the reality is that right now we're still at the stage where you learn by doing.

0:37:21.4 Raza Habib: But the good news is that I think if you get stuck in now, you will kind of be learning as the practices get established and you will be developing your skill set alongside the industry and there isn't years and years and years of established practice that you need to go and learn and lots of established tools and frameworks. It's actually pretty early.

0:37:41.0 Satyen Sangani: So just go do it?

0:37:43.2 Raza Habib: Go do it, get involved in online communities, go to meetups, build things, listen to the content that's being created by others who are doing it. But yeah, I don't think there's a textbook you can go buy or there's a couple of courses. Like I think people are starting to produce stuff, but it's very early.

AI governance

0:37:58.0 Satyen Sangani: So people talk a lot and we talk a lot about this idea of AI governance and there it's a cousin to this thing called data governance. And you even mentioned that sort of prompts kind of live in this weird world of being simultaneously code and simultaneously data. How do you think about this world of AI governance? What does it mean to you? I mean a lot of what you're doing could be broadly described as AI governance. Where is this field and where is it going?

0:38:21.0 Raza Habib: Yeah, so I think it's a great question. I think about this a lot as well because of the upcoming EU AI Act. So the EU AI act is actually going to force people, especially those who are working on high-risk applications, to be able to show that they used data sets that had been checked for quality and bias, that they had good record keeping of their decisions, that they have a risk management system in place. And so it's something that's becoming non-optional for a lot of companies soon as well. But for me it's we want to be able to make it easy for teams to be able to track the history of what they did, the decisions they made, the data they used to make everything repeatable. So if you are trying to then go back and audit a system, it's easy to understand why did we change that prompt? What was the evaluation that was running, who did it, what data did we actually train it on? I think it's not just about safety and bias and fairness and the things that the compliance people are forcing onto people and sometimes people feel like, oh, I have to do these things.

0:39:15.5 Raza Habib: I also think they're best practices that Actually just help you build better products. So if you have a repeatable pipeline for evaluation, then you can answer the question of compared to three months ago, did we actually make the system better? And so to me, having good governance around LLM Ops is being able to track the data sets you use, being able to version the prompts and track the changes to them, tracking the history of evaluation, making the system repeatable, having good observability and auditing of your system in production. So if something goes wrong, you can go in and understand why it went wrong. And I think that those are things that you're going to have to do for regulatory reasons and compliance reasons. But I think a lot of them are also just good best practices if you want to build great features...

0:39:54.6 Satyen Sangani: Yeah, it's very different though from the world of data governance. In particular, there's this idea within data governance of lineage and people want to be able to track the steps of a program and so they will literally take the instructions that are in the program and go from input to output to input to output to input to output and they'll sort of really understand the flow of the actual data. In the case of AI, you're really just talking about because the base model exists and in some sense is, it's, that is the starting point, but it is both real and inscrutable. You're really starting from the base model as your base premise and you're saying here are all the ways in which I modified or instructed the model to do different things. And it's that process of kind of training that is what you're dealing with.

0:40:37.2 Raza Habib: I think that's right. The only sort of thing I'd also kind of. Yes, and with that is that most of the applications people building today are no longer just a simple prompt on the base model. They're sort of more compound systems, they're retrieval augmented generation agents or more complicated kind of action taking agents where the LLM and the templates around it are just one piece of a wider system. So it's also important to be able to version and track that whole thing, not just the model part.

0:41:04.0 Satyen Sangani: The whole system. Yeah. So with that in mind, it actually bridges to sort of the next question, which is like, where is this all going? I mean, certainly OpenAI is deeply, and the other model providers are deeply interested in having people use their models, build trust within their models. OpenAI is building their own GPTs that you can sort of launch and run. Do you see a lot of this observability being replicated by, are these folks, where do you see the space going that you're participating in and how do you see it evolving?

0:41:33.6 Raza Habib: Yeah, I think that whilst there is some incentive for the model providers themselves to provide some of this tooling and you see some of that from OpenAI, they've added support for fine-tuning and some support doing basic evaluation. The majority of customers want to be able to use multiple models. So most of the customers we speak to don't want to just use OpenAI, they want to be using Open-source and OpenAI and Anthropic and others. And as a result they're reluctant to commit themselves to a platform that ties them to one model provider, which I think makes it hard for the foundation model providers themselves to really build good tooling here because people want to use model agnostic tooling. And so whilst I think there will be some support built by the foundation model providers themselves, I think that they're also encouraging and supporting an ecosystem of tools around their models. I think it benefits them for people like Humanloop and others, the vector database companies, et cetera, to exist and make it easier for companies to adopt their technology. And so I think it's more complementary than it is competitive. And I think we'll see a foundation model layer with companies like OpenAI and Anthropic and Cohere and others providing the base models.

0:42:40.9 Raza Habib: There's obviously the application layer and then there's various infrastructure that lives in between. And I think that there will be quite a rich family of tools that are available to people to support building with AI. And I think something like Humanloop having an LLM evals platform is essentially now a non-optional part of the stack. Like what we see with people who are building is either they build something like this themselves or they acquire a tool. But it's very hard to build reliable AI product without that piece in the stack.

0:43:07.1 Satyen Sangani: Take us a little bit into the future. The models are themselves evolving quite fast. There's the most recent Strawberry model, it's got this thing called deep reinforcement learning within it, which people, it's a new sort of thing, people don't quite understand it. Where are these models going? How quickly do you see them evolving? What are you excited about and what do you think people should be aware of? Because I think people are now only getting an understanding or at least starting to get an understanding of how the existing models work. I mean, obviously there's a range, but certainly the average person is starting to get into understanding how the models work and the Models themselves are changing shape over time. What should people know?

0:43:40.3 Raza Habib: So I think there are some easy predictions to make that I can say with high confidence and feel like I'm not going to end up with egg on my face. So some things that I think are kind of almost certain. It's just a question of when are the models will become increasingly multimodal. So, they used to be just text in, text out. Now we have models that can do images in and models that sort of do images out, but they're not necessarily all in the same model. Increasingly you'll find the same model that works across all modalities. So audio, video, text, and we see some of this already. Gemini can do some of these things and OpenAI, some of their models do images, but they don't do images out, et cetera. So multimodality just becoming a part of the models, I think is somewhat inevitable. OpenAI and others are investing very heavily in reasoning. And that's, I think, what you mentioned with kind of O1, right. Increased ability to solve complicated problems and math problems and computing problems, the ability to start taking actions in the world. So computer use is something that has been talked about for a while, but is starting to be released from the providers themselves.

0:44:38.5 Raza Habib: So the models are going from things that were. They're still being called language models, but realistically they're a lot more than language models now. They're multimodal models that can take actions in the world and do things and their capabilities are going to keep improving. I think that the rate of improvement of the models, I would personally assume that it stays very high.

0:44:57.7 Raza Habib: So this is one that I start to move out of the territory of things I can say with absolute confidence and start to move into territory where you have to actually make some guesses and predictions. The reason why the models have been able to improve so much over the last few years and the explanation for the success of GPT-3 and GPT-4 has mostly been scale. So the story so far has been by taking things that already worked with language models and transformers and making them a lot bigger, we've been able to get reliable and consistent and predictable improvements. And then there's a question as to how long scale alone can continue giving us those improvements. And it's a little subtle because the loss of the model, the score that the model is being trained to optimize, that is a very diminishing return with scale.

0:45:39.5 Raza Habib: So you have to put in more inputs to get a similar bang for your buck on the perplexity or the log likelihood of the model. But smaller changes in loss later on correspond to big, potentially bigger improvements in capability. So predicting model capability as a function of scale is non trivial. And also we've now opened up other vectors of improvement. So it used to be that scale was driving all the improvement that you mentioned. The Strawberry models, which are using test time inference as a new vector, who are trying to make the models better. So there are more ways in which people can keep improving the models. And so my default prediction would be that we should expect a high gradient of improvement to continue for quite a long time. There's increased resource of investment to scale things. There's also increased resource of investment in actual research to explore other avenues for improvement. So if I was building an application today, I would build it under the assumption that the models will get a lot smarter, that they'll get multimodal, that they'll get better at being agentic, they'll be able to do computer use and take more actions in the world.

0:46:40.5 Satyen Sangani: Brilliant. Well, I think we'll stop there. I mean, I think, there's an incredible amount of sort of wisdom and information that you've given and given people context for how to think about these models. Where to start, where to end. Check out Humanloop. We'll post the reference readings in the show notes. And Raza, thank you for the time. This is a fun conversation.

0:47:00.6 Raza Habib: Yeah, thanks so much for having me.

0:47:04.9 Satyen Sangani: What an inspiring conversation with Raza. If there's one big takeaway, it's the future of AI isn't just for specialists, it's for everyone. Raza and Humanloop are showing how businesses can make AI tools easier to use, faster to adapt, and more impactful across teams. He reminded us that collaboration, experimentation, and continuous learning are the keys to unlocking AI's full potential. This is your call to action. Embrace the tools, ask the hard questions, and start building smarter systems today. I'm Satyen Sangani, CEO of Alation. Thanks for tuning in to Data Radicals. Keep learning and keep sharing, and I'll see you next time.

0:47:46.8 Producer: This podcast is brought to you by Alation. Your boss may be AI ready, but is your data. Learn how to prepare your data for a range of AI use cases. This white paper will show you how to build an AI success strategy and avoid common pitfalls. Visit alation.com/ai-ready that's alation.com/ai-ready.

Other episodes you might like

Season 3 Episode 5

From Statecraft to Codebreaking: The Big Data Origin Story

What do statecraft, WWII codebreaking, and generative AI have in common? They’re all chapters in the story of data. Chris Wiggins unpacks the political nature of data, the history of computation, and what AI means for the future of technology and society.

Watch now

Season 2 Episode 20

The Precision Prescription

Want to increase your odds of successfully ramping up a data team at your organization? With advice from Maddy Want, VP of data at Fanatics Betting & Gaming and co-author of Precisely, it’s a sure bet. Maddy explains how turning data into a valuable asset requires anticipating challenges in scaling as well as preserving team and company culture as the pace of growth accelerates.

Watch now

Season 2 Episode 5

The Scientific Integrity Crisis

Seeing is believing — or is it? Today, Photoshop and AI make it easy to falsify images that can find their way into scientific research. Science integrity consultant Dr. Elisabeth Bik, an expert at spotting fishy images, addresses the murky world of research, the impact of “publish or perish”, and how to restore trust in science through reproducibility.

Watch now