Vector Databases 101

with Edo Liberty, CEO and founder of Pinecone

Edo Liberty, CEO and founder of Pinecone

Edo Liberty

CEO and founder of Pinecone

Edo Liberty is the Co-Founder and CEO of Pinecone, the managed vector database helping build knowledgeable AI systems. Edo is the former Head of Amazon AI Labs and former Research Director at Yahoo! He created Amazon Sagemaker and has contributed over 100 academic papers, propelling AI research forward. His company, Pinecone, founded in 2019, is now valued at $750M.

Satyen Sangani, Co-founder & CEO of Alation

Satyen Sangani

Co-founder & CEO of Alation

As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”

Producer 1: (00:01)
Hello and welcome to Data Radicals. In this episode, Satyen sits down with Edo Liberty, CEO of Pinecone, the managed database for large-scale vector search. Previously Edo led Amazon AI labs and created platforms like SageMaker and Rekognition. During his time at Yahoo, he built machine learning platforms to improve applications for search security and media recommendation. Edo and Satyen give a crash course on vector databases, what they are, who needs them, how they will evolve, and what role AI plays.

Producer 2: (00:36)
This podcast is brought to you by Alation. We bring relief to a world of garbage in garbage out with enterprise data solutions that deliver intelligence in, intelligence out learn how we fuel success in self-service analytics, data governance, and cloud data migration at alation.com. That's A-L-A-T-I-O-N.com.

Satyen Sangani: (01:00)
Today on Data Radicals we're joined by Edo Liberty, founder and CEO of Pinecone. Prior to Pinecone, Edo was a director of research at AWS and head of Amazon AI Labs, where he built groundbreaking ML algorithms, systems and services. He also served as Yahoo Senior Research Director and led the research lab building horizontal ML platforms and improving applications. Edo received his BSC in physics and computer science from Tel Aviv University and his PhD in computer science from Yale. After that, he was a postdoc fellow at Yale in the program in Applied Math. He's the author of more than 75 academic papers and patents about machine learning, systems and optimization. That's all super impressive, and thank you for joining Data Radicals. Welcome.

Edo Liberty: (01:43)
Thank you so much, man.

 
 

What is a vector database?

Satyen Sangani: (01:45)
So I'd like to just start with the very basics. Pinecone describes itself as delivering a fully managed vector database. Can we start like super elementary? What is a vector?

Edo Liberty: (01:54)
A great question. So a vector is an array of numbers, right? So in, you've studied physics or math, like you think about a vector is something with a direction and a magnitude. The easiest way to represent that in like 2D, is just two numbers, right? It's just a tip of the arrow. You've done like geometry so this is like an xy axis. A vector could be represented by two numbers in two dimensions. In a thousand dimensions, a vector is a thousand numbers, right? But the numbers have an order. So xy, x, y, z, so on. So that sounds very abstract about why would I care about this vectors in a thousand dimensions. And so it happens that large language models and all foundational models represent basically all data in this numerical form. It's an array of numbers out of which that somehow captures the essence and the meaning of objects, whether they're text or images or audio. And that representation seems to be a lot more actionable for those foundational models to represent data, retrieve data, and work with it. And that's it. And of course, with a rise of foundational models, this format and this data type became extremely common.

How vector databases revolutionize AI

Satyen Sangani: (03:13)
Is there a reason why that's the case? Why vectors are more naturally suited to representing a language or a format that these models can understand?

Edo Liberty: (03:21)
Yeah. So we humans and in general, AI doesn't really work with exact matches and should, filters and SQL queries and other kinds of modes of accessing data, because that doesn't apply to like soft, natural data. When you see somebody, you've never seen them exact... Somebody you know, if I may. You've never seen them in that lighting. They might be wearing a new hat, whatever. You might see them against a different backdrop. The image itself is very different, but somehow your brain is able to go and fetch that person's name, right? Because now you searched your database of faces to find your friend's face, right? And so it's not an exact match, it's a similarity search. It's something what is like that. Same thing happens in language. When I ask you a question about how a catalytic converter works, right? You know it's about cars.

Edo Liberty: (04:19)
You start to search like whatever, I think it's about cars. [laughter] I don't know anything about catalytic converters, by the way, but you might know that you don't know anything about it because you know the topic and you know how to search in the vicinity of knowledge of that type, right? And so this kind of search and access by analogy, by similarity, by alignment, by connotation, is what vector databases are really, really good at. And what foundational models need to be able to operate well, rather than exact matches on filters or SQL queries or fetches of specific records, which is how pretty much all databases work, other database eventually.

Satyen Sangani: (04:58)
And so then would it be fair to presume or extend that a vector database is a database that's optimized for this storage and retrieval of vectors?

Edo Liberty: (05:08)
100%. And the question is not only storage and retrieval, but the question is retrieved by what? It's not just give me vector number 726. It's really just about give me something... Give me all the vectors or all the data points that are aligned with this data point that are close to it, that resemble it. That are similar to it, right? And that kind of query is extremely highly optimized in vector databases. In fact, the whole database is designed to be able to do that efficiently so that in real time you can, say, do question answering by reading a question. And then the first thing you do is like, "Hey, wait a second." First, let's fetch all the documents that contain information that is relevant to the answer, right? And then try to answer the question instead of try immediately to answer.

 
 

Building the bridge: from raw data to vector databases

Satyen Sangani: (05:56)
How does a vector database get produced? How does it get populated in the first place? What does one have to do to put information into it?

Edo Liberty: (06:03)
So the vector database itself is a very foundational layer. It doesn't speak the language of text and audio and video and so on. Foundational models that exist today is as API calls, really, it's the managed services are great companies that support them. They already take data and vectorize it, okay? So that means I'll take a text document or I'll take a sentence and I'll ask for an embedding, right? Companies like AI21 and Cohere and Hugging Face and OpenAI and others, all have great models for embedding text or embedding images or embedding composite items. That puts the items in what's called vector space. You give it text, you get back a vector, and that vector is put... Being put into. Same thing at the query time. You take a query, you vectorize it, and you search in Pinecone for the relevant information.

Satyen Sangani: (06:53)
And so is all of the information or all of the, I guess, purported knowledge that exists in a foundational model essentially vectorized into a vector database? I mean, is that essentially... Is it a container for that information?

Edo Liberty: (07:04)
Yes. That's a great way to think about it. And not only that, it's the way to do it in inside companies, inside corporations, where your data is very sensitive when your data... You don't want to retrain the models. You don't want to make the models very dynamic. You just put the knowledge that the model uses next to it in a database that you control, and then you can be both very knowledgeable on the foundational model being knowledgeable, but also have full control over your data, governance of it, and so on.

 
 

Foundational models define the integrity of vector databases

Satyen Sangani: (07:33)
And are the underlying vector databases hard bound to the foundational model, which effectively populated them or represented the information in the first place? Is there any sort of swappability of the knowledge?

Edo Liberty: (07:45)
No, I think it's a fairly tight, not fairly tight. It is a tight mapping between... If you use a specific model to embed your data, then that is the model you should keep using for embedding data into this specific index. Of course, you might have different indexes, different models, but yes.

Satyen Sangani: (08:03)
Because on some level, it's an understanding of how you, the model in some sense has produced a representation that is accessible to it, but on some level, it only understands its own language.

Edo Liberty: (08:16)
In many ways, the model, the embedding model is like the ETL into the vector database. If you swap the ETL, yeah. I mean, of course, the whole schema changes. For the regular database, you can't just swap the ETL for another ETL and that would keep working, right? It's the same thing.

Satyen Sangani: (08:31)
Yeah. Although in a relational database, of course, the knowledge representation is discrete. The catalog effectively gives you some understanding, or the schemas really give you some fundamental understanding of what data is inside it. So first name tells you that, "Oh, this is a list of first names." In some sense, the index, as it were, the way to be able to understand and look up the information is actually contained in the model, which is outside or external to the underlying database. Or is that wrong?

Edo Liberty: (09:00)
No. I think it still kind of matches. Thinking about the schema, it's like just a dimensional, it's dimensional. You need some number of loads for this to fit in the index, but the meaning could be different, right? I mean, I can have a table with a name and an age. In one thing, that is about humans and the other one, it's about cheeses, and that would... The data would just become complete garbage, even though the table... Whatever schema is respected, right? But you can't just replace the source and the meaning of data, even though you conform to the structure. It's the same thing as here. I mean, you can conform to the structure and replace the model, and then whatever, you get some mix of two data sources that wouldn't play nice in it.

Satyen Sangani: (09:41)
Yeah. I guess on some level, the metadata, in a relational database, would tell you that, "Oh, this table is about the age of cheeses versus this one being about the age of people." There is no such sort of inherent self-describability that comes with the underlying vector database, if I'm understanding your description correctly.

Edo Liberty: (10:00)
Yeah. I mean, look, just talking about a collection of vectors, vector databases are much more complex. You have metadata, and you have other fields and so on, and it's a much more complicated object. So you can encode all that. All I'm saying is, if you swap the model with a different model mid-stream, you might garble, you might become unoperatable.

Satyen Sangani: (10:18)
Yeah, for sure.

Edo Liberty: (10:18)
You're sure. You are just sure.

Satyen Sangani: (10:18)
If you just arbitrarily just took any random ETL and just wrote it into any arbitrary set of tables...

Edo Liberty: (10:25)
Correct.

 
 

The AI-driven search engines redefining data discovery and application

Satyen Sangani: (10:26)
You would come up with garbage. In that sense, it would be garbage in, garbage out. I think we're making two different points. I think what you're saying is, "Look, one needs to understand what it's talking to in order to talk to it," which is obviously correct. I think the point that I'm observing is slightly different, which is one can look at a... If you had the appropriate metadata, you could understand what's inside of a relational database by just simply knowing the metadata and what's inside of it. That same property doesn't exist for a vector database, I guess. You don't know what those vectors represent.

Edo Liberty: (10:55)
Not easily, no. Not easily.

Satyen Sangani: (10:57)
Which is, I think, part of what's so interesting, people call it vector search, or another way of referring to it as vector search. Is it more helpful for people to think about vector databases as a form of search engine, or how do you describe it?

Edo Liberty: (11:13)
Yeah. It's very often used as a search engine. Thinking about it as a search engine, it's a very AI native search engine, is not a bad way to think about it.

Satyen Sangani: (11:23)
Yeah. And so who uses a vector database? I presume it's somebody who wants to augment the information from a foundational model, essentially, or augment what a foundational model knows. Is that broadly the case?

Edo Liberty: (11:37)
Not only. In fact, we started building Pinecone well before the explosion of foundational models. So the use cases are much more broad. As a whole, you indicated it yourself, search and semantic search, the ability to really search for things by meaning and by analogy and by correlation is a very powerful thing. It's used for shopping recommendation, and legal discovery, and anomaly detection in financial data, and fraud detection, and spam filtering, and a million different platforms that acquire this foundational layer, this component, right?

Edo Liberty: (12:17)
So it happens that to do what's called RAG, Retrieval Augmented Generation, that became sort of like a necessary component. So now when you have a foundational model answering questions, hopefully factual questions that you care so much about the organization data in a workplace, and it takes being accessible and being data-driven, right? If you as a company have a lot of data internally, whether it be legal documents, or reports, or historical accounts and so on, and you want to converse intelligently inside your company. It's the ability to go and fetch relevant information in real time and for the foundational model to actually respond intelligently with data, that is a necessary component. So you have to put all your data somewhere and search through it, get the text or the charts or the images, right, and use them in what's called as context for the answer in the element.

 
 

Pinecone’s meteoric rise: from stealth to spotlight

Satyen Sangani: (13:16)
I mean, you've been around for five years, is that correct?

Edo Liberty: (13:18)
Right.

Satyen Sangani: (13:19)
Pinecone has been around for five years. But the growth, as I've understood it, and this may be incorrect, you can tell me that this is wrong, over the last two, has been sort of stratospheric.

Edo Liberty: (13:27)
Correct. I mean, we only launched our product two years ago. And so before that, we had no growth, though. But yeah, beginning of '22 is when we launched our paid product for the first time. And you're right. I mean, '22 was a high-growth year, and '23 was just an anomaly, just complete explosion. Basically, what people experienced with ChatGPT and Agents and Co-Pilots and so on, we've been powering a lot of that wave. And so yeah, we started off being a fairly well-known secret to being like a well-known brand and a company that's just in the eye of the storm.

Satyen Sangani: (14:08)
Yeah. Which is the proverbial tornado, in Jeff Moore terms.

Edo Liberty: (14:12)
Yes.

 
 

Transforming the digital landscape with semantic search and LLM integration

Satyen Sangani: (14:13)
And so where do you find the use cases are, which use cases have taken off most? You mentioned a variety, which ones have been most transformational? How often are you being used in conjunction with LLMs, or how often are you otherwise doing more traditional semantic search?

Edo Liberty: (14:31)
No, even semantic search is being done with LLMs. I mean, this is... We're very tightly coupled with the model providers, and that's why we'll partner with pretty much all of them. Semantic search was, in some sense, a holy grail for 30 years, right? There are whole academic conferences with thousands of publications on information retrieval, starting from the... Literally, from the '70s and '80s. The ability to actually process text and get it to a form on which semantic search actually works, that's the new thing, right? The ability to take models and really convert text into these vectors and those vectors being extremely good semantic representation of the content, that's new.

Edo Liberty: (15:14)
And the other thing that is new is the existence of tools like Pinecone, and specifically our recent launch of the serverless architecture, which allows you to really ingress billions and billions of these vectors, each one of them, maybe every sentence or paragraph embedded multiple times and so on, right? The ability to have huge amounts of data and to search very effectively and on a budget now makes this extremely powerful in the same way that foundational models graduated and vector databases graduated, and now we've got really strong foundations on both sides. Now you can put these systems on steroids and companies like Notion, Gong, and many others that have built question answering and analogy discovery and so on, for their own tens of thousands of customers sometimes, right? That's possible today, which was not possible even like two or three years ago.

 
 

Anticipation and evolution of AI

Satyen Sangani: (16:11)
Yeah. I mean, but you founded the company, obviously, five years ago. And prior to all of what we've seen over the last two years, did you imagine the sophistication of these models? Or did you have a sense that this was coming, or?

Edo Liberty: (16:26)
The answer is yes. No, I mean, look, the timing and the intensity, I think nobody could foresee, okay? The direction and the capability, I think the writing was on the wall probably from 2017-ish. There was a huge sea change in how we do natural language processing. It became a lot more deep learning heavy, transformer models became significantly better. GANs, as they used like the adversarial networks and generation started being a thing. You could start seeing cracks in the armor that you could see, "Okay, this thing is starting to break." Okay, we're starting to really see some transformation here. 2018, I think, BERT came out. Vector search started becoming a lot more talked about and by normal engineers and not by just AI enthusiasts. So you could see the sea change starting to happen. Of course, back then it was... Very few people knew about it, but the change was happening.

Satyen Sangani: (17:29)
And today, as you see the space evolving, as you see vector databases moving forward, these models are fed so much information. They are so large in terms of their requirements of compute and the amount of data over which they are built. How do you see the evolution of the space moving forward? Is it simply a matter of scale? I mean, scale is not simple, but is that where the giant sucking sound lives, or are there other problems that are of interest?

Edo Liberty: (17:54)
No, there are many, many problems of interest. So there is an abstract way from models versus vector databases and so on. I want to kind of pop out one level above and say, we as a community need to learn how to reason and think, right? We need to teach our machines how to reason and think and talk and read. [chuckle] So this is the intelligence. And we need to teach them how to know and remember and recall relevant stuff, right, which is the capacity of knowing and remembering. We are much more focused on the latter, right? We work with a lot of legal discovery companies. They say, "Hey, we have millions of contracts," right? When somebody asks a question, they need to take into account having read hundreds of thousands of contracts. The LLM, you're not gonna be able to train an LLM to know well, and that... Also, that knowledge changes all the time, right?

Edo Liberty: (18:52)
Same thing happens for support and Jira Tickets and other support issues and medical histories and company wikis and you name it, like everything in the company. You want your foundational models or your agent's code and so on to know, right? So the question is, what does it mean to know something? To know something is to be able to digest it somehow to make the connections. And when I ask you something about it to figure out, "Oh, what's relevant?" And I know how to bring the right information to bear so that I can reason about it. So this ping-pong between reasoning and retrieving the right knowledge, right, is what we need to get back at, to get good at. And I think that is, if you ask what's the next step, is those three things. It's how do we get knowledge to be extremely good? How do we get reasoning to be extremely good and how we connect them in the optimal way. And we're making big strides on all three.

Satyen Sangani: (19:50)
Yeah. And this, to your point and the point that you've made, these models are extraordinarily good at broadly resemblance and some level of pattern recognition over vast amounts of information. And obviously those vast amounts of information are stored largely inside of the models or whenever information has been trained on.

Edo Liberty: (20:07)
Inside the vector database to me.

 
 

The role of vector databases in multi-domain knowledge integration

Satyen Sangani: (20:09)
Inside the vector database. That's right. But what humans do really well is on some level, I mean, you mentioned that facial recognition pattern. There's almost another class of sort of abstraction that humans seem like they're capable of where when you get to that models may not be good about, they may not always have that capability of abstraction. And so this like, it seems like this interplay of like teaching a model to reason or teaching the model to be better as it were, in some sense training it to know when to reference which stores of knowledge and how to think about when to figure out what the appropriate answer is. Is that what people mean in some sense by... I mean, there's prompt engineering, which is obviously feeding it, feeding the model as it were good responses and potentially good answers. But do you see a world evolving where maybe a language model is taught to talk to multiple stores, is taught to sort of be able to know when to go to one specific domain, specific model versus another? How do you see that evolving?

Edo Liberty: (21:05)
We're talking about only the most basic interaction, right? Kind of an atomic unit of like a model and OneNote store. Agents are becoming a thing where you take one prompt or one mission or one task and you break it to a sequence of 10 different things and you go and execute one after the other and the steps might rely on one another and you might access multiple stores and multiple tools and so on. All of this is happening. I mean, there's an infinite complexity that you can go and build. And this is why we build foundational tools, right? We build a very un-opinionated tool because remembering images and code and text and audio snippets for a vector database, it's the same, right? Our job is to make it bigger, cheaper, easier, faster and more secure, right? That's it. We try to very consciously stay in our swim lane because even that is extremely hard, right? I can tell you how much we improved. Okay, let's just put things in perspective.

Edo Liberty: (22:01)
We had a customer two years ago, there was an annual contract of $200,000 a year. After a while, our system got so good that I had to go back to them and say, "I don't know how to tell you this, but your workload now fits on our free tier," right? We would still love for you to stick around, [laughter] but you don't have to keep paying us if you really don't want, right? That is how much the system has improved. That's how much drama... How dramatic the cost reduction has been, right? And so, yeah, I mean, we try to improve that part, but the whole system could get, of course, infinitely complex. You can have multiple vector databases, multiple non-stores, multiple foundation models, and the logic above it, it orchestrates everything as an agent to do, I don't know, 20 different tasks, one after the other, while you're asked one prompt, so.

 
 

From AI foundations to vector database innovation

Satyen Sangani: (22:50)
Yeah. And your background, I mean, interestingly, of course, is... I believe it's an AI, which is very much a mathematical, at least its fundamental core, it seems like a mathematical discipline. But here we have you developing a form of a database, which is obviously highly correlated with sort of systems development and computer engineering. So in some sense, you're building something for the people that you've been trained as. How has that transition been for you? I mean, you're operating at low levels of the stack. I mean, has that been a fun transition? Do you find yourself migrating back to some of the problem spaces that are above what has been represented and processed in your systems?

Edo Liberty: (23:28)
So it actually hasn't been a very sharp transition for me because I'm older than I look. I started my PhD in 2005, roughly. I might be off by a year. Back then, you couldn't even build on one of the TensorFlow or PyTorch, right? There was nothing. You had to build everything from scratch all the time. And none of the tools that we know today, definitely like using GPUs was today is very well understood. Back then, it was a novelty, right? And so you basically had to build everything they use, build all the tools that you need to actually go build something. So tool building was an essential part of doing machine learning. You've experienced it yourself, right? This is, again, today, data catalogs and data and through it, whatever, through their best stores and so on, are well-understood tools, but you had to build it yourself. If you wanted to do it, you had to build it. And so for me, again, this is not a big change. I just really enjoyed the tool building. I really enjoyed the platform building. That's what I state because I love that. There's something very clean about it, something very measurable, accurate, and disciplined about it, and I enjoy that part of the stack.

Satyen Sangani: (24:39)
Yeah. I mean, it's like the early days. I mean, what you describe, essentially, if you remember the early days of the internet, back then, it was called an Application Service Provider, would have to essentially host their own servers because there was no AWS. What are the class of people... I mean, this is obviously a massively sexy space, massive market potential. Lots of folks are coming after you and I get this sort of... I remember early on when we were finding Alation, I'd get a lot of like, "Ah, is this a feature of something else? Is this something that is going to be encapsulated by something bigger?" How do you think about that? Do you think about this as being sort of mainstreamed into something larger? How do you think about the space evolving and how do you think about differentiation unfolding over time?

Edo Liberty: (25:20)
Yeah. So, look, I mean, one of the most exciting moments for me before starting Pinecone was the realization that this is truly a new kind of database. If this is not some tweak or some flavor of something else, this is not something you can bolt on a document store or NoSQL engine or SQL engine or a OLAP system or whatever, a warehouse or a data lake. These things are extremely specialized, okay, in their data access, in the data layout, in the query execution logic, and so on. And when I understood that what you need to be able to actually build a vector database that's actually very efficient, very scalable, and actually operate with the performance and the cost scale trade-offs that people need to actually run big applications, you needed something that's foundation new.

Edo Liberty: (26:21)
That was one of the most exciting parts and one of the crystallizing moments where I said, "Okay, we have to do this. This is a privilege." The ability to actually build a new kind of database is, I think... It doesn't come up very often. I mean, it only comes up. It already happened in history I don't know like 10 times. And so the privilege to carry that torch is, for me, it was unavoidable, right? And so that is my answer in some sense. We believe and we have all the odds to show this is fundamentally a new thing. Just to give you an example, as an experiment, we loaded the internet into Pinecone on an index. We looked to count, crawl, and vectorize it, put it into Pinecone, and asked, basically use the internet as a base for RAG, for factual questions with models. Interestingly enough, by the way, it cut hallucinations on GPT-4 by half and improved all foundational models, including Mistral and Llama. The interesting thing is to be able to do that efficiently and for this thing to only take a handful of days and actually not even be that expensive, the only way to do that is to build something that's actually extremely good at this, right?

Satyen Sangani: (27:39)
Hmm.

Edo Liberty: (27:39)
And if we try to do this with any like NoSQL engine that like salt and pepper vectorize over everything or some data lake or whatever, it would just either not work at all or just be nauseatingly expensive. And Which is what we see. I mean, people come to us say, "Hey, we try to do this on X, Y, Z as advertised by them and the whole thing broke down. Oh, we figured out we're gonna spend $3 million this year on this." Like, "Okay, we need to find something else."

 
 

Exploring AI’s black box: The challenge of understanding complex systems

Satyen Sangani: (28:10)
Yeah, because I mean, on some level, these models are operating at a level of abstraction that the other databases are simply incapable of representing in any reasonably scalable way. You mentioned this idea that the performance of the underlying retrieval improved. Can you give us a little bit more of a sense for why did that happen? What was the innovation there?

Edo Liberty: (28:29)
So if you remember, we talked about both retrieval and the reasoning, okay? In the retrieval itself, there's a lot of logic. It's what embedding you choose, how you search through them accurately, how many results you bring back, how do you then re-rank them, what do you do with the results that come back, how you figure out what's relevant or not. There are many permutations and variants of how you do retrieval and how you bring back the best results, right? Interestingly enough, improving, significantly improving over all the public foundational models on factual questions was surprisingly not hard, right? This kind of shows you the power of this paradigm. Of course, we can do a lot better. I'm not saying we're nearly done, right? But the fact that when running this on Pinecone took a few days and whatever, a few $1000 worth of compute is nothing short of a miracle because training those LLMs is a massive multi-year, multi-million dollar effort.

Edo Liberty: (29:31)
One of the things that I think certainly it's hard for me to understand is we don't know, I guess, how these models represent information, what the numbers actually represent and what they mean. And when a model is retrieving information like how there's a mapping between sort of humanly understood comments, concepts, and the numbers, do you think that there will be a time where we'll be able to explain what's going on?

Edo Liberty: (29:53)
No. [laughter] We can get philosophical here, but the short answer is we are... As a trend, we're moving away from understanding it deeper and not moving towards understanding.

Satyen Sangani: (30:07)
Say more.

Edo Liberty: (30:08)
Networks are getting deeper, bigger, connections become more complex. We train on more data. And frankly, we start understanding these systems as complex systems. It's like we measure them in aggregate. We don't understand the behavior, right? You can measure the angle of the sand in a dune, right, without knowing where each sand kernel is, right? That is an aggregate phenomena. You can measure... You can say something about the behavior of an ant colony without understanding what each ant wants or does, right, and so on. I mean, these complex systems have emergent behaviors that you can study as a topic or as a system without truly being able to characterize each one of the components, right?

Edo Liberty: (31:04)
And that's how we study neural nets now, right? In fact, we study them on two ends. It's like you have old timers like me who truly care about how do you propagate gradients correctly and how you compute. This is the very low level of literally how you should make one step in learning from one layer to the next, right? And we obsess about doing that exactly right. But then you put three layers and we have no idea of what's happening, right? This is already like way too complex. There's no way to actually characterize what's happening. But suddenly, at hundreds of layers and billions of parameters, the emerging behavior is already starting to be predictive enough that you can start measuring it again and start reasoning about what this thing should be doing, even though we don't really understand the good plans.

Satyen Sangani: (31:54)
Yeah, so you test the behaviors from the outside as opposed to sort of really understanding what's going on inside the black box, as it were.

Edo Liberty: (32:02)
Yeah.

 
 

How to navigate AI sensitivity

Satyen Sangani: (32:03)
But these models are hypersensitive. I mean, the very small changes in inputs can yield dramatically different outputs, or?

Edo Liberty: (32:09)
It has to. But it mathematically has to. I mean, this is something that I keep telling people, and I think they don't want to listen. But it's reality, I mean...

Satyen Sangani: (32:19)
Because they want to know that there's an immutable response. Is that why they don't want to listen?

Edo Liberty: (32:24)
Yeah. So, kind of let's break it down to the math. I mean, this is where being a mathematician really helps, right? Not that like mathematicians would be offended that I said that. I'm not like... Mathematicians wouldn't count me in their club, but.

Satyen Sangani: (32:39)
That's okay. It looks like you wouldn't count me in your club. So there are many clubs. [laughter]

Edo Liberty: (32:44)
Yeah, fine. But you have... Look, the input for those models are oftentimes, let's say, a thousand dimensional vector, which is a thousand numbers, right? And let's say the output is like one number. Say, what else? So say classification or some... You're just trying to output one item. Forget about a whole sentence or an essay or writing a job description, just like... Output just one number between like, yes and no, okay? And anything in between, right? So now, it doesn't matter how you slice and dice the network. It could be one neuron and it could be a billion neurons, right? This thing is a function. It takes a thousand numbers and it outputs one number, right?

Edo Liberty: (33:23)
And so now that you can think about this thing as one big complex function, and you can ask yourself, this multidimensional function takes a thousand numbers and outputs one number. It takes a vector and outputs a number, right? How smooth can it be? And the answer is, if it's not effectively constant, basically, if it doesn't always output roughly the same number, say 0.5, I don't know, right? If it needs to be consistently saying yes and no about a bunch of stuff and be somehow informative, then it needs to be extremely sharp, sharply chain, right? It has to be, right? And again, I don't want to jump into the math of why that happens. These are like deep theorems in isoperimetric inequalities, but believe me that it's mathematically impossible. So, the function has to be sharp in the sense that you have to be able to... There will be many places in which you vary the input a little bit and the output varies dramatically, right?

Satyen Sangani: (34:24)
Yeah.

Edo Liberty: (34:25)
And we have to get used to that because that's not going change. I mean, this is mathematically impossible to fix.

 
 

Unlocking AI transparency

Satyen Sangani: (34:31)
Which is a fundamental characteristic of complexity, that small changes have potentially massively different implications for the output.

Edo Liberty: (34:42)
And you think about that as being brittle or being insecure or being unsafe, and I see why people are uncomfortable with it. And in some sense, that's an inherent behavior of these large networks.

Satyen Sangani: (34:55)
I know at Anthropic, there's this researcher that's working on AI explainability that are maybe a little bit more optimistic than you. Do you have a sense for the case that they would make in terms of how far we'll be able to get? Or do you think it's fundamentally a question of just simply describing the library of sort of external tests well enough to be able to know how consistent the output would happen to be?

Edo Liberty: (35:16)
So I can tell you what I think about explainability and what we are very much, again, we're trying to represent knowledge and somehow some level of sanity [laughter] for a lot of companies when using LLMs. And maybe what I said before would just cause alarm and despair, "Oh, we can never fix this." What I think about explainability is giving enough evidence to support your claims, right? You don't have to be always right. I mean, you're bound to sometimes be wrong, right? We're not ever going to make AI perfect. We're not going to make natural intelligence perfect, let alone artificial intelligence, right? But if you're generating texts in an answer, you can say, "Hey, this is the answer to your question." And by the way, the answer that you got... I got this information from these five documents that are in your company's stash of data, whatever that might be, right?

Edo Liberty: (36:12)
And you can go and verify that what I said here matches that, okay? And if you can, great. And if you can't, then that's a problem. Today, we can do that. In fact, if you look at Notion as one of the leading companies in organizing company data and being this collaborative platform, they now built a Q&A feature on top of Pinecone that allows you to do exactly that, right? So you get an answer when you query your own data and you get an answer. It's not just an answer. It's an answer with the citations and the actual proof that this is the right thing. Not that it's never wrong, but it's significant. So for me, that's a practical, reasonable way to get comfortable and confidence in the answers, even though you know. Is everybody perfect?

 
 

Striking a balance between AI innovation and thoughtful regulation

Satyen Sangani: (37:02)
Even the most sort of bulletproof or branded studies and the most bulletproof journals, essentially, rely upon a similar method. I mean, you have reasoning that are... That is outlined in a paper, but ultimately it's referencing firework, data, and insights. And so on some level, that's what you expect of a human. You're not asking the human to tell you which neuron touched another neuron in order... Or fired another neuron in order to be able to get to an answer or cause words to be spoken.

Edo Liberty: (37:30)
Correct. Correct. Exactly. If you have a paralegal writing a document, you don't say, exactly, "How did your brain work when you wrote this?" You ask them, "What source is it?"

Satyen Sangani: (37:40)
And yet this is all like very scary to, I mean, many people and people are thinking of regulating this stuff and there's like all sorts of initiatives and proposed regulations. How do you see that unfolding? I mean, people must ask you this question all the time in terms of your being one of the foremost experts and at the forefront of all of this. I mean, how do you think the world ought to be contending with this? Is there a place for regulation? Is there a place for more oversight?

Edo Liberty: (38:04)
For sure there is a place for regulation. We know that this technology can be abused in many ways and probably already, I know for a fact it already is being abused in all sorts of ways. That said, I am concerned when regulation specifies how a technology can be used rather than what it can be used for, right? And that part scares me because we don't know how half of these things work. They move at the speed of whatever, like every quarter the technology changes, we understand new things. Literally every quarter, a few months, it's not enough to pass any legislation.

Edo Liberty: (38:40)
And so that whatever regulation we have, it's going to be two, three, four years behind what's actually happening. And in the pace of innovation in AI, that's just completely obsolete, right? Even if we do something brilliant today, right? If we trust our politicians and legislators to be absolutely brilliant and knowledgeable and scrupulous and honest and good, well-intended. Even when all those are in place, in three years what they have said in motion today would be completely just crippling in all sorts of unreasonable and unintended, right? I'd rather see a regulation that says, "Okay, what kinds of things we don't want to see?" Then we'll just make sure, as technologists, to not step outside of those bounds.

Satyen Sangani: (39:25)
Yeah, which I mean, obviously applies to good actors. And then another question is like, how do the bad actors deal with this?

Edo Liberty: (39:31)
Like everything illegal. I mean, doing illegal stuff is illegal. [laughter] You go to jail for it. Regulation never stops anybody who doesn't want to follow regulations.

Satyen Sangani: (39:40)
That's right. That's right. This has been a phenomenal conversation and I appreciate your patience with what are probably some very basic questions. This has been a lot of fun and I'm sure our listeners will really appreciate it. Thank you for taking the time.

Edo Liberty: (39:55)
My pleasure, man. Thank you.

[music]

Satyen Sangani: (40:01)
The rapid progress in AI technology has fueled the evolution of new tools and platforms. One such tool is vector search. If the function of AI is to reason and think, the key to achieving this is not just in processing data, but also in understanding the relationships among data. Vector databases provide AI systems with the ability to explore these relationships, draw similarities, and make logical conclusions. Edo remains optimistic about the future where knowledge can be accessed at any time. He is certain that models will become increasingly complex, but also more efficient and adept at managing intricate computations. And, it's clear that understanding and harnessing the power of vector databases will have a transformative impact on the future of AI. Thanks for listening, and thank you, Edo for joining today. Our next episode will be the last of this season, featuring none other than the inspiration for the theme and title of this podcast: Saul Alinsky. Thanks to ChatGPT, we’re able to end the season with a special, GPT-infused interview that explores Saul’s most famous work, "Rules for Radicals." Published in 1971, the book provides guidelines for social activism and focuses on strategies for bringing people together and inspiring them to strive for a shared objective. I'm Satyen Sangani, CEO of Alation. Data radicals, keep learning and sharing. Until next time!

Producer 1: (41:19)
This podcast is brought to you by Alation, subscribe to our Radicals Rundown newsletter. You will get monthly updates on hot jobs worth exploring, news we're following, and books we love. Connect with past guests and the wider Data Radicals community. Go to alation.com/podcast and enter your email to join the list. We can't wait to connect.

Other Episodes You Might Like:

Start with Story, End with Data

Ashish Thusoo

Ashish Thusoo

Founder of Qubole and Creator of Apache Hive

Subscribe to the Data Radicals

Get the latest episodes delivered right to your inbox.

Marketing by