Data Radicals logo

From Statecraft to Codebreaking: The Big Data Origin Story

Chris Wiggins, Chief Data Scientist, The New York Times

Chris Wiggins

Chris Wiggins is the Chief Data Scientist at The New York Times, where he leads a machine learning team solving newsroom challenges. Chris is also an Associate Professor of Applied Mathematics at Columbia University, co-author of How Data Happened: A History from the Age of Reason to the Age of Algorithms, and co-founder of HackNY, a nonprofit connecting students with NYC startups.

Chris Wiggins

Chris Wiggins

Chief Data Scientist

The New York Times

Satyen Sangani

As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”

Satyen Sangani

Satyen Sangani

CEO & Co-Founder

Alation

Satyen Sangani (00:03):

Welcome back to Data Radicals. In this episode, we have the privilege to hear from Chris Wiggins, Chief Data Scientist at the New York Times, drawing on his career experience and lessons from his book How Data Happened. Chris walks us through an exhilarating journey of how we arrived at our Data Empowered reality. We explore the origins of statistics and how states used data as a path to power. Chris reveals how World War II set the stage for the birth of digital computing and AI as we know it. Have you ever wondered how data shaped societies or been curious about what the future holds? This episode is for you.

Producer (00:37):

This podcast is brought to you by Alation, a platform that delivers trusted data. AI creators know you can't have trusted AI without trusted data. Today our customers use elation to build game-changing AI solutions that streamline productivity and improve the customer experience. Learn more about Alation at A-L-A-T-I-O-N-dot com. 

Satyen Sangani (01:01):

Today on Data Radicals, we're excited to have Chris Wiggins with us. Chris is the Chief Data Scientist at the New York Times where he leads a machine learning team to solve business and newsroom challenges. He's also the author of How Data Happened and Data Science In Context. Chris is an associate professor of Applied Mathematics at Columbia University and a founding member of the Data Science Institute executive committee. Chris co-founded Hack NY, a nonprofit that hosts student hackathons and runs the Hack NY Fellows program, a summer internship at NYC Startups. Chris, welcome to Data Radicals. 

Chris Wiggins (01:36):

Thanks for having me.

How did you come to write How Data Happened?

Satyen Sangani (01:36):

So I'm super excited to have you on for a million reasons, but one of them is because you wrote this book on the history of data, which talks about how subjective data is, and I think in my life so much of what I try to struggle with is how to get that subjectivity across and to deal with that in software. And so to me it's a really sort of exciting history. It tells a story that I think is counterintuitive to how people typically think about the topic. And then besides all of that, of course you're a Columbia professor and as a former Columbia College grad, it's just great to see Columbia's doing great things. But I want to start with a book. I think it's central to a lot of what we think about in the podcast. Tell me about the story of writing it. How did you decide to go do it? Where did the idea come up and what motivated you to do the work?

Chris Wiggins (02:22):

The book came out of a class and the class came out of a dinner conversation. I co-authored the book with a real historian. I'm merely a fan of history. My co-author is a proper history professor, and it started out from a dinner that I had at his house with a bunch of students. He was a residential faculty member, meaning he lived in East Campus, if you remember East Campus from your undergraduate days and used to have faculty over as part of that role of being faculty in residence. So one evening we got together with a bunch of students and they were sort of half engineers, half Columbia College, meaning they were coming from the two sides of the two cultures, the techies and the fuzzies. And we had this conversation about data that was pretty wide ranging between the history of data, the technology behind data science and data enabled products, and of course the impact of data on society and their futures.

(03:12):

So at the end of the dinner, the students were very encouraging. They said this could be a class. And as we went away and thought about it more, we had I think very limited ambitions. In retrospect, we would teach a class on the history of data science, which was kind of exciting to him as a historian and to me as a practicing data scientist. And as we started actually teaching the class starting in 2017, I would say the students really pushed us to think more broadly and not just how did data science come to be, but how did our data empowered reality come to be and what are the forces at play which are much bigger than just the lives of the technologists who develop. The technology also involves state craft and political interests as well as politics. By politics here I just mean of relating to the dynamics of power. So the class itself we taught together from 2017 to 2023 again to a student body that was a mixture of techies and fuzzies. Again, we felt like there was material that was important that was not being taught to future statisticians, future computer scientists, nor to future senators and CEOs, important ways to understand how data is shaping society.

Satyen Sangani (04:17):

So many different directions. We can go just based on that one answer, but maybe just stay in the class. So you start the class, how much of the actual history did you discover in the process of creating the curriculum and how much of it did you and Matt know as you sort of built the course?

Chris Wiggins (04:35):

I think for both of us there was a lot to discover, and for me as a non historian, I really wasn't well versed and I certainly wasn't trained in the history. In retrospect, I would say it's a missed opportunity for technologists to teach history as part of the way we teach a subject. Technological subjects usually build on themselves, which means there's an opportunity to teach any technological subject historically, but most of the context, most of the motivations for the science or who benefited from the science that's considered outside the lines. And so as technologists, we rarely go there, and I certainly wasn't trained in it. So for me, there was a lot to discover about the role of state power, the requisite infrastructure that had to be created from World War II until present day. In order for it to even be conceivable that data would have so much impact. So for me, there was a lot of history to learn.

Statistics as the science of statecraft

Satyen Sangani (05:23):

So tell us a little bit about that history. You mentioned this idea of statecraft and that data started as a mechanism or as a divisive statecraft. First of all, what is statecraft and tell us how that story started.

Chris Wiggins (05:36):

So we had to choose somewhere to start a history. Obviously when you teach a history class, you have to choose how far back you're going to go. A useful place to start was around the word statistics, which enters the English language in 1770. And in some ways I thought that was a useful mile marker when the word statistics enters the language in part because the word has nothing to do with data or models or mathematics. The word enters the English language as a word about the science of statecraft, how to run a country, a state, and it's right there in the name statistics, but most people forget about that fact. The other thing that's useful about that period of time is there's a real transition in the way people decided what was true and argued for what is true. So it's part of the age of reason as we say in the subtitle of the book, and part of some things in transition is when there's lots of fights.

(06:26):

So for example, when statistics enters the English language, almost immediately you get these fights about whether or not there should be numbers in statistics, which sounds crazy to us today because we use the word statistics and numbers interchangeably. But there's these great fights from the early 19th century about people saying, these fools think that the greatness of a country can be captured in a table of the population and the number of animals and things like that. That sort of gets to the other part of your question, which is why did people start enumerating things? And there's a strong motivation to count things when you have a lot of things and you want to quantify and understand the extent of your power, particularly if you're in some sort of rival contest with somebody else. And you might like to enumerate who has more armies, who has more land, who has more resources.

(07:08):

And so a lot of the history which we don't go into before 18th century, is about the creation of statecraft. So there's a lot of great prehistory before the 18th century, but that's essentially where we started is when people started changing how they decided, how we know what is true and where these fights start to enter society as to whether or not people think data should have a seat at the table. That I think is useful because that fight is constantly happening to the present day where people have some craft or they have some way of understanding what's true, and then suddenly they find themselves awash in data and everyone has to ask themselves, to what extent could we use these data to change the way we understand this thing that we study or this way that we live our lives.

Satyen Sangani (07:46):

Let me ask one more question on the application of statecraft. So this idea of statecraft, this idea of describing the state and furthering the ends of the state, when you think about the idea of data gathering, what was the ultimate end? Were people just simply trying to prove that their country was better than another country? Or what were the reasons why people did this work and counted things or described things?

Chris Wiggins (08:08):

Usually in histories of this people point to two material ends, taxation and building armies. So that's another interesting thing about that period in history, particularly in Europe, was a large number of military conflicts as Europe sort of settles out political power. And so it's very useful to know how many people there are in order to make financial planning and also to gather resources and also to execute wars, right? To execute war, you need to be able to conscript people and place those armies in the appropriate places with the appropriate numbers. So it informs strategy directly to be able to quantify. So a lot of the history of quantifying the state, I would say right before the material that we pick up in our book is largely about taxation and executing military campaigns.

Satyen Sangani (08:50):

Yeah, it's super interesting in this podcast how often this idea of the inner relationship between data and war has come up. We've had a general on the podcast who was talking about how data was transformational to the prosecution of the sort of anti-terrorism work in Afghanistan. We've had just literally the last podcast recording about AI and the application of AI and sort of producing total war and how furthers game theory essentially gets you to the end much faster than you otherwise might do. So this idea of power and data and forced power and data seems like a very unexpected and interesting topic. You mentioned that this idea of how to count things and what to count and how to structure data itself was sort of at the origin of it. Talk a little bit about that. I think people deciding what to count, I think is on some level the fundamental politics of data, and yet that often gets overlooked. People just assume that these things are just immutable and well described.

Chris Wiggins (09:47):

So some of the history we talk about involves when people try to apply this directly to society, not just full states, but when people, particularly in Victorian England are interested in trying to use data to make policy arguments. So they're very interested in counting crime or counting suicides or other things that they're concerned about. The direction that Victorian society was go at that point, there was an empire that was perceived to be in decline and there was a lot of interest in how they could make the empire great again, so to speak. And the tool that people were very interested in was taking scientific thinking, which had shown itself in the first half of the 19th century to be extremely good at a variety of problems, celestial mechanics and predicting the location of planets. So there was an interest in the late 19th century, earliest 20th century at applying that technology to make society great.

(10:33):

I would say the story about war and data, which I certainly hadn't appreciated when we started the class, was really about World War II.

World War II as the springboard for data science and digital computing

And so the breakpoint between part one and part two of the book in the class is World War II and the story of the creation of digital computation and a story that I think is not as well celebrated or certainly wasn't well celebrated in the first, I would say half a century of digital computation was the role of code breaking and therefore World War II in the development of computation itself. So we spent some time in the book talking about Bletchley Park, which was the center of code breaking in the United Kingdom and the innovations there about special purpose engineering in order to solve a very applied data science problem in which streams of messy real world data needed to be processed very quickly in order to make sense of decryption as quickly as possible by whatever means were necessary.

(11:24):

It wasn't a place where mathematical statistics, which had been percolating in the academy for two or three decades at that point was used. I mean, the tools that were used were engineering and heuristics, and the people who developed those methods went on after World War II to go on and create something that looks a lot like data science, meaning using computers which were being created for that purpose in order to make sense of streams of messy real world data. So that was sort of a surprise to me in researching the book was the importance of World War II and the creation of computation, but also what would become data science. I grew up sort of hearing all the time how important World War II was, but I was kind of tired of hearing about it, but researching the history of data and computation, you realize, wow, that World War II was really transformative.

Satyen Sangani (12:07):

Yeah, it's kind of incredible because I mean the movie that at least I've seen probably the most popular depiction of it is the Imitation Game, and you watch the work of Alan Turing and decoding the Enigma Machine, but talk a little bit about what the specific innovations were that they were able to uncover that led to this transformation from counting things for the purposes of taxes and power to now actually almost using data as a mechanism for influence and understanding.

Chris Wiggins (12:37):

So it was quite transformative. So we talk a little bit about the imitation game in class to try to explain what ways in which it was and was not supported by the historical record. One of the things we talk about in terms of the historical record was well, Alan Turing interest the book quite a bit, both in terms of code breaking and the creation of artificial intelligence in subsequent chapters, but with code breaking in particular, we talk about the sort of forgotten story of the three Polish mathematicians who actually had built special purpose hardware for reproducing and decrypting the Enigma machine, which was the primary machine used by the Germans during World War II commercially available machine, which they had realized that they could build their own special purpose hardware to loop through possible settings of the rotors in order to find decryptions. And decryptions here really means patterns that you would get that look statistically like German language.

(13:23):

So there's a statistical observation, which is you want to use some heuristics about the statistics of German language in order to rapidly do the search through all possible rotor settings. But it was actually a meeting between these three polish mathematicians and allies right before Poland was occupied by the Germans, which was crucial for the allies learning how you could build special purpose hardware in order to solve this problem. So how did that lead to present day? Well, what they really did at Bliley Park was scaled it up. So they already knew that you could build special purpose hardware to loops or configurations, but they really built whole rooms full of machines in order to do this. So part of what happened there that survives to present day was the realization that making sense of data involves engineering digitizations because it wasn't being done with analog computers, it was being done in a digital format and labor.

(14:09):

And so right away you see in Bletchley Park how people realized, okay, this is going to be a huge project. Let's decide who's going to do what. Another thing that survives for present pay, which we talk about quite a bit, is how immediately the labor was gendered. Immediately people said, okay, well this is going to be women's work and this is going to be men's work and the men are going to be cryptographers and the women are going to be in the military and have to do drills and things like that irrespective of their mathematical or computational talent, but mostly about the labor and the work and the realization that processing information digitally is going to require a lot of engineering work. And the story really scaled up from there. You can look then at Alan Turing's writings as early as 1948, certainly his 1950 publication on Can Machines Think, and you can see the future right there in 1950.

(14:52):

He describes how if you want to make a machine process information in a way that reminds us of how we think you can do it, if you just have enough memory and if you have big enough computers. So that story is certainly playing out in the present day where the story now is not an amazing number of intellectual transformations day by day. It's really just bigger. There's some great intellectual transformations over the last two decades, but ultimately the thing that people are seeing now under the name of scaling laws is basically more is more. And if you throw more compute and throw more data at the problem, you will see more eye popping performance emerge from computation.

Satyen Sangani (15:25):

Yeah, I mean that's so incredible. I mean, in some ways these machines are so brilliant and yet at the same time, so kind of trivial because it all comes down to predict the next token.

Chris Wiggins (15:36):

Correct.

Satyen Sangani (15:36):

And in predicting the next token, essentially what you're doing is to your point, using statistics in order to be able to come up with in German or English and as a case is and was what the computer is to do next.

The post-WWII rise of digitization

And so that's super interesting. What's different I guess between then and now besides the computational force? What has happened between the fifties and today besides the of course evolution of GPUs and computational capacity that has allowed us to evolve in what we do now?

Chris Wiggins (16:06):

I think the vast majority of things that have changed since then are not so much the realization of certain ways of architecting the model because some of the innovations really just have been about what the model is shaped like. The main innovations, though I'd say are about labor data capital and our norms. So data means you have to have a data infrastructure in order to make large sets of data available to transmit and to ingest. And if you think about the history of artificial intelligence, which involves for the first half of the life of artificial intelligence, people thought that data was a bad idea. People thought that machine learning was a bad idea and that you should get artificial intelligence from logic and rules and from programming things. In retrospect, it's not clear that if they were wrong because at the time there was not the infrastructure that we have now for gathering, transmitting and making sense of data.

(16:54):

No worldwide web, no internet for most of the life of artificial intelligence. Only now do we have such incredibly large corpora of text that you can easily ingest and transmit. So data is one thing, compute is huge now, I mean just the scale of memory and computation, and again, this gets back to Alan Turing's 1950 assay. He writes that you could imagine a computer doing these things, but it would take much more memory than we have access to in 1950. Capital goes along with all of that, there's been just tremendous investment. And there was a 6 billion raise this week at a valuation, if I remember correctly, 157 billion. So these are extremely well valued, capitalized companies that are taking in large investments.

Satyen Sangani (17:31):

And burning like 5 billion as well at the same time.

Chris Wiggins (17:35):

And burning through cash and also burning the environment in ways that have been illustrated by computer science and other colleagues for a couple of years now, before some of these companies were created, computer scientists were publishing papers about the environmental costs of what were called foundation models at the time.

(17:52):

But in any event, in terms of the transitions that have happened, I would say there's amazing investment in capital, the creation of the infrastructure necessary to possess and transform and transmit data, a bit of advancements in the architecture of the models and the realization that a certain way of shaping the prediction function is going to advance thing. But a lot of it has also been our norms. So normatively, we all have certain expectations of what a computer canine cannot do. We all have norms about what information we're going to put online and make publicly available, which is made things possible, or image data as well as text data. So I would say those are all transformational, but it's been kind of humbling to see how much of the problem is really just more is more compute and more data makes more eye popping results possible.

The tension between objectivity and subjectivity in data today

Satyen Sangani (18:37):

Switching gears back to the micro a little bit. You talk a lot about the data gathering, the politics of data gathering, deciding what to count, and a lot of people talk about this idea of structured and semi-structured data and unstructured data. And the interesting thing that I find is that on some level, the structuring of data is itself a political decision, or at least there's a decision making process in it that optimizes for some objective function, whether it's political or not. And now what's interesting about of course these models is they literally take the unstructured data as it were, and on some level through heuristics and kind of remove any of the human generated bias in the data. You don't select features anymore, you just feed the data to the model and it does whatever it does. Now you have to select weights and that weight selection process and the model construction obviously is theoretically the politics of it, but it's limited to a very small number of people.

(19:28):

And I mean this idea of choosing what to count now and choosing what to wait is on some level where all of the power lies, and that's becoming the province of a much smaller number of people, the few the people who can sort of on one hand to decide what features exist in a model as a small count. Now the features of the number of people who can decide what weights exist in an LLM are even fewer. Is this, first of all, do you think a fair characterization of power centralization and how do you think about that? How do you reflect on that and how do the students that you're teaching grapple with that material? So there's a lot of questions in there, but maybe you can sort of riff off of a lot of the ideas that I put forward.

Chris Wiggins (20:02):

Yeah. The tension that I try to get students to wrestle with is the tension between claims of objectivity and the reality of subjective design choices. So the nature of the subjective design choices will be very different from constructing and publishing a table to publishing unstructured text or just publishing all of Shakespeare's work and putting it on GitHub say there's not so much work into the structuring of the data that's true, but there's other places with all sorts of subjective design choices that go into the creation of what will eventually become a product. So people who do data, people who work with data professionally know that they're constantly making subjective design choices. In fact, arguably that's why they have jobs is because these people are constantly making wise decisions about how to make sense of data. And yet we have centuries of rhetoric in which once something is reduced to a number, it has become objective.

(20:51):

We conflate the objectivity of let's say one plus one equals two or other statements of logic. We conflate that objectivity with the objectivity of a narrative in which a number appears or a product which relies on a data empowered algorithm. And because there's data empowering it, we somehow imbue it with claims of objectivity, forgetting that there's innumerable subjective design choices. And again, anyone who does data professionally knows that those subjective design choices are being made arguably that's why we remain employed. You mentioned something about the politics of gathering data. So yes, so all those subjective design choices have politics even before you start doing any mathematics. Just when you've made the decision of what data to keep and what data to throw away, it's well known that these things have politics. And again, by politics, I don't mean over relating to voting, I mean of relating to power.

(21:41):

When we choose different quantifications of things that we see in the world and we wish to turn into mathematics, those things themselves have politics behind them. So I would say that's one of the things that we want students and readers to internalize is how claims of objectivity and we want things that are true to be objectively true. Our intention with the fact that making sense of data involves innumerable subjective design choices from the choice of what data to gather and what did to throw away the choice about how to label things. When you take particularly about people and you start taking social constructs and making those entries in a table, and then eventually product choices, am I going to deploy this product using this data set or that data set or with this set of affordances or that set of affordances? Even more so in the last two years, one of the major techniques for advancing the most high popping products has been RLHF Reinforcement Learning from Human Feedback.

(22:29):

And it's right there in the name, right? There's human feedback in there. There's somebody who's making subjective design choices about which of these two models is the better, and that means a human being somewhere who's been given usually a code book, which doesn't mean programming code, it means a list of rules is looking at some content, usually human generated content, but it could be an algorithmic content and deciding which one is better. There's innumerable subjective design choices happening there, which eventually become encoded in a product, but the presentation of it as though it's somehow unbiased and free from any subjective design choices is illusory.

Satyen Sangani (23:02):

In your teaching the course, I mean, I'm sure this is material that the average Columbia student's going to love and get into because kind of a sort of place where these social issues are entwined and sort of pretty understood, but how well understood do you think this is in technology and the tech community more broadly? Do people just leap right over this or do you in your travels find that there's a clear awareness, particularly in sort of the practitioner level and the people running these businesses that this much subjectivity exists?

Chris Wiggins (23:34):

It requires some honesty and some sophistication to have that conversation. I mean, honesty, for example, the person who's the data practitioner has to be willing to be honest about where the subjective design choices in the analysis that's presented. And I think it's different in different communities and therefore different companies. Not every data practitioner wishes to foreground the subjectivities in an eventual analysis, plus it's a complexifier. Often we would like it to believe that the data speaks for themselves, right? This phrase you often hear, or we would like it to be that the data are raw rather than cooked. And in fact, it takes more complexity to say, somebody, these data absolutely were cooked and I'm reflexive about the way that I cook the data. Let me share that with you so that you understand that nuance. Often that's a complexity that people are not so interested in. So I require some honesty on the part of the person who's making a rhetorical statement with the data, and it requires some honesty and some curiosity on the critical listener. And by critical, I don't necessarily mean that's bad. I mean critical in the sense of somebody who understands the complexity and the subjectivity behind something that's presented with numbers.

Satyen Sangani (24:38):

In our work, we do a lot of sort of metadata gathering in order to be able to understand where data came from, and often it comes from some source system. The source system is produced from some computer program that computer programmer modeled to the world in whatever way they thought was expedient in order to complete the task, and they probably weren't thinking about the idea that somebody would consume this information in many cases for the purposes of counting or understanding or doing anything with it. They were just simply producing data in order to get to an end or help the user get to an end user of the system to get to an end. So there's also just a lot of, I guess on some level ignorance too, or people were just moving to get to an outcome and this data is just exhaust. So it's a super complicated and fun thing to think about because I think as we get into ideas like data definitions and business glossaries and all of these things that are seemingly somewhat boring, it's there that a lot of the science exists.

(25:28):

What is this thing? How is it gathered? What does it mean? What's the intention? How is it counted? All of that super important and interesting. You talked about deep reinforcement learning.

What is Reinforcement Learning from Human Feedback (RLHF)?

I'd love to talk a little bit about that because it's a part of AI that I think is perhaps less high profile and me, obviously there's robots and the like and people see them, but most of the consciousness is around all of these generative models, but this deep reinforcement learning area seems like it has the opportunity for a lot of transformation. Can you explain what it is, how it differs from the things that we see in the news or compliments it and what the opportunities are there?

Chris Wiggins (26:00):

Yeah, so there's two very good ideas in deep and reinforcement learning. They're both great ideas. So reinforcement learning as an idea is decades old. It's the idea that there's fundamentally a different type of analysis when you're trying to predict an outcome in the absence of an intervention from the problem of trying to learn what is the optimal intervention in order to get some sort of outcome. So the former is supervised learning, right? It's like you take an image and it has a cat face in it, and you want to build an algorithm that says, does this image have a cat face or a dog face in it? There's a whole high profile branch of machine learning over the last 20 years around that sort of supervised learning problem. I like to call it cat face science. That though if you think about interacting with the world is of limited value to somebody who is a doctor or a robot or somebody who runs a company, those are all examples where decisions have to be made and you interact with the world.

(26:52):

So the problem of simply predicting what's going to happen in the absence of intervention is useless to say a doctor. If you have a hospital that never gives anybody any medicine or it gives everybody the same medicine, then you could hire some statisticians and predict who's going to get better and who's not. But as a doctor, you really want to know which people should get which medicine. So that's a problem not of prediction, but a problem of prescription. They're trying to prescribe the right drug and the branch of machine learning for making decisions in a world that you are trying to figure out how the world works at the same time as you're trying to make the right decisions in that world is called reinforcement learning. So reinforcement learning is a good idea. It's multiple decades old. It had some flourishing in the robotics community earlier, but it's also the right abstraction for running a digital company when those decisions are ultimately decisions being made by software.

(27:40):

So that's paired with deep learning, which is essentially a realization that the idea from 1943 that an artificial neural network might be useful for processing information is a very good idea with sufficiently big artificial neural networks. So the transformation to go from neural networks, which have been posited as a potential information processing mechanism since 1943 could be really, really effective at function approximation was a realization from 10 or 15 years ago that it could work really, really well if you made many layers of artificial neural networks. So at that point, people stopped calling their field neural networks and they started calling their field deep learning. So deep reinforcement learning is really combining these two ideas in which the thing that you're trying to predict is how much value will you get given the current state if I take a certain action? And you try to solve that problem using a function approximation while you are at the same time interacting with the world and trying to choose the next action in a way that will get you the best outcome.

Satyen Sangani (28:38):

And the difference between this and sort of our generative world is in this world of sort of predict the next token, what you're simply trying to do is predict the next thing that might come up here. You're dealing with a multi-variate universe where it's not about predicting a singular thing, it's about evaluating amongst a series of decisions which one might be the optimal one.

Chris Wiggins (28:59):

Anything with prescriptive learning, you have an action space like a set of drugs on the counter, for example, and you have to choose what is the right drug to prescribe to which person. So the goal is to choose the action A, which you've got to maximize outcome, which is different than the goal of predicting the probability of any possible next token, and then sampling from that probability distribution to generate the next token.

Satyen Sangani (29:19):

But it's the combination of these things, these two ideas that people talk about, AGI and a SI sort of like just this completely autonomous singularity where there's this all-knowing being, but it's a combination of these two things that make that thing in some people's eyes fairly proximate or that idea fairly proximate. What's your take on when we'll see sort of this, the singularity? 

Chris Wiggins (29:46):

Yeah, I'm not bullish on the singularity arriving anytime soon, but I am an observer that the thing that we call artificial intelligence is quite jagged in that there are parts, there are things that you can ask a computer to do where it does it much better than you expect, and then there are other things that you ask a computer to do and does much worse than you expect. And that's been true for decades. I mean, part of what leads this longstanding feeling of elation and despair about what computers can do and what computers can't do is this jaggedness that there's some things that we don't think a computer can do and it does this and we're all either elated or terrified, and then there's still other things that you think a computer should be able to do and it just does not do it. And that front is quite jagged and evolving all the time.

(30:26):

So there's a good paper about this by the philosopher Dreyfus from 1965, I believe, called Artificial Intelligence and Alchemy. Dreyfus went on to write a book called What Computers Can't Do, and then later a book called What Computers Still Can't Do. So this frustration with the fact that computers can do some things just so, so well and we're all excited about it, and then there's other things that absolutely cannot do well, I think that's longstanding. I'm not taking any particular bets about the singularity, but I'm simply an observer over the last couple of decades that it is a constant dialogue when people being surprised and delighted by things they didn't think a computer can do and then the computer could do it. And then with the passage of time, people think that that's trivial and they dismiss it as not really artificial intelligence. It's some other thing now that computers can do it,

Satyen Sangani (31:08):

And if it can do this completely surprising thing, then of course there's a whole bunch of trivial things that it ought to be able to do. And so a lot of people underestimate those somewhat trivial tasks and therefore get to a place where they're like, of course we're going to be able to get to this end state faster.

Chris Wiggins (31:20):

That is the jaggedness of artificial intelligence. And an underappreciated aspect of the sentence you just said is surprise, which is that we're always constantly doing this gap analysis between what we think should be possible and what is possible, but it involves our own subjective sense about what should be possible. So the surprise is really a function of our own norms and what we expected. This has been true for a long time where you can read Joseph Weisbaum actually have his book in front of me who built the chatbot Eliza in the 1960s, and he was so sort of surprised and a bit disappointed about how much people enjoyed talking to Eliza and Anthropomorphize did and felt like they were having an emotional connection with Eliza in the sixties from a computer program that just had a little bit of randomness and a couple of rules about regular expressions.

Is data scientist still the sexiest job of the 21st century?

Satyen Sangani (32:01):

Super interesting. I want to switch gears a little bit though, to the world of data scientists, you have the opportunity to train and teach data scientists from some of the earliest moments. I mean, Columbia was out front and building the Data Science Institute. You've obviously had the job for a long time. It's 12 years ago, I think folks described it as the sexiest job in the 21st century. Tell us how has the job evolved and what's changed about it, and is that still true or do you feel like the sheen's worn off a little bit and what's happened that has been unexpected? 

Chris Wiggins (32:31):

I'm pleased with how successful the field of making sense of data on computers has been over the last 12 years. So one thing that's changed is just the names of things. I mean, what people meant by machine learning in the early eighties was sort of an outgrowth of artificial intelligence and cognitive science sprinkled with a little bit of pattern recognition from the computer vision community, and by 2011, machine learning something that was not particularly a concern with artificial intelligence. It was definitely a data-driven task. Similarly, data science. In 2001, Bill Cleveland writes this essay about the way they made sense of data on computers at Bell Labs and says, I'm proposing a new field called Data Science. Academic Statistics Departments, you should change what you're doing because academic statistics in 2001 looked very different than what was happening at Bell Labs at the time.

(33:20):

Similarly, you can go read what data science meant in 2009. It was this long list of things which mixes data engineering work, data analytics work and data strategy work. So one thing that's happened is that things have gotten much more stratified and specialized so that we now have separate job functions and industry of data science, data analytics, data governance, data engineering, analytics, engineering. These are all different functions now in part as the field has matured and people have specialized in different ways, but I think also the other thing that's changed is just people's expectations. People have, I think, a much more rationally optimistic sense about what data is capable of. So with any new technology, you go through this inefficient hype cycle as people encounter the technology and they sort of overinflate what they think it can do either because people are selling it or just because we're hopeful, and then at some point is a trough of despair as you realize it couldn't do everything that you thought it could do, and then you get to this more efficient point of rational exuberance rather than irrational exuberance. And I've been pleased that people are there. Now, there's just more and more companies that realize all the different ways that you can use data to make sense of the world, use data to make smarter decisions. So I would say I've been pleasantly surprised with the way things have worked out over the last decade.

Satyen Sangani (34:37):

What are the key skills? What do you teach? What does the curriculum look like?

Chris Wiggins (34:41):

Yeah, it depends on the different employing companies, and I would say the nature of the academic industry relationship is such that in large part, industry has set what data science means. So many fields that are in academia but are very applied field are certainly influenced by industry. Data science is no different, but these days it does still depend on which company you go to. So at the New York Times, data science means developing and deploying machine learning. So the fundamental skills that you need for that are some amount of Python and SQL are dealing with relational databases and some literacy. And what's happening today with embeddings as a way of representing complex data sets is certainly useful. Those are the technical skill sets, but the sort of collaboration skill sets that I think are relevant are the ability to understand what people want, and the people in this sentence might be marketing editors, product people, business partner whose brain is shaped very different from your brain after all in industry, the whole point is to make cross-functional teams of people whose skills are very, and the magic happens where people play complimentary roles. And then in academia, what are you teaching people? Well, it's a mix of statistics, computer science, databases, and some amount of building a project. So the senior class that I teach to applied math majors is absolutely about doing projects in small groups in order that they learn to do something original, to communicate with other people who have different skill sets and to present that to everybody else. So those I think are extremely useful skills in industry.

How has Gen AI impacted data science?

Satyen Sangani (36:03):

How has generative AI changed the role? Is the expectation now that data science expands to understand and incorporate these models and the experience?

Chris Wiggins (36:12):

So three ways that I think generative AI has impacted things. One is it's made working with data an option for a set of people who don't know machine learning or statistics. So because many times when we say generative ai, what we really mean is a large machine learning problem that has been solved by somebody else. It means that you can call sort of like a sub routine or a function, call some other application or an API and address that application with an input, which is often in natural language and often in text and get back a result often posed in natural language. Remember again, it's a set of tokens which you can incorporate into a larger program. So for many people, you can build very simple things that are mostly front end applications, say, but then the interesting bit, the part that makes sense of data, you don't necessarily need to know a lot of statistics and machine learning, you simply have a function call to the machine learning that has been done by somebody else in the interim.

(37:08):

So it's opened up a lot of opportunities for software engineers and developers to sort of leverage somebody else's machine learning. That's one thing. A second thing slightly more technical is embeddings. So the idea of embeddings is to take arbitrary structured data like text or document or images or audio and use a large language model to render it as a long list of numbers. And once something is a long list of numbers, then we're in a very happy place for mathematics because we've been making sense of a long list of numbers for centuries and doing mathematics with numbers for even longer than we've been dealing with data. So embeddings have been really useful for making sense of structured data, again, images or documents or something like that. So it's made it possible for you to get a statistical result very quickly because you don't have to spend a lot of time thinking about features or representations, schema, what have you.

(37:55):

The flip side of that is you have no interpretability whatsoever. So if you build a machine learning model where the primitive features are embeddings representing some structured data, don't really know how the thing works. You may get a result that's predictive, but you don't really have a lot of ways to drill down and interpret how that machine learning model works. The third way is totally sociological, and that is because of the excitement about generative ai. Many more people are excited also about good old fashioned machine learning. So there's a lot of people who are sort of asking, oh, what could data do to solve this problem? And they probably wouldn't have been asking that question four years ago. But because everybody is very excited about AI, we who are data practitioners have the opportunity to say to people that it's a great idea to use data. And the right answer might be a classifier, might be logistic regression, it might be a histogram. There's all sorts of ways that making sense of data from well before 10 years ago might be the right tool for the right job. But because people are very curious about how data can make sense of the world, there's an opportunity for brand new collaborations and brand new product development.

Embeddings for data practitioners

Satyen Sangani (38:51):

Makes tons of sense. Let's actually go back to sort of problem set number two. I think it's sort of a super interesting question. So in the traditional machining learning world, you select features and those features are essentially then fed to a computer in order to be able to predict or determine something. Now you've of course got these models, the feature selection processes. Now to your point, in some sense abstracted from you, tell us a little bit about from a practitioner's perspective, how has that changed the game and what sorts of problems are more accessible? What decision points do people have to make? I think just understanding a little bit of that I think would be really helpful to the audience to be able to understand the differences and where to apply what and what's new.

Chris Wiggins (39:30):

So for most of the life of machine learning how to represent the problem on a computer was one of the primary challenges. What are the right set of features? So if you take some domain where there's lots of expertise, like biology for example, in computational biology, a lot of the work and often the thing that made the difference between a fine paper and a really good paper was a good choice of how to represent complex structured data in such a way that it became a statistical problem. So for example, in computer vision, you might have a picture and you have to figure out, do I represent it based on the edges or the segments or some other representation of the image data? It was sort of transformative maybe 10 years ago for people to start dealing with image data where the features were simply the raw pixels themselves, and you did not need to think about how to represent the image data in a way that could be represented on a computer.

(40:21):

Again, the flip side of that is you end up with an uninterpretable model, meaning with a model that you represent in terms of understandable features. In the end, you can go look at the coefficients and make some argument about why the coefficients should be what they are, and it gives you some sense of how the world works less so when the features that you're using are simply embeddings. Two historical points that might be useful here, one is you mentioned deeper reinforcement learning. One of the things that made people very excited about deep reinforcement learning was this work by DeepMind back in 2014 or so, I think before it became Google DeepMind, where they showed that you could use deep reinforcement learning to play Atari games. And there the features that were being given to the deep reinforcement learning network were not things like, here's where the character is or here's where the puck is, or here's the score.

(41:05):

The features were just the raw pixels. So most of the attention was, wow, you can play Atari on a computer. Another thing people had paid attention to were deep reinforcement learning must be great. For me, one of the things having been working in computational biology that was transformative was I don't need to think about the features. I could just give it the raw pixels and a model will figure out what the right features are. So that is really transformative. It's wonderful because you don't need to think about how do I represent the Atari game? I just give it all the pixels. It's a downside because you don't necessarily know at the end why it works or what sort of conclusions about the universe you can draw.

Satyen Sangani (41:35):

Yeah, there's just zero. There were some glimpses of explainability and some feature selection zero in this world of embeddings. So really good and really interesting, and I think really exemplar of sort of why you're a teacher and clearly a great one.

Advice for future leaders of technology

It's just amazing to hear the explanation. Well, I guess maybe on that last point, the world's changing so fast, everybody's worried about the future of work, where jobs go, how one thinks about labor in this world where the computers and the robots are doing so much more. If you were advising a student who was maybe slightly mathematically inclined, but perhaps even a little scared of the future, where would you advise them to dig? What skills are most critical? What things do you think are going to be most important in this modern world?

Chris Wiggins (42:20):

Yeah, technology means change. So I would say I would advise any students to keep their eyes on where the puck is going rather than where the puck is. And in order to do that, you really need to engage with the technology and know where that jagged line is of what computers can do, what computers still can't do, and what are the ways in which the subjective design choices that people make all the time are still valuable. So I mean, one piece of advice to any student is simply to keep a close eye on where that jagged edge is between what computers can and cannot do. Another thing for anybody in the workforce is to see that disruptions often aren't from technology replacing people's work, but from people who run companies changing where they invest. So the threat that a computer can make a particular economic sector no longer economically viable is enough to make that economic sector no longer economically viable if nobody's going to invest in that anymore, if nobody's going to go into that field, that can be self-fulfilling in a way that the technology itself did not drive, but rather the way that managers and investors responded to that threat can have an order one impact.

(43:26):

One impact. For example, it may be that companies are reticent to hire in a particular area like software engineering, not because software engineering is going away or because computers are replacing software engineers, but simply because managers and investors don't exactly know what's going to happen and they don't want to make the wrong investment or hire a bunch of people that they no longer think they will need. So some of these dynamics really are only moderately coupled to what computers are capable of. They're much more coupled to people's hope, dreams and fears about what computers are capable of. And hiring has its own timescale different than the timescale of technology. So I do think it's difficult for students. It's also difficult for students who are in the research field. Just keeping up with the research is extremely difficult, and the volume of papers being produced is extremely large. I have empathy for students who are building a career and building a future by being able to keep up with all of the developments because it's quite a volume of research being published on these subjects.

Satyen Sangani (44:23):

Yeah, it makes a ton of sense. And so obviously being aware, and I think we have a podcast name jacket, edges, it clearly sort of outlines the understanding or the desire to have understanding, but also the idea that you have to be aware of these things and curious about them. So Chris, thank you for taking the time. This has been as amazing as I expected it would be, and so just awesome to hear you explain the world that I think all of us sense, but don't quite understand the edges around. So really appreciate your time.

Chris Wiggins (44:44):

Thanks for having me.

(44:53):

The conversation with Chris Wiggins was a vivid journey through the evolution of data science. From data's roots in statecraft to wartime breakthroughs in computing, Chris traced the path to today's advanced technologies. According to Chris, the current wave of generative AI opens a thrilling frontier, one where raw pixels can be transformed into intricate and mysterious models for those navigating this tech landscape. Take note from Chris, stay sharp and adaptable amid rapid changes. It's a brand new world where data practitioners are more than mere analysts. They're pioneers charting unknown territories. I'm Satyen Sangani, CEO of Alation. Data Radicals, keep learning and sharing. Until next time,

Producer (45:29):

This podcast is brought to you by Alation. Your boss may be AI ready, but is your data? Learn how to prepare your data for a range of AI use cases. This white paper will show you how to build an AI success strategy and avoid common pitfalls. Visit alation.com/ai-ready. That's alation.com/ai-ready.