By Venky Ganti, Ph.D.
Published on February 20, 2020
In a prior blog post on challenges beyond the 3V’s of working with data, I discussed some issues which hindered the efficiency of data analysts besides drastically raising the bar on their motivation to begin working with new data. Here, I want to drill down into a couple of those issues and my past experience around them.
Let’s consider the scenario when an engineer or a data analyst inside Google wants to find relevant data, say, a table in Dremel or an SSTable on GFS. She still has to remember the name of the table, and which among Google’s myriad data stores contain it. Further, unlike documents which are self-describing, it is not easy to “understand” what is inside a dataset and how to use it. The user needs to understand the data by talking to people who know about the data, or through some other alternative means. Contrast the effort spent by an engineer within Google for finding and understanding data, relative to that an external user spends using Google to find and understand information on the web.
Let me recall one of my own frustrating experiences around a similar scenario. I worked on the AdWords team at Google. I needed to find information about search queries that led to similar user behavior on Google’s products, specifically Search and Ads. I felt that there must be several datasets out there in the Search and Ads teams. I found two in the Ads teams because I knew someone who worked on those projects. But, it turned out after further investigation that I could not use either because of the differences in target applications. However, I had little luck in finding out similar information from the Search teams. I tried rebuilding my own, spent months, and didn’t succeed. Recently, after I left Google, an ex-colleague told me he chanced upon a pointer to the right data and successfully used it!
Of course, these problems around finding and understanding data are not peculiar to Google but exist at any organization which leverages data to enhance their decision-making and their products. In general, an engineer at Google has a better chance at overcoming these problems due to awesome internals tools (e.g., code search).
The focus of much of the technology related to data has been on enabling processing massive amounts of data, and visualizing results better. But, there is no focus on empowering users to find and understand data within these databases to prepare queries and programs more reliably and efficiently.
The primary reason in my opinion for the lack of focus on these issues, is that it is much more concrete to measure and show progress on query processing efficiency and visualization capabilities. On the other hand, it is hard today to articulate the benefits of helping data users find and understand data. By the way, wasn’t this true for Search over the web until Google came along and illustrated the economic and productivity gains across a wide spectrum of users? I believe that we are at the cusp of a similar revolution in data consumption.
After an analyst finds a dataset, she needs to understand its usage by other analysts and applications. Often, it is very hard to find such knowledgeable users. There were many times when I found it quite hard, even at Google, to identify the people I need to talk to for such questions; when I did find them, I felt the pain of distracting engineers with run-of-the-mill questions which they must have answered many times over.
As an example, I was responsible for migrating an application reading data from one engine to a newer more robust engine. A big part of the migration involved rewriting queries to read from the new schema. I was among the last few to be doing this migration, and hence similar questions must have been answered. But, the wiki that I was pointed to didn’t have all the information I needed. So, I had to drag myself very reluctantly to a very busy principal engineer, the only one I knew directly, to get help. I would have appreciated, a lot, if I could quickly find someone else who went through a similar migration.
On the flip side, I would repeatedly answer the same set of questions over and over on data that I produced and maintained. I tried creating a wiki page, but was still asked lots of questions. As we all know, this approach comes with its own set of challenges — keeping the wiki updated and reliable over time. In retrospect, I wouldn’t be surprised if I or my colleagues may have missed a few updates.
So, how much time is actually spent by analysts on these activities of finding and understanding data? I haven’t tried measuring this yet. We just don’t have the methodology and the tools to do it. But, depending on who you ask and which data they need to use, the answer varies widely. New users to a particular dataset will spend upwards of 80% on these tasks, while experts much much less. However, experts spend time by answering other users’ questions over and over.