By David Crawford
Published on 2020年2月20日
For years we’ve been talking about agile analytics, but the implementations to date have little to recommend them. And our problems seem to revolve around the data lake.
The data lake is the answer to reducing cycle time in analytics—once you have one, you no longer have to wait months for engineering to create your data set before you get started on analysis. In the old waterfall model, analysts spec’d out the schemas needed to answer their questions, and then waited for engineers to ETL data into a warehouse to support them. In the agile model, engineers capture data in the data lake ahead of time, and analysts come along later asking questions from the data that’s already there. So the data lake is our solution to removing engineers from the cycle—the analysts’ cycle of asking and answering questions—so analysts don’t have to rely on another team’s cooperation to get analytics projects done.
While this sounds like just a shift in data is captured (earlier rather than later), the implications are larger. It leads to a dramatically different data environment for analysts.
Before, data was organized to drive analytics; now it’s organized for maximum detail. Before, schemas were documented when they were designed; now they’re created on the fly without any documentation. Before, data was refined for use; now it’s captured raw. On the spectrum from capture to insight, the engineers have taken a step back, creating a gap between the engineers dumping data into a lake, and the analysts fishing it out. In the course of removing the engineers from the process we’ve shifted a lot of their work to the analysts, in a way that sets them up to fail. Analysts often don’t have the programming skills, and can’t possibly have the context to succeed.
What’s perverse is that the more you invest in data capture and agile analytics, the messier it gets, and the harder it is for your analysts to find anything. And the challenge is right up front, before analysts can even get started figuring out the question they’re going to ask.
To bridge this gap, analysts must do two things:
Access (e.g. find, understand, and use) the data they need either in the source system, a data lake, or even in Hive
Transform it into the format that will drive insights — generally, but not always, a relational format
Each of these things requires new tools. The folks at Trifacta, Paxata, and Alteryx (and even IBM and Informatica) are doing great work on transformation and preparation, making it easy for analysts to refine and manipulate data at scale. We here at Alation are working hard on #1. If you want to read about that, check out Satyen’s post.