One of the most significant trends in big data and analytics is the concept of the big data lake. In fact, according to EMC, data lakes are gaining momentum as scalable repositories for critical data to be used for predictive Big Data analytics.
In this video from the recently concluded 2014 Strata + Hadoop World conference in New York City, Edd Dumbill, vice president of strategy at Silicon Valley Data Science, talks about the concept of “data lake”, its potential and what it will take for it to become a large-scale reality.
Here’s some transcript of the interview:
What is data lake?
Essentially, the data lake is the idea about putting all your data whether it’s raw or processed into one repository that can be distributed and scalable and so makes data much more accessible to applications and people inside an organization.
If the data lake becomes sort of the default model, what types of opportunities do you think that will create?
Well I think primarily, we’re looking at organizational agility being a big one of them but you know the data lake is a model be a part of a larger platform really, where you have the data stored in the repository where it can easily be accessed and create new applications from it, is sitting on top of the foundation of the cloud infrastructure, of devops, you know a lot of things that we do about an open-source, that means organizations will be able to set-up infrastructure for new applications more more quickly, you know the high-level we’re looking at developing in agile sense we’re not looking at three-year projects anymore. If we’re working with data we need to be ready for the fact that data is gonna come up with new opportunities and present new way, new products we may want to discover things and find out when you take things a different way — so it’s not just the technology, it’s this combination work in an agile sense and also this idea that having data isn’t just the end of the story, it’s not just a big whole way of putting things into, get the raw data and get the benefit of being able to revisit assumptions but that is we put our process data in there and expose that API statuses essentially back in to your company so it could be a building block of new things. You know the old ways, you build application A to do thing A; an application B to do thing B and they all had assumptions about the data and through different things away but now if you can espouse the model that famously Amazon has done, you know they don’t build any functionality that exposing it as a service, well think about that in the data sense as well, don’t create any data that make it available and controlled and useful way back to the organization. So I think done right this is a powerhouse for agility, a powerhouse for more invention and really creating value from data around to see as a cost center.
What’s the most important changes you’ve seen in the data space?
This one I’d like to be able to say – you know, I do think people realize, starting to realize that it’s not a one tool fits all solution. You know the very beginning we had lots of talk about nosql and there was like throw everything out the window. I think one of the most important realizations that the industry have is you pick the right tool for the job. Because data is subtle — some data moves fast, some data moves slow, some requires complex process. And we understand that actually there’s never gonna be one database, one thing you can get a shrink up and expect everything to to work. Your data is the image of your company, your business and its such as unique, it has quirks, and special requirements and therefore you should choose the tools and the things you do to match that. You know the old way of thinking about data is faster paper, this is just automation of existing process regular, very predictable. In the big data era we’re using data to create value, it’s very different, you pick the right tool for the the job. It’s about your business and if you’re gonna be competitive economically at, so I think immaturity about understanding, there is no panacea, you need to be intelligent about what to do.