“By 2020, IDC expects the annual data-creation total to reach 40 ZB, which would amount to a 50-fold increase from where things stood at the start of 2010.”
We have entered the era of “big data” where an unprecedented amount of digital information is being generated every day. If this information can be captured, managed, processed and used effectively, it holds great potential value for a variety of purposes.
For a quick overview of big data, here’s a summary of my interview with one of our big data architects here at Exist.
What is big data?
The simplest way to describe the concept of big data is to put it in the context of “3Vs” (volume, velocity, variety) model of Gartner.
Gartner defines big data as high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
Volume refers to the amount of data, velocity refers to the speed of data flowing in and out of the systems and variety refers to the range of data types (structured and unstructured) and sources.
“When you go for big data, your comparison would be like one large machine vs multiple machines, which is the typical vertical vs horizontal scaling,” described Adrian Co, architect at Exist.
“If the machine can no longer handle your data effectively, then I think that’s where big data comes in.”
When do you use big data?
The key to determining when to use big data approach is to assess your needs. You need to evaluate the 3Vs of your data. Look at the projection of your data such as the expected data and the growth of your data. Say if you have an expected data which is on a terabyte scale and then the growth of your data is at gigabytes per day, it’s highly likely that you should consider leveraging a big data approach.
Big data is an investment. It is something that you need to be sure about since its added complexities (i.e. managing, debugging, troubleshooting) might kill your schedule.
From a business and development perspective, you should figure out what you need to accomplish and think if it would make sense to invest on it immediately.
Adrian shared several factors to consider such as:
- What is the projection of your data?
- Can your schedule accommodate the complexity?
- Do you have resources?
The assessment should be done on a per domain basis. The 3Vs can serve as a guideline but of course it will still depends on your business challenges and resources.
“I can recommend it for enterprises who use high volume of non-transactional data,” added Adrian.
“And as enterprises advance in their adoption of big data + the maturity of the technology, I think this is a good time to do big data.”
What are the challenges in big data?
The challenge in big data would be the complexity in terms of management, scaling, and setting it up.
Adrian revealed that there are some things that big data would have a hard time addressing at particularly the consistency of data across clusters.
“Its not impossible, but from a technological perspective, it’s difficult,” added Adrian.
Other challenges would be privacy, skill shortages and security to access and deployment.
What are the opportunities in big data?
The challenges of big data remain, but the opportunities are even greater. McKinsey calls big data “the next frontier for innovation, competition and productivity.”
“From a business perspective, insights definitely. More data means more information that can translate into value in terms of having customer insights and identify trends,” shared Adrian.
The advent of big data enables you to do a “low cost way” of storing, managing and even processing vast amounts of disparate data — providing any company with information beneficial to gain competitive advantage, enhance productivity, and deliver business value.
Easy and timely retrieval and analysis of related and unrelated information can also be done with big data. This is crucial especially for sectors such as government, healthcare, retail, etc.
What are the tools used in big data?
“When you say tools, it can be the platform such as Hadoop, NoSQL databases such as Cassandra, HBase and MongoDB, and massively parallel processing (MPP) databases which are used for data storage, processing and analytics,” explained Adrian.
Hadoop is one of the most popular Open Source platforms for working with big data. It is a Java-based programming framework that enables the processing of large data sets in a distributed computing environment. Leading providers of Hadoop include Cloudera, Hortonworks, MapR and Greenplum.
Proprietary vendors such as Oracle, IBM and HP also offers stack of big data tools.
But so far, big data Open Source tools are becoming the best choice among enterprises since they are more cost effective and flexible.
NoSQL (Not Only SQL) on the other hand is loosely defined as a set of database management system that follows a more non-relational approach to data representation and does not require fixed table schema.
According to an article in wikibon, NoSQL and Hadoop technologies usually work in conjunction. It cited HBase, a popular NoSQL database modeled after Google BigTable, as an example, that is often deployed on top of the Hadoop Distributed File System, to provide low-latency, quick lookups in Hadoop.
Now since there are a lot of big data tools available in the market the challenge really is choosing the right tool.
“It is a balancing act from a business perspective — from choosing the right tool to finding the right resources and delivering on schedule,” said Adrian.