Data has always been the key factor in business computing. However, the role that it plays has evolved throughout the years. These evolutionary epochs have generally been termed as the 3 waves of data management.
Wave 1: The Rise of Relational
In the first wave, we see the emergence of the relational model and relational database management systems as an improvement upon the flat file data store. Having the advantage of a structured query language (SQL) to extract data from the database enabled businesses to more easily derive value from their data.
Data in this era was used to support specific business processes and applications.
Data served the application.
Wave 2: Eyeing the Enterprise
The second wave will have data being used in a more enterprise-wide fashion. Here we see the emergence of the use of unstructured data in the form of documents, web content, images, audio, and video in Enterprise Content Management (ERM) systems. Other applications would be Enterprise Resource Planning (ERP), supply chain, etc.
Data served the enterprise.
Wave 3: The Tsunami of Data
We are currently in the 3rd wave. Vast improvements in cost efficiencies in the areas of storage, network speed/reliability, memory, and over-all computing capability have paved the way for the emergence of Big Data.
Simply put, Big Data is the ability to gather very large amounts of all kinds of available data (structured, semi-structured, unstructured) at various latencies (even real-time), profile the data, catalog the data, and parse/prepare the data for analysis, all done in a distributed file and processing architecture.
Data in the 3rd wave is front and center. It now transforms business processes (see Wave 1) and creates new business models (see Wave 2).
Data powers digital transformation.
Wipeout Points with Big Data
The following are some pain (wipeout) points with Big Data:
1. Functionality and performance gaps of processing engines on Hadoop – These frameworks (such as MapReduce, Hive on Tez, and Spark) are good for certain use cases but lack the core functional and performance requirements for big data integration.
2. Provide faster and flexible development – a big data journey should be lean and agile, focusing on automation, reusability, and data flow optimization.
3. Search data assets in Hadoop and the Enterprise – a solution that enables easy searching and discovery of relevant data sets is not readily available. There is the need to answer the question: How do I find my data and know their relationships?
Ride the Wave with Informatica
It must be noted that Informatica has been the leader in data management in Wave 1 and Wave 2.
With Wave 1, Informatica pioneered and defined ETL and data integration categories. They are still the market leader in these areas.
With Wave 2, as data became enterprise-wide, Informatica added data quality, master data management, cloud integration, data masking, and data archiving to their solution portfolio. They are the market leader in each of these categories.
Hanging Ten with Informatica Big Data Management
With the arrival of YARN, the capability to build custom application frameworks on top of Hadoop to support multiple processing models was realized. What Informatica Big Data Management (BDM) did was combine the best of open source (i.e., YARN) and 23 years of data management experience to build out Informatica Blaze.
So what is Blaze? You can look at Blaze as a cluster-aware, data integration engine for Hadoop—built using in-memory algorithms, all in C++—for Big Data batch processing. It’s integrated with YARN, so you can expect it to be a very scalable and very fast, high-performance distributed processing engine for Hadoop.
But does Blaze replace the other Big Data processing engine frameworks? Does it replace MapReduce, Tez, or Spark? The answer is No. What Blaze does is actually complement the capabilities of the other processing engines by virtue of the fact that there is not one solution to solve all of the Big Data batch processing use cases.
What Informatica did to overcome the functional gaps of the other processing engines was expose their transformation libraries (built over 23 years) to the Hadoop ecosystem—to a distributed processing platform—through the Informatica Blaze engine. What that allowed Informatica to do was open the floodgates to a lot of their functionality (not just the core functionality of joiner transformations, aggregates, and look-ups, but also their complex data integration transformations: the complex data quality, data profiling, and data masking transformations) through the Blaze engine, making it much easier for you to implement complex ETL processing in a Hadoop ecosystem. In terms of performance, what Informatica did was they took Blaze and made it an in-memory processing engine built purely on C++.
If I execute a mapping on the Hadoop cluster, you may be wondering, will it automatically default to the Blaze engine? Not necessarily. Informatica BDM has this key innovation for the Hadoop ecosystem called the Smart Executor. It’s a polyglot engine. This means that it has the ability to understand multiple languages and implies that not one technology will solve all the Big Data integration use cases. What it does is it automatically, dynamically, and intelligently selects the best execution engine to process the data based on various parameters like mapping, workload type, and infrastructure configuration. It will optimize that mapping and, based on the cluster configuration, determine which is the best execution engine to run it on and could pick either of the engines as faster than the others. It is built to intelligently pick the best execution engine.
As the graph above indicates, Informatica Blaze is faster than Spark and Hive on MapReduce. But why?
With its multi-tenant architecture, Blaze allows you to run concurrent jobs served by one single Blaze instance. This translates to optimized resource utilization and sharing amongst jobs. So even if you have a thousand mappings for execution, Blaze will only launch one YARN application to serve this requirement. Also, as mentioned earlier, Blaze was written in C++ code, providing better memory management compared to a Java-written framework.
Blaze also uses the Data Exchange Framework (DEF), a process for the shuffle phase, which is an in-memory built framework that shuffles data amongst the nodes without the loss of recovery—a very key capability in Big Data processing for Big Data processing engines.
Safely Back to Shore
What your business does with data will determine whether it will wipe out and sink to the bottom or ride the wave all the way back to shore.
With Informatica and Informatica Big Data Management, you can be assured that your data will be made to drive the digital transformation needed to ensure that your business is empowered and not floundering around.
1. Module 04: Informatica BLAZE Overview: Big Data 10.x: Black Belt Enablement (Module) (internal partner resource)
2. Keynote: CEO Anil Chakravarthy – Informatica World 2016 (https://www.youtube.com/watch?v=QmBC3W4jh_c)
3. Big Data for Dummies by Judith Hurwitz, Alan Nugent, Dr. Fern Halper, and Marcia Kaufman (Hoboken, NJ: John Wiley & Sons, Inc, 2013) (https://www.amazon.com/Big-Data-Dummies-Judith-Hurwitz/dp/1118504224)