How to Begin Transforming into a Data-Driven Organization Webinar Highlights

How to Begin Transforming into a Data-Driven Organization Webinar Highlights

How to Begin Transforming into a Data-Driven Organization Webinar Highlights 800 507 Reizl Jade Ramos

Speaker: Mr. Warren Cruz | Exist Data Solutions Architect

It has been said that data is the new oil and data is the new gold and this speaks to how data is now the differentiator between mediocrity and success in the business landscape. We are now in the 4th Industrial Revolution wherein data-driven digital technology is the drivetrain of every sort of business operation and customer engagement. But in order for data to be of any business impact, it must become meaningful. The meaning, or the value, that an organization is able to derive from its data depends on many factors, and it is the aim of this talk to help you turn your data into something meaningful-to turn your data into actionable, enterprise-transforming insights-to help you transform not so much into a data-driven organization but into an insights-driven one.

What is Data-Driven Transformation and Can Your Organization Afford to Stay the Same?

What does it mean to be data-driven?

To be data-driven means processes or activities are motivated or spurred on by actual facts as opposed to being driven by mere intuition or personal experience. Decisions are made based on hard evidence and not mere speculation or gut feeling.

But when and how does data become transformational?

Data becomes transformational when the value of being data-driven is embraced by every aspect of the organization, from the top all the way to the bottom. It requires that data be readily accessible, interpretable, and actionable at the point and time of need. It requires this democratization of data to be part of the business process and culture of the organization. That means, every part of the organization values data, and key users and stakeholders would have easy access to data whenever they need it in order to make quick and agile business decisions.

What then is a Data-Driven Organization?

A data-driven organization is an enterprise that realizes the value of data and bases its decisions, actions, and processes on hard facts. There is top-level executive buy-in to the data-driven transformation initiative. This organization has invested time and resources in building a data analytics platform that is able to source data from both within and outside the organization.

After ensuring the quality of the data, it is then stored in a central data hub that will function as the organization’s single source of truth. Self-service access to this data is enabled with data security implemented through data governance policies. Business intelligence tools, along with advanced AI/Machine Learning techniques, then turn data into actionable, enterprise-transforming business insights.

What are the 4 Key Value-Drivers of Data-Driven Transformation?

These are the reasons why an organization would want to undergo a transformation process wherein it can be data-driven:

    • Data helps the business understand and respond to the customer better. It gives insight on when the customer buys, how they buy, and what they buy?
    • Data also helps the organization reimagine and improve business processes. These are the kinds of decisions like whether to put up an online retail presence, whether to put up an online, digital payment channel, on whether to enable your business applications to run on mobile devices, etc. Data gathered and collected, and subsequently analyzed will aid in making these critical decisions.
    • Data enables the organization to identify new opportunities for revenue. For example, your data can show that a certain product line sells more in certain times of the year (like school supplies in May or June, for example). With this insight, you can then augment production and even upsell certain new product lines based on demographic data.
    • Data helps the organization to balance risk and reward. This will come in decisions pertaining to when and how money and resources are spent, most notably in acquisitions and investments. It also involves predictions and forecasting based on AI and Machine Learning models.

Your Organization’s Data Maturity Level

Level 1

Your business is at the 1st maturity level if it cannot answer the question of “where’s the data?” with sufficient conviction. At this level, there is no system for collecting and gathering data into a central repository for ease of operational reporting in aid of decision-making. If decision-support reports are already in place, creating them is usually a very tedious process of sourcing reports data directly from the online transactional and departmental systems. Sourcing directly from online systems often degrades their performance, thereby having a negative impact on business.

An organization that does not have at least a data warehouse implementation in place is at this level.

Level 2

Your business is at the 2nd maturity level if it can answer the question of “what happened?” For this question to be answered, your organization must already have a mechanism for collecting data from various data sources in place. This is commonly called data integration or data ingestion and is usually part of a traditional data warehouse strategy. With this, your organization must also be turning out regular operational reports in aid of executive decision-making. These could be weekly or monthly reports on sales performance, customer turnout, customer churn, & such. Another capability is the ability to serve ad-hoc reports whenever they are requested. These are reports that usually comprise unusual views of the data and often require I.T. involvement for their fulfillment.

If your organization already has a data warehouse in place and you derive not just any report but decision-support reports from it, then your organization is at this level.

Level 3

Your business is at the 3rd maturity level if it can answer the question of “why did it happen?” At this stage, your organization understands not just what happened but what is behind those numbers. Data is usually presented graphically in bar charts & pie charts, etc., through dashboards and scorecards enabling you to interact with the data and digest it quickly. The ability to slice, dice, and drill into the data is made possible by the many business intelligence and data visualization tools available. Traditional statistical analysis methods are also already employed at this level.

If your organization already uses dashboards and data visualizations and traditional statistical analysis methods, then it is at this level.

Level 4

Your business is at the 4th maturity level if it can answer the question of “what’s likely to happen next?” At this level, your organization’s ability to gather, store, and analyze many data sources, from structured, semi-structured to some unstructured is already quite proficient. The amount of data that you are able to collect and feed into traditional statistical models has increased, therefore enabling the results to be more accurate and useful. The ability to make forecasts and engage in predictive analytics is characteristic of this maturity level.

If your organization can already do effective forecasting and base business strategy around them, then your organization is at this level.

Level 5

Your business is at the fifth and final maturity level if it can answer the question of “what’s the best possible thing that could happen?” This is what we are aiming for and if your organization is at this maturity level already, kudos to you! At this stage, all possible data sources are tapped into, from structured to semi-structured, to unstructured. Large volumes of data are able to be collected and stored by an organization of this maturity level. Advanced analytics techniques involving text analytics, graph analytics, and geospatial analytics, along with other AI/Machine Learning technologies are employed. Real-time decisions are able to be made which can come in the form of: 

    1. Rapid deployments of new apps based on analytics outcomes
    2. Preempting industrial accidents by real-time insight into hardware conditions, and
    3. Better customer experience by offering products that the customer did not know that he or she actually wanted, or influencing customer behavior

If your organization can react to customer actions in real-time, then you are at this level.

The Key Components of a Data-Driven Transformational Journey

What are the steps to becoming a more data-driven organization? How do we execute the action items by which we can go in this direction?

Data Sources

There are 3 kinds of Data Sources: structured, semi-structured, and unstructured.

    1. Structured Data – It is data that is usually organized into rows and columns and can easily be mapped into predefined fields. These usually come in the form of data that is stored in Relational Database Management Systems (RDBMS) like Postgres, SQL Server, and Oracle, and accessed through SQL. Another example of structured data is your Excel spreadsheet. With the structure inherent in the design, relationships between data entities are more easily built and this makes for easier storage, searching, and analysis. Until recently, this was the only type of data available to businesses, but now it accounts for only 20% of all data.
    2. Semi-Structured Data – this type of data has some defining and consistent characteristics about them without conforming to as rigid a structure as relational database data. The organizational properties of semi-structured data usually come in the form of semantic tags and metadata. Examples of this type of data are XML, JSON, and CSV files. NoSQL databases like MongoDB or CouchDB are also examples of semi-structured data.
    3. Unstructured Data – The majority of data in the world is unstructured. Experts estimate that 80 to 90 percent of the data in any organization is unstructured. It is data that cannot be contained in a row-column database and doesn’t have an associated data model. Examples of unstructured data are the text on your email, photos, video files, audio files, text files, social media content, presentations, PDFs, websites, and more.

An organization matures and becomes more data-driven as its ability to collect, store, and analyze all these types of data increases.

Data Ingestion/ Data Integration

Data ingestion, or data integration, is the process of collecting data from many different data sources and combining them in a unified data repository, making it more actionable and valuable to those accessing it.

There are 2 basic ways of implementing data integration:

    1. ETL (Extract, Transform, Load) Batch method – through the use of ETL tools, data is extracted from source systems, transformed, and stored in destination data repositories. The jobs created from these ETL tools are run periodically, usually during off-peak hours, and data is loaded in bulk or in batches and not on a per-record basis.
    2. Streaming – through the use of message-based streaming technologies, data can be loaded into the data repositories as they are created, real-time, on a per-transaction basis.

A classic example of this would be for fraud detection. For example, a bank with a high data maturity level would have a process in place where a withdrawal made in an ATM will provide a real-time update to the central data warehouse and will be able to analyze the transaction for fraud patterns and provide immediate feedback for actions if needed. This same process can also be applied to real-time target marketing where a customer who does a withdrawal can be alerted for possible sales or promotions near the area based on his or her personal profile.

Licensed and Cloud Services

Pros:

    • Support
    • Features (future proof)

Cons:

    • Cost

Licensed Tools:

ETL Batch

    • Talend
    • TIBCO
    • Informatica
    • Pentaho
    • SAP

Streaming

    • Confluent – No.1 provider of enterprise Apache Kafka

Cloud Service Tools:

    • Azure Data Factory
    • AWS Glue

Open-Source:

Pros:

    • Free/Low cost
    • Can be used right away in your data integration initiatives
    • Might be enough for your need

Cons:

    • Support
    • Features

Open-Source Tools:

ETL Batch

    • Talend
    • Apache NiFi
    • JasperSoft
    • Pentaho

Streaming

    • Kafka
    • RabbitMQ
    • Apache ActiveMQ

Data Quality

The next thing to ensure is that the data which we will be collecting is accurate, complete, consistent, timely, unique, and valid in order for them to be of true value to the business. To do this, we need to have a Data Quality process in place.

Data quality is the process of conditioning data to meet the specific needs of business users.

The main characteristics of a Data Quality initiative are:

    • Data integration must be in place. You cannot effectively have quality data to be used in business analytics without integrating it from various sources, and the data quality process must be ingrained in the data integration process.
    • Your Data Quality process can be comprehensive wherein it is thoroughly embedded into every business process of the enterprise with data stewards tasked with curating the data, or it can be minimalist, figuring only in the data integration process.
    • There are distinct Data Quality software tools for the comprehensive approach, while the minimalist approach can make use of custom apps or even just the features built into the data integration tool.

Licensed and Cloud Services

Pros:

    • Support
    • Features

Cons:

    • Cost

Licensed Tools:

    • Talend Data Quality
    • TIBCO Data Quality
    • Informatica Data Quality
    • SAP
    • SAS
    • Oracle

Cloud Service Tools:

    • Azure Data Factory

Open Source

Pros:

    • Free/Low cost
    • Can be used right away in your data integration initiatives
    • Might be enough for your need

Cons:

    • Support
    • Features

Open-Source Tools:

    • Talend Open Studio for Data Quality
    • Pentaho 
    • Match
    • Datamartist
    • WinPure

Datahub/Warehouse

Datahub is the heart of a modern data analytics platform. While it is true that both a Database and a Data Hub store data, the Data Hub is so much more:

  • It creates visibility into all data available in the organization – It provides views into data that would otherwise have been inaccessible to many different business users. A database implementation is usually specific to an aspect of the enterprise, not the whole thing.
  • It has centralized data governance – Data ownership, usage, and sharing are made easier in a Data Hub. Based on business-defined rules, every stakeholder who needs access to data in order to do his or her job well will have access to pre-curated data views.
  • Metadata management – The ability to create many forms of metadata on your data with the use of search indices, domain glossaries, & data catalogs make data visibility and access easier.
  • Read & Load at Scale – The Datahub usually comprises a cluster of compute resources and storage working together in order to enable the user to read and load data at speeds otherwise unachievable in non-MPP (massively parallel processing) systems.
  • Advanced Analytics – The Data Hub must inherently possess or be easily integrated into an advanced analytics engine. The ability to easily create AI and Machine Learning models enable the Data Hub to engage in the prescriptive and predictive analytics that make it an edge for the organization that has it in place.

The key difference is that a traditional data warehouse deals only with structured, relational, SQL data, and the analytics that could be done on it is limited to the business intelligence reporting common to descriptive analytics. A Data Hub on the other hand deals with all kinds of data, be it structured, semi-structured, unstructured, and is naturally tailored for all types of analytics, be it descriptive, prescriptive, or predictive.

Licensed and Cloud Services

Pros:

    • Support
    • Features
    • Critical Component

Cons:

    • Cost

Licensed Tools:

    • VMWare Tanzu Greenplum
    • Oracle Exadata
    • MS Parallel Data Warehouse
    • Teradata
    • Cloudera
    • SAP

Cloud Service Tools:

    • Azure Data Service
    • Snowflake

Open Source

Pros:

    • Free/Low cost
    • Can be used right away in your data integration initiatives
    • Might be enough for your need

Cons:

    • Support
    • Features

Tools:

Datahub

    • Community Greenplum
    • Hadoop

Data Warehouse

    • Postgres
    • MariaDB

Data Consumer

Various ways by which your organization can access, analyze, and derive insights from the data in the datahub. This is the stage at which data becomes insight and there are two ways of approaching it:

1. Business Intelligence – a discipline that combines business analytics, data visualization, data mining, data tools, and infrastructure to enable organizations to make data-driven decisions that drive change, eliminate inefficiencies, and quickly adapt to market dynamics.

There are 2 types of Business Intelligence:

    • Traditional Business Intelligence – This is the approach where business intelligence was driven by I.T. and most, if not all, analytics questions were answered through static reports. 
    • Agile (Self-Service) Business Intelligence – This is the modern way of doing business intelligence characterized by the ability of users to access data to create their own reports and dashboards. The role of I.T. then becomes more in the area of data governance, managing data security, and access. With the proper software, users are empowered to visualize data and answer their own questions. Most of the software in this category have some AI/ML built into them to facilitate the creation of advanced analytical models that make for better prescriptive and predictive analysis.

2. Applications – Custom applications built in Java, or any other programming platform can also be used to access data in our Data Hub as used in the traditional client-server model and others.

Licensed Tools:

    • TIBCO Spotfire
    • Tableau
    • PowerBI
    • Pentaho

Open-Source:

    • Jaspersoft Reports
    • Saiku
    • Pentaho

Putting It All Together: An Open-Source Data Analytics Platform in the Cloud

Identify your data source

Determine the types of data sources that are available in your organization, the kinds that you think are needed for business analytics and decision-making. These are often relational, SQL online transaction processing systems. For organizations with a more advanced data maturity level, they can even leverage data from social media and other external systems. 

You can try migrating a subset of data in a test database and ingest data from there. You can also directly source from CSV files, flat files, spreadsheets, even Hadoop.

Setup your ETL

You can either provision a Linux bare-metal machine or a cloud compute instance to function as your ETL Server. A 4-core, 8GB RAM machine should suffice.

Setup your Datahub

You can also provision 3 instances of either Linux bare-metal machines or cloud compute instances to act as your Data Hub. We will be using Greenplum as our Data Hub platform which will consist of 1 Master Host and 2 Segment Hosts. The Master Segment can have 8 cores, with 16GB RAM, while each of the Segment Hosts can have 8 cores, with 24GB RAM.

Connect your BI tools/Apps

You can make use of Jaspersoft Reports as your BI tool.

There are three (3) steps that we can go through to integrate the various components into a simple, open-source base of a modern data analytics platform.

    • Create ETL jobs using your Talend Open Studio for Data Integration software. An ETL job is a process that you will run periodically which extracts data from data sources and bulk loads them into destination data repositories. 
    • Make sure that tables that will receive data have already been defined in your Greenplum database. You will usually have a staging area for storing cleaned up, raw data from the data sources, a data warehouse area with fact and dimension tables, and sometimes data Marts, which are small subsets of the data warehouse catering to specific users.
    • After populating the data warehouse and/or the data Marts, you can configure your Jaspersoft Reports to get data from the data Marts and display them as data and visualizations.