“No, you can’t always get what you want. You can’t always get what you want. You can’t always get what you want.”
Thus echoed the famous Rolling Stones tune of 1969. The principle may be true in many cases, but when it comes to the Data Integration needed for turning up actionable business insights, getting at all the Data Sources that you want can actually be an “always” kinda thing. Let me explain.
What is Data Integration?
To put it simply, you want access to all the data that are relevant to your organization—data that will enable you to do the analytics that drive business-building decisions. Having access to the data, or Data Integration, can mean two things:
- You haul the data off from various data sources into a central data hub, or…
- You directly interface with the data in-place (where they reside) from a data access platform
The first involves your classic ETL (Extract-Transform-Load) paradigm, and may also include ELT (Extract-Load-Transform), mostly used in dealing with unstructured data, and real-time, event-driven streaming. The second is what’s become a buzzword nowadays called, Data Federation or Data Virtualization.
What if I told you that you don’t have to buy separate Data Integration tools to accomplish these two ways of getting at data? What if I told you that all you need is the X-Factor…the Pivotal X-Factor!
What is Greenplum PXF?
A primary feature of the Greenplum Modern Data Analytics Platform, one that makes it a cut above the rest, is the Greenplum Platform Extension Framework (PXF). Greenplum PXF provides data connectors to all relational data sources supported by JDBC. It also has HDFS, Hive, and HBase connectors to all major commercial Hadoop distributions like Cloudera, Hortonworks Data Platform, and MapR, along with generic Apache Hadoop distributions. Is your data up on Cloud object stores? PXF provides connectors to Text, Avro, JSON, Parquet, AvroSequenceFile, and SequenceFile data on Azure, Google Cloud, Minio, and S3 object stores as well.
By defining External Tables that point to these data sources, you are able to retrieve and modify external data from the comforts of your own Greenplum home, as it were, by merely issuing SQL statements that behave as if the tables were actually local. And given Greenplum’s MPP architecture, access to these outside-the-fence data is lightning fast! How does inserting 71 million+ records from SQL Server into a 3-segment Greenplum cluster in 13 to 18 minutes sound?!
A Simple Example
Let’s say we have data that we want to do analytics on residing in two different kinds of databases, one SQL Server and the other Postgres. And suppose some important information also resides in CSV files provided by some department.
The traditional way of acquiring the data from these three different data sources is to use a separate ETL tool to define jobs that would physically transfer the data to destination tables. While this is OK, it leaves you committed to the data that you have already ingested and any mixing and matching of relationships can only be done after the fact and at the cost of physically transferring data.
What if you wanted the flexibility of first querying the data in these distinct data sources individually, and then together, establishing relationships between your SQL Server, Postgres, and CSV data without needing to load them physically first into your data repository?
With Greenplum PXF (the Pivotal X-Factor!), you can do just that!
So these are your three different data sources:
After configuring PXF JDBC settings under the hood, you define External Tables to the particular tables in SQL Server and Postgres that you want to access:
SQL Server External Table (PXF)
CREATE EXTERNAL TABLE pxf_sqlserver_TestTable
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Postgres External Table (PXF)
CREATE EXTERNAL TABLE pxf_postgres_dimairline
FORMAT ‘CUSTOM' (FORMATTER='pxfwritable_import');
For accessing the CSV file, we also define an External Table but with the use of another awesome Greenplum innovation called gpfdist (which will be the subject of another blog):
CSVExternal Table (gpfdist)
CREATE EXTERNAL TABLE ext_csv_address
FORMAT 'CSV' ( DELIMITER ',' );
And with just one query, you can retrieve the data from all three data sources, relating them to each other as you please:
SELECT * FROM
pxf_sqlserver_TestTable A → SQL Server
RIGHT JOIN pxf_postgres_dimairline B → Postgres
ON A.id = B.airlineid
LEFT JOIN ext_csv_address C → CSV file
ON B.airlineid = C.id
ORDER BY A.id LIMIT 10;
This is the Data Federation or Data Virtualization use case. You can also use the External Tables as a simple way to implement ETL by inserting to your pre-defined Greenplum tables from them, like so:
INSERT INTO gp_sqlserver_TestTable
SELECT * FROM pxf_sqlserver_TestTable;
Take note that this insertion to Greenplum is done in parallel across all segment hosts, thereby giving it lightning-fast load performance (71+M recs in 13 to 18 minutes!)
If transformations on the data are required, then creating user-defined functions in Greenplum should do the trick.
So there you have it. With Greenplum PXF, you can always get at the data that you want, enabling your organization to transform data into actionable insights fast and quick. You do want your business to stick around as long as the Rolling Stones, don’t you? 😉