So I like to think I know the data virtualization space fairly well at this point. The competitive landscape for Data Virtualization was part of my beat for a couple of years at Cisco, and I paid attention to anything that could really be interpreted as a threat, which is to say anything from other vendors that we actually saw in deals.
So it was with some surprise that I ran across IBM Fluid Query, a data virtualization product I’d never encountered before. IBM’s data virtualization offering in my experience had been IBM Infosphere Federation Server, which is incorporated into DB2 and IBM Big SQL and which lets you create views on external data within your DB2 database, and run federated queries, pushing down a certain amount of your warehouse processing into the source based on the capabilities of the source.
IBM Fluid Query, unlike the other “DV-inside-the-database” offering, implements a “data warehouse extension” use case bridging your conventional data warehouse with Hadoop and offloading processing on older data to the Hadoop cluster. The pattern was familiar, once I saw what it was– all kinds of vendors have implemented something of the sort.
But IMHO it’s not data virtualization, at least not in much of a meaningful or interesting sense. A federated query capability that can basically only federate one main database and one SQL-on-Hadoop layer just isn’t that interesting– at least from a data virtualization standpoint.
The Wikipedia definition of Data Virtualization reads in part:
Data virtualization is any approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located,[1] and can provide a single customer view (or single view of any other entity) of the overall data.[2]
Unlike the traditional extract, transform, load (“ETL”) process, the data remains in place, and real-time access is given to the source system for the data. This reduces the risk of data errors, of the workload moving data around that may never be used, and it does not attempt to impose a single data model on the data (an example of heterogeneous data is a federated database system). The technology also supports the writing of transaction data updates back to the source systems.
This isn’t the most interesting or extensive definition, though. Cisco’s Data Virtualization product has an online summary of how analysts like Forrester and Gartner (as well as TDWI) have looked at the term “data virtualization”, and invariably the definition gets into pulling together many sources into a “logical data warehouse”-type view over your primary data sources for the purposes of analytics. Bolting a data warehouse extender based on Hadoop onto your database and calling it “data virtualization” hardly qualifies.
The fact that IBM would do this, though, suggests that the concept of data virtualization has gotten enough traction in the minds of enterprise decision-makers that DV is territory worth contesting.
Which is pretty cool, if you like DV.
Which I still do.