Data: Adapt or Die!

Sometimes you read something and you just want to get drunk with the person who wrote it and laugh into the deep night. This is one of those times. “Data, Adapt or Die!


I feel really badly for the Big Data gurus who painstakingly schlepped their love of data around digital-venues such as LinkedIn, Twitter, Forbes and other professional comedic-stages only to be slapped right-up the side of the head by data’s callous disrespect for prosaic, opportunistic and populist characters.

So, why hasn’t data adapted in the way we wanted it to?

First, because it is arrogant, ignorant and amoral. That’s why. And secondly, because there is so much of it. We’re surrounded by it. It’s become the digital enemy within, without and betwixt.

But it goes far deeper than that. We have been massively permissive with data; too obsequious; too impetuous; and, too rash.

We have allowed data to become uncontrollably Marxist, communalist and collectivist. So, be afraid, very afraid and hold these powerful thought-snippets in your head: “Enterprise Data Factory”, “Das Data Kapital” and “data workers of the world integrate”.

I might come back to these in just a moment.

But to loosely paraphrase Marx “It is not the consciousness of data that determines their being, but, on the contrary, their being that determines their mindfulness; sort of.”

And this is what we have let data do. We have let data live their own social lives which has inevitably led to bad things, unintended consequences and undesired patterns of behaviour.

Do you see where I am coming from with this?

We thought data integration would help create a great and good data melting-pot, a great cauldron full of nourishing, life-giving data-soup of which we all could avail ourselves.

It didn’t work, we were wrong. So sad.

Google hit with $57M Penalty by French Regulators (or, “Be Here So You Don’t Get Fined”)

So… This happened.

Google fined $57m by French regulator for breaching GDPR

No surprise, really. It takes a while for a Beast to discover the full extent of its superpowers and then to bring them to bear on Evil (guys, you said you wouldn’t be evil, what happened?). But now that several months have passed and at least companies that are paying attention and are of good will are GDPR-friendly, if not GDPR-ready, the regulators are starting to flex their muscles in a serious way. Now we can wait for the screaming…

Seriously. Companies, you gotta do this. Otherwise you will start getting fined in a more serious way. Even if your primary motive is summed up in the Mantra of Marshawn Lynch, get serious, or get fined.

I’m waiting for the first time the “4% of global annual gross revenues” penalty is applied to a company that’s truly playing fast and loose with privacy. (My money’s on Uber.) Regulators have the equivalent of the death penalty at their disposal, and they’re gonna use it, you know they will, to make an example for the others.

Be here… or get fined. To death, maybe.

GDPR: There’s a Right Way and a Wrong Way (and a Really Wrong Way)…

So GDPR is one of the hottest trending terms in Google right now of interest to those of us doing things with data that might in any way touch on Europeans. Assuming, that is, that we intend to keep doing it after 25 May 2018.

As with any regulatory initiative, there’s just no two ways about it– you have to figure out how your business is going to comply. For GDPR this is true of just about any company that has customers or other stakeholders who are “natural persons” (i.e. not corporations)… It’s doubly-true for technology vendors who sell products into the data racket, especially those used to work with customer data.

Companies are mostly coming at this in one of three ways:

horror couple GIF by Gnomo

The Wrong Way — panic. a popular posture, among those who are now racing to learn to spell GDPR.

Don’t panic. Panic is not a good look for you. And besides, if you do you might be tempted to kid yourself that somehow this doesn’t apply to you, or that it’s somebody else’s problem.

It applies to you, in principle (which is what matters in the law), if anyone from Europe ever comes to your website or uses your app. Even if that person from Europe is not in Europe when they interact with you.

You can stick your head in the sand and take your chances, but violations are a big deal– 20M Euro fine big. And your customers will look at you like these two over on the right.

So that’s probably not how you want to do this.

Image result for I'm here so I won't get fined

That leads us to the really wrong way: passive-aggressive, minimal, even malicious compliance.

Marshawn Lynch (an American footballer, if you’re someone who doesn’t follow) may have embodied the credo “I’m just here so I don’t get fined” in NFL-mandated pre-Super Bowl press “availabilities” a few years ago. And he rubbed a lot of people the wrong way doing it. But he did it from a place of deep authenticity, which made it admirable, and from a position of actually delivering Beast Mode for nine years on the field.

“Beast Mode” for GDPR would be having all this worked out by January 2018, so if you’re reading this now, you’re not Marshawn Lynch. And more importantly, taking this attitude to customers’ data privacy internally will set you up to do things that a) undermine your interests and your relationship with your customers’ b) can get you fined.

Your created Keep Calm posterSo that leaves us the right way.

Take the attitude that GDPR exists for a reason. Your customers have right and reason to be distrustful of you, and many companies like you. There have been a lot of bad actors out there who have broken trust with customers by making distasteful, dangerous or even criminal use of private data.The Right Way: Now, this is going to sound like so many platitudes, but… it’s true, it’ll work. Or at least it will play better than the other two alternatives.

  • Tracking your web activities, your mobile device usage (including location), your social media presences, your purchase habits, your medical history, all manner of things–
  • Profiling them based on that data, and then acting on any resulting predictions in ways that materially affect someone’s finances, employment or legal status
  • Engaging in discriminatory practices (such as only showing job ads to people in a certain age range or baking our own impulses into a data engine across
  • Or just being massively negligent in the processing and storage of such data, to the point where most of the adult population of the US has their private data stolen (I’m looking at YOU, Equifax).and the company can’t even figure out exactly whose data was stolen or how much detail the thieves got. And given that, realize that trust with personal data must be earned.

The rollout of GDPR is a crisis but an opportunity, for all brands big and small: a chance to acknowledge that in this moment, your customers have legitimate worries about their personal data, and to distinguish yourself in their eyes by driving home the message that your data handling practices reflect a core commitment to your customer, rather than your own fear of liability.

So while GDPR will be a disruption, and in the near term the goal is to get as compliant as possible, ask yourself as you formulate and implement your response, “how can I use this disruption to my normal processes to make sure my people are honest and treat customers honestly and fairly?”

  • Familiarize yourself with the rights of the data subject (your customers), and really think through what those mean in your industry, and in all the ways that you use customer data to market your service and to deliver it. (Lawyers can advise here, but it really takes the stakeholders who understand, control and operate your products and your marketing practices to do this review.)
  • Think about what it can mean from a product offering or service offering standpoint. Think about an experience that helps drive home the idea that your customers’ data, regardless of who controls or processes it at various stages, is being processed only in line with their consent and only to help you meet their needs better. (And hold yourself to that standard.)
  • Show them what you have collected, from where, and for what purpose. Make it easy to discover those disclosures, and to access features like reviewing all their data, in a form that they can understand..
  • Show them that they can withdraw their consent at any time– you will forget them, correct their info, or stop using the data, at their say so. Make this attitude normal.

In the short term it takes a lot of effort, and some of it may seem counter-intuitive. You may feel like you’re not exploiting your data to the fullest. But isn’t that word “exploit” part of the problem?  Just have some empathy for your customers, and ask yourself what will really make them come back. Make the move to a new model of privacy protection for your customers a win-win, and over time your relationship with customers will be stronger and the wins will multiply.

Now get at it.

Presto Starbursts to the Enterprise

So it looks like Teradata is spinning out their work on Presto as an open source project into a new company, Starburst.

I think this is for the best, as far as Presto adoption and growth goes — while Presto surely received a lot of investment at Teradata, that did it a lot of good, surely it couldn’t pursue its own evolution and mass adoption, as long as there was the tension of a parent company with its own agenda around Presto as part of their QueryGrid data virtualizaation strategy.

Now Teradata will be free to improve their support of their existing data warehouse in Presto and Presto can improve freely to support more data sources like Oracle, Redshift, NoSql databases and so on.

My ask: will they improve their support for other SQL platforms to handle more complex queries, including pushdown? As it stands now, Presto, while quite powerful on most sources it supports, has big gaps around its support for mainstream SQL databases. Facebook can’t have cared much, since most of their data is presumably in Parquet; Teradata had every reason to not invest in this area since it would only bring other databases up as peers to Teradata.

This development could make Presto more of a rival to Drill and Dremio.

Anyway, the full press release is here:



Teradata Partners with Starburst, a New Company Focused on Continuing the Success of the Presto Open Source Project

Starburst will continue to grow and develop Presto for the enterprise with both companies partnering to provide support for the Presto user community
Teradata (NYSE: TDC), the leading data and analytics company, and Starburst, today announced a strategic relationship to keep the Presto community vibrant, growing and supported. The partnership builds on Teradata’s commitment and success with the Presto open source project, leveraging several former Teradata employees – key Presto contributors – who have formed Starburst. The new company will be focused exclusively on accelerating the development of Presto while providing enterprise-grade support to the rapidly expanding Presto user base. Teradata’s partnership with Starburst demonstrates a continued commitment to Presto and open source as part of its Teradata Everywhere strategy.

Originally created at Facebook as a successor to the Apache Hive project, Presto is a SQL engine that provides fast, interactive query performance across a wide variety of data sources including HDFS, S3, MySQL, SQL Server, PostgreSQL, Cassandra, MongoDB, Kafka, Teradata, and many others. Teradata continues its commitment to Presto as part of Teradata QueryGrid which brings together diverse environments into an orchestrated analytical ecosystem.

“At Teradata, we embrace open source as part of our analytical ecosystem,” said Oliver Ratzesberger, Executive Vice President & Chief Product Officer at Teradata. “In addition to our extensive work with Presto, Teradata is also driving the open source development of Kylo for data lake ingest and management, and Covalent for user interfaces. With Starburst, we further support the growing adoption of Presto as a foundation for enterprise deployments.”

Over the past few years, Teradata has engineered numerous scalability, performance, security and manageability improvements into the core Presto engine. Starburst will continue investing heavily in query engine development while also certifying releases of Presto optimized for QueryGrid to provide Teradata users with the fastest path to query data wherever it may be.

“We are thrilled by the opportunity to pursue our ambitious goals for Presto as a new independent company,” said Justin Borgman, Co-founder at Starburst. “Whether you’ve been using Presto in production for years or you’re trying out Presto for the first time, we look forward to working with you to achieve your goals with the best open source SQL engine on the planet.”

Apache Arrow: MapD adopts portable, standard in-memory analytics data format from Dremio

Some time back I had the pleasure of meeting some of the good folks at MapD. Among the topics we wound up talking about was Apache Arrow, an emerging in-memory columnar data representation with multiple language bindings which, to make a long story short, can let different analytics solutions pass data from one to another by reference, rather than by value. Very cool stuff, which dovetails with Remote DMA technology to allow a cluster to process big in-memory datasets with vastly more efficient communications for data interchange. So much pain and processing eliminated by just choosing a common (and well-optimized) representation and going with it.

I had previously encountered through work Jacques Nadeau, at MapR when I met him, and now a founder of Dremio, where the Arrow work was a byproduct of the development of Apache Drill and Dremio’s eponymous offering. And I said to Jacques when I crossed paths with him at Strata Hadoop World: “There will soon be two classes of analytics software: those that can interchange data with Arrow, and those that are obsolete.”

This blog post at kdnuggets on Arrow provides some good background.

Streamlining the interface between systems

One of the funny things about computer science is that while there is a common set of resources – RAM, CPU, storage, network – each language has an entirely different way of interacting with those resources. When different programs need to interact – within and across languages – there are inefficiencies in the handoffs that can dominate the overall cost. This is a little bit like traveling in Europe before the Euro where you needed a different currency for each country, and by the end of the trip you could be sure you had lost a lot of money with all the exchanges!


We viewed these handoffs as the next obvious bottleneck for in-memory processing, and set out to work across a wide range of projects to develop a common set of interfaces that would remove unnecessary serialization and deserialization when marshalling data. Apache Arrow standardizes an efficient in-memory columnar representation that is the same as the wire representation. Today it includes first class bindings in over 13 projects, including Spark, Hadoop, R, Python/Pandas, and my company, Dremio.

Now I see MapD is taking up the use of Arrow and even driving the GPU Open Analytics Initiative, devoted to building out support for Arrow as part of making GPU analytics more performant and standard. From

…While Spark has a python interface, the data interchange within PySpark is between the JVM-based dataframe implementation in the engine, and the Python data structures was a known source of sub-optimal performance and resource consumption. Here is a great write up by Brian Cutler on how Arrow made a significant jump in efficiency within pyspark.

MapD and Arrow

At MapD, we realize the value of Arrow on multiple fronts, and we are working to integrate it deeply within our own product. First, we are finding our place in data science workflows as a modern open-source SQL engine. Arrow solves precisely the problems we expect to encounter related to data interchange. Second, a natural outcome of being a GPU-native engine means that there is great interest in integrating MapD into Machine Learning where Arrow forms the foundation of the GPU dataframe, which provides a highly performant, low-overhead data interchange mechanism with tools like, TensorFlow, and others.

It’s so cool to see a good idea spreading and taking root. Hope I get to engage with Arrow (and with some of the cool people I’ve met over the years) soon.

Cisco Data Virtualization –> Tibco Data Virtualization

So this happened. Should be a more natural home for the products than Cisco.

TIBCO Acquires Data Virtualization Business from Cisco

Analytics Users to Benefit from Improved Data Agility, Enhanced Scalability, and Better Business Insights
Palo Alto, Calif.


05 October, 2017

TIBCO Software Inc., a global leader in integration, API management, and analytics, today announced it has entered into an agreement to acquire Cisco’s Data Virtualization business (formerly Composite Software), specifically Cisco Information Server, a market-leading solution that powers enterprise-scale data virtualization, and associated consulting and support services. This strategic move strengthens TIBCO’s portfolio of analytics products, allowing businesses to get analytic solutions into production faster than alternatives, while continuing to adapt as data sources change from traditional databases and big data sources to cloud and IoT. The transaction remains subject to customary conditions and is expected to close in the coming weeks.

Data Virtualization helps knowledge workers to quickly discover and access their own views of corporate data in an automated, intelligent way. The Cisco technology can access a large, diverse, and complex set of enterprise data stores and create a “virtual” data layer for analytics without disturbing the source data. All this is done without extracting data via ETL in a separate data warehouse.

“Data Virtualization helps our customers find and analyze the data they need in hours or days, rather than months, so that they can quickly discover insights and take insight-driven action,” said Mark Palmer, senior vice president of analytics, TIBCO. “The next generation of business intelligence depends on doing more with analytics than just putting data on a graph. Data Virtualization is a key component of getting the right data at the right time to business analysts, data scientists, and automated applications using streaming analytics.”

The addition of the Data Virtualization business will enable TIBCO analytics users, including TIBCO Spotfire® customers, to improve data agility for faster responses to ever-changing analytics and business intelligence needs, reduce data complexity for enhanced scalability, and drive better business insights. Spotfire® is a smart, secure, enterprise-class analytics platform with built-in data wrangling that delivers AI-driven, visual, and streaming analytics. The Spotfire product line includes seamless enhancement of Cisco’s Data Virtualization into TIBCO’s broader Connected Intelligence platform, while also delivering enhanced end-to-end data discovery and governance.

Learn more about TIBCO’s solutions here, as well as the acquisition of Cisco Information Server here.

Follow us @TIBCO on Twitter, and on our Facebook and LinkedIn pages to hear the latest news and updates from our team.


TIBCO fuels digital business by enabling better decisions and faster, smarter actions through the TIBCO Connected Intelligence Cloud. From APIs and systems to devices and people, we interconnect everything, capture data in real time wherever it is, and augment the intelligence of your business through analytical insights. Thousands of customers around the globe rely on us to make better decisions, build compelling experiences, energize operations, and propel innovation. Learn how TIBCO makes digital smarter at


The Morning Paper: Keeping up with Computer Science Research

So The Morning Paper gets my vote for cool resource of the month… dives into a computer science research paper each day– usually something new, sometimes looking backwards for background on one of their other papers. Always good to skim, sometimes to dive in if the subject is one you are engaged with.

For example, right now here’s articles that are at least tangentially relevant to my main gig or my other topics of interest in computing right now:

So there’s a lot in there if you don’t have bandwidth to keep up with everything happening in tech right now. And who does?


Origins of Data Virtualization: Composite Software Veteran Gives First-Hand Account

This blog isn’t going to be all-data virtualization, all the time, I promise– but I did want to give a shout-out to a good review of the origins of the product space of “data virtualization” as well as the term, from a true class act and industry veteran, Bob Eve, currently at Cisco (which acquired pioneer Composite Software back in 2013). Bob literally wrote the book on data virtualization and had a hand in carving out a market that will only grow in importance in years to come.

Data Virtualization: Going Beyond Traditional Data Integration to Achieve Business Agility, the first book ever written on the topic of data virtualization, introduces the technology that enables data virtualization and presents ten real-world case studies that demonstrate the significant value and tangible business agility benefits that can be achieved through the implementation of data virtualization solutions.

This first chapter describes the book and introduces the relationship between data virtualization and business agility. The second chapter is a more thorough exploration of data virtualization technology. Topics include what is data virtualization, why use it, how it works and how enterprises typically adopt it. The third chapter addresses the many ways that data virtualization improves business agility, with particular focus on the three elements of business agility business decision agility, time-to-solution agility and resource agility.

The core of the book is a rich set of in-depth data virtualization case studies that describe how ten enterprises across a wide range of industries and domains have successfully adopted data virtualization to increase their business agility. The ten enterprises profiled are customers of Composite Software, Inc., a data virtualization software vendor.

The Composite team accomplished a lot and continue to innovate within Cisco. As more players enter the space, they continue to be the team to beat.

IBM Fluid Query: Data Virtualization? Not Exactly.

So I like to think I know the data virtualization space fairly well at this point. The competitive landscape for Data Virtualization was part of my beat for a couple of years at Cisco, and I paid attention to anything that could really be interpreted as a threat, which is to say anything from other vendors that we actually saw in deals.

So it was with some surprise that I ran across IBM Fluid Query, a data virtualization product I’d never encountered before. IBM’s data virtualization offering in my experience had been IBM Infosphere Federation Server, which is incorporated into DB2 and IBM Big SQL and which lets you create views on external data within your DB2 database, and run federated queries, pushing down a certain amount of your warehouse processing into the source based on the capabilities of the source.

IBM Fluid Query, unlike the other “DV-inside-the-database” offering, implements a “data warehouse extension” use case bridging your conventional data warehouse with Hadoop and offloading processing on older data to the Hadoop cluster. The pattern was familiar, once I saw what it was– all kinds of vendors have implemented something of the sort.

But IMHO it’s not data virtualization, at least not in much of a meaningful or interesting sense. A federated query capability that can basically only federate one main database and one SQL-on-Hadoop layer just isn’t that interesting– at least from a data virtualization standpoint.

The Wikipedia definition of Data Virtualization reads in part:

Data virtualization is any approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located,[1] and can provide a single customer view (or single view of any other entity) of the overall data.[2]

Unlike the traditional extract, transform, load (“ETL”) process, the data remains in place, and real-time access is given to the source system for the data. This reduces the risk of data errors, of the workload moving data around that may never be used, and it does not attempt to impose a single data model on the data (an example of heterogeneous data is a federated database system). The technology also supports the writing of transaction data updates back to the source systems.

This isn’t the most interesting or extensive definition, though. Cisco’s Data Virtualization product has an online summary of how analysts like Forrester and Gartner (as well as TDWI) have looked at the term “data virtualization”, and invariably the definition gets into pulling together many sources into a “logical data warehouse”-type view over your primary data sources for the purposes of analytics. Bolting a data warehouse extender based on Hadoop onto your database and calling it “data virtualization” hardly qualifies.

The fact that IBM would do this, though, suggests that the concept of data virtualization has gotten enough traction in the minds of enterprise decision-makers that DV is territory worth contesting.

Which is pretty cool, if you like DV.

Which I still do.

Data is The New Racket

So… after months of staring at various blank pages, thinking about some magnum opus on data management and data virtualization, I get tired of stalling, and begin.

Specifically, I’ll begin with a definition, lifted from


,,,2 (informal) An illegal or dishonest scheme for obtaining money. let’s not dwell too much on this one 😉

   2.1 A person’s line of business or way of life.

   Example Sentences:
  • ‘I’m in the insurance racket’
  • ‘You had better have a darn good reason for any involvement in the casualty insurance racket.’
  • ‘It’s a strange business, this journalism racket.’
  • ‘Initial conversation gives you the impression that this kid’s just too nice to make it in the music business, this racket will chew him up and spit him out.

So… the Data Racket, then, is the world of doing things with data, one way or another, as a business or a way of life.  And, consistent with the ongoing datafication of everything, even the example sentences hold up pretty well if you add “data” to them:

  • ‘I’m in the insurance data racket’
  • ‘You had better have a darn good reason for any involvement in the casualty insurance data racket.’
  • ‘It’s a strange business, this data journalism racket.’
  • ‘Initial conversation gives you the impression that this kid’s just too nice to make it in the digital music business, this racket will ingest him and spit him out.’

I kind of stumbled into the data racket myself, about ten years ago, thanks to a series of lucky breaks that got me into product management (which was a longtime goal) in the domain of integration and data management (which just kind of happened). Classical data warehousing, ETL and data quality on Oracle were the beginning, but then someone stampeded a herd of yellow elephants through Kimball’s cathedral, and there came a bunch of NoSQL data stores, and clouds rolled in, and… one thing led to another. And now I’ve been places, seen things, and learned, I hope, more than a few useful lessons along the way. If they help me make sense of everyrthing happening with Big Data, I will let you know

This blog will blend lessons distilled from across the years with my take on what’s coming next, and, perhaps, guest posts from gracious people I’ve met along the way. My hope is that some of it will be worth a read, and even occasionally novel.

And now, as they say, “allons-y.”