5 things you need to know about data exhaust

Image: Ruben de Rijcke

Big data is now a familiar term in most of the business world, and companies large and small are scrambling to take advantage of it. Data exhaust, on the other hand, is less widely known, and in some ways it’s an evil twin brother. Here are five things you should understand about data exhaust’s pros and cons.

1. It’s essentially all the big data that isn’t core to your business.

The “data exhaust” term has actually been around for more than a decade, and it arose with the new streams of data coming from smartphones, said Tye Rattenbury, director of data science and solutions engineering at Trifacta, which makes software for data preparation. Today, more accessible data tools are bringing exhaust to the fore.

If big data is “primary” data that relates to the core function of your business, data exhaust is secondary data, or everything else that’s created along the way, Rattenbury explained.

For instance, a bank would consider primary all the data about debits and credits to its customers’ accounts. Secondary data might include information like what percentage of customers’ transactions are done at an ATM instead of a physical branch.

There are no standard definitions or schemas for data exhaust, which tends to be raw and unstructured, but in many ways, it’s equivalent to the byproducts associated with a company’s machines and core online activities. It can include streams coming in from Web browsers, plug-ins, log files, Internet of Things (IoT) devices, and more.

2. It’s typically bigger than ‘big.’

The term “big data” is itself a relative term, boiling down essentially to “anything that’s so large that you couldn’t manually inspect or work with it record by record,” Rattenbury said. In general, data exhaust tends to be even bigger, primarily because there are few limits on what a company can collect.

“Google is the leader here,” he said. “They literally collect everything, even before they know what they will do with it.”

That brings up another interesting feature of data exhaust: It can become primary data once a use for it is found.

3. It has great potential.

Data exhaust can be enormously useful. In that bank example, for instance, knowing where consumers conduct most of their transactions can help the bank do a better job.

“It’s not core to the transaction, but it can still be hugely relevant to servicing customers at a better level,” Rattenbury said. “It provides a level of understanding and contextualization to that primary transaction or service that’s increasingly desired by customers.”

Data exhaust can contain important elements of information that you may not be looking for today but that could prove useful in the future, noted Mary Shacklett, president of research firm Transworld Data.

“A lot of exhaust data isn’t immediately valuable,” agreed Nik Rouda, senior analyst with Enterprise Strategy Group. “The trick is figuring what is or could be.”

4. Beware the ‘swamp’ — and the legal baggage.

There can be risks associated with data exhaust.

“This is generally stuff customers may or may not be willing to have given you,” Rattenbury explained. “So there are potential legal, marketing, and public-relations risks around leveraging that data. You could end up alienating your customer base or partners by knowing stuff about them that they didn’t want you to know.”

The implications can be subtle. If an insurance company were to make use of the fact that it can see the GPS location of everywhere you’ve recently parked your car, for instance, it could raise rates for customers who routinely park in higher-crime areas. Without intending to do so, it might build an algorithm that ends up discriminating racially, he pointed out.

Another potential risk is saving data that will never be useful.

“CIOs need to balance the value of data exhaust against the waste of keeping tons of useless data forever,” Shacklett said. “This is very difficult to do right now. “

The goal is to save data exhaust that can go beyond just adding incremental insights and color to being transformative in business activities, Rouda said. “If there isn’t any business reason, this is where data lakes get a bad rap” and become data swamps.

5. You need to make some decisions.

The bottom line is that it’s critical to be selective about what data exhaust gets saved.

“It is important to start making some executive decisions on what you are going to throw out,” Shacklett said.

For instance, when it comes to smartphones and other devices, it’s well-known that much of the associated streaming data is “overhead” from device handshaking and extraneous “log data gibberish,” she pointed out. “It is doubtful that this type of data will ever be useful.”

Companies should also consult with lawyers, Rattenbury said.

In addition, they should get their employees closest to the core business in touch with the data. “They’ll have immediate questions they can ask that will show the relevance right away,” he explained.

From a technical perspective, companies need scalable storage technologies as well as tools for self-service data access.

One of the hardest pieces of working with exhaust data is getting a single coherent view around it, Rattenbury said. Cleaning up and unifying that data can be a challenge.

“I might have signed up for service at one place and entered credit-card information at another,” he explained. “You’ve recorded the same piece of data on me from a few different places.”

With secondary data, companies don’t typically worry at the time of collection about cleaning it up, Rattenbury added. So “you have to realize that it’s not just a matter of saying, ‘here’s this great pile of data — let’s do something with it.’”