Toxic (Log) Containment
Long ago, our logs were remarkably well behaved and lived by themselves in quaint little servers. There were a handful of sources generating logs. Our logs changed only occasionally along with our periodic software updates. We stored them on carefully maintained servers we gave names to. We rotated our logs because storage was expensive and expanding it took effort. Logs were stored on intentionally isolated systems, used by a handful of people for a fixed number of purposes, such as producing data for audits, investigations, and troubleshooting. If sensitive data was found in a log, the problem was naturally contained and mostly shrugged off.
Exactly none of this is true any longer.
The number of log sources have exploded and so has the variety of logs themselves. Perspective from Graylog: “Not long ago, 500MB per day was considered a normal volume of logs for a small shop. Today, 5GB per day is not unusual for a small environment. A large environment can produce a thousand times more than that.”
Log management has understandably shifted to the cloud for a host of reasons. A primary reason is to remove the burden of scaling infrastructure to keep pace with the torrent of data. There’s no need to rotate logs any longer when storage is cheap and it expands automatically as needed (as does your bill, but I digress). Route the logs to your data lake and move on. The data lakes and warehouses that store our logs are also no longer isolated. Like the move from an idyllic village to a bustling metropolis, the number of people and things they want to do with logs have grown dramatically. Data scientists, product teams, compliance officers and others are all handling log data now.
Finding personal data, developer credentials or health data (to name a few) inside logs is a much bigger problem now given the amount of additional exposure. But there are additional factors that further heighten the risk. We log differently than we did previously, dumping detailed information for troubleshooting everything from failed authentication to mobile app crashes and Internet connectivity issues for snack machines. This drives up the volume of logs as well as the likelihood of inadvertently dropping toxic data into logs.
The trends above are reasonably well established, if not obvious. But, there’s another less visible shift underway that turns the problem of toxic log data into a bigger problem. The SaaS applications we use, especially for sales and marketing, all work better if we feed them a steady diet of data. Unsurprisingly, modern data architectures make it increasingly straightforward to directly connect our SaaS applications to our data lakes.
So what does this mean for our toxic log scenario? The sensitive data that mistakenly made its way into our logs is now not only exposed to many more people doing more things, but it’s now been accessed, consumed and stored by whatever connected SaaS application that pulls from the same log storage location.
This takes finding and sanitizing toxic log data to another level. Formerly it was a headache as you had to plow through massive amounts of JSON, plaintext, protobuf, etc. to make sure you scrubbed out all the offending data. Now, you have to also determine all the SaaS applications that accessed the toxic data, find out where it is stored and remove it there as well. This turns what was already a multi-hour task, if not multi-day, into a long and painful project that could stretch across weeks and involve people from many parts of an organization.
No one has time for this type of clean up. It is the wrecker of schedules and devourer and the best laid plans. The potential regulatory penalties and impact of a public incident might be the least painful of the implications.
So what’s the answer here? The classic “ounce of prevention”.
I won’t get into the specifics of how Open Raven works, but our answer to preventing sensitive data in logs from going toxic is to find and remove it at the point where it is stored. Let us worry about scaling to meet the challenge of your massive piles of logs and parsing all that JSON for personal data. By inventorying, classifying and acting on data at scale automatically we can turn what would otherwise have been a schedule destroying incident into a quick fix.
Curious to learn how to prevent data leaks through logs using Open Raven? Read our latest solution guide, Finding & Eliminating Sensitive Data in Logs.