$10,000 Tylenol: Fixing Cloud Data Security Costs
Imagine you have a splitting headache. You go to the drug store to grab some Tylenol and a few other items. You’re checking out and the bill comes to… $10,000. Your jaw drops. You ask the cashier how the hell you ended up with a bill that looks more like a mortgage payment than a trip to the local CVS. You no longer want Tylenol or to shop there ever again. And your head still hurts.
Now swap out CVS with the cloud service provider (CSP, such as Amazon, Google) or security vendor of your choosing. And Tylenol with data security— or any of a number cloud security or monitoring services. What you have is an exaggerated but far too often true portrait of where we find ourselves today: we have many options for observability and security, but few that fit or respect our budgets. As I write this, we’re removing a solution that works perfectly fine but at a cost that has grown to 5x the working alternative due to absurd licensing fees. And while many companies are currently tightening their belts, the same story has been true for years now and in many ways it began with the move to the cloud.
Cloud migration moved expenses that were previously fixed and negotiated annually to more developer-friendly models that were based on consumption. While the ability to try new tech without a commitment is great, the trade-off is often knowing what your actual cost will be at scale and over time. Cloud service adoption, especially from CSPs, also eliminated proof of concept projects where a solution is assessed for fit at the expense of a modest time investment. This is great, right? We can now move fast and try things in a real environment without a long-term commitment.
The downside is that related costs are often opaque and hard to project. They can vary by region, data type, underlying processor, etc. And while you can turn them on without any hassle, you can also rack up eye-popping bills just as quickly. We’ve heard a number of times about massive AWS Macie expenses, one surprise bill reaching into the 7 figures after it was inadvertently turned on. It’s now common to hear: “I’d like to do data security but it’s too expensive for my environment”.
This is not only true for the direct use or licensing of a product, but also the services it uses as part of its operations. This is primarily the case for 3rd party vendors, like us and all of our xSPM brethren. The APIs we call, the compute we use, and so on all end up on a CSP bill somewhere. And far too often this has not been transparent enough to avoid an unpleasant surprise. The customer pain can run deep, starting with an investigation to figure out exactly where the unexpected cost came from (lost time), to the “Finance walk of shame” to explain why the budget was exceeded (lost trust) to the work done to make sure it doesn’t happen again (unexpected effort / lost time). If this situation lingers long enough, it can irreparably stain a company’s brand and even an entire category of products.
The solution? It’s not smarter Sales teams who do an amazing job setting customer expectations. It’s better-designed products that put the customer in control. It’s thoughtful licensing that recognizes pricing is part of the customer experience. It’s solving for predictability in both design and licensing. And yes, transparency matters too.
The 2nd half of this blog explains how we met the challenge of delivering affordable cloud data security at PB scale in a way that’s budget friendly and a great experience for all, from the security team to DevOps all the way down the (entirely virtual) hallway to Finance.
Affordable at scale
Nothing matters if the actual cost of analyzing large amounts of data is excessive. This has been the primary blocker for many: it’s simply too expensive to discover and classify a mature data environment (e.g., 100s of TBs, PBs). Most vendors “solve” this problem via sampling: analyzing a small percentage of the data and inferring from that the presence of sensitive data. This works in some cases, in many it does not. We posted in detail about this previously so I won’t elaborate here, but suffice to say the largest amounts of data are typically unstructured and they represent the most risk. And unfortunately they are also where sampling is the least effective, no matter what AI/ML magic a vendor claims. They might be well capitalized, but check out the billions Microsoft has put into OpenAI to get a sense of what something like that would take at considerable scale.
So what’s the solution? Highly optimized, flexible serverless-based analysis. Our approach requires zero dedicated compute and efficiently uses the exact amount of ephemeral processing power necessary to do the job at hand. The analysis is done in the same account as the data, requiring no copying or transfer (which require $$$). Both enumeration of objects and classification have been tuned expressly to run at scale over nearly arbitrary data types. After the initial baseline, all further analysis is incremental only. And all of this has been tested in the fires of very large enterprise environments going back to 2020.
Cost predictability
Another benefit of doing the hard work of data discovery and classification over serverless functions is that it allows us to put the customer in complete control over how the platform works. If you want a scan to stay within a defined budget, no problem. Subsequent scans will pick up where the last one left off, allowing you to complete full analysis in phases if you’re tightly managing costs. Let’s say instead that time is of the essence. In this case, you can time-constrain the scan by days and if you need to move with extreme speed (e.g., for an investigation) we can increase the number of default functions used for more parallel horsepower. Similarly, you can toggle sampling rates from 1-100% based on the desired level of completeness. In all instances, the exact cost of the scan itself is run by tagged serverless functions in the account where the analysis is performed and the resulting cost of the scan is plainly displayed alongside the results.
Eliminating hidden costs
Continuing with the theme of transparency, our SaaS model with no dedicated compute resources (i.e., no agents, no scanners, etc.) means that not only are the costs of analysis fully within your control and predictable, but there are no additional, surprise costs or overhead. In the name of privacy, a number of vendors offer a “modern on premises” model where the platform is self-hosted by the customer. The result is often unexpected costs and work related to bearing the operational expenses and workload of running their platform. We have delivered fine-grained privacy controls and “touchless” data scanning in our platform such that we can provide the same privacy of an on-premises solution without the associated overhead.
Straightforward licensing
Licensing and pricing is hard. About every other week, I get asked to weigh in on a pricing decision for a young company. At some point in the conversation, I usually end up saying: “If you’re spending a lot of time on this and it feels difficult, don’t worry, you’re doing it the right way.” Pricing is a key part of the customer experience. It touches not only those who use the product but also the Finance team, procurement and potentially others. At Open Raven, we designed it to be easily understood and to encourage frequent use of the platform— factors that we deemed to be important beyond predictability and affordability already mentioned above.
How did we do this? By metering off of the number of data stores and the amount of data. The number of data stores is generally stable and grows at a predictable rate. Often the same is true for the amount of data. Compare this to alternate approaches that price off a percentage of your data bill (how do you even count this accurately?), off of the people in your org , or bill you on a per scan basis. In order to accommodate unexpected growth, we use a true-up model with no hard stop for the duration of the (typically annual) contract.
Wrap-up
The dynamic nature of both the cloud and the data economy make it difficult to solve key cost challenges at scale, from making it affordable to keeping costs within budget as an environment shifts and grows. And the stakes are high— getting it wrong means lost time, unexpected costs and damaged trust both within an organization and with the vendor partner. Thankfully, the flexibility and increasing sophistication of cloud services such as serverless functions also gives us the tools, from serverless tech to detailed telemetry, to thoughtfully solve the same problems that have been created by the migration of massive amounts of data to the cloud.