Platform Enhancements for Structured Data Scanning
With our latest platform release, we announced multiple enhancements to our structured data scanning UI and workflow. In this blog, we’ll go in-depth on how we approach analyzing structured datastores and review some of the new features that we announced.
Secure and Private by Design
We know that securing your data is hard enough - it’s our job to make sure it’s as straightforward as possible, starting with setup. Open Raven’s platform was designed to be secure and private, beginning with requiring as few permissions as possible. Further, when we touch or scan data it is always done in your account so that Open Raven never sees raw data or secrets. You also have the means to determine how much of a data preview is sent back to the platform.
Scanning structured data stores presents some unique challenges: most databases use a username + secret (password, certificate, etc) to construct a “connection string” that is used to assess the database. There are alternative approaches, such as copying the data store, that arguably create more risk than they reduce. Using a safe and private approach, accessing databases using secrets introduces additional complexity in managing those secrets. As much as we can avoid it, we do not want your secrets to be stored within Open Raven.
In this vein, Open Raven’s data scanner (a serverless function that runs in the customer’s cloud environment) knows where and how to fetch the necessary secrets needed to access the targeted database, with the actual password being stored in that cloud provider’s secrets manager. In other words, the database secret never leaves your environment; Open Raven will only fetch the secret when it needs to, and do so with a function that runs entirely in your environment while the secret itself is never sent back to Open Raven’s environment.
For Open Raven users, setting up credentials is easy using the new Credential Manager feature. It’s found in the Settings section of the platform. The Credentials Manager displays the secret location, its format, and whether that secret has been successfully used in authenticating to the data store in question previously.
After we’ve attempted to authenticate to a database, our Asset List will also differentiate between databases we have successfully authenticated with, and ones we have not been able to authenticate yet, so that users can identify and resolve any authentication issues before their next scan.
Snapshot-less Scanning
A database snapshot is another copy, and all copies increase risk by having additional data hanging around. Creating snapshots also creates additional costs and may be prohibitively expensive or time consuming for large databases. Relying only on database snapshots would also preclude certain types of data stores where snapshots cannot be created, such as databases that are hosted within dedicated cloud compute instances, rather than by the cloud provider.
Without creating snapshots, any scanning would involve queries on a live database. In order to ensure that we minimize performance impact, we only make one query per table during scanning. In the next section, we go into further details about our scanning approach.
Why Sample Structured Data
Scanning data at scale is one of the key advantages of the Open Raven platform. If you look at our release notes, you’ll see that much of our recent efforts have been focused on achieving unstructured data scanning at massive scale: billions of objects and PBs of data. Speed and accuracy are key ingredients to successful data scanning. Faster scans means lower computational costs with quicker results and accurate results means that you arrive at high quality information that you can take action on.
In unstructured data stores, it is difficult to get accurate results without completely scanning the data store, because there is little guarantee that the data contents of one object are similar to the next object, or at the beginning vs middle vs end of a file. Just because you didn’t find sensitive data in one log file, does not mean that the next log file in the same store also does not possess sensitive data. While sampling can provide directional information, the accuracy of your findings is only as good as how complete your scan is.
In contrast, with structured data, it is straightforward to reach a high level of confidence about the contents of a database, because we can assume that data in a given column are self-similar. Only a representative sample of a single column is needed to arrive at an accurate assessment of its contents. However, to accurately assess the data contents of a columnar database with a relatively small number of samples, one must make sure that the samples are representative of the actual data, across the entire set of data; if the sampling method is not random or large enough, then the results will be biased and inaccurate, and difficult to make decisions or take action about.
Ensuring performant, random sampling took careful design and a bit of hard work. Slower, more computationally expensive approaches would have better randomness guarantees, but with significant performance and cost penalties, whereas other built-in sampling techniques do not produce representative samples, with each approach/technique being database-engine dependent. Our engineering team was up to the challenge, using novel techniques to ensure sufficient sampling randomness, so that we could accurately determine the contents of a structured database with fewer samples and lower resource utilization.
Keeping Things Simple
While adding structured data scanning has meant a few changes and new features, we aspired to keep the user experience changes to the rest of the platform to a minimum.
For example, customers who already scan their unstructured data with data class collections like “Personal Data”, or custom data collections, can continue to use the same collections and data classes to scan their structured databases as well, with zero changes in their data definitions. Viewing and triaging the results of those scans, in data catalog and policy violations, have not changed either, and includes results from both structured and unstructured findings in the same place, so that teams have a unified view their data risk posture.
In Maps and Asset Lists, we’ve integrated structured data assets seamlessly. Besides a few changes, such as the authentication status for structured data assets, the remaining functionality remains the same - for example, you can quickly see whether a structured data asset has a backup policy configured, just like you can see it with an object data store.
Conclusion
Across our careers, our team has been part of building hundreds of products from Microsoft Cortana to CrowdStrike EDR and far, far beyond. Creating the Open Raven platform has been one of the fascinating efforts we’ve been a part of as we set tight parameters for data safety and privacy while accepting the massive scale and variety of modern data. Our latest work on structured data we described in this post is a signal of much more to come as we take Open Raven into new frontiers in the year ahead.