Achieving Complete and Accurate Data Scans Using Validator Functions
Introduction
One of our core principles is when we look for unsecured sensitive information in cloud storage, we offer the ability to fully scan unstructured and semi-structured data. Why not use sampling methods like other data security solutions, you say? Because when you are dealing with sensitive information, a small blind spot can lead to a big data leak. Security teams must be able to scan files completely so they know all sensitive data locations in a budget-safe way without sacrificing scale, time, or risking false positives.
Sampling Doesn't Provide Complete Visibility
Scans that rely on sampling scan only a subset of the data. Sampling works when different types of data are mixed together evenly, but it can often miss something when the important information appears only in a few isolated spots. Think about a log file. The data residing at the beginning of a log file isn't necessarily the same as at the end. By scanning the full file, Open Raven can find sensitive information at massive scale in locations that would often slip through the cracks of sampling-based methods.
Validator Functions Ensure Accuracy
When performing a complete scan of terabytes or petabytes of data, even a very low false positive rate can produce large numbers of spurious findings. This is why it is crucial to ensure a high degree of accuracy. False positives can occur when something matches the pattern of a specific data class by coincidence. For instance, sometimes a random token just happens to look like a driver’s license number. Therefore, we put a great deal of emphasis on making our methods as precise as possible.
One way we do this is through the use of validator functions. In our data scanner, a validator function is a piece of code that runs after pattern matching to verify the results. When you create a custom data class, you have the option of adding a default function or using our validator function editor to create a custom function:
Not all data classes use validator functions, but in many cases, they provide an important way of filtering out false positives when pattern matching alone cannot provide certainty.
Types of Validator Functions
A common type of validator function makes use of checksums that are built into some identifiers. You may have noticed that if you mistype a credit card number on a web form, you will sometimes get an error saying it is an invalid number. This is possible because the numbers include a check digit computed using a method called the Luhn Algorithm, which can then be verified against the other digits of the number.
The Luhn Algorithm is named after Hans P. Luhn, who worked as a researcher at IBM. The method, patented in 1960, is sometimes viewed as a precursor of the hashing methods now pervasive in computer science. The patent describes a mechanical device about the size of a wallet, which allows users to dial in a number and see the check digit through a small window. Its purpose, the patent explains, was to detect human error when transmitting numerical codes such as inventory numbers.
These check digits can also improve the precision of data scanning. Based on our experiments, a random 16-digit number will match the pattern of a card number about 50% of the time. Simply verifying the check digit with a validator function reduces this percentage by a factor of ten. When combined with other methods, such as keyword matching, the Luhn check enables us to find card numbers with a high level of confidence.
Conclusion
Checksums are only one of many ways we ensure our data classification is accurate when doing complete scans of large amounts of data. We also employ validator functions to filter out specific types of data known to create false positives—for instance, we use statistical methods to prevent our scanner from matching words when looking for alphanumeric identifiers. Validator functions can also classify data based on encoded information, such as grouping vehicle identification numbers by manufacturer or region. These methods ensure a high degree of accuracy, and reduce false positives so that security teams can identify all sensitive data locations at scale.