Operationalizing Data Security in the Cloud at Scale - Data Classification Explained Part 2
Last time, we discussed the core principles that guide Open Raven’s approach to determining what types of sensitive data reside in large cloud accounts. As we discussed, automatically detecting things like social security numbers and names can be a difficult problem, since many data types have overlapping formats that can only be distinguished by context. At first glance, the solution might seem obvious: train a machine learning model. In reality, things are not so simple. It is often difficult to get good training data for highly sensitive data types like national ID numbers; further, the contexts we encounter vary widely from customer to customer, making it difficult to develop a model that works reliably in all cases.
As a result, Open Raven takes a hybrid approach that combines machine learning with explicitly coded rules. We’ve developed ML models for dealing with the hardest data classification problems. At the same time, our R&D team does painstaking research into government and industry documents to find up-to-date information about data formats and other facts that can be used to classify data accurately. By bringing these two approaches together, our system can combine the advantages of both—ML’s ability to handle highly complex problems along with the dependability and efficiency of expert-written rules.
To understand why this hybrid approach is best, we have to consider some of the pitfalls that face ML in data classification.
The dangers of careless machine learning
Machine learning is a powerful tool, but no single method can be a panacea. Getting ML to work well for a particular application requires care in defining the problem, assembling the training data, and tuning the model. In some cases, this process just provides a circuitous route to a simple solution that was staring you in the face all along. In others, if you are not careful enough, it can lead you astray.
A large subset of the data types we scan for have clearly defined formats that can naturally be written as regular expressions or simple conditional rules. One example is International Bank Account Numbers, or IBANs. Our scanner detects IBANs while also classifying them by country so as to help our users understand their data.
Classifying IBANs by country is fairly easy, at least if one takes a rule-based approach. IBANs begin with a standard, two-letter country code. To determine the country, all you have to do is extract those two characters and look them up in a table. IBANs also have country-specific rules about length that can be used to validate country code and account number combinations.
One example is the following (fake, I assure you) bank account number:
BR16 3974 1585 7194 5589 3390 477J V
“BR” is the standard country code for Brazil, which clearly indicates that this number, if it were real, would belong to a Brazilian account. We can also verify that the number has 29 non-space characters, which is the standard for Brazilian IBANs.
Although these rules are easy for a human to understand, an underperforming machine learning model could easily get them wrong. When trained on a small or unrepresentative data set, an ML-based classifier can easily pick up on spurious correlations that distract from the most important features. If, for instance, all the Spanish bank account numbers in the training data happen to end with V, a training algorithm might incorrectly determine that a V indicates an account number from Spain and misclassify that example, even though it is obviously from Brazil.
Such problems are well-known in the field of machine learning, and there are techniques to avoid them. A properly tuned ML model should be able to pick up on the country code as the strongest signal while avoiding noise. But why train a model when you already know exactly what to look for? In such cases, ML only adds unnecessary complexity, increases the chance of error, and decreases speed and cost performance relative to simple pattern-matching methods.
Where ML shines is in handling data types with more open-ended formats. In the next section, I will discuss some of the ways we use ML to improve accuracy where human-coded rules alone won’t do.
Using ML where it matters
If pattern-matching methods are the best way to classify IBANs, the situation is much the reverse with regard to personal names. Since a person can be named almost anything, there are no hard-and-fast rules for whether a given string is a name or not. Detecting names thus depends in part on context: what can the surrounding text or data tell us about the meaning of a string? It is also a matter of probability: is the string a common name like “John,” or is it a common word like “Database”?—or is it something that could be either, such as “Will”?
A standard approach to identifying names is Named Entity Recognition (NER), which involves combing through data to find names of certain types of entities, such as people and cities. Typical NER methods are designed to work with natural-language documents such as news articles or product reviews. However, most of the data we scan at Open Raven is not just paragraphs of English sentences; cloud storage instances often contain server logs, code, configuration files, and data in semi-structured formats like JSON and Parquet. The most common off-the-shelf NER solutions are unprepared to handle these data types. The formats also differ widely from company to company, which would make it difficult to train a new NER model that would work reliably with all the data we encounter.
As a result, we employ a name detection method that works differently from standard NER. Our approach draws on statistics about the frequencies of certain first and last names, large text datasets, and human knowledge about how names are typically formatted. We’ve managed to reduce the resulting model into a compressed form that can run with extremely low memory and compute requirements. A part of this compressed model consists of an enormous, ghastly machine-generated regular expression that no human could ever write; the rest is implemented in the data class’s validator code.
We also employ ML methods as a way of checking matches to prevent false positives in a variety of data classes. Some important types of sensitive data are difficult to classify because they have very general formats. For instance, passport numbers for some countries can consist of eight alphanumeric digits. That’s easy enough to search for with a regular expression: [A-Za-z0-9]{8}. Yet this simple pattern would also match many things that are not passport numbers, including the word “PASSPORT,” which, for obvious reasons, is highly likely to turn up as a false positive.
As a way of improving accuracy in such cases, we’ve developed a specialized ML method for distinguishing ID numbers and security tokens from strings that probably serve other functions, such as words and code identifiers. The model we use is a cousin of large language models like GPT-4, albeit much smaller and faster. Combined with standard pattern-matching and keyword methods, this method allows us to avoid false positives while maintaining a high rate of recall for some of the toughest classification tasks.
Our hybrid approach is highly efficient in terms of speed and cost, because the most complex logic only runs after a prospective match has been found. This means that we aren’t passing all your data through a costly machine learning pipeline—instead, our most complex models are applied selectively as a way of increasing accuracy.
Next Up
We’ve explained our overall approach to data classification and how we navigate the benefits and drawbacks of machine learning. Next time, we’ll talk about how Open Raven’s platform enables you to operationalize data findings to address security problems. Our software can look for risky combinations of data types, such as name and SSN; generate secure data previews to provide visibility into data without revealing sensitive information; scan metadata to provide further information about resources; and produce alerts when sensitive data types are open to attack. Taken together, these features provide a robust platform for preventing data breaches before they happen.