Discover and Classify Data

Scanning for Developer Secrets in Git Repositories using Open Raven

Michael Ness
Security Researcher
June 1, 2022

Introduction

Open Raven’s initial focus was scanning unstructured and semi structured data at scale in AWS S3 buckets. Since just about anything can find its way into an S3 bucket, our platform’s analysis can stretch to a number of places, including source code repositories. Using infrastructure provisioned by CloudFormation templates, customers can sync any of their Git repositories into a S3 bucket in their cloud environment using AWS native services. Syncing happens any time the main branch of repositories are changed so everything in the bucket remains up to date. This bucket containing the repo’s can then be scanned in the product to detect any sensitive personal information and developer secrets on a scheduled basis. Once set up, the process can be scheduled to be fully automated, allowing customers to constantly evaluate their codebases for any sensitive information before attackers have the chance to find it.

Developer Secrets can fall under multiple different categories but the general concept is that they are string-based credentials that provide access to different types of resources. If an attacker finds these keys, they can then obtain access to the service it is used for and extract information as if they were the victim company themselves. Open Raven currently supports 22 different developer secret data classes, with immediate plans of adding more throughout this quarter. Examples of those supported include but are not limited to: AWS Access Keys, Github & Gitlab API keys, Paypal & Stripe API keys. These strings should never be hardcoded into repositories and Open Raven can help identify those that are for immediate remediation.

Git Repository Syncing

The first stage to enable the scanning is syncing an organization’s Git Repositories into an S3 bucket within their AWS account. This is done in a way where it allows automated syncing for any new changes to the repository, so the bucket will always contain the most up to date repositories. This syncing process is done through a mixture of different tooling including: Webhooks, Amazon API Gateway, Lambda’s and CodeBuild. All of this is available in a CloudFormation template provided by AWS. The infrastructure that the CloudFormation template provisions can be seen in the Figure 1 below.

Linear diagram from L to R. On the far left is 'Git users', arrow saying 'Git push' points to 'Third-party Git repository'. The next arrow enters the AWS Cloud and a specific region. It says 'Git webhook' and connects to Amazon API Gateway. The next arrow enters a VPC and a Private subnet. It points to a Lambda function. Next arrow points to AWS Code Builder which as two arrows – one pointing back to the Third-party Git repository and the other is still linear and splits into 3, pointing to Amazon S3 SSH key bucket, AWS KMS key, and Amazon S3 output bucket.
Figure 1: Depiction of the infrastructure provisioned in AWS from the provided CloudFormation template.

The principle here is a webhook is set within specific repositories, or at an organizational level and if specific conditions are met i.e., a merge to the main branch, then it will be triggered and send a HTTP request to the API gateway configured in the hook. The gateway receives the request and triggers a Lambda to process and validate the request. If the Lambda determines it is a legitimate request it activates AWS CodeBuild which is responsible for the cloning of the git repository, packaging up and uploading to a S3 bucket in a zip archive. This process will happen any time a repo is updated, removing the old zip file and replacing it with the most recent version of that repository.

Scanning your S3 bucket

Once you have synced your repositories into the S3 bucket, it is really simple to scan them within the product. All you need to do is set up your scan in the UI you can see below in Figure 2

Screenshot of Open Raven's Create New Scan Job panel. On the left is the form that includes Job Name, Status, Job Description. Below is Schedule, where users can choose how often to Repeat and the hour and minutes. Next is the Data to find. It includes da, data collections and classes to choose. On the right side, assets appear and a user can select which assets to include in the scan job. It shows asset name, account, storage used, files, and configuration. It also shows scan limitations by file number and TBs.
Figure 2: Configuration of a scan in Open Raven’s Product UI.

It is as simple as clicking continue to submit your scan with the selected configurable options and this scan will be run at the specified time intervals - which means the whole process can be fully automated, syncing and scanning. 

The results from the scan will be presented in the platform under the particular asset you scanned, from here you can observe all of the findings as seen in Figure 3.

Screenshot of Open Raven's scan results panel. On the left is an overview that shows Asset Type, Resource Name, Region, Account (redacted), Tags (redacted), and a Data Summary of Data Classes. To the right is a table with 3 tabs. Currently selected is Data Violations > Open Violations. Two rules are in violation with a low severity.
Figure 3: Displays the scan findings for the S3 bucket containing the code, which has AWS keys.

Summary

Syncing Github repositories is extremely useful for Open Raven’s customers as it allows them to safeguard another aspect of their organization within the same product. The continuous syncing and automated scan scheduling for the repository bucket means they can automate this process and only need to worry about the findings that come through. The downside to this approach is customers are having to take their repositories and place them in a bucket. The minor risk associated with this is minimized by the fact the bucket and scanning exists within their own cloud environment as opposed to ours, allowing them to take the security aspects regarding the bucket into their own hands.

Don't miss a post

Get stories about data and cloud security, straight to your inbox.