How to prevent sensitive data leakages through code repositories

Version control software (VCS) is essential for most modern software development practices. Among other benefits, software like Git, Mercurial, Bazaar, Perforce, CVS, and Subversion allows developers to save snapshots of their project history to enable better collaboration, revert to previous states, recover from unintended code changes, and manage multiple versions of the same codebase. These tools allow multiple developers to safely work on the same project and provide significant benefits even if you do not plan to share your work with others.

Although it is important to save your code in source control, it is equally important for some project assets to be kept out of your repository. Certain data like binary blobs and configuration files are best left out of source control for performance and usability reasons. But more importantly, sensitive data like passwords, secrets, and private keys should never be checked into a repository unprotected for security reasons.

Checking your Git Repository for Sensitive Data

First of all, once you started managing your secret security you need to check the repository for certain data. If you know an exact string that you want to search for, you can try using your VCS tool’s native search function to check whether the provided value is present in any commits. For example, with git, a command like this can search for a specific password:

git grep my_secret $(git rev-list --all)

Setting the security

Once you have removed sensitive data from the repository you should consider setting some internal tools to ensure you did not commit those files.

Ignoring Sensitive Files

The most basic way to keep files with sensitive data out of your repository is to leverage your VCS’s ignore functionality from the very beginning. VCS “ignore” files (like .gitignore) define patterns, directories, or files that should be excluded from the repository. These are a good first line of defense against accidentally exposing data. This strategy is useful because it does not rely on external tooling, the list of excluded items is automatically configured for collaborators, and it is easy to set up.

While VCS ignore functionality is useful as a baseline, it relies on keeping the ignore definitions up-to-date. It is easy to commit sensitive data accidentally prior to updating or implementing the ignore file. Ignore patterns that only have file-level granularity, so you may have to refactor some parts of your project if secrets are mixed in with code or other data that should be committed.

Using VCS Hooks to Check Files Prior to Committing

Most modern VCS implementations include a system called “hooks” for executing scripts before or after certain actions are taken within the repository. This functionality can be used to execute a script to check the contents of pending changes for sensitive material. The previously mentioned git-secrets tool has the ability to install pre-commit hooks that implement automatic checking for the type of content it evaluates. You can add your own custom scripts to check for whatever patterns you’d like to guard against.

Repository hooks provide a much more flexible mechanism for searching for and guarding against the addition of sensitive data at the time of commit. This increased flexibility comes at the cost of having to script all of the behavior you’d like to implement, which can potentially be a difficult process depending on the type of data you want to check. An additional consideration is that hooks are not shared as easily as ignore files, as they are not part of the repository that other developers copy. Each contributor will need to set up the hooks on their own machine, which makes enforcement a more difficult problem.

Adding Files to the Staging Area Explicitly

While more localized in scope, one simple strategy that may help you to be more mindful of your commits is to only add items to the VCS staging area explicitly by name. While adding files by wildcard or expansion can save some time, being intentional about each file you want to add can help prevent accidental additions that might otherwise be included. A beneficial side effect of this is that it generally allows you to create more focused and consistent commits, which helps with many other aspects of collaborative work.

Rules that you need to consider:

  • Never store unencrypted secrets in .git repositories. 
    A secret in a private repo is like a password written on a $20 bill, you might trust the person you gave it to, but that bill can end up in hundreds of peoples hands as a part of multiple transactions and within multiple cash registers.
  • Avoid git add * commands on git. 
    Using wildcard commands like git add *or git add . can easily capture files that should not enter a git repository, this includes generated files, config files and temporary source code. Add each file by name when making a commit and use git status to list tracked and untracked files.
  • Don’t rely on code reviews to discover secrets.
    It is extremely important to understand that code reviews will not always detect secrets, especially if they are hidden in previous versions of code. The reason code reviews are not adequate protection is because reviewers are only concerned with the difference between current and proposed states of the code, they do not consider the entire history of the project.
  • Use local environment variables, when feasible.
    An environment variable is a dynamic object whose value is set outside of the application. This makes them easier to rotate without having to make changes within the application itself. It also removes the need to have these written within source code, making them more appropriate to handle sensitive data.
  • Use automated secrets scanning on repositories.
    Implement real time alerting on repositories and gain visibility over where your secrets are with tools like GitGuardian