Scrubbing your Django database

This is the second in a series of posts, focusing on issues around open sourcing your Django site and data privacy in Django.

You’ll end up with production data in your Django database and that will likely contain different kinds of data such as: configuration data, required basic data (categories for example), collected data and personal user data. There’s a couple of reasons for taking that production data and copying it off your production servers:

  • for developers and contributors you want a sample copy of the app with some key data in.
  • for testing or staging servers, you might want to copy down the database from the production server so you can test certain scenarios or load.

Extracting parts of your database

For the first case, it’s nice to prepare a minimal copy of the database that contains key data. For example, for those wanting to develop or contribute to addons.mozilla.org we have Landfill by Wil Clouser.

Django comes with a nice facility for loading data, fixture dumping and loading. This can be used to pull data out of your database and then reload it. However the built in Django dumpdata dumps all the records for your model (depending upon your default object manager). That might not be what you want for this scenario. So a useful utility for dumping just the records you want is provided Django fixture magic written by Dave Dash.

A standard dumpdata looks like this:

manage.py dumpdata users.UserProfile

And will dump every UserProfile. By contrast:

manage.py dump_object users.UserProfile 1

Will just dump the UserProfile with primary key of 1. Django fixture magic also has a few other useful things such as merging and reordering fixtures.

This allows you to trim a set of fixtures from your live database down quickly. Then developers or contributors can load the key parts of the database that they need from those fixtures.

Anonymising the database

Sending the production database downstream to internal developers or internal test sites is a pretty common use case. This process does not require a complete clean of the database, but it does require some cleaning of database. If you stored credit card data, for example, you’d never want to copy that off your production database.

At Mozilla we use an anonymising script, written by Dave Dash again. There are few options: to truncate, nullify, randomize or selectively delete. The format is a simple YAML file, for example:

   tables:
        users:
            random_email: email
            nullify:
                - firstname

This is a snippet from the config script for addons.mozilla.org.

When the IT copies the databases down from production, this script is run against the database. Ensuring that when us developers access the backups to investigate certain issues, we’ll be getting the bits we want and not the bits that might expose user data.

In the next blog post we’ll look at logs and tracebacks.

1 response

  1. Luke Plant wrote on :

    For anonymizing, there is also this package: http://pypi.python.org/pypi/django-anonymizer

    Instead of a YAML config file it uses some Python classes, but it will create these for you, introspecting the Django models and guessing what ‘randomizer’ to use. Appropriate subclassing can be used for more customization of the behaviour.

    It is only suitable if you’ve already trimmed down your DB to a suitable size, however, as it doesn’t do any trimming of the DB at the moment.