MongoDB and the Dangers of Schema Rot

A large news website, like many web companies over the past few years, has been undergoing a slow but steady transition from a single, monolithic code base to a service oriented architecture. Our reasons are pretty standard – the team is growing, everyone wants to be able to develop without worrying about stepping on other developers’ toes – but it have one obstacle that makes its situation unique: its primary data store is a single MongoDB database that contains six years of production data, and all services are expected to affect the public-facing website by directly modifying said database.

Broken Encapsulation

If that last sentence made you cringe, you probably already know where this is going: using a single, shared database breaks service encapsulation in a dangerous way, and it gets worse the older a database gets. Services have to maintain a deep knowledge of their companion service's internal mechanics not only as they exist at any given moment, but as they have existed over time.

Normally, there's nothing wrong with sharing a database: for many companies, a traditional database is the very first service they deploy: it provides a standardized, widely understood, and extensively documented way to access and modify data in a controlled way, especially if the data is in the third normal form. Unless someone makes a radical change to the database schema, access patterns and security protocols are defined in the service responsible for storing the data. And even when the schema changes, the data access API changes for all data. Not so with Mongo: change the schema a particular application uses, and it will affect documents inserted and modified after the changes were effected. Data stored prior to that must be retrieved and processed using the old schema, leading to services littered with ugly hasOwnProperty or isset checks, or if you want to formalize changes, something along the lines of if (document.version < 2.3) ....

This is schema rot.

Even alternative document stores like CouchDB and Solr/ElasticSearch, with their predefined views and schemas, respectively, provide a much stronger public API than Mongo can. Mongo's support for arbitrary queries and lack of an internal mechanism for schema migrations mean that if you use it, you will eventually find yourself supporting multiple schemas in the same collection. If you're using an ODM like Hibernate or Doctrine, this will be a source of considerable pain, especially once multiple services all have to independently implement defensive mechanisms for protecting themselves from pervasive schema rot.

So what can you do? A common solution seems to be implementing an internal REST API that is the only service allowed to talk to Mongo. But now you have a defined schema and centralized access control, meaning that all you gained from using Mongo instead of Postgres was … global write locks.