Should I use SQL or NoSQL?

Every application needs persistent storage — data that persists across program restarts. This includes usernames, passwords, account balances, and high scores. Deciding how to store your application's important data is one of first and most important architectural decisions to be made.

Most programmers use some kind of standalone database to provide persistent storage. Databases can be broadly classified as SQL or NoSQL, according to whether they use the SQL programming language and a schema. But which should you use? Let's take a look at the options, and unashamedly explore the benefits of SQL over NoSQL.

SQL vs. NoSQL

What exactly are we talking about when we say "SQL database" and "NoSQL database"? (Feel free to skip this section if you are already familiar with these terms.)

SQL

A SQL database, also known as a relational database or an RDBMS, requires the user to create an explicit schema that specifies exactly what sort of data may be stored in the database. Any data that does not conform to this predefined structure will be rejected by the database. This ensures that all data in the database has the same shape — no important fields can ever be accidentally omitted. Popular SQL databases include PostgreSQL, MySQL, Oracle, and SQL Server.

SQL databases organize their data into tables, which are something like spreadsheets in that each has a single set of columns for different attributes, and then a large collection of rows that fill in specific values for each column. You might also think of it as similar to the concept of classes in object-oriented programming. Each table models a single entity that is meaningful to the application, such as "users" or "accounts".

Within each table, there are defined columns that specify what attributes we wish to record. Once that's done, we can add any number of rows that store actual entries about specific entities.

For example, a Users table might have columns for username, password, first_name, last_name, and signup_date. When each new user is added, a new row is inserted into the table, with specific values given for each column.

Putting the "relational" in relational database

Tables can specify that certain columns record fixed relationships to other tables, linking them together. For example, suppose we're writing a blogging app. We record each user's sign-up details as rows in the Users table. When a user writes a comment, we record the comment as a row in the Comments table. One of the required columns in Comments records which user wrote the comment, a so called foreign key that points to a row in the Users table. In this way, we can ensure that every comment has its author recorded as it's created.

The schema allows the database engine to run ad hoc queries across the entire database in an efficient way. Querying the relationship between two tables is called joining. This means we're cross-referencing data from multiple tables together. For example, if we have a specific user from the Users table, we can easily get a list of all Comments created by that user by "joining" them together.

The magic of relational databases is that the database engine itself can create highly optimized query plans for how to do this join process. That might not sound so impressive when thinking of only two tables such as Users and Comments, but imagine a much more complex table structure that models, say, a large e-commerce website — joining in an efficient manner quickly becomes a non-trivial task when there are many complex joins going on at once.

This is only possible since the database engine already knows the structure of the data in detail. For example, if you have one table for Users and another table for Comments made by users, the engine can easily join them — that is, give you a list of comments created by a particular users.

SQL databases have historically been the most popular type in industry. In the 2000s, a new type of database called "NoSQL" became popular, too.

NoSQL

NoSQL databases, by contrast, have no rigidly pre-defined internal structure for the data that they store. Applications can write data in into the database in any format they like, with no constraints imposed by the database itself. Any structure used is entirely optional and left to the application layer to define and enforce. Popular NoSQL databases include Mongo, CouchDB, and Cassandra.

Proponents of the NoSQL model argue that this more flexible system is better as it frees the programmer from having to explicitly model the application's data in advance. They claim it supports a more agile or dynamic style of development in which unanticipated changes in requirements can more easily be accomodated by the database layer. This lack of internal structure comes at a price, however: the data in a NoSQL does have some kind of structure — otherwise programs could not make sense of it. It's just an ambiguous, poorly defined, and unenforced structure, which leads to maintenance issues and bugs.

Such databases also typically claim to be more scalable than SQL databases. This is an orthogonal issue — that is, their scalability properties are not directly related to their schemalessness, it results from their consistency model. The supposed ease of scaling NoSQL databases horizontally derives from their use of an "eventually consistent" consistency model. This means that the nodes in a cluster allow different user applications to see different versions of the data at the same time. The database deliberately presents an inconsistent view of the data from different nodes, leaving the problem of how to reconcile them to the application.This is acceptable in some specific applications, but rarely acceptable in any case involving money or other external side effects.

Why bother making a schema? That's just the man keeping us down.

Cases where an application really needs to store truly amorphous data with no fixed structure at all are exceedingly rare. In almost all cases, the data storage problem space can be ordered into a relational database schema with a little thought and planning. This is a valuable exercise in itself.

Not having to explicitly think about the structure of what's your database doesn't mean that it doesn't have any structure; it just means that you haven't thought much about what structure it has. This leads to redundancies and difficulty in sorting and querying the data, because the structure it's stored in doesn't correspond to the structure of the data.

Querying and especially joining data is often difficult in a NoSQL database of any complexity. It frequently requires the user to hand-code the join procedure themself, at the app level, whereas queries and joins are automatic and efficient in a SQL database. Doing that automatically is a solved problem, and the Postgres query optimizer is usually way better at joining and sorting data than you are.

Data drift

Applications and support programs such as analytics systems that use the application database all need to know about all present and historical versions of all data that has ever been in the database. This is a maintenance nightmare.

It's not that NoSQL data doesn't have any structure; it's just that no particular structure is enforced by the database itself. The result is data drift: as time goes on, the same logical entity winds up having several slightly different versions of itself present in the database. For example, suppose we collect users' last names in a field called "surname." Later, we decide to change this to the more common "lastname" — but records that are already in the database don't have "lastname", they have "surname"!

This may sound trivial when we're talking about just one field, but you can imagine far more complex data model changes that touch multiple entities without much effort.

How do NoSQL users deal with this data drift? There are only two options: either mass-update all the existing data — an error-prone and potentially time consuming chore — or make sure all programs that ever touch the database, now and in the future, know about all historic names for all fields used in all versions of all programs that have ever touched the database. Neither is particularly attractive. If we had a schema, we'd just update the field name in the schema and change our programs to use the new name.

Bad data passing as good

What happens if a programmer misspells a field name due to typo? What happens if nobody notices this for six months?

Since there is no enforced structure in a NoSQL database, over time, multiple similar but slightly different structures can evolve, unless coders are utterly meticulous, all the time, and completely avoid bugs and typos. (Guess how often that happens?) Programming bugs or user error frequently introduce data that is in some sense invalid into the database. This is common as application requirements change and new features are introduced.

If a field name is misspelled somewhere in the application code, the NoSQL database will happily accept this data anyway, whereas the SQL database will reject it immediately for having an invalid column name.

What happens when your data structure changes?

Programs evolve over time. Eventually they need to store new sorts of data that were not envisioned when V1 was written. The idea that you can easily slip new data into the database whenever you like is a main selling point of NoSQL databases.

Suppose in 6 months we decide to have all of our users tell us their favorite color, which we have never asked for before. Let's see how this new requirement would work in both the SQL and NoSQL models.

Updating the schema makes you think deliberately about what you're doing.

Since a SQL database rejects any data that doesn't conform to its schema, you have to deliberately update the schema before you can write new types of data to your database. This schema update is called a migration. Migrations are a source code file that explicitly instructs the database to alter its schema in some way.

In this case, we could alter the Users table to add a new optional column called favorite_color. For our existing user rows, this will be NULL by default, which is SQL's way of saying "there is no value in this slot." When users log in, our app asks them their favorite color, and writes it into this column.

We might have a mobile app, which hits an internal API server, and a user-facing web app run by entirely different teams. No matter how many different applications we have writing to the database, the slot for the user's favorite color will always be in the favorite_color column of the Users table. Any attempt to write to some other column will be rejected immediately by the database, ensuring that the data is uniform and always accessible by the same conventional names everywhere.

The perils of schemalessness

Imagine the same scenario with a NoSQL database. We tell the mobile app team and the web app team about our new requirement to ask users what their favorite color is.

Schemaless databases sound appealing because they're entirely flexible. You don't have to know in advance what sort of data your application will ultimately wind up storing. When the time comes to store a new data attribute, you can just start storing it in the database, with changes required at the database layer.

The mobile app team decides to store the user's favorite color in the favorite-color key. The web app decides to store the user's favorite color in the favorite_color key. Can you spot the problem?

It's subtle but common: one team used an underscore, the other team used a dash. We now have two entirely different slots for the same data. We either have to manually pick a winning key variant and then write a special one-off program to scan the whole database and rewrite any entries with the losing key to use the winning variant; or accept that there are two variants in the database, and make sure that all future versions of every program that touches the database know about all the historic variants.

Sound like a complicated pain in the ass? It is. It's deceptively simple and straightforward upfront, with deferred pain later.

You can easily imagine many variant cases. Suppose our mobile app team is in the UK, while our web app team is in the US. The mobile app team could easily use the British spelling and call the slot favourite_colour with Us, while the American team will call it favorite_color. Or maybe someone will call it preferred_color in their code. You get the idea.

Again, this may sound rather trivial in this contrived one-field example, but you can easily imagine how it quickly spirals out of control with more complex requirements.

"But," you say, "this can easily be solved: just pick a name to use in advance, and make sure everyone knows what it is." Yep, absolutely — that's called "designing a schema".

Your schemaless database isn't really schemaless

NoSQL teams wind up using schemas anyway — they just don't realize it. Programs written by different engineers couldn't interact otherwise. You can't stuff just any data into a database and expect your programs to magically understand it. There must be some conventionalized names for certain pieces of data, or programs won't know how to find it. That set of "same names for the same data" is called a "schema". The question is only whether you make the schema a first-class artifact that everyone knows about, can reference, can work on, and can propose changes to; or whether you pretend it doesn't exist and furtively modify it on the fly without noticing.

The schema is still there when you use a NoSQL database, it's just not explicit. It's usually not written down anywhere, and if it is that document is rarely up to date. Updates pass by word of mouth, in an unspecified and ad hoc fashion. This is slow and extremely error prone.

The schema is also "suggested" rather than "required," in the sense that there is by definition no technical mechanism for its enforcement in a NoSQL database. Since the database itself has no notion of what the current schema is, it can't reject attempts to add data that don't match the schema's requirements. It can only accept any data it gets, no matter what it contains.

The SQL database, by contrast, has a single explicit schema, which is by definition the central and universal source of truth on how the database is structured. Any developer who wants to know how to reference any data in the database can simply refer to the schema to see what's there and what it's called. There can never be any question of the same data winding up in multiple slots with different names, whether due to a bug or to lack of coordination between developers.

If you want to modify the schema, you have to write an explicit migration. You can't just wing it and have your program start writing new values into the database. This is a good thing. It makes you think about what you're doing, and it makes you make it explicit to the other developers on your team.

When our SQL-using team alters its database with a migration to add the favorite_color column to the Users table, that's it. The slot is called favorite_color by definition, and attempts to write favourite_color, favorite-color, or preferred_color will simply be rejected by the database. Every developer on the team then knows exactly where to find a user's favorite color from then on. A whole class of errors has been rendered impossible.

The Scalability Red Herring: You're Not Google

NoSQL advocates tout the fact that Google and Facebook hit the scaling limits of traditional SQL databases some time ago, and developed their own NoSQL databases as a result, to work around this scaling limit.

Google has in fact recently decided that these tradeoffs are not worth it, and switched back to a relational database model) with its proprietary Spanner / F1 database system.

Yet they trade on this halo effect: people assume Google and Facebook are using the very best tools for everything, and think they want to be like Google and Facebook. They don't mention the expensive and complex tradeoff involved in using a NoSQL database: you basically have to re-invent your own domain-specific query engine and schema enforcement system at the application layer. You get that for free with a relational database.

Google and Facebook are prepared to take that on because they are serving millions of requests per second, a level far beyond what almost all companies ever have to deal with, and they have the vast budgets necessary to do so. That level of scaling requires expensive and complex workarounds in any case.

Frankly, you're not Google or Facebook. Your scaling needs are not remotely similar to theirs, and almost certainly never will be, despite success beyond your wildest dreams. If you do actually hit Google-level scale in the future, you'll have to rewrite your whole system from scratch anyway; using a relational database will be the least of your worries.

So what is NoSQL good for?

If you are slurping in a lot of unstructured JSON data from a third party, then using a document store for its initial landing spot can be a good decision. This is a reasonable first step before further processing takes place, or when you only care about a small portion of the data.

But don't use a NoSQL database for that! Use a Postgres database with a JSONB column. This is essentially a mini-NoSQL database that lives within your relational database. It's a better NoSQL than NoSQL is! This allows you the convenience of slurping in that unstructured data in a designated spot within a structured relational database.

Conclusion

It seems like a lot less upfront work to skip the schema and dive right in. That's why most people choose it. They want to flow, they want to be freeform, they want to be creative and expressive and move where the wind takes them. I get it.

The problem is that computers are rather rigid in how they can operate. Attempting to apply a free-spirited attitude to computer programming and skip the boring old "planning" part of what you're doing. is asking for pain.

If you're going to devote weeks, months, or years of your life to building a computer program — an artifact that can only ever obey the rules of rigid logic &dmash; does it not behoove you to take an hour or a day to think through a couple of plans on how to structure that logic?

We think so. SQL is the way to go. If your project doesn't seem to fit into a SQL schema... think about it some more. :) It is possible and will save you a world of pain down the road.

We're here to help. Want to chat about your schema? Write me: pjlegato@databaselabs.io

Paul Legato
Founder, Database Labs

February 5, 2016