Markus Winand's Blog

Choosing NoSQL For The Right Reason

In Performance, Reliability, Scalability on 2011-05-13 at 09:42

Observing the NoSQL hype through the eyes of an SQL performance consultant is an interesting experience. It is, however, very hard to write about NoSQL because there are so many forms of it. After all, NoSQL is nothing more than a marketing term. A marketing term that works pretty well because it goes to the heart of many developers that struggle with SQL every day.

My unrepresentative observation is that NoSQL is often taken for performance reasons. Probably because SQL performance problems are an everyday experience. NoSQL, on the other hand, is known to “scale well”. However, performance is often a bad reason to choose NoSQL—especially if the side effects, like eventual consistency, are poorly understood.

Most SQL performance problems result out of improper indexing. Again, my unrepresentative observation. But I believe it so strongly that I am writing a book about SQL indexing. But indexing is not only a SQL topic, it applies to NoSQL as well. MongoDB, for example, claims to support “Index[es] on any attribute, just like you’re used to“. Seems like there is no way around proper indexing—no matter if you use SQL or NoSQL. The latest release of my book, “Response Time, Throughput and Horizontal Scalability“, describes that in more detail.

Performance is—almost always—the wrong reason for NoSQL. Still there are cases where NoSQL is a better fit than SQL. As an example, I’ll describe a NoSQL system that I use almost every day. It is the distributed revision control system Git. Wait! Git is not NoSQL? Well, let’s have a closer look.

Git doesn’t have an SQL front end

Git has specialized interfaces to interact with the repository. Either on the command line or integrated into an IDE. There isn’t anything that remotely compares to SQL or a relational model. I never missed it.

Git doesn’t use an SQL back-end

Honestly, if I would have to develop a revision control system, I wouldn’t take an SQL database as back-end. There is no benefit in putting BLOBs into a relational model and handling BLOBs all the time is just too awkward.

Git is distributed

That’s my favourite Git feature. Working offline is exactly what is meant by ‘partition tolerance’ in Brewer’s CAP Theorem. I can use all Git features without Internet connection. Others can, of course, still use the server if they can connect to it. Full functionality on either end. It is partition tolerant.

Conflicts happen anyway

If there is one thing we learned in the 25 years since Larry Wall introduced patch, it is that conflicts happen. No matter what. Software development has a very long “transaction time” and we are mostly using optimistic locking—conflicts are inevitable. But here comes the famous CAP Theorem again. If we cannot have consistency anyway, let’s focus on the other two CAP properties: availability and partition tolerance.

Acknowledging inconsistencies means to take care of methods and tools to find and resolve them. That involves the software (e.g., Git) as well as the user. But here comes one last unrepresentative observation from my side: most NoSQL users just ignore that. They assume that the system magically resolves contradicting writes automatically. It’s like using a CVS work flow with Git—it works for a while, but you’ll end up in trouble soon.

I’m not aware of a minimum feature set for NoSQL datastores—it’s therefore hard to tell if Git fulfils them or not. However, Git feels to me like using NoSQL for the right reason.

It’s about choosing the right tool for the job. But I can’t get rid of the feeling that NoSQL is too often taken for the wrong reasons—query response time, in particular. No doubt, NoSQL is a better fit for some applications. However, an index review would often solve the performance problems within a few days. SQL is no better than NoSQL, nor vice-versa. Because the question is not what’s better. The question is what is a better fit for a particular problem.

Do not use offset for pagination. Learn why.

  1. “Most SQL performance problems result out of improper indexing.”

    I’m sorry, but that is simply not true. The main reason you run into performance problems with relational databases is because the data model has problems. Indices are more like “inherent opportunities” to speed things up here and there, but the main performance comes from understanding the data, understanding the access patterns and understanding how the SQL engine will calculate a plan. And then designing a relational model that will balance these things out.

    Designing a relational model that performs well is hard. Which is why there are so few database professionals available who seem able to do this well. It requires deep understanding of both the domain that needs to be modeled as well as the technology it will run on.

    Just tweaking indices is going to work for a small subset of problems, but the real performance gains are made during the design of the relational model you will use. And unfortunately, they do not teach this in school.

    Also, the whole NoSQL vs SQL debate is artificial. There is no real debate: they address entirely different classes of problems. It can be boiled down to this: SQL is about consistency and flexibility during querying at the expense of scalability and performance.

    NoSQL is about “performance at scale”and flexibility when writing data at the expense of consistency and querying flexibility. Note that I said “performance at scale”, because many of the NoSQL databases do not have particularly impressive performance for small datasets.

    Sacrificing absolute consistency is hard, but as it turns out, for a lot of “new” problems lack of consistency is less of a problem than long response times. The “new” problems here are online systems with massive numbers of users. For some classes of companies you can feel this directly. Most online banks are still pretty slow. Sites like Amazon, on the other hand are quite snappy given that they have a lot more web traffic than any bank. (And when it comes to online commerce, every millisecond counts).

    A system can be said to be scalable when the cost of increasing its size is sublinear with respect to the dimension you need to scale. Since the relational model is inherently expensive to apply in a distributed manner, it is relatively easy to show that you cannot get sublinear cost for arbitrary scale along any dimension.

    However, with certain sacrifices you can get sublinear cost. For instance by breaking the relational model somewhat and partitioning the data into independent instances that have no dependencies on other instances.

    Note that when we say “cost” we mostly talk in terms of latency and processing power. Not dollars. Although it will end up costing dollars.

    SQL has its place and NoSQL has its place, but it is important to understand that they address different types of problems. I have worked at companies that have naively used SQL databases for NoSQL type problems and vice versa. It is unhelpful that people keep comparing them directly instead of trying to develop and disseminate the kind of knowledge needed to reason about this.

    Also it doesn’t help that Stonebreaker et al, to draw attention to themselves, muddy the waters and confuse the issues by planting the idea that NoSQL is somehow the antithesis of SQL. In fact the label “NoSQL” has been incredibly unhelpful because it suggests that there is a problem with SQL and that NoSQL is the magic solution. This is, at best, naive. And unfortunately leads people to get hung up on the wrong ideas.

    • “The main reason you run into performance problems with relational databases is because the data model has problems.”

      Well, I made another observation. In fact, everybody knows that database design is important and must be done carefully. There are many books covering that in more or less detail. I do not say that database design is not important, but I say it’s usually done carefully anyway because everybody knows it’s important.

      What I find at client sites is that developers are not aware how to index properly and how to write queries that can benefit from indexing. The DBAs, on the other hand, know about indexing but don’t have the deep domain knowledge to know how the data is queried.

      Nobody ignores schema design but indexing is almost always ignored until it’s too late. Adding some more or less random indexes might improve the situation, but it is exactly what I refer to as “improper” indexing. Indexing without a plan. In fact, my position is that indexing must be designed with the same care as the schema.

      I pretty much agree with your statements about NoSQL.

    • “No doubt, NoSQL is a better fit for some applications.”

      Other than performance, could you provide examples of applications that are more suitable to NoSQL semantics? SQL has all other advantages, like persistence frameworks, consistency, and other tools. NoSQL solutions like cassandra are target for scaling writes. Other than sharding how do you think to scale writes on a RDMS?

      • Well, in lack of a definition for “NoSQL semantics” I like to see the CAP Theorem as the central star that NoSQL systems orbit around.

        That said, I believe that any application where partition tolerance is more important than consistency is a good fit for NoSQL. Partition tolerance seems to be poorly understood in the field. I took the Git example because many developers know what “distributed” means in context of Git. I could have taken any other distributed, partition tolerant revision control system for that purpose. Source code repositories are a particularly good fit because they hardly every reach consistency anyway.

        The question—what is more important, partition tolerance or consistency—depends on the data. Huge social networks have to cope with tons of data that has very little value. Strict consistency doesn’t pay off for that. The damage caused by conflicts is little compared to the costs to establish strict consistency. That argument is, however, nonexistent for small sites because consistency is easy to achieve there.

        I also mentioned BLOBs in the article. As a software architect, I have been involved in many discussions where to store low value binary data like user uploads. I remember a meeting where the DBA smashed the proposal to use BLOBs for vast amounts of user data by proclaiming that “BLOBs have no business in my database”. BLOBs have, quite often, little value—even if connected to high value relational data. From that angle, I believe that some NoSQL systems make a great distributed BLOB store which can coexist with a relational database.

        Scaling writes is subject to Brewer’s CAP Theorem—take two out of three. Sharding and similar methods bypass it by not distributing the data at large—that is, only a small subset of the nodes is responsible to maintain a particular data sub-set.

        I feel, however, that the need to scale out is constantly decreasing for most applications. I have observed a multi-national banking system over the past decade. It was initially running on a huge two node active-active cluster to distribute load. A few years later, it was moved to a hot-standby cluster—just one node active. Today, it’s being migrated to a virtualized server running other databases on the same hardware. “Scale-out” is not the trend—virtualization is, at least in enterprise environments. Huge social sites being the obvious exception.

  2. Nice post, Git being close to a NoSQL system shows how little the word “NoSQL” actually means.

  3. […] Winand made the case earlier this year that the version control system Git is actually a NoSQL datastore. The blog […]