Should You Migrate Your Reporting Queries From a “General Purpose” DB (MongoDB) to a Data Warehouse (ClickHouse)? (Performance Overview)

MongoDB and “General Purpose” Databases

I’ve been using MongoDB for quite a few years now, and that has been an enjoyable experience overall.

Mongo covers pretty well the vast majority of usage scenarios you might encounter. The flexible schema allows you to evolve the applications in a way that’s quite challenging with relational databases like MySQL, SQL Server, Postgress, etc. It also supports a rich feature set like ACID transactions, advanced indexing, replication, sharding, complex aggregation queries, just to name a few.

Together with the great community, and the tooling around it, Mongo stands as one of the best general-purpose databases.

Still, this wide variety of supported features can be considered as Mongo’s biggest strength and weakness at the same time.

Why is that?

Useful Resources

This article is very much influenced by the Designing Data-Intensive Applications book. I strongly recommend it if you want to learn more about the nuts and bolts of Data-Driven Systems.

Another excellent learning source is the Introduction to Database Engineering course at Udemy.

“One Size Fits All” Doesn’t Cut it Anymore

Simply put, when your system is built to deal well with significantly different usage patterns, it is doomed to perform suboptimally, at least for some of them at scale.

Depending on the context, modern systems have very different requirements in terms of data access patterns.

You should be open to embrace this variety and use the right database solution for the specific use case in hand, following the Polyglot Persistence architecture mindset.

Let’s describe some different database usage scenarios.

Fundamentals of Database Engineering

Learn ACID, Indexing, Partitioning, Sharding, Concurrency control, Replication, DB Engines, Best Practices and More!

Online Transaction Processing (OLTP) / Transactional Databases

OLTP queries typically look like something this:

“Get the orders for a user with ID 123.”

Usually, you request a single or a few records from the database by some sort of an identifier. This means the filtering criteria (userId = 123) is quite selective, enabling you to use a proper indexing strategy to boost the performance. You can achieve a pretty good execution time event with a large dataset with billions of records.

Also, generally with OLTP-style queries, you are interested in most of the fields of the database entity. In the orders example above, you may want to display the full order details on a web page for the customer to review.

Most of the popular transactional databases, like MySQL and MongoDB, fall into the category of OLTP Database Systems.

Online Analytical Processing (OLAP) / Data Warehouses

OLAP queries have a very different structure and purpose.

Say you have some data with sales transactions, and you want to understand the total revenue for every city for the last three years.

What would the database query look like logically?

You need a filter for the last three years, group by city, and sum the paid amount within every group.

In SQL terms, it would look something like this:

select sum(amount) from sales where year in [2019, 2020, 2021] group by city

How does that compare to OLTP?

How is OLAP Different From OLTP?

First, OLAP queries might need to scan over a vast volume of data after the filters are applied. With OLTP, we only have a few records returned after the filtering (e.g. all the orders for the user), while in the OLAP case, we are traversing all the sales records from the last three years, which can be millions (or even billions) of entities. In other words, the filtering doesn’t help that much. You still have a large number of records to aggregate across.

Second, for OLAP, you tend to need only a tiny fraction of the fields/columns from the DB entity. In our example, it’s only three fields – the year, the city, and the amount of money the customer paid for the purchase. The goal is not to retrieve all the information per sale (as with OLTP) but rather to do some analytics (sum of amount) across a large number of sales.

OLAP systems are often referred to as data warehouses.

What About Writes?

Transactional databases, in contrast to data warehouses, are heavily used for write (insert/update/delete) operations or any type of ACID transactions in general.

On the other hand, data warehouses are primarily about generating insights from the data by serving analytical queries. Most often, the data arrives in the data warehouse from a transactional database through batch or stream processing.

The focus of this article is exploring the query (read) performance, so I won’t spend too much time on the writing side of things.

Still, just for completeness, let’s discuss yet another usage pattern seen more and more often in modern systems.

High Volume Write Throughput

There are cases when your system needs to handle a massive amount of writes.

A typical example is some sensor data coming from IoT devices leading to thousands or more writes per second. Another case might be recording users’ interaction events on a high-traffic website.

There are databases like Cassandra that are optimized precisely for such scenarios.

In essence, their storage engines most often use append-only logs to store new data, which is hard to beat in terms of writes throughput, mainly due to the strictly sequential disk writes.

Such engines use a data structure called a Log-Structured Merge Tree (LSM tree). LSM trees are also supported in WiredTiger, the storage engine used by MongoDB.

The Right Tool For the Job

I hope the last few sections convinced you that it’s challenging to use a single database for all the use cases in your system.

Notice that I didn’t even mention scenarios like caching, searching, time series, etc., where databases systems like Redis, ElasticSearch, and IndluxDB/Prometheus would excel.  

Now, it’s time to turn to the main focus of the article – ClickHouse.

Meet ClickHouse

ClickHouse is a columnar OLAP database used for high-performance real-time data analytics.

It claims to be “at least 100 times faster for OLAP scenarios compared to traditional databases”.

I was a little skeptical when I initially read this statement, so I decided to give it a try and do some actual benchmarks. I imported the same dataset to ClickHouse and Mongo and ran some queries against both.

I found the outcomes quite impressive, so I decided to document them in this blog post.

To learn more about ClickHouse, see the Altinity videos on YouTube. Also, for an intro, you can listen to the following episode of the Software Engineering Daily podcast.

The Demo Dataset

Yandex.Metrica is a free web analytics service that tracks and reports website traffic similarly to Google Analytics. Yandex has provided the Anonymized Metrica Data for public access. I will be using the “hits” dataset that contains around 9 million entities.

I found this dataset quite useful because it’s big enough to demonstrate the main points of interest, especially in terms of performance, but at the same time, it’s s convenient for a local Docker setup.

For our needs, it’s certainly not essential to understand in detail the idea of every field in the dataset (I certainly don’t). The main point is that the DB entities are quite realistic in terms of size and structure.

This makes the dataset quite handy for exploring some real-world use cases.

Here’s just a tiny fraction of a document, so you get a sense of what it looks like:

{ 
    "_id" : ObjectId("61975333685b7fae52e6911c"), 
    "WatchID" : 7043438415214026105, 
    "JavaEnable" : 1, 
    "Title" : "", 
    "GoodEvent" : 1, 
    "EventTime" : ISODate("2014-03-17T13:29:13.000+0000"), 
    "EventDate" : ISODate("2014-03-16T22:00:00.000+0000"), 
    "CounterID" : 57, 
    "ClientIP" : 706706104, 
    "RegionID" : 14050, 
    "UserID" : 8585742290196126178, 
    "CounterClass" : 0, 
    "OS" : 56, 
    "UserAgent" : 4, 
    "URL" : "http://hurpasspx?Entitle,pos,p5,source=wiz&fp=2&pos=42&wp=1&pos=7&lang=ru&lr=10275961,55.84517842040/?fromServic-last/moto.am/iadebnye_uchasti-shedshipmeethodID.asp&is_vative=0&search-tag=2&rpt=simages.yandex.ru/rp/semennoe/zarplatjie-babe-podpisyvat-otvety.ru/tour=170000009886%26ref%3Dhttp://kryma-parca.ru/chat&users/lexus-70@mail.yandex.ru/yarportwo/11868/parta.com/load/483/?from]=200837&tease", 
    "Referer" : "http://kelebekgaleri.hurriyet", 
    "URLDomain" : "hurpass.uaprod.hu", 
    "RefererDomain" : "sprashivai.ru", 
    "Refresh" : 0, 
    "IsRobot" : 0,
    ... more fields
}

Again, the dataset is available publicly if you want to get a deeper insight.

Test Environment

For this demos, I am running both MongoDB and ClickHouse under Docker on a host machine with moderate specifications:

  • Processor: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz   2.21 GHz
  • RAM: 16.0 GB
  • Drive: Intel® SSD 760p Series (256GB, M.2 80mm, PCIe 3.1 x4, 3D2, TLC)
  • OS: Windows 10

There Aren’t Perfect Benchmarks

Before moving on, let me clarify that the performance measurements you’ll see below are far from being a scientific benchmark. And they are not meant to be.

It’s close to impossible to perform universally accurate and unbiased benchmarks.

In this article, I’m just trying to showcase the performance boost you might gain for your OLAP workload if you consider using a data warehouse (ClickHouse specifically).

If you decide to explore integrating ClickHouse into your production system, I advise you to perform your own testing specific to your use case and dataset.

For a performance comparison of ClickHouse and other data warehouses – you can visit this page.

The Queries

Let’s start running some queries against ClickHouse and Mongo and compare their execution time.

I’ll summarize the results in a tabular format in a later section. I’ve also provided screenshots with the results from the queries just to make sure the semantics stays the same, although the syntax is different (as Mongo does not use SQL as a query language).

I did my best to avoid caching that would bring misleading results. All of the numbers below are taken after a cold start of both ClickHouse and Mongo. On a side note – ClickHouse seems to benefit a lot from caching, e.g. once a query has been executed, pretty often a second run takes significantly less time.

Query 1 – Most Popular Search Phrase by User

ClickHouse Query:

SELECT UserID, SearchPhrase, count() FROM datasets.hits_v1 WHERE SearchPhrase != “” GROUP BY UserID, SearchPhrase ORDER BY count() DESC LIMIT 3

ClickHouse Execution Time: 0.57 seconds

Mongo Query:

db.getCollection("hits_v1").aggregate([
    {
        "$match": {
            "SearchPhrase": {$ne: ""}
        }
    },
    {
        $group: {
            _id: {
                "userId": "$UserID",
                "searchPhrase": "$SearchPhrase"
            },
            "count": {
                $sum:1
            }
        }
    },
    {
        "$sort": {
            "count": -1
        }
    },
    {
        "$limit": 3
    }
]);

Mongo Execution Time: 33 seconds

Query 2 – Most Popular Mobile Phone Models

ClickHouse Query:

SELECT MobilePhoneModel, count() AS c FROM datasets.hits_v1 WHERE MobilePhoneModel != “” GROUP BY MobilePhoneModel ORDER BY c DESC LIMIT 3

ClickHouse Execution Time: 0.13 seconds

Mongo Query:

db.getCollection("hits_v1").aggregate([
    {
        "$match": {
            "MobilePhoneModel": {$ne: ""}
        }
    },
    {
        $group: {
            _id: {
                "device": "$MobilePhoneModel"
            },
            "count": {
                $sum: 1
            }
        }
    },
    {
        "$sort": {
            "count": -1
        }
    },
    {
        "$limit": 3
    }
]);

Mongo Execution Time: 31 seconds

Query 3 – Most Popular URLs

ClickHouse Query:

SELECT URL, count() AS c FROM datasets.hits_v1 GROUP BY URL ORDER BY c DESC LIMIT 3

ClickHouse Execution Time: 4.408 seconds

Mongo Query:

db.getCollection("hits_v1").aggregate([
    {
        $group: {
            _id: {
                "url": "$URL"
            },
            "count": {
                $sum: 1
            }
        }
    },
    {
        "$sort": {
            "count": -1
        }
    },
    {
        "$limit": 3
    }
],{ 
        allowDiskUse: true
});

Mongo Execution Time: 46 seconds

Query 4 – Find a Single Document

ClickHouse Query:

SELECT * From datasets.hits_v1 WHERE WatchID = 5950807165367044006

ClickHouse Execution Time: 1.959 seconds

Mongo Query:

db.getCollection("hits_v1").find({
    "WatchID": NumberLong("7043438415214026105")
})

Mongo Execution Time: 25 seconds

(*) The result here is a little more surprising to me since “select *” queries generally fit better for OLTP databases like Mongo. Still, ClickHouse seems to perform much better here as well.

Query 5 – Count of Distinct Mobile Phone Models

ClickHouse Query:

SELECT COUNT(DISTINCT(MobilePhoneModel)) FROM datasets.hits_v1

ClickHouse Execution Time: 0.266 seconds

Mongo Query:

db.getCollection("hits_v1").distinct("MobilePhoneModel").length

Mongo Execution Time: 34 seconds (*)

(*) It’s a little harder to get the exact timing for that Mongo query, so I manually measured the execution time.

Query 6 – Substring Search

ClickHouse Query:

SELECT count() FROM datasets.hits_v1 WHERE URL LIKE '%rutube.ru/inworld%'

ClickHouse Execution Time: 1.610 seconds

Mongo Query:

db.getCollection("hits_v1").count({
    "URL": /rutube.ru\/inworld/
})

Mongo Execution Time: 33 seconds (*)

(*) It’s a little harder to get the exact timing for that Mongo query, so I manually measured the execution time.

Execution Time Comparison

Here’s the table summarizing the queries performance (in seconds):

Query IDMongoClickHouseMongo/CH Ratio
1330.5758
2310.13238
3464.410
425212
5340.3113
6331.620

The execution time for ClickHouse varies somewhere between half a second to a few seconds, while Mongo’s results are mostly in the 30-40 seconds range.

That’s quite a difference, isn’t it?

Why is ClickHouse so Fast?

There are a number of factors explaining the impressive performance of ClickHouse. I thought this article would be incomplete without presenting at least a few of the core components that make it so fast.

Remember that ClickHouse is built for performance. This means that except for the fundamental design decisions that are common across most data warehouses, there are also plenty of low-level optimizations.

I recommend reading Chapter 3 – “Storage and Retrieval,” section – “Transaction Processing or Analytics?” from Designing Data-Intensive Applications and visiting the following page on the ClickHouse website.

Columnar Storage

Traditional databases arrange data on disk row by row, e.g. all the fields for a specific row “sit” next to each other. This makes it easy(*) to load a whole row, or a set of adjacent rows, into memory.

(*) “easy” means fewer IO disk calls.

The rows are usually organized in pages, so strictly speaking, the DB would load a full page in memory instead of an individual row. The exact mechanics here are not that important for the current discussion.

However, as we’ve already mentioned, OLAP queries are typically interested only in a small subset of columns.

In this sense, imagine what a row-based database needs to do when executing an OLAP-style query. It would load all of the fields for every row of interest (or the pages that contain them), just to extract a few columns. This significantly increases the IO bandwidth requirements.

On the flip side, columnar storage simply stores all the values from a specific column together. Each column is kept in a separate file, so a query only needs the files with the columns of interest, which is a vast improvement, especially for documents with many fields.

The figure below compares graphically columnar and row-oriented storage types.

Fig. 1 – Row-Oriented vs. Column-Oriented Databases

Storing data by columns opens a new optimization opportunity – compression.

Column Compression

Notice that the values in a single column are homogenous and can be pretty repetitive.

Take, for example, the Bonus column from Fig. 1. The FirstName and LastName columns can also have a lot of duplicates. The more repeated values we have, the better compression level we can achieve.

Compare this to a row-based storage engine, where columns with different types and their values are stored together. The same compression algorithms are just not applicable in this case.

Depending on the data in the column, different compression techniques can be used. This discussion is, however, out of scope for this article.

Memory Bandwidth and Vectorized Processing

Big aggregation queries require scanning over millions of rows. Loading the data from disk into memory becomes a bottleneck.

Data warehouses are optimized to utilize the CPU cache efficiently.

This allows for:

  • Avoiding branch mispredictions and bubbles in the CPU instruction processing pipeline.
  • Designing different data operations to run on chunks of compressed data directly – this technique is known as vectorized processing.

From Experiment to Production

This article focused on comparing the query performance characteristics of a data warehouse (ClickHouse) a transactional database (MongoDB).

One of the main takeaways is that modern big-scale systems need specialized tools to deal with specific query patterns.

In particular, we’ve explored how OLAP queries can be hard to handle by a traditional general-purpose OLTP database like MongoDB.

ClickHouse is built for performance, so it’s worth considering using it for your high-volume aggregation queries.

Still, it’s one thing to run a few examples in a local environment and receive some promising results. However, it’s an entirely different story to make a particular tool part of your production system, especially when we talk about databases.

There are so many different aspects you need to consider further.

For example, for switching your OLAP workload from Mongo to ClickHouse, here’s an (incomplete) list of items you may need to explore:

  1. Data replication from Mongo into ClickHouse. Delivery guarantees.
  2. Data modeling – support for nested objects and arrays.
  3. ACID transactions.
  4. Complex filtering and indexing support in ClickHouse.
  5. Client SDKs/drivers.
  6. CI/CD.
  7. Tooling.
  8. Community.
  9. Others.

I plan to dig deeper into some of these topics in future articles.

Stick around if that sounds exciting!

Resources

  1. Designing Data-Intensive Applications
  2. ClickHouse Documentation
  3. Introduction to Database Engineering, Udemy

Site Footer

Subscribe To My Newsletter

Email address