NoSQL databases have become very popular. Big companies rely on them to store hundreds of petabytes of data and run millions of queries per second. But what is a NoSQL database? How does it work and why does it scale so much better than traditional relational databases?
Let's start by quickly explaining the problem with relational databases, like MySQL, MariaDB, SQL Server, and alike. These are built to store relational data as efficiently as possible. You can have a table for customers, orders, and products, linking them together logically. Customers place orders, and orders contain products. This tight organization is great for managing your data and data consistency, but it comes at a cost: relational databases have a hard time scaling. They have to maintain these relationships, and that's an intensive process, requiring a lot of memory and compute power.
So, for a while, you can keep upgrading your database server, but at some point, it won't be able to handle the load. In technical terms, we say that relational databases can scale vertically, but not horizontally, whereas NoSQL databases can scale both vertically and horizontally. You can compare this to a building. Vertical scaling means adding more floors to an existing building, while horizontal scaling means adding more buildings. You intuitively understand that vertical scaling is only possible to a certain extent, while horizontal scaling is much more powerful.
Now, why do NoSQL databases scale so well? First of all, they do away with these costly relationships. In NoSQL, every item in the database stands on its own. This simple modification means that they are essentially key-value stores. Each item in the database only has two fields: a unique key and a value. For instance, when you want to store product information, you can use the product's barcode as the key and the product name as the value. This seems restrictive, but the value can be something like a JSON document containing more data, like the price and description.
This simpler design is why NoSQL databases scale better. If a single database server is not enough to store all your data or handle all the queries, you can split the workload across two or more servers. Each server will then be responsible for only a part of your database. To give an example, Apple runs a NoSQL database that consists of over 75,000 servers.
In NoSQL terms, these parts of your database are called partitions, and it brings up a question: if your database is split across potentially thousands of partitions, how do you know where an item is stored? That's where the primary key comes in. Remember, NoSQL databases are key-value stores, and the key determines on what partition an item will be stored. Behind the scenes, NoSQL databases use a hash function to convert each item's primary key into a number that falls into a fixed range, say, between 0 and 100. This hash value and the range is then used to determine where to store an item.
If your database is small enough or doesn't get many requests, you can put everything on a single server. This one will then be responsible for the entire range. If that server becomes overloaded, you can add a secondary server, which means that the range will be split in half. Server 1 will be responsible for all items with a hash between 0 and 50, while Server 2 will store everything between 50 and 100. Theoretically, you've now doubled your database capacity, both in terms of storage and in the number of queries you can execute.
This range is also called a keyspace. It's a simple system that solves two problems: where to store new items and where to find existing ones. All you have to do is calculate the hash of an item's key and keep track of which server is responsible for which part of the keyspace.
Besides great scalability, NoSQL is schemaless, which means that items in the database don't need to have the same structure. Each one can be completely different. In a relational database, you have to define your table structure, and then each item must conform to it. Changing this structure isn't straightforward and could even lead to data loss. Not having a schema can be a big advantage if your application and data structure is constantly evolving.
Now, at this point, it's clear that NoSQL databases have certain advantages over relational ones, but that's not to say that relational databases are obsolete. Far from it. NoSQL is more limited in the way you can retrieve your data, only allowing you to retrieve items by their primary key. Finding orders by ID is no problem, but finding all orders above a certain amount would be very inefficient. Relational databases, on the other hand, have no trouble with this.
Another downside is that NoSQL databases are eventually consistent. When you write a new item to the database and try to read it back straight away, it might not be returned. As I've explained, NoSQL splits your database into partitions, but each partition is mirrored across multiple servers. That way, a server can go down without much impact. When you write a new item to the database, one of these mirrors will store the new item and then copy it to the others in the background. This process might take a little bit of time, so when you read that item back, the NoSQL database might try to read it from a mirror that doesn't have it yet.
In summary, both NoSQL and relational databases will be around for the foreseeable future, each with their own strengths and weaknesses.
So, now you know how NoSQL works. Let's look at a few examples. Cloud providers heavily promote NoSQL because they can scale it more easily. AWS has DynamoDB, Google Cloud has Bigtable, and Azure has CosmosDB. During Amazon Prime Day in 2019, Amazon's NoSQL database peaked at 45 million requests per second. That's mind-boggling. But you can also run NoSQL databases yourself with software like Cassandra, Scylla, CouchDB, MongoDB, and much more.
Before ending this video, let's quickly talk about the name "NoSQL". It's a bit confusing as it can have two meanings. First up, NoSQL can mean "Not only SQL", pointing to the fact that some NoSQL databases partially understand the SQL query language on top of their own query capabilities. And secondly, it's often called NoSQL in the sense of "non-relational" because it can't easily store relational data.
NoSQL 数据库已经变得非常流行。大公司依靠它们来存储数百 PB 的数据,并每秒运行数百万次查询。但什么是 NoSQL 数据库?它如何工作?为什么它的扩展性比传统的关系型数据库好得多?
让我们先快速解释一下关系型数据库(如 MySQL、MariaDB、SQL Server 等)的问题。它们被设计用来尽可能高效地存储关系型数据。你可以为客户、订单和产品建立数据表,并从逻辑上将它们连接起来。客户下订单,订单包含产品。这种紧密的组织结构非常适合管理数据和保证数据一致性,但它也带来了代价:关系型数据库很难扩展。它们必须维护这些关系,这是一个资源密集型过程,需要大量的内存和计算能力。
因此,在一段时间内,你可以不断升级你的数据库服务器,但到某个点,它将无法处理负载。用技术术语来说,关系型数据库可以垂直扩展,但不能水平扩展,而 NoSQL 数据库既可以垂直扩展也可以水平扩展。你可以把这比作一栋建筑。垂直扩展意味着给现有建筑增加楼层,而水平扩展意味着增加更多的建筑。你可以直观地理解,垂直扩展只能在一定程度上实现,而水平扩展则强大得多。
那么,为什么 NoSQL 数据库的扩展性这么好呢?首先,它们摒弃了这些成本高昂的关系。在 NoSQL 中,数据库中的每个项目都是独立的。这个简单的改变意味着它们本质上是键值存储。数据库中的每个项目只有两个字段:一个唯一的键和一个值。例如,当你想存储产品信息时,你可以使用产品的条形码作为键,产品名称作为值。这看起来有限制,但值可以是一个像 JSON 文档那样包含更多数据的东西,比如价格和描述。
这种更简单的设计是 NoSQL 数据库扩展性更好的原因。如果单个数据库服务器不足以存储所有数据或处理所有查询,你可以将工作负载分散到两个或更多的服务器上。每个服务器将只负责你数据库的一部分。举个例子,苹果公司运行着一个由超过 75,000 台服务器组成的 NoSQL 数据库。
在 NoSQL 术语中,数据库的这些部分被称为分区,这就引出了一个问题:如果你的数据库被分散在可能数千个分区中,你如何知道一个项目存储在哪里?这就是主键发挥作用的地方。记住,NoSQL 数据库是键值存储,键决定了一个项目将被存储在哪个分区。在底层,NoSQL 数据库使用哈希函数将每个项目的主键转换成一个落在固定范围内的数字,比如 0 到 100 之间。这个哈希值和范围随后被用来决定在哪里存储一个项目。
如果你的数据库足够小,或者请求不多,你可以把所有东西都放在一个服务器上。这个服务器将负责整个范围。如果这个服务器变得过载,你可以添加第二个服务器,这意味着范围将被一分为二。服务器 1 将负责哈希值在 0 到 50 之间的所有项目,而服务器 2 将存储 50 到 100 之间的所有项目。理论上,你现在已经将数据库的容量翻倍了,无论是在存储方面还是在可执行的查询数量方面。
这个范围也叫做键空间(keyspace)。这是一个简单的系统,解决了两个问题:在哪里存储新项目和在哪里找到现有项目。你所要做的就是计算一个项目键的哈希值,并跟踪哪个服务器负责键空间的哪个部分。
除了强大的可扩展性,NoSQL 还是无模式的(schemaless),这意味着数据库中的项目不需要有相同的结构。每一个都可以是完全不同的。在关系型数据库中,你必须定义你的表结构,然后每个项目都必须符合它。改变这个结构并不简单,甚至可能导致数据丢失。如果你的应用程序和数据结构在不断演变,没有模式会是一个巨大的优势。
现在,很明显 NoSQL 数据库相比关系型数据库有某些优势,但这并不是说关系型数据库已经过时了。远非如此。NoSQL 在你检索数据的方式上更有限制,通常只允许你通过主键来检索项目。通过 ID 查找订单没有问题,但要查找所有金额超过某个值的订单就会非常低效。而关系型数据库则没有这个问题。
另一个缺点是 NoSQL 数据库是最终一致的(eventually consistent)。当你向数据库写入一个新项目并立即尝试读回它时,它可能不会被返回。正如我所解释的,NoSQL 将你的数据库分成多个分区,但每个分区都会在多个服务器上进行镜像备份。这样,即使一个服务器宕机,也不会有太大影响。当你写入一个新项目时,其中一个镜像会存储这个新项目,然后在后台将其复制到其他镜像。这个过程可能需要一点时间,所以当你读回那个项目时,NoSQL 数据库可能会尝试从一个尚未更新的镜像中读取。
总而言之,NoSQL 和关系型数据库在可预见的未来都将继续存在,各自有其优缺点。
所以,现在你知道 NoSQL 是如何工作的了。让我们看几个例子。云服务提供商大力推广 NoSQL,因为他们可以更容易地对其进行扩展。AWS 有 DynamoDB,Google Cloud 有 Bigtable,Azure 有 CosmosDB。在 2019 年的亚马逊 Prime Day 期间,亚马逊的 NoSQL 数据库请求峰值达到了每秒 4500 万次。这简直令人难以置信。但你也可以使用像 Cassandra、Scylla、CouchDB、MongoDB 等软件自己运行 NoSQL 数据库。
在结束这个视频之前,我们快速谈谈 "NoSQL" 这个名字。它有点令人困惑,因为它可能有两个含义。首先,NoSQL 可以指 "Not only SQL"(不仅仅是 SQL),意指一些 NoSQL 数据库除了自身的查询能力外,还部分支持 SQL 查询语言。其次,它通常被称为 NoSQL 是取 "non-relational"(非关系型)的意思,因为它不能轻易地存储关系型数据。