Database Performance at Any Scale • The Register

Sponsored Feature The web has changed more than just the way we consume data; it has changed the way we treat and serve it. The design of the legacy commercial database engines that sustained us in the mid-1990s are showing signs of wear and tear. And the key question for businesses now is how best to scale data processing to meet regularly and rapidly fluctuating transaction volumes.

Amazon Web Services (AWS) released DynamoDB in 2012 to address this issue. Fully managed cloud-based database services have removed the burden of installing, patching, and managing database instances, but still require some level of capacity planning in order to support workload peaks. Scheduling and provisioning these is complex and requires customers to pay for allocated resources that are not always needed. Amazon DynamoDB was one of the first database services designed as a serverless database, which eliminates both the need to plan for peak capacity and to pay for allocated resources that aren’t always in use. .

A different database architecture

To meet the growing demand for Internet-scale volumes, AWS decided to offer a different architecture. First and foremost, DynamoDB was designed as a serverless database that requires no administrative overhead – customers simply create tables to use the database. Second, the serverless architecture allows them to scale in milliseconds when they need on-demand capacity and only pay for the resources they consume by scaling to zero when the resources aren’t needed. Finally, AWS turned to NoSQL concepts when creating DynamoDB and used key-value stores, which differ from the relational model.

“Initial customer research has shown that for some key relational databases, many API calls are for single-key lookups on a primary key,” said Joe Idziorek, DynamoDB product manager at AWS. “We decided that if we designed for this access model, we could do things an order of magnitude better than existing technologies.”

Relational databases and DynamoDB store data in tables with primary keys, but the similarity ends there. The storage structure in a relational system, known as a schema, is rigid. Each table stores multiple records in its own row. Each record contains a set of predefined fields starting with a primary key that uniquely identifies that record.

A database of movie actors, for example, might include their name, date of birth, salary, and place of residence. If you don’t know where they live, you should store a null value in this field.

But what if you decide, after entering 100 actors, that you want to store the name of actor 101’s pet? Then you’ll need to run a migration to update the schema, adding the field with a null value for any actors that don’t have that information. It’s tedious and time-consuming, and it’s not a decision any organization would take lightly.

DynamoDB breaks the paradigm of storing records and fields. Instead, each movie actor would be an item in the table, containing a list of attributes. Each element has its own combination of attributes. The “dog name” attribute would only appear in the list of attributes for actor 101. By using a key-value approach, DynamoDB is very flexible. You don’t need to define a schema in advance. For modern apps, DevOps teams update cloud-based apps every few days or even faster.

“Customers benefit from truly creative schema modeling in DynamoDB that matches their application and gives them the flexibility to iterate quickly,” says Idziorek. “The ability to do this without having to modify a table schema is one of the productivity features of a NoSQL database like DynamoDB that developers really appreciate.”

DynamoDB differs from a relational database in another important way, everything is stored in a table. Suppose you want to store every movie an actor has starred in. In a relational system, you would define a separate table and store each movie’s information as a record with its own primary key. Next, you would create a JOIN table.

If actor 50 (Clint Eastwood) was in movie 12 (Dirty Harry), then you would store their two primary keys (50 and 12) side by side in the JOIN table. You would do the same for Harry Guardino, John Vernon and John Mitchum, who were also in this movie. This means you only have to store the Dirty Harry recording once. It’s efficient from a storage perspective, and it also means that if you ever need to edit the information for Dirty Harry, you only have to do it in the Movies table.

Scalability Resolution

DynamoDB works differently. It would store the movie information as a nested attribute inside the actor element. This means that Eastwood, Guardino, Vernon, and Mitchum would each contain full Dirty Harry information, including possibly the year of creation, director, genre, and budget.

“It’s inefficient from a storage perspective, but there’s a trade-off,” Idziorek says. “It’s worth it when you want to find all the actors in Dirty Harry, or all the movies Clint Eastwood has been in, but you’ll have to run a SQL JOIN query. It’s expensive and computationally slow, especially when a studio restarts Dirty Harry and everyone starts querying the database at the same time.”

When the relational database was invented, storage was expensive, now it’s cheap. “It depends on what you’re optimizing for, because storage is no longer the limited resource,” Idziorek says. “In many cases, storage costs aren’t the primary concern with duplicating information. Instead, what we’re trying to do is optimize query performance.”

Partitioning for Performance

Performance tuning allows DynamoDB to deliver the same performance for the millionth client as it did for the first, Idziorek says. It also means that each client gets the same throughput no matter how many queries it sends to the database. Indeed, AWS designed DynamoDB from the ground up to partition workloads.

“Relational database architectures fetch data from storage and process it in a single in-memory instance. They were not designed to break queries into multiple parts and work on each at the same time. As a database data designed for the cloud, DynamoDB changed that,” he adds. “We designed DynamoDB with horizontal scaling, which means the database is not constrained by a workload that can only fit on one machine, but rather to partition them across multiple nodes Calculation.”

The partitioning is transparent to the user and the database manages it using a partition key. The key serves as input to a mathematical hash function which, in turn, determines which physical storage location will contain the data.

DynamoDB uses this mechanism to physically store groups of items close together while distributing large sets of data across many partitions. This means that individual compute nodes can each work on different shards in parallel to complete a query faster. The database also uses an optional second sort key that developers can combine with the partition key to create a composite key. The sort key speeds up sorting of items in a partition, which further increases performance.

“A shard can serve an entire key range,” says Idziorek, “but shards automatically split as the workload increases. AWS matches the number of shards to manage the workload, and customers can set their own threshold rules to limit their spending. They can use an on-demand mode that bills on-demand, removing any caps for high-value critical workloads that might be unpredictable, such as retail. For more predictable workloads where they can absorb some high-volume churn, they can opt for provisioned services with a compute cap. “

Snap, creator of Snapchat, migrated to DynamoDB in 2018. The company pays for reduced capacity upfront, which reduces its capacity while automatically scaling computing power as needed. This saved money and time by reducing the median latency time to send a Snapchat message by more than 20%.

AWS has increased the performance of DynamoDB by scaling it globally across multiple regions, of which there are currently 24. Administrators can configure a table in each region and write to each using a local application, also known as “active-active” replication. Applications benefit from the ability to read at low latency in their local region as well as fast writes. The managed database engine takes care of post-write cross-regional reconciliation on its own schedule using a “last writer wins” algorithm.

Serverless Operation

DynamoDB was the first serverless database service developed by AWS, which has since expanded the portfolio to include Amazon Aurora, Amazon Keyspaces, Amazon Quantum Ledger (QLDB), and Amazon Timestream. As mentioned earlier, one of the main attractions of a serverless database is that it can reduce operating costs by only calling the compute capacity it needs on demand, allowing customers to pay only for the computing power they use rather than powering an idle machine. virtual server.

Developers can also use the AWS Lambda serverless feature in combination with DynamoDB Streams to trigger on-demand events. Streams is an optional service that captures changes to DynamoDB tables in near real time, timestamping them, and recording them for 24 hours. Developers can use events captured in Streams to fire Lambda events. You can use this to initiate an email notification when an order attribute nested in a customer item is updated to say it has been shipped, for example.

As the global pace of digitization accelerates, transaction volumes will continue to increase. Additionally, the application modernization movement further amplifies the performance gap between traditional relational architectures and NoSQL. Therefore, Idziorek expects DynamoDB to become even more popular among companies trying to keep up.

“That’s what’s really unique about DynamoDB,” he concludes. “It allows customers to future-proof their applications. They don’t have to refactor them, no matter how successful they are.”

Sponsored by AWS.

Maria H. Underwood