The fundamental principle of cloud systems is to focus on multiple, disposable and replaceable machines. This has direct consequences on implementation techniques, and therefore on the capacities of database systems implemented in the cloud.
Traditional databases can be roughly categorized as parallel first (eg MongoDB or Teradata) or system first (eg PostgreSQL or MySQL), often with subsequent scaling (eg Redshift, Greenplum). Each category has limitations inherent in its basic design. The extent of these limitations is in part a function of maturity. However, for some basic architectural decisions, particular features may not be effectively supported.
For example, Greenplum has sequences, but not Redshift, although both are derivatives of PostgreSQL. BigQuery does not have sequences, unlike Teradata (although they are not really sequential in the traditional sense).
Cloud databases fall into the same categories, with a clear tendency to favor the parallel for new systems. The fundamental properties of cloud systems are parallelism for machine scalability and replaceability.
In the single-system category first, cloud instantiations tend to focus on the managed costs, upgrade, and reliability (RPO / RTO) of the traditional single-machine product, such as Heroku PostgreSQL, Amazon Aurora ( PostgreSQL / MySQL), Google Cloud SQL (PostgreSQL / MySQL) and Azure SQL (SQL Server).
Within the parallel category, there are effectively two subcategories: the SQL / relational category (BigQuery, Snowflake, Redshift, Spark, Azure Synapse) and the DHT / NoSQL category (BigTable, Dynamo, Cassandra, Redis). This distinction has less to do with the presence or absence of an SQL-like language and more to do with the fact that the physical layout of data within the system is tuned for single-row hash access for quick searches. on a key, or bulk access using sort-merge and filter operations.
Parallel relational databases first often rely on one or more cloud native storage systems. These storage systems are always built in parallel first and expose a very limited get-object / put-object API, which generally allows data partitioning, but does not allow high performance random access. This limits the ability of the database to implement advanced persistent data structures such as indexes or, in many cases, modifiable data.
Therefore, cloud implementations using native storage tend to rely on sequential reading and writing of micropartitions instead of indexes. There is usually only one physical path to a storage-level object, based on the name of the object. Indexes should be implemented outside of the underlying storage, and even when this is done, the underlying cloud storage API can make it difficult to use an address or offset in practice. byte in a storage level object.
The infrastructure is managed for you. In the cloud, deployment, reliability, and administration are someone else’s problem. All layers of the stack, from power, software and hardware installation to operating system management and security (from hardening to intrusion detection) are handled by the cloud provider.
The convenience of free trial offerings from cloud providers to allow you to run initial experiments, and then scale gracefully at scale if needed, is something that’s difficult at best in traditional on-premises systems.
Another advantage is that cloud providers offer many standardized processes to integrate with third-party SaaS products. The result is that the cloud provider makes the infrastructure a problem for someone else so that you can focus on your core business.
Efficiency. The cloud lives by maximizing the use of resources. It is much more common for a cloud system to expose resource usage controls to the database application than for a non-cloud system. Load can be smoothed out, shifted to low demand time slots, and interactive and critical tasks can be prioritized.
Of course, cloud providers can leverage the efficiency of large-scale purchasing, load sharing, and very high utilization rates. These scale arguments alone can justify the move to the cloud. Not to mention the benefits of using the vendor’s expertise for hardening and intrusion detection.
Closely related to scalability is the ability of cloud providers to provision passive storage at lower cost, making it easier to maintain longer historical windows of data, whether for experimental or analytical reasons, or for backup or audit, and more cost effective to implement features like time travel, where data can be inspected from a historical perspective.
And of course, heavy data processing loads can be solved by temporarily scaling using cloud provider scale (at a cost to the user, of course).
Economy. Besides economy of scale and efficiency, cloud provider accounting mechanisms tend to expose storage and processing cost data down to the individual query level. This allows the user to make a rational business decision on the cost-benefit ratio of a given analysis item and to make optimization decisions accordingly. Indeed, sometimes the business may decide that it is cheaper to use the scale of the cloud to be bigger and ‘simplistic’ in the way an analysis is structured rather than spending time and money. mental energy to sculpt a “robust analysis” (one that is cheaper and perhaps more precise).
The weaknesses of the cloud
The infrastructure is managed for you. The cloud has a very different set of fault domains than, say, a Z-series mainframe. Distributed computing in the cloud, which is a shared substrate (compute, storage, networking), is subject to to many more disruptions, and each of them can cause interactivity failure or temporary work failure. Even automated management by a cloud provider can, on rare occasions, negatively impact the customer experience by changing the properties or behavior of a system.
Efficiency. Most cloud databases are still immature compared to traditional on-premises systems. Cloud databases lack the functionality of more mature products. Some features may never be introduced because the concept of a fully distributed, fault-prone platform makes them impractical.
Many cloud-based parallel relational systems have drastically reduced efficiency for a specific database mutation (
DELETE), which can be problematic in some use cases.
Of course, the additional latency between the cloud and on-premises systems or systems hosted in other clouds will tend to force consolidation of the cloud infrastructure. Users tend to be forced to choose a geographic location and provider first, and then are effectively limited to the services of that provider.
Economy. The cost of cloud follows a very different curve from on-premises deployment – it’s very easy to expand capacity. It is so easy that controlling the costs becomes more difficult. On the other hand, if the cost is capped, interactive jobs submitted after reaching a cost cap may be rejected. This adds a layer of complexity that traditional DBAs will need to learn in order to create a successful deployment.
And, of course, vendor lockdown is just as prevalent in the cloud as it is elsewhere. Migrating between clouds is no easier than migrating between on-premises systems.
There are so many offers to choose from and no single offer has all the features. The most important first steps are to identify the fundamental properties or behaviors of all the required workflows and ensure that the chosen cloud provider has the capacity to deliver them all, potentially every behavior of a different product, but at least weakly integrated, of their continuation. Don’t expect to see a single product like Oracle or Teradata that does “it all” for the price.
Shevek is CTO of Works Compiler.
The New Tech Forum provides a venue to explore and discuss emerging business technologies with unprecedented depth and breadth. The selection is subjective, based on our selection of technologies that we believe to be important and of the greatest interest to InfoWorld readers. InfoWorld does not accept marketing materials for publication and reserves the right to edit any content provided. Send all your inquiries to [email protected]