Storage in the face of the trillion-row database apocalypse – Blocks and Files
A new era of large-scale analytics and storage is upon us. Ocient, Imply, VAST Data and WEKA are four startups positioned to store and access data from hundreds of petabytes or trillions of database rows in seconds. They all use massively parallel access techniques in one way or another and basically use software rather than hardware to achieve their performance levels.
Update. The SingleStore database can browse a billion lines/second using 448 Xeon Platinum 8180 Skylake cores, 28 per server, as his blog describes. The data was stored on SSDs, although that wasn’t really significant since the queries were executed hot, so the data was already cached in memory. The network was 10 gigabit ethernet.
The need for such high-speed access to structured and unstructured data is not yet widespread. It is focused on a few markets, such as financial trading (where VAST and WEKA are doing well), online media display advertising technology (an Ocient objective), high performance computing (WEKA again) and training on AI/ML models.
The AI/ML driver
VAST co-founder Jeff Denworth believes the use of AI/ML technology will spread to the general enterprise market. Most companies will need to sift through their internal production logs and external customer interaction data to find patterns, analyze root causes, and make decisions to optimize internal and external operations. This can save or earn pennies per transaction, but generate or save significant amounts of money on a larger scale.
ML models are used to aid in healthcare device analysis diagnostics, investment trading decisions, factory production operations, logistics delivery routes, product recommendations, process improvements, and staff efficiency. The complexity of ML models roughly doubles every year, according to Denworth. The general rule is that the larger the model, the better the training and subsequent inferences.
Pure is jumping into the wider dataset market, and high-end array vendor Infinidat could say it’s already here.
All of these companies intend to respond to this shift from petabytes to exabytes. They see this affecting on-premises environments as well as those in the public cloud. VAST is an on-premises business, but will be cloud-connected – if not cloud-present – in some way in the future. Ocient is both on-premises and in the cloud, just like WEKA. Imply is pure software so it can run in the cloud, whereas Infinidat is an on-premises business.
Their acceptance and adoption of exabyte-level scale sets them apart from traditional storage vendors, which Denworth says will need to overcome significant architectural drawbacks if they are to compete.
Ocient Large Scale Data Warehouse
Ocient has just launched its Hyperscale Data Warehouse product. This is a v19.0 product – earlier versions have been used, successfully, for large scale deployments over the past year with a select group of enterprise customers. It says the product is designed to deliver unparalleled value for money for rapid, complex, and continuous analysis of massive structured and semi-structured datasets. Clients can run previously unfeasible workloads in interactive time, returning results in seconds or minutes instead of hours or days.
According to Ocient, the software features Compute Adjacent Storage Architecture (CASA), which places storage next to compute on industry-standard NVMe SSDs. This provides hundreds of millions of random read IOPS and enables massively parallelized processing for concurrent data loading, transformation, storage, and analysis of complex data types. The entire data path has been optimized for such performance.
For example, it has a high-speed custom interface for NVMe SSDs with highly parallel reads with high queue depths to saturate the drive hardware. There is a massively parallel, lock-free SQL cost optimizer that ensures that each query plan is executed to the best of its ability in its service class and without impacting the performance of other workloads or users.
The Ocient Hyperscale data warehouse is generally available as a fully managed service hosted in OcientCloud, on-premises in the customer’s data center and on Google Cloud Marketplace.
VAST Data has a major software launch coming up. Denworth says what VAST did for its hardware array, stateless controllers, and single-tier QLC flash storage, it will now do for software.
Holders will need to respond to match what newcomers have. Switching to 100% flash and a single level is not enough: they have to change software. It could mean software technology that will take years to develop from scratch. We could see incumbents buying this technology rather than developing it. We could see processor chip developers, like Nvidia, buying their way to having their GPUs fed with the data they need to crunch AI/ML training models.
Unless Dell EMC, IBM, HPE, NetApp, Qumulo, and object storage vendors can demonstrate that they can operate at the same scale, performance, resiliency, and cost as these new entrants, they may have to fight harder for the structured/unstructured dataset area of hundreds of petabytes and billions of rows – at least if what Imply, Ocient, VAST and WEKA see coming is correct.