Meta’s Velox means database performance is not subject to interpretation

A decade and a half ago when Dennard’s scaling ran out of gas and many of us started thinking about what the end of Moore’s Law might look like if that day came , a group of us wondered what that might mean. People have talked about 3D stacking, optical computing and interconnects, and all kinds of cool ideas. And we joked this:

“It looks like we’ll all have to go back to coding in assembly,” and everyone had a good laugh. And then we added, “Well, at the very least, it looks like we’ll all have to take all this open source stuff written in interpreted languages ​​like Java and PHP and rewrite it all in C++ and compile it very close to Hardware.”

No one laughed so hard at that one. And the reason is twofold. First, it would be very difficult to do and would mean giving up some programmability and portability. And second, it’s a trick that works precisely once, such as increasing server utilization by adding a server virtualization hypervisor to a machine. There is no iterative improvement.

This day, at least according to Meta Platforms and its Facebook division, has come for database and data store runtimes and so we see the launch of the Velox open source project, with Intel, Ahana and ByteDance all joining the project with enthusiasm.

Velox has been in development on Facebook for several years and was quietly open sourced last year when Ahana, ByteDance and Intel signed up to help the cause. ByteDance is famous for being the owner of TikTok, which is at the very least much more fun than Twitter or Facebook and has infrastructure requirements that rival those two social media platforms. Ahana was launched two years ago to market Facebook’s Presto database federation overlay and if there’s a new runtime coming to Presto, it has to be involved. And unsurprisingly, over the past year, Ahana has been one of the main contributors to the Velox project. Intel has to be in the middle of it all, that’s why it’s there.

At the Very Large Data Bases 2022 conference next week, Meta Platforms will officially present what Velox is to the world and open the project up for outside input and generally make a lot of noise about what it does. The company is publish the Velox paper today and making this official launch today, ahead of the Labor Day holiday in the United States and public holidays in various European countries to give us all something to think about while we eat, drink and rejoice.

Velox is a fast runtime first, which meant rewriting a runtime in C++ instead of Java and another high-level interpreted programming language such as Java, which was used for many open source data analytics stacks created by hyperscalers and cloud builders. Google’s MapReduce and its Hadoop clone, for example, were largely written in Java, with bits of C++ here and there to speed things up. A number of data analytics startups have taken ideas implemented in Java and recompiled them into C++, so this idea of ​​getting closer to iron is nothing new. But what’s new is that Meta Platforms wants to create a unified runtime engine that underpins the Presto federated database, the Spark in-memory database, and even the PyTorch machine learning platform that does much of the heavy lifting for Facebook, Instagram and its other properties on the Internet.

So in a sense, with Velox, the write-once-run-anywhere promise of Java is preserved, but in a slightly different and more computationally efficient way. An execution engine that runs on distributed worker nodes across all analytics platforms – and indeed that is the goal here, which is admirable – gives portability, and code in C++ in some measure.

Moving from Java to C++ means that the database and data stores using Velox will do their job faster – between 2 and 10 times faster. And if so, hyperscalers and cloud builders can get the same crawl job done 2-10x faster or with 2-10x fewer servers, or somewhere using a mix of speed and less servers. (More on that in a moment.)

As far as Meta Platforms can tell, there’s a lot of reinvention of the wheel in runtimes that sits between database engines and runtimes and their distributed SQL query plans and optimizers running on their working nodes.

“This evolution has created a siled data ecosystem made up of dozens of specialized engines that are built using different frameworks and libraries and share little or nothing with each other, are written in different languages, and are maintained by different engineering teams. Additionally, scaling and optimizing these engines as hardware and use cases evolve, is prohibitively expensive if done on a per-engine basis.For example, extending each engine to Better take advantage of new hardware advancements, such as cache-coherent and NVRAM-coherent accelerators, support for features such as Tensor data types for ML workloads, and leverage future innovations from the community searchers are impractical and invariably lead to engines with disparate sets of optimizations and features. Fragmentation ultimately impacts the productivity of data users, who typically must interact with multiple different engines to complete a particular task. The data types, functions, and aggregates available vary from system to system, and the behavior of these functions, null handling, and streaming can be very inconsistent between engines.

So Meta Platforms has had enough, and just like RocksDB has become a unifying database engine for various databases, and like Presto has become a unifying SQL query layer for running across data stores and databases of disparate and incompatible data (and made it possible to query the data in place in these data stores and databases), Velox is going to be a unifying execution layer of workers in its databases and data stores. In fact, Meta Platforms not only uses the Velox engine in Presto, Spark, and PyTorch, but has also deployed it in its XStream stream processing, F3 feature engineering control, FBETL data ingestion, XSQL distributed transaction processing, Scribe message bus infrastructure, Saber high number of external service requests per second and several other data systems. What Meta Platforms wants now is to help tune Velox so it’s ready for the next hardware platform when it arrives, and the database and data store vendors that market open source software stacks source by facebook or others or have proprietary systems all want the same thing and will be willing to help.

In other words, if this Velox works properly, the execution engine of data analysis platforms will no longer be a differentiating feature for anyone, but a common feature for all. Just like InnoDB and RocksDB database engines and PostgreSQL front ends have become standard in many relational databases.

The runtime created for Presto is called Prestissimo and the runtime for Spark is called Spruce. By adding Velox to the XStream streaming service, the PrestoSQL functions package used with Presto can now be exposed to XStream users and they don’t have to learn a new domain-specific query language to run a query against streaming data.

In many cases, especially with in-memory or streaming data, the Velox execution engine brings another key benefit, called vectorization, which Steven Mih, co-founder and CEO of Ahana, explains to The next platform.

“Velox is designed to have state-of-the-art vectorization, and that’s a big deal,” says Mih, reminding us that Databricks had a native C++ vectorized runtime engine for Spark called Photon, which he talked about earlier this year. “Vectors allow developers to represent data in columnar format in main memory, and then you can use them for input and output, as well as for computation. Apache Arrow does some of that, but it’s more for data transfer. This is for the runtime itself. If you want to do a hash or a comparison, for example, these vectors are really good at that, and as the compute engines become more and more parallel, they can take advantage of that hardware as well, increasing the speed even more.

On that note, we asked Pedro Pedreira of Meta Platforms, one of the article’s authors and a social network software engineer who is one of the creators of Velox, if this runtime could still offload the work on GPUs. “Not today, but it’s something in our plans for the future – offload execution to GPUs and potentially other hardware accelerators,” Pedreira replied via email. With so many beefy vector engines being added to server CPUs these days, it doesn’t matter at the moment if there’s GPU offloading.

What matters is that Velox isn’t reinventing the wheel, and because it’s written in C++ and runs much faster than a Java-based runtime, so it’s economically feasible to justify tear apart a stack of data analysis to get Velox inside of it.

To prove this point, Meta Platforms ran the TPC-H data warehousing benchmark on its Presto databases, one using the Presto Java engine and the other using the Prestissimo Velox C++ engine. Looked:

The TPC-H benchmark was run on a pair of 80-node clusters with the same processors (which were not disclosed), 64GB of main memory, and two 2TB flash drives. It was a 3TB TPC-H dataset, which is not particularly large but representative of a large number of datasets in the enterprise. Meta Platforms selected two CPU-bound queries (Q1 and Q6) and two random queries with heavy I/O requirements (Q13 and Q19) to test the systems. The speedup is maximum on CPU-bound requests, and the work coordinator on the execution engine is the new bottleneck that limits the work. For random queries, there are metadata, timing, and message size issues that need to be further optimized to improve performance.

In addition to performing the TPC-H work, Meta Platforms took a real set of analytical queries running in the enterprise and replayed them through two separate clusters, one using the Presto Java runtime and the other using the Prestissimo Velox C++ runtime engine. The graph below is a histogram showing the number of requests accelerated along the Y axis and the factor of this acceleration along the X axis:

The average acceleration is between 6X and 7X. There were no queries where Presto Java was faster (which is column 0), and there were only a few where performance was the same (column 1) between Presto Java and Prestissimo Velox. About a fifth of queries had an order of magnitude better performance using the Velox runtime.

Maria H. Underwood