Rodrigo Aramburu built his first version of a GPU-accelerated SQL engine while working on a project for the Peruvian government.
His older brother Felipe and a few other people had started their own consulting company. They were working on a fraud-detection algorithm for Peru’s public pension authority, sort of like the Social Security Administration in the United States.
It involved 40 years of historical data in paper format — 15 million participants submitting biweekly payroll stubs for their entire history. This was being verified by a team of 100 analysts, who were digitizing those records into a new system.
“In Peruvian politics, every time a new administration comes in, everything the previous administration did was wiped. There were literally 14 different systems,” said Aramburu, adding, “It was a bit of a nightmare.”
They would hit bottlenecks just by trying to join tables together across all these different legacy systems because there was no way to uniquely identify a participant.
“If I were looking for me, and I lived in seven of those systems, I would have to correlate all sorts of different data points to find out if I’m actually the same exact person or not,” he said.
So this giant join was taking around 35 hours to process. It was a little over a terabyte of data.
“We had never worked with that type of scale, so we were not necessarily using the best-in-class tools at the time. So we tried playing around with GPUs (graphics processing units) for doing the join itself. That led us to how this query that was taking 35 hours — on GPUs, it was taking about 30 seconds,” he said. “That was when we said, ‘Holy crap! I think we just landed on something really cool.’”
After a year, their project was accepted into an accelerator program and the Aramburu decided to go full time on the project.
In 2017, they began working aggressively with NVIDIA and came to the realization that they wanted to create an ecosystem for GPU-accelerated data science, which is now called RAPIDS.ai. It’s a suite of software libraries and APIs for performing data science and analytics pipelines entirely on GPUs.
The company has since grown to 15 people, with headquarters in San Francisco and a development hub in Lima, Peru.
BlazingSQL recently was released as an open source project under the Apache 2 license.
Querying Raw Data
Originally a whole database management system called BlazingDB, it has changed its name to BlazingSQL with its focus solely as a GPU-accelerated SQL engine.
It’s built on Apache Arrow, a collaboration across 13 other open source projects, including Drill, Cassandra, Hadoop and Spark, which specifies a standardized language-independent columnar memory format for flat and hierarchical data.
It’s basically a standard for how columnar data can live in memory. By adhering to this standard, it can receive anyone else’s columnar data in a very performant fashion.
“Pre-Apache Arrow, if Blazing came out with results and I wanted to hand it off to something else, that thing would have to read my results row by row to interpret it and parse it out and turn into something useable for that tool,” Aramburu said.
“With Apache Arrow, because I say this is an Apache Arrow data frame, it says I know what that is. It’s already in a format I understand. I don’t need to read it line by line. … Now I can go on and move with it.”
It’s able to query raw data from a data lake — no schema required — loading data directly into GPU memory using GPU DataFrame (GDF).
BlazingSQL’s core is the GPU Data Frame (GDF) memory model and the data-processing functions that are in the C++ API of cuDF. It’s the RAPIDS AI library focused on data preparation containing functions familiar to users of Pandas, Numpy, or SQL. It integrates with Dask, which uses Python APIs and data structures to make it easy to switch between Numpy, Pandas, Scikit-learn to their Dask-powered equivalents.
“Create this data frame and the rest of the ecosystem can pick it up and do whatever it wants,” he said. “We can run SQL queries on top of these data frames in GPU memory and it makes them run incredibly fast.”
It touts speed more than 20x faster than Apache Spark in comparison on Google Cloud Platform using NVIDIA’s T4 GPUs.
“By leveraging Apache Arrow on GPUs and integrating with Dask, BlazingSQL will extend open source functionality, and drive the next wave of interoperability in the accelerated data science ecosystem,” Josh Patterson, general manager of data science at NVIDIA, at the open source announcement.
Feature image via Pixabay.