From the very beginning of the big data era, large data volumes have been a source of fear, uncertainty, and doubt for those tasked with analytics: In order to work with big data, goes the argument you must make it small. Given the primitive processing tools and programming models of the time, it became obvious that no one could practically deal with “high-definition” data.
For many important algorithms — recommendations, predictions, behavior modeling — the need for a single source of truth has been supplanted by the highest probability based on the available datasets. Given the business upside of using machine learning to enable new revenue streams or greater sales pipelines, we’ve been willing to wait hours, days, or weeks for these models to play out.
But the rise of machine learning has not blunted our desire to work interactively with data — to probe and experiment with live, high-definition data. Not only is this now possible, it is also practical with the intelligent, systematic use of precomputation techniques developed as part of the open source project called Apache Kylin.
Apache Kylin is an open source project that implements the precomputation pattern in big data environments. The company behind Apache Kylin, Kyligence, was founded by the team that created the Kylin project. Kyligence provides the commercial version of Kylin that can be deployed either in the cloud or on-premises. Some of the world’s largest financial institutes, e-commerce sites, telecom vendors, and consumer packaged goods companies have been using Kylin technology to solve their most challenging analytical problems.
In this article, we are going to take a look at the familiar notion of precomputation as a means of increasing analytics performance and to achieving sub-second response times for queries on extremely large (100’s of terabytes to petabytes) datasets in the cloud.
For Analytics in the Cloud, How Fast Is Fast Enough?
“My query is slow! It’s unusable.”
This is probably the most common complaint we hear from business users about their analytics experience in the cloud. While everybody would love to conduct their analysis at the speed of thought, the reality is we spend too much time waiting for the dashboard to refresh. Naturally, we start looking for the next shiny BI tool hoping that it will run faster than the one we are using. However, the performance challenge is likely systemic, and not easily remedied by using another BI client.
There are two contributing factors to the delay on the dashboards: the time to fetch and prepare the data and the time to render and update the UI as illustrated in Figure 1. End users often fail to realize the majority of the delay, commonly referred to as query latency, actually comes from the first step. For complex queries and large datasets, the query latency can be tens of seconds to tens of minutes. Obviously, in a situation like this, it doesn’t matter how fast your BI tool can draw the diagrams, the user experience is going to be slow.
Ideally, to truly support analytics at the speed of thought, we need the overall delay — the time from when a user clicks the mouse to the dashboard being refreshed — to be around one second, and no more than three seconds. That means our query latency should be less than one second for the majority of use cases, and one to two seconds for the most complex scenarios.
To feel that we are truly getting accurate analytics on high-definition data, we need this level of latency on petabyte-scale datasets. This has to apply to the most complex analysis scenarios where tens of attributes are analyzed by thousands of concurrent users. This is the type of performance modern analytics systems require of the backend data service layer.
Compute on the Fly vs. Precomputation
Data Warehousing was created to maximize query performance by organizing data in a way that makes it easy to summarize the data for reporting and dashboarding. Data is grouped by subject area, organized in star schemas, and distributed based on the key filter or grouping conditions. Therefore, summarizing the data in the data warehouse is much faster than doing it in transactional databases.
In the big data world, columnar storage arranges data by columns instead of rows. Since most analytical queries are interested in summarizing data in certain columns, columnar storage dramatically improves the performance of analytical queries. New generation query engines such as Presto, Apache Impala, and Kinetica are designed to come up with ways of processing data in parallel across large numbers of servers. These new software technologies, together with the advancements in hardware such as GPUs, faster networks, and faster disks, have dramatically improved query performance.
In the last several years we’ve also seen the emergence of cloud data warehouses, with Snowflake being the best-known example. Storing data in a cloud data warehouse has become as simple as a couple of mouse clicks, and users don’t need to worry about many of the administration tasks that usually require a seasoned DBA to perform. The cloud data warehouse greatly reduces the effort needed to organize and analyze data in the cloud.
But all of these technologies run analytical processing after the data is read from the data warehouse or the file system. You want to know how many products were sold in Q3? Let the query engine read those records, calculate the total amount, and send the result back. How about from all of 2019? Well, we will just read all records from 2019 and give you the total amount.
This is obviously a simplified scenario. Data warehouses have many optimization methods (e.g. bitmap indexes, SQL optimization) to improve query performance, but the principle is still true as denoted in Figure 2: scan, compute, service the result, rinse, and repeat.
Move Computation to an Earlier Stage in the Process
Instead of reading the raw data and computing the summaries for every single query, what if we compute the summaries and persist the results before the queries were even issued? This way, by the time the users ask these questions, they get the answers instantaneously. This is the concept of precomputation (see Figure 3).
As many reading this article will know, precomputation is not a new concept in computing. People have been using various data indexing strategies and materialized views in RDBMS for years. And there are numerous examples in operating systems and microprocessors as well. For a bit more background on the subject, you can refer to a short blog I wrote on the Evolution of Precomputation.
Precomputation, a Paradigm Shift for Modern Analytics
In Apache Kylin environments, aggregations and other key measures are calculated upfront, or precomputed. The results of the precomputations are persisted as indexes stored in the storage layer. We can also define logical hierarchies for certain attributes such as time (day, week, month, quarter, year, etc.), location (city, state, country, etc.), product (product family, product category, product line, etc.), and so on to enable drill-down and roll-ups. Querying and fetching the results becomes a simple lookup operation, which is typically hundreds of times faster than brute force aggregation and calculation.
The benefits of this architecture are dramatic. First and foremost, the query latency doesn’t increase with growing data volumes (to be fair, the query latency will be slightly longer when working with a large index, but this increase is negligible compared to the typical MPP query engines’ performance).
The latency curves are illustrated in Figure 4.
Another benefit of the Apache Kylin precomputation layer is the ability to support a much larger number of concurrent users. Since most traditional query engines need to compute data and assemble result sets in memory at run time, supporting hundreds or thousands of concurrent users not only contributes to unacceptable latencies, but they also require a crazy amount of memory. With Kylin’s precomputation architecture, each query is very lightweight — usually just a simple lookup — so we can easily scale out to handle thousands of concurrent users without sacrificing performance.
One of the resulting benefits of precomputation is the significant cost-saving on cloud infrastructure. Cloud providers — and especially Cloud Data Warehouse vendors — charge for every CPU cycle consumed, which means users have to pay for every aggregation being calculated. So, if someone writes a bad query, or if there are many users issuing queries, you are going to see a spike in cloud costs at the end of month. On the other hand, with precomputation, the cost of each query is minimal.
This architecture also works extremely well with a cloud data lake architecture. Raw data can be ingested into the data lake using services like AWS Hudi and Azure Data Factory. To run the precomputation jobs, we can start the compute resources on-demand, finish the jobs, and shut down those compute nodes. Depending on the query demands, we can also dynamically increase or reduce the query nodes to service the queries.
Knowing What to Precompute
The first question that probably comes to mind is: how do I know what results should be precomputed? One method would be to manually analyze your SQL or MDX queries to understand which datasets and results are most used and therefore suggest which aggregate indexes should be created.
A more forward-looking approach suggests that this question provides a clear opportunity to apply AI to your precompute strategy. The key is to leverage machine learning algorithms to predict the questions that will be asked by your analysts. By learning users’ query histories, analysis behaviors, metadata, and data access, we can apply AI algorithms to predict what questions users will ask and, therefore, which will benefit from having precomputed answers. As with any good machine learning model, the more queries we receive, the more data we process, the more users we serve, the more accurate these predictions will be.
By knowing what questions users will ask, the software can automatically pick the best strategy to run the appropriate queries, get the correct answers, and store them in the distributed file system. Many people probably will ask ‘why file system — why not in memory?’ The reason is that file systems such as Cloud Data Lake give us unlimited storage capability at a very reasonable cost, while still maintaining excellent data access performance. Thanks to today’s distributed compute and storage systems, we can now process trillions of rows of data in a short period of time at a low cost. That is the essence of achieving interactive analytics on high-definition data.
Summary
One of the challenges facing information workers is to quickly operationalize and analyze the huge amounts of data that are available to them. Our initial impulse was to make the data smaller, but this comes at the cost of time, effort, and operational expense. With the precomputation architecture of Apache Kylin, analytics data pipelines are simplified and users are able to interact with larger datasets to get a much clearer, high-definition picture of the types of insights we’ve been waiting a long time to behold.
Feature image via Pixabay.