The next version of Apache Spark will expand on the data processing platform’s real-time data analysis capabilities, offering users the ability to perform interactive queries against live data.
The new feature, called structured streaming, will “push Spark beyond streaming to a new class of application that do other things in real time [rather than] just analyze a stream and output another stream,” explained Matei Zaharia, Spark founder and Databricks chief technology officer, at the Spark Summit East, taking place this week in New York. “It’s a combination of streaming and interactive that isn’t really handled by current streaming engines.”
The feature came about in response to users who were clamoring for more sophisticated ways to process streaming data. Spark has offered the ability to create a stream of analytic data that was mathematically derived through the original stream of live data. What Spark users wanted, Zaharia explained, was the ability to combine live streaming with other types of data analysis.
In particular, they were looking to apply batch jobs and interactive querying to live data. Jobs like these. To enjoy these capabilities, users must set up a separate database or data warehouse through an ETL (Extract, Transform and Load) process, a complex setup.
“With structured streaming, [there is] no need to put it in other weird storage and worry about consistency,” he said.
Structured streaming could provide a real boost in Web analytics, for instance. A user could run interactive queries against a Web visitor’s current session.
Structured streaming could also be used to apply machine learning algorithms to live data. The algorithms could be trained on old data, and then be redirected to incorporate, and even learn from, new data as it enters the memory.
Structured Streaming is accessible through a “declarative API,” Zaharia explained. Users can aggregate data into a stream, then expose it through a JDBC (Java Database Connectivity) plug-in, allowing users to query the data in its latest state. Queries can be changed during run-time.
Structured streaming is built on the Spark SQL Engine and leverages many of the features of the core Spark Streaming functionality, including the ability to reassemble time-oriented data that arrived out-of-order, and create data windows of pre-set lengths. To boost performance, it takes advantage of work done by Project Tungsten, which looked for ways to optimize CPU and memory usage for handling structured data.
“We don’t just want to do streaming, we want to do continuous applications, end-to-end applications that read a stream and serves queries off of it,” Zaharia said.
Version 2.0 of Apache Spark, which should be available around April or May, will have much of the structured streaming capabilities, and additional libraries and operators will be added in future editions.
Apache Spark project manager Reynold Xin will explain structured streaming in greater detail Thursday at the Summit, in a presentation that will be live-streamed.
Feature image via Pixabay.