The Apache Flink open source stream processing software is going through a major overhaul, starting with next week’s release of version 1.4. This iteration includes major restructuring of the dependency structures and adds reverse class loading. And much of the work that went into this version of Flink prepares users for the upcoming version 1.5, which will aim to unify stream processing and batch jobs under a single API.
“We’re constantly working on [Flink] to make stream processing more powerful. We started out as a datastream API with an event streaming window model,” said Stephan Ewen, Chief Technology Officer of dataArtisans, which offers Flink training and commercial support. “In the last releases, we extended [Flink] with Stream-SQL and made it more low level with an API where you can get more fine-grained control over state over time.”
Ewen said that many Flink users are coming from the batch processing world, and use Flink to handle both batch and stream processing jobs. The capabilities of Flink, he said, make it amenable to treating batch jobs just like streaming work. Ewen said that Flink is becoming a key part of many enterprise data processing applications.
“We also see users that used [Flink] build stand-alone applications, and they’re starting to build these apps on stream processor, and their business logic on a stream processor. They need a more low-level interface to say, ‘Give me an event, I have logic to apply to it,’” said Ewen.
One of the big advantages of Apache Flink, said Ewen, is its ability to offer incremental checkpoints to keep state in the stream processor. In a world where the going wisdom is that state should be abstracted from scalable cloud-based applications, Flink takes the opposite view. Ewen feels the industry is in a bit of a state-of-state denial, so to speak.
“Maybe it works on some parts of the cloud. You’re pretending to be in a stateless world and saying state is handled by someone else, like cloud services. In the end,” said Ewen, “It’s still state. If you make it someone else’s problem, you have to worry about how they handle data, keeping it consistent. Applying an update just once is hard. Having the state in the stream is a very natural answer to that. If you have the right tools, that’s very possible. That’s very powerful and an important thing we’ve always pursued in Flink.”
The approach Flink uses is to deploy distributed snapshots, or checkpoints, which offers a complete decoupling of processing while maintaining fault tolerance, said Ewen. Ewen said this approach makes it, “Easy to integrate to add fault tolerance to data structures that don’t lend themselves very easily to logging changes. They might come from a legacy library. If you can say, ‘serialize me the state of that checkpoint,’ you can make this thing fault tolerant.”
Plan Now for the Future
Ewen said that version 1.4, and version 1.5 scheduled for next year, are twin releases. There are a bunch of big changes coming up to Flink, driven by two trends. One is the diversification of the uses that Flink and stream processing, in general, are facing, Ewen elaborated. The other one is the unification of batch and stream processing.
Ewen said the team at dataArtisans and Apache decided to spread this release across two versions so as not to overload the userbase with a massive change all at once. Version 1.4 “has a bunch of very important fixes, other tooling, and features in there. We didn’t want to push these bigger changes in there yet because bigger changes are hard to adopt.”
In preparation for 1.5, version 1.4 of Flink includes a lot of work in Stream-SQL, a query language that extends SQL with the ability to process real-time data streams. This release of Flink adds streaming joins, a connector API, integration with later versions of Kafka and their transactional exactly-once capabilities.
“We’ve done a big rework of the dependency structure,” said Ewen. This includes a major reworking of how, Flink does class loading to class resolution, to solve dependency version conflicts that users were seeing.
“Many systems, actually even when they unify the API, they still run two different engines underneath for batch and streaming. We think there is a neat potential in unifying these deeper” — Stephan Ewen
This fix means that teams can use one version of, say, the Akka distributed messaging platform in their core Flink installation, while using another version on top of it for application construction. This reworking has also extended to included libraries for Flink. Previously, Flink included a number of Hadoop libraries, for example, but these have now been removed as dependencies, and made entirely optional.
For the next release, version 1.5, the Apache Flink team has big plans. First is a project to rewrite the distributed process. Currently, Flink offers master, client, and worker processes. These will be supplemented with new types, such as the dispatcher process. These changes will help Flink run better in a Kubernetes-managed environment.
“The thing that asked for the biggest adoption in fundamental principles is container infrastructure, like Kubernetes. They introduce a few interesting things, like overlay networks, and different ways of coordinating, but they also require the framework to adopt that way of thinking,” said Ewen.
Previously, Flink had been able to use YARN or Mesos to schedule its work and manage its workers, but the move to Kubernetes is necessitating a major shift for the platform. “We didn’t believe bending YARN was the right thing to do, as this was kind of a different use case,” said Ewen. “From the Kubernetes side, this came from the side of users building initially stateless and later stateful application containers. The amount of resources an application gets is not controlled by the application framework, it’s controlled by Kubernetes. Kubernetes says, ‘I give you five containers, make it work with five containers,’ as opposed to before where Flink would say, ‘Give me eight containers from Mesos.’”
Another area of work for Flink 1.5 is around the unification of batch and streaming, said Ewen. “Think of it on two levels. For programming abstraction, we have done a lot of work there: the current data stream API handles both finite and infinite programs. Flink has a dedicated batch API still, which has some special constructs for running loopy programs which don’t translate as naturally to streaming. In the streaming API, we are working on more and more unifying APIs.”
“The next step is more of a unification into the runtime,” said Ewen. “Many systems, actually even when they unify the API, they still run two different engines underneath for batch and streaming. We think there is a neat potential in unifying these deeper. If you’re in catch-up mode in streaming, you can exploit the same advantages of batch processing. In order to do these things, you do need a deeper integration. These systems traditionally are different, and Flink has always tried to keep them close, but we need to unify them deeper.”
Feature image by Ludovic Fremondiere via Unsplash.