Confluent sponsored this post.

Ben Stopford
Ben is Lead Technologist in the Office of the CTO at Confluent where he has worked on a wide range of projects, from implementing the latest version of Kafka’s replication protocol to developing strategies for streaming applications. He has more than two decades of experience in the field and has focused on distributed data infrastructure for the majority. He is also the author of the book “Designing Event Driven Systems,” O’Reilly, 2018. Find out more at benstopford.com.

Databases are something we take for granted. One of the most successful software tools ever written, they are ubiquitous and well understood beyond the technology industry. They have evolved over the years with data warehousing, NoSQL, big data, etc. Despite all these changes, databases still look and feel broadly similar to the databases of 50 years ago: static repositories of data to which you direct questions and expect answers to return. Their most recent evolution, however, is quite different.

Think for a moment about what it’s like to use a database. You send a question to some dataset held on a remote server. The dataset is large, far too large for your mind to reason about unaided. The database provides a near-instantaneous answer to your question — a wonderfully powerful tool for comprehending the incomprehensible.

But this design has become less and less applicable to the modern digital world, because it is a design that focuses on human-computer interaction. That is to say, it’s designed to help you. To answer your questions. To put data on your screen. To display the contents of your shopping basket or the balance of your checking account. To aid the crafting of software that makes it easier for you to do your job. The database is built to help people better manage and understand data.

Yet, today’s modern digital world isn’t about aiding human-computer interaction — it’s about solving whole business problems in software and taking humans off the critical path. This type of wholly software-defined process requires a type of database that isn’t simply an end point, somewhere where dataflows stop. Today, the movement of data plays as big a role in most business processes as accumulating it does, and that is where event streaming comes in.

Ride-sharing businesses like Uber or Lyft are a good example. When you order a ride, the whole process is automated. Streams of events track your location and the location of nearby drivers. Business events trigger many separate pieces of software that talk to many other pieces of software to hail, route, and pay for your ride. These are a far cry from the database-backed systems that help traditional taxi operators take and record phone calls, manage pickups, etc.

Read More:   Update Isima Takes on Data Management Ingest to Insight

The trend towards event streaming also sees banks automating work typically performed by credit officers in banks and checkout operators in supermarkets, as well as enabling a plethora of up-and-coming use cases — from real-time analytics to self-driving cars.

Today’s most successful businesses don’t use software simply to make people better at their jobs, they use software to automate what the business does. This changes the business itself, but it also changes the architecture of the software that supports it. More importantly — going back to where we started — it’s the reason that databases themselves must also change. 

How Event Streaming Relates

While popular stream processing technologies have tables and SQL, just like a database does, they have a different purpose. These technologies move data between different pieces of software and allow that data to be manipulated as the data flows. They use the same joins, filters, summarizations, etc., that are traditionally applied to database tables, but apply them to data-in-motion. But should technologies like these be a substitute for traditional databases?

If you’ve ever used a stream processor like Apache Flink or Kafka Streams, or the streaming elements of Spark or ksqlDB, you’re quite unlikely to think so. These technologies don’t feel much like traditional databases at all. If you run a query, you will find that an answer does not come back. Queries don’t return when done. They run continuously, and the output isn’t a response; it’s another event stream.

In fact, the shift to event streaming can be quite hard to understand, because it’s so un-database-like in nature. In a database, the data sits there passively. You have to send a query to it. If you’re off making a cup of tea, the database literally does nothing, springing into life only when you come back and ask it a question.

With event streaming technologies, it’s the other way around. The data is active. Nobody is clicking buttons and expecting things to happen. Instead, the appearance of new data is what triggers queries to be run. More like a chain of dominoes, one piece of software triggers the next, creating a chain of processes that perform a holistic business process.

So in event streaming, the trigger is not you. The trigger is the data. The trigger is the arrival of an event. The data doesn’t sit passively, as it does in a traditional database; it’s always on the run.

Read More:   Update Top 5 Reasons for Moving from Batch to Real-Time Analytics

Databases and Event Streams Converge

While event streaming works well for chains of business operations that depend on one another, it isn’t enough for many end-to-end use cases — particularly ones involving users. Buttons still need to be clicked, orders are still created, and shopping baskets are still displayed on screens in the exact same way that we have used databases for decades. Until recent times, stream processors couldn’t help with this, so implementers would end up bolting technologies together: messaging systems, stream processors, databases, etc.

Two technological shifts address this issue. In the first, stream processors become more database-like in nature, replacing the need for a separate database component to serve the results of streaming computations into users. In the second, databases become more stream-like, emitting data from tables as they are updated.

ksqlDB is an example of the first category: a stream processor that can materialize views that you can query just like a database table. In the image below a stream of payments is transformed into credit scores using a streaming computation and the result of that computation is stored in a credit scores table. The user can then send queries to this table that return specific values, or listen to changes via an event stream. Technologies like this that combine event streaming computations with tables that users can query are called event streaming databases.

Active databases approach the problem from the opposite side, allowing you to create triggers and materialized views that respond to changes made to a database table. These have been around for a couple of decades, but the functionality is limited to the confines of the database itself: no streams of events are emitted to the outside world. More recent database technologies — like MongoDB, Couchbase and RethinkDB — have incorporated the notion of tables that emit event streams when records change. Applications can then react to these streams.

The result is a convergence toward a middle ground that incorporates streams and tables as first-class citizens, and this is in the process of being formalized in an extension to the SQL standard.

While the two approaches — active databases and event streaming databases — may appear to be similar, they are used in different ways. Both build on the same three building blocks: queries, tables, and event streams. Active databases are better at queries over tables, but they can’t query event streams. Event streaming databases can query both event streams and tables, but the table-based queries are less advanced than their active database counterparts.

What this means in practice is that active databases suit simple applications or individual microservices that can benefit from an event-based interface to the data held. By contrast, event streaming databases typically use Apache Kafka to move data around, making them better suited to microservices (plural) or data pipelines. They serve as systems in which event data physically moves from one place to another, or creates chains of business logic tied together by these data flows — quite literally turning the database inside out. These are also the exact types of applications we see taking human taxi operators, credit officers, checkout clerks, etc., off the critical path.

Read More:   Update Why Cloud Is Impossible without Open Source

After 50 years, the database is morphing to the needs of a new master. This metamorphosis is not over. In fact, it is likely only beginning. But one thing seems clear: the common answer to the question What is a database?” is likely to change, as the software world becomes less about software that helps people like you or me do our work and more about the chains of software that fully automate our world.

Feature image via Pixabay.