Streaming infrastructure

Introduction

Streaming infrastructure is table stakes for modern applications. The use cases range from analyzing click-stream data on a website to analyzing large volumes of data in real-time for fraud detection.

In this post, I look at the streaming infrastructure landscape and share some thoughts about what’s next.

From batch to streaming

We can think of data produced by applications as having two key dimensions: 1) volume: how much data is produced 2) velocity: how fast is the data produced.

Both of these dimensions have exploded in the last decade as applications produce massive amounts of data at increasingly higher velocities.

Deriving value from a real-time data stream under these conditions is non-trivial. The traditional way to do this – batch processing – involves significant latency on the order of hours and days. This latency is unacceptable for real-time workflows.

This is where streaming infrastructure comes in.

Double-clicking into streaming architecture

What exactly counts as streaming architecture? Streaming architecture refers to systems that have one or more of these components:

  • Message brokers: a pub/sub message queue that takes data from a source, converts it into a standard format, and continuously sends it to a destination with very low latency.
  • Stream processors: takes the output of a message broker and applies arbitrary transformations. For example, streams from different sources can be aggregated and structured for analysis.
  • Storage layers: Where streaming data is persisted temporarily or permanently.

What's next for the streaming landscape?

Real-time streaming for everyone

The first time I encountered Kafka as an engineer, I was overwhelmed by its complexity. I simply wanted the benefits of an event stream without being a JVM expert or dealing with the nuances of configs and schema changes. Today, several startups have emerged to fill this gap.

Redpanda abstracts away Kafka’s complexity with a C++ rewrite that removes the need for JVM and Zookeeper. Lenses offers a low code platform for setting up Kafka streams. Upstash takes a serverless approach and eliminates setting up Kafka clusters. Finally, Decodable combines a serverless approach with an interface that lets developers interact with streams via SQL.

I expect to see more startups competing on ease of use and developer experience. The net effect is that streaming infrastructure will be accessible to more developers. This is a TAM expansion I am particularly excited about.

Vertical streaming infrastructure

Streaming infrastructure today is a patchwork of cobbled together brokers, stream processors, and storage layers. A set of streaming use cases lend themselves well to being verticalized. This class of tools abstracts away the details of low-level streaming infrastructure and lets developers focus on the complexities of their use case.

One area I am particularly interested in is machine learning streaming infrastructure. I saw firsthand how challenging it was to build real-time machine learning infrastructure as an engineer. A class inference pipeline looks something like this: run inference ahead of time → store results → query for a prediction when needed.

However, incorporating any sort of real-time session data into this pipeline requires architecting both a real-time event stream and an inference pipeline that updates predictions in real-time and serves it to a client. This is operationally complex as you’re not just managing an event stream but also a real-time inference schedule and feed to a client. Companies like ClayPot AI are tackling this by building a vertical streaming infrastructure for real-time online prediction and continual learning.

Stream stores for real-time queries

Where should you store stream data? The answer depends on how you intend to query it. Traditionally, you would write streams to a database and query a specific result with an analytics or BI tool. However, if you're investing in real-time infrastructure, it follows that you care about computing results in real-time. Thus, running a query on a stream should not return a static response but one that updates as new data arrives — in some sense, it should return a new event stream. Traditional transactional databases don't have this capability — neither do legacy OLAP databases. However, a new wave of analytical databases like Imply (built on Apache Druid) fills this gap. In addition, another crop of companies like Materialize and Rockset are filling this gap while offering a familiar SQL interface for developers. Finally, companies like DeepHaven provide a query engine for streaming data that lets you combine static and real-time sources.

Conclusion

While the core ideas behind streaming infrastructure have been around for decades, developers are still in the early days of the adoption curve. If you're building in this space or skeptical about any of these trends — especially if you are skeptical — I would love to hear from you.