cat content/blog/2016/04/2016-04-20-continuum.md

Continuum: A Pragmatic Solution to High-Cardinality Time Series Analysis

// Apr 20, 2016

I'm excited to share a project I've been working on that solves a problem many engineers at growing startups face: massive amounts of streaming data to analyze, but the "right" solutions are prohibitively expensive. So I built something different.


Video Streaming at Scale

I work at a video streaming company (CDN accelerator) that delivers HLS chunks to viewers across the globe. Every single chunk we deliver gives us an opportunity to measure Quality of Experience (QoE) on the server side. This is incredibly valuable data:

All segmented by network, device, location, etc, etc, etc.

But it comes with a challenge: volume.


We need to answer two fundamentally different types of questions:

  1. Traditional time-series queries: What's our average bitrate across all users in the last hour? How many buffer events are occurring in North America right now?

  2. Session-level analysis: What is this specific user's experience during their particular viewing session? Are they experiencing buffering? What's their bitrate progression over time?

The second type of query is what I call "time-key-value" data - high cardinality time series where each unique session ID creates its own timeline of metrics. This was the thing no existing solution solved well.


The Expensive Solution We Can't Afford

Yes, I know Flink can handle this. Kafka and Spark can definitely solve this problem. But here's the thing: we're a startup at an early stage. Setting up and maintaining a distributed stream processing infrastructure would be massive overkill. We need something that works now, that one person can manage, and that won't eat all of our money.


Enter Continuum

I've created Continuum as a JVM library that handles both traditional time series and time-key-value data efficiently. The core insight is simple: use the right tool for the job, and don't reinvent what already works.


Check it out:

// open continuum
Continuum continuum = continuum().open()

// create an atom (measurement)
Atom atom = continuum.atom()
                .name('temp')
                .particles(city:'lax', state:'ca', country:'us')
                .value(99.5)
                .build()

continuum.write(atom)
// scan continuum and get a slice of the atoms
Slice slice =
    continuum.slice(
        scan('temp')                            // temperature series
            .function(Function.AVG)             // average temperature
            .particles(country: 'us')           // where country = us
            .group('state', 'city')             // group by state, city
            .end(Interval.valueOf('10d'))       // last 10 days of atoms
            .interval(Interval.valueOf('1d'))   // in 1 day intervals
            .build())

Values values = slice.values()                  // {min,max,count,sum,value}
List groups = slice.slices()                    // {ca:values,lax:values}

continuum.close()

The Foundation: RocksDB

I want to emphasize how incredible RocksDB is. Seriously. It's a log-structured-merge (LSM) tree database that uses memory-mapped files and handles massive write throughput beautifully. Facebook built it via forking LevelDB, and it's battle-tested at enormous scale. Rather than building my own storage engine, I leverage RocksDB (LevelDB and BerkeleyDB as alternatives).

This is a pattern I wish more engineers embraced: reuse existing technology. RocksDB has already solved the hard problems of durable, high-performance key-value storage. I just needed right abstraction on top of it and find the correct key design for both of these use cases.


Two Schema Designs for Two Different Problems

Continuum employs two specialized approaches:


1. Time Series Data

For traditional metrics with a small number of unique series but massive data volume (think millions to trillions of data points):


2. Time-Key-Value Data

For high cardinality scenarios with a large number of unique keys but smaller amounts of data per key:

This second pattern is the real innovation here. It's what lets us analyze individual user sessions without building a massive distributed system.


The Architecture


Continuum is designed to be pragmatic and scalable:


Data Tiers:

Scaling Options:

Features:

Getting Started

The library is available via Maven:

<dependency>
    <groupId>continuum</groupId>
    <artifactId>core</artifactId>
    <version>0.+</version>
</dependency>

Or build from source:

make
# or
./gradlew install

Why This Matters

Not every problem needs Kafka. Not every startup needs Flink on day one. Sometimes the best solution is the one you can ship this week, maintain yourself, and that solves 95% of your needs at 5% of the cost.


Continuum represents a philosophy: leverage proven technologies, build focused abstractions, and solve the problem you actually have - not the problem you might have if you were Netflix.


If you're dealing with high-cardinality time series data and don't want to stand up a distributed stream processing cluster, give Continuum a look.


And if you're curious about the pitch that convinced my company to let me do it, check out the presentation - I promise it's... unique.



Continuum is open source and available on GitHub. Contributions welcome!

_