Realtime Aggregation of Streaming Data

Problem Gist:

How to aggregate multiple simultaneous streams of data which streams in at different rate and at different speed in realtime?

By aggregating, I mean

  1. Finding the minimum value for a field in a stream of data
  2. Finding the maximum value for a field in a stream of data
  3. Finding the nth largest value for a field in a stream of data
  4. Grouping data together based on a common field
A Naive Solution
Constantly Re aggregating the data as the data arrive. The algorithm complexity would be be very great and the demand for CPU resources would also be great.  For example if you need to find the minimum value of a stream of data.  
You would need to resort the data every time a new piece of data arrive. That would be at least be a complexity of O(nlogn). 
Here are the considerations:
  1. Memory consideration: Simultaneous streams of data would mean huge volume of data flow. For example, with 100 simultaneous streams of data with each stream of data having 50 updates per second
  2. Performance consideration: In short, aggregation has to be fast.
  3. Horizontal Scalability:  With such huge volume of data, it would be better if we could scale horizontally so that we could leverage multiple machines to handle the aggregation of streaming data
  4. Reusability: Clients requesting for the aggregation of same stream of data should be served by the same aggregator. You would want to spawn multiple aggregator for the same data. It would entail wasting of CPU and memory resources.
  5. Extensibility:  We should be able to extend the solution to do all form of aggregation
  6. Easy to configure
  7. Fault Tolerant
Current solution which I know of which can solve the problem
Both of them are quite cumbersome to setup and configure.  A light weight solution is more preferable to me

This was how I solve the problem

All the components communicate through a data bus which in my case I use rabbitmq and Redis Pub Sub for the communication between components. You can use anything here, Netty, ZeroMq 

Upon the receipt of the request for the aggregation of data, 

I started an FSM Akka actor which will listen for data which are meant for the request. This is adapted from the concept of EventSourcing. Actor are very light weight and theoretically you can spawn up to millions of actors on a Single Heap.

Upon arrival of the data, an Event is sent to the FSM actor will then pick up the data and do the aggregation.  Algorithmic complexity is lesser in this case, for example insertion into a sorted list using a Red Black Tree is only O(log n) compared to resorting a whole stream O(nlogn). In addition, an actor is highly customizable so you could theoretically do anyform of aggregation.

A client could then query from the actor at any instant of time and get a snapshot of the aggregation data at the time.

After no receipt of any event of any data or request from the client for x amount of time, the actor can then be configured to shutdown, thereby conserving system memory. You could also arrange to persist the data at this moment of time.

Here is my simplistic FSM Actor:


To allow for horizontal scalability of the system, the request are load balanced between the aggregator, residing on different machine, which are responsible for spawning the actors.  By having multiple aggregator, fault tolerance is also achieved and also load is spread across different machine thereby leveraging on multiple machine resources and also the Actor architecture allows you to easily leverage on multiple cores on a single machine.

Performance:

Aggregation result, sorting and grouping operation, returns in average 1 second for a production system after discounting some heavy serialization stuffs which are domain specific

Limitation:

Each stream of data are independent of each other.  So you can instantiate an actor to manage each stream of data

Inspiration from:

FSM Actor: http://doc.akka.io/docs/akka/snapshot/scala/fsm.html

Event Sourcing: http://www.slideshare.net/debasishg/functional-and-event-driven-another-approach-to-domain-modeling

I am looking forward to Akka Cluster and the possibilities which it might bring.