Realtime Aggregation of Streaming Data
Problem Gist:
How to aggregate multiple simultaneous streams of data which streams in at different rate and at different speed in realtime?
By aggregating, I mean
- Finding the minimum value for a field in a stream of data
- Finding the maximum value for a field in a stream of data
- Finding the nth largest value for a field in a stream of data
- Grouping data together based on a common field
- Memory consideration: Simultaneous streams of data would mean huge volume of data flow. For example, with 100 simultaneous streams of data with each stream of data having 50 updates per second
- Performance consideration: In short, aggregation has to be fast.
- Horizontal Scalability: With such huge volume of data, it would be better if we could scale horizontally so that we could leverage multiple machines to handle the aggregation of streaming data
- Reusability: Clients requesting for the aggregation of same stream of data should be served by the same aggregator. You would want to spawn multiple aggregator for the same data. It would entail wasting of CPU and memory resources.
- Extensibility: We should be able to extend the solution to do all form of aggregation
- Easy to configure
- Fault Tolerant
- Twitter Storm. https://github.com/nathanmarz/storm
- Spark on Apache Meso. http://www.spark-project.org/

This was how I solve the problem
All the components communicate through a data bus which in my case I use rabbitmq and Redis Pub Sub for the communication between components. You can use anything here, Netty, ZeroMq
Upon the receipt of the request for the aggregation of data,
I started an FSM Akka actor which will listen for data which are meant for the request. This is adapted from the concept of EventSourcing. Actor are very light weight and theoretically you can spawn up to millions of actors on a Single Heap.
Upon arrival of the data, an Event is sent to the FSM actor will then pick up the data and do the aggregation. Algorithmic complexity is lesser in this case, for example insertion into a sorted list using a Red Black Tree is only O(log n) compared to resorting a whole stream O(nlogn). In addition, an actor is highly customizable so you could theoretically do anyform of aggregation.
A client could then query from the actor at any instant of time and get a snapshot of the aggregation data at the time.
After no receipt of any event of any data or request from the client for x amount of time, the actor can then be configured to shutdown, thereby conserving system memory. You could also arrange to persist the data at this moment of time.
Here is my simplistic FSM Actor:
To allow for horizontal scalability of the system, the request are load balanced between the aggregator, residing on different machine, which are responsible for spawning the actors. By having multiple aggregator, fault tolerance is also achieved and also load is spread across different machine thereby leveraging on multiple machine resources and also the Actor architecture allows you to easily leverage on multiple cores on a single machine.
Performance:
Aggregation result, sorting and grouping operation, returns in average 1 second for a production system after discounting some heavy serialization stuffs which are domain specific
Limitation:
Each stream of data are independent of each other. So you can instantiate an actor to manage each stream of data
Inspiration from:
FSM Actor: http://doc.akka.io/docs/akka/snapshot/scala/fsm.html
Event Sourcing: http://www.slideshare.net/debasishg/functional-and-event-driven-another-approach-to-domain-modeling
I am looking forward to Akka Cluster and the possibilities which it might bring.