Designing Event Driven Systems for SAAS applications

Posted 17 Oct 2021

Designing Event Driven Systems for SAAS applications

While designing systems for SAAS applications, some common patterns are followed in stateful systems around databases. For compute as well, some aspects of scaling are to be considered. For a SAAS application, what are the aspects to be considered for event-driven architecture that does the job of creating and processing events?

Main aspects to consider:

Noisy neighbour problem. We need to ensure one client does not have too many events or messages to be processed thereby introducing latency for other clients or bringing down the entire system. It is important to ensure other clients are not impacted and the trade-off of client isolation strategies are well understood
Scaling approaches for the provider and consumer of the systems
Ordering guarantees when it comes to the processing of the events and messages
Parallel processing to reduce the overall latency
Batch processing, retries and handling failures
Ability to control backpressure to downstream systems
Managing the state and status of the events
Kind of traffic patterns - Fixed vs Bursty event workloads
Cost considerations and the corresponding trade-offs
Whether the data is maintained in the event payload or can be queried from a different system.

Based on the different use-cases, some or all of the above aspects need to be considered when designing event-driven systems. There are two main eventing services in AWS that we can consider and each one comes with its pros, cons and trade-off.

Kinesis Data Streams and SQS are used for different kinds of situations. It may not be an apple to apple comparison. Kinesis Data Streams has the event and corresponding messages persisted and can be used by future consumers as well. This can help to build any functionality by replaying the events at a later point in time. While SQS is more for ephemeral situations where the messages are available for a max of 1 week, after which if unprocessed, they are expired. Also once the messages are read, it is deleted.

But having considered this crucial aspect, we can try to understand which can be a better option considering the use-cases for SAAS applications and the kind/ volume of the event generated and processing required for these events.

Option 1: Kinesis Data Streams

Identifying the required number of shards - One shard per client vs single shard for all clients. This is a critical decision that has a direct impact on the cost consideration.
Throughput of the shards. In case if we need more throughput, enhanced fan-out can be considered.
Auto Scaling of Kinesis Data streams shards is not very easy. Horizontal scaling is possible by adding more shards and consumers can be made to listen to the new shards. Although this process is not seamless. There is a bit of manual intervention required. An additional shard can incur costs and cannot be a good solution for the bursty workload.
Irrespective of the usage, costs for Kinesis shards are always fixed. It is both an advantage and disadvantage. Advantage because there is a predictable pattern to the cost incurred on a monthly basis. Disadvantage because if there are no messages in the shard, we still end up paying for it. And in case if there is a bursty workload, there is additional latency as the limits are exceeded. We may have to wait for retries.
Ordering of events can be guaranteed here and the kinesis partition key can be handled at the client level. Events for specific clients can always be routed to a specific shard to ensure the order is maintained.
Since the throughput is fixed, the consumers have a limit on how much they connect to the Kinesis Data Streams. This means there is less stress on the downstream for bursty workload situations. Even if the millions of events are triggered, Kinesis Data Streams consumers poll for events and process them at a normal pace thereby ensuring the downstream systems are not receiving extreme activity in a limited period of time.
Kinesis Data Streams stores the event for as long as we require. This means the same event can be processed by different consumers, again and again, using playback. If the use-case requires this as the main requirement, Kinesis may be a better option.
In the case of batch processing, if one of the messages in a batch is failed, the entire batch will get re-processed again and again. If there is one ‘bad apple’ event/message that leads to undesirable long processing that is killing resources, it is going to impact other messages as well. To avoid this issue, it is better to maintain the status of each event processing and ensure the max retries are set so that it does not overwhelm the system by processing the same event again and again in a limitless loop.

2. SQS

Unlike Kinesis Data Streams, SQS does not store messages for a long time. It is ephemeral. So the use cases for SQS are slightly different from Kinesis Data Streams. Also, multiple consumers cannot use the same queue, unlike Kinesis Data Streams.
A dedicated queue per client vs a single queue for all clients is a key decision to be taken depending on the use case. Having a single queue can help to better manage the scaling, while a dedicated queue can help to address the noisy neighbour problem better, even though the scaling has to be considered on a per-client basis.
SQS has the advantage of pay as you go, model. As many queues can be created and there are no charges to it. Only when the messages are placed into the queue and read from the queue, we will be charged. This is the main advantage compared to Kinesis Data Streams when it comes to bursty workload situations.
Scaling of SQS can be done based on different types of triggers. For example, one option would be based on the number of messages pending to the processed, the consumers can be scaled accordingly. This helps to ensure the queues and in turn, the messages are drained quickly. And SQS has some integrations with Lambda to handle the Lambda invocations upon receiving new events even better. This follows the push mechanism instead of the general pull polling mechanism.
One disadvantage of the scaling options is that they can overwhelm the downstream system. If there are too many messages in the queue, scaling can kick in and concurrency can overwhelm the downstream systems. To avoid this, it is better to have a configurable limit for concurrency so that the downstream systems are not impacted during bursty workload situations.
To ensure there are ordering guarantees, we can consider between Standard and FIFO. Although FIFO has relatively lesser throughput when compared to Standard SQS queue. For a multi-tenancy situation, group id can be used in FIFO. This will ensure the ordering is maintained. And even for parallel processing of events within the same client, this can be helpful as well.
The concept of visibility time out helps to ensure the same message is not processed again and again. Also, SQS has a redrive policy inbuilt into the service. If a message cannot be processed for some reason, after max configurable retries, the messages can be pushed to a dead letter queue and the rest of the messages can be processed. This ensures that even if one of the messages is a bad apple, it does not overwhelm the other messages and services by getting processed repeatedly.

I hope this gives more clarity on when to go for Kinesis Data Streams vs SQS - pros and cons for each service and the use-cases that will help to determine the right option when it comes to designing event-driven systems for SAAS applications.

References

« Learnings from Running Amazon Aurora MySQL in Production Resiliency and Discovery Recovery in AWS »

Master Of None

Designing Event Driven Systems for SAAS applications