A Year With Redis Event Sourcing

7 minute read

In the very first post on this blog in December 2017 I wrote about how we at Learning.com are using Redis to implement an event-driven microservices architecture using the Event Sourcing pattern. At the time, the architecture, patterns, and implementation with Redis were relatively new to us. Over the past year we’ve had great success with event-driven services and event sourcing, but we’ve also had to face and overcome some challenges. In this post I am going to discuss how the architecture is working out for us a year on, some of the lessons we’ve learned in that time, and how we’ve overcome some of the challenges we’ve faced. Event Sourcing is not always the right persistence pattern

One of the first lessons we learned is that event sourcing is not always the right persistence pattern for all our data. Event Sourcing can be thought of as both a way of communicating between services and as a way of storing data. First, communication is done in an asynchronous, event-driven fashion using some sort of messaging system, in our case Redis. Data about each event and the effect it has on the state of an aggregate is stored to a persistent data store so that it can be replayed to get to the current state of an aggregate or to create new projections. In short, communication and persistence are intertwined.

When we started implementing our new microservices architecture using event sourcing, we knew we wanted to make communication between services asynchronous and event-driven in order to maintain performance and allow for greater scalability than an HTTP communication model would have given us. What we quickly realized however, was that in many cases we really just wanted a messaging system. We weren’t particularly interested in storing the data indefinitely for many of the events we were publishing. Many of the events were one-off messages and the data they contained wasn’t really describing a stream of changes to an aggregate. What we needed in many cases was just a way to do asynchronous messaging between our services.

In the first versions of our Learning.EventStore library there was no way to publish a message without also persisting the data contained in that message to the event store. As we started building new services, we noticed that our event store was getting filled up with message data that wasn’t particularly valuable. Since the data was stored in Redis, in memory, this was also quite expensive. As a result the Learning.EventStore library was modified to separate the messaging and event persistence so that it can also be used as a straight messaging system for communication between services without necessarily persisting all of the event data.

Event sourcing is very good for data where you are interested in storing changes over time to an aggregate. An example from our domain is the score data that we store when students complete Learning.com lessons. It is extremely important and valuable for us to store how student’s scores change over time for reporting and to analyze the efficacy of our lessons. Another example from the finance domain might be bank account information where it’s important to know how account balances change over time. The lesson learned here is that when using event sourcing it is very important to think about the data that’s being stored and whether the overhead and cost of storing that data is worth it. You very well may just need a messaging system and not a full event sourced model. So much data and so hard to query

Event sourcing can produce a lot of data in a production system being used by thousands of users in a short amount of time. After all, we are storing data about every change to an aggregate, not just the current state like in more traditional systems. If you are using Redis for storage, this can get quite expensive as well since everything is stored in memory. Once our services using event sourcing were deployed to production we started seeing memory usage on our Redis cluster instances increase on the order of a GB or more per week. This was quite unsustainable and therefore steps needed to be taken to stem the tide.

The first thing we did was to start gzip compressing the JSON containing the event data before putting it into Redis. This resulted in approximately a 50% reduction in memory usage. The change addressed the issue in the short term, but we realized we still had an unsustainable situation for the long term. While compressing the data reduced memory usage, it also exacerbated the next issue that I will talk about which is the queryability of the event data in Redis. Since the data was gzipped, it had to be decompressed before it was human readable making any kind of querying more complicated.

Once we started getting some data into the Redis event store it also became quite apparent that the data was relatively difficult to query. In the Redis event store implementation in Learning.EventStore, data is stored as JSON strings in a series of Redis hashes. While it is simple enough to retrieve all of the events for a particular aggregate, more complex queries are difficult.

While something like the ReJSON module for Redis 4 would probably alleviate the querability problem, we’re still left with the relative expense of storing large amounts of data in memory in Redis. The approach we ultimately took was to move event storage into PostgreSQL on Amazon Aurora. Storage in PostgreSQL is much cheaper and the JSON data types makes complex queries on JSON data a breeze. Redis is still heavily used in our implementation for message queuing, pub/sub, and snapshot storage, we are just no longer storing event data there. As a result, the Learning.EventStore library now supports 3 event storage providers: Redis, PostgreSQL, and SQL Server. Latency and performance

We also ran into some latency and performance issues with Redis that needed to be overcome. As we started scaling up our use of our persistent Redis cluster, we began seeing errors like this from the StackExchange.Redis client library:

Timeout performing RPOPLPUSH Learning.ScoringService.Web-Prod:{production:ItemScoreCreated}:PublishedEvents, inst: 1, queue: 3, qu: 0, qs: 3, qc: 0, wr: 0, wq: 0, in: 296, ar: 0

All of the troubleshooting steps we went through to diagnose this could be a blog post in themselves. Suffice to say that we narrowed it down to slow disk I/O with the Amazon EBS volumes we use for storage. The Redis documentation has some excellent troubleshooting steps for latency issues that were very valuable to us in helping diagnose this. The Latency Monitor and Latency Doctor features built into Redis can help you narrow down the problem and in the case of the Latency Doctor even give suggestions for how to address the issues. The three things we ended up doing to address this were:

  1. Increasing the syncTimeout setting in the StackExchange.Redis Configuration from the default 1 second to 5 seconds.
  2. Increasing the provisioned IOPS on the EBS volumes.
  3. Setting the no-appendfsync-on-rewrite Redis configuration setting to “yes” in order to avoid doing an fsync on the AOF file when either a BGSAVE or BGREWRITEAOF command is in progress. This avoids accessing the disk in the blocking fsync command during times of high I/O, thereby reducing latency.

These measures have greatly reduced the latency issues we’ve seen with persistent Redis, but have not completely gotten rid of them. Our next step is to re-architect our persistent Redis cluster to so that we only do persistence on the slave nodes. This means the masters are not touching the slow EBS volumes at all and therefore should not have any latency due to slow disk I/O. This model (and some potential pitfalls) is discussed further here. Conclusion

Over the past year, our event-driven architecture and event sourcing implementation have worked out very well. It has allowed us to build high performance and scalable services. As with any new system, it came with it’s share of challenges to be overcome as well. We have learned how to identify which use cases fit event sourcing well and which use cases call for a more basic messaging approach. We’ve addressed the issue of expensive event data storage in Redis by moving that storage into PostgreSQL, which has also made the data much more queryable and accessible. Finally, we’ve addressed the latency issues with Redis by making some targeted configuration changes and ultimately by re-architecting our persistent Redis cluster to use slave-only persistence.

Originally posted on the Learning.com tech blog

Updated: