Post-Mortem: Service Degradation on March 14, 2024

Dear Datacake community,

We would like to share some insights about the service degradation we experienced on Thursday, March 14, 2024, that started around noon CET and remained until 11pm the same day. Our prime objective is to maintain open communication with our users and share our analysis, findings, and the steps we're taking to avoid similar interruptions in the future.

What Happened?

While there was no data loss, data queued up and did not process as quickly as needed. This service degradation originated due to alterations in the data retention policy of one of our enterprise customers. To accommodate this change, we had to migrate device data to a different TimescaleDB hypertable. Despite having successfully executed similar tasks in the past without impact on performance, this migration led to slower queue processing.

Migration Procedures and Problem Detection

Our standard migration query, which can be seen below, started the data migration around noon.

INSERT INTO {new_timescale_table_name} SELECT * FROM {old_timescale_table_name} WHERE device=%s AND time > %s;

Approximately thirty minutes later, we noticed that several of our queues were not processing at their optimum speed. Probing this issue revealed an increased load on our Timescale Database and a significant increase in average query duration.

In response, we scaled up our Timescale server instance; a swift process on Timescale cloud but this didn't improve metrics noticeably.

We managed to trace the root of the issue - the unexpectedly long time-span of the migration queries - with the following SQL:

SELECT pid, age(clock_timestamp(), query_start), usename, query
FROM pg_stat_activity
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%' AND query LIKE 'INSERT INTO {new_timescale_table_name} SELECT %'
ORDER BY query_start desc;

Further analysis showed that storage IO utilization was peaking at 100%, thereby confirming that we were hitting the limit of how quickly we could move data, regardless of the availability of CPU and memory.

At the core, this issue emerged as this particular customer has a substantial fleet of devices, each generating a substantial amount of data.

Meanwhile, despite our best efforts to optimize and speed up the queue processing, it wasn't until 11pm CET that the system reached a favorable real-time processing status again.

Lessons Learned and Future Provisions:

From this event, we've gleaned important insights that we'll use to prevent similar occurrences in the future:

  • Impact Estimation: Prior to migration, we will conduct thorough investigations on the potential impact keeping in mind the device size.
  • Task Decoupling: We'll segregate the migration tasks to a separate queue. This change will serve a dual purpose:
    1. It'll prevent other processes (which had to wait for the migration to complete previously) from slowing down.
    2. It'll enable us to make more accurate estimates of the processing power required for migrations.

We deeply appreciate your patience during this time. We will continue to learn and adapt, maintaining transparency along the way. Our commitment to delivering reliable services remains the same.

Thank you for your understanding and support.

The Datacake Team.