System Outage

Postmortem Feb 20, 2024 9:38 AM EST

Incident Summary
A service interruption with the third-party provider of our primary and secondary Redis database clusters resulted in a period of system unavailability from approximately 4:24 pm to 5:10 pm ET on Friday, February 16, 2024. The affected services were migrated to a new provider, and at approximately 5:10 pm user access to the system was restored, with the exception of certain account-level customizations. At approximately 6:20 pm all aspects of the system were restored.

Incident Details
Initial Detection and Response
The disruption was first identified at 4:24 pm when our production application lost connection with the Redis nodes, which are required for managing account flags and overall system caching. The engineering team was immediately notified and began efforts to diagnose and resolve the issue.

Root Cause Analysis
Further investigation indicated that the entire Redis cluster had become inaccessible. Efforts to establish new clusters with the existing service provider were unsuccessful. Given the critical nature of the situation and inability to quickly resolve it with the existing service provider, the decision was made to migrate the Redis cluster to our primary AWS environment. This move was the fastest path to resolving the active incident and also provides a more robust and reliable infrastructure moving forward via further centralization of our stack within AWS.

Recovery and Restoration
System access and all primary functionality were restored at approximately 5:10 pm with the new AWS Redis cluster, with the exception of certain account-level customizations. At approximately 6:10 pm all aspects of the system were restored.

Impact
The incident resulted in a system-wide outage for approximately 46 minutes from 4:24 pm to 5:10 pm, followed by a period of partially degraded service for approximately 70 minutes from 5:10 pm to 6:20 pm. The incident did not result in any loss or compromise of customer data.

Corrective and Preventive Measures
In response to this outage, we have taken the following steps to enhance system resilience:
-Migration to AWS ElastiCache: We have streamlined our infrastructure by integrating with our existing AWS-based stack, providing a more reliable and scalable Redis management solution. AWS ElastiCache offers superior integration, resilience, and scalability compared to the previous service provider.
-Multi-AZ Configuration and Cross-Region Redundancy: The Redis cluster is configured across multiple Availability Zones and established cross-region redundancy similar to our existing AWS infrastructure to ensure higher availability and fault tolerance.
-Redis Persistence to S3: This strategy ensures quicker and more reliable access to backups, facilitating faster recovery times.

Resolved Feb 16, 2024 6:33 PM EST

The issue is resolved, and all feature customizations have been restored. If you need assistance with your account, please reach out to help@kaymbu.com.

Problem Identified Feb 16, 2024 5:00 PM EST

The system is back online. Some feature customizations have not yet been restored. If you need assistance with your account, please reach out to help@kaymbu.com.

Problem Identified Feb 16, 2024 4:40 PM EST

The system should be back up and running in about 20 minutes. Some features may take longer to be restored.

Problem Identified Feb 16, 2024 4:40 PM EST

Kaymbu is experiencing a system outage. Our team is currently working to resolve the issue.