Increased failure rate in configuration changes of hot-warm deployments

Incident Report for ESS (Public)

Resolved

We have tested extensively over the weekend and monitored our failure rates. The rate of failures has dropped to our usual level and we are now resolving this incident. Thank you for your patience!

Posted Jun 10, 2019 - 15:02 UTC

Monitoring

We have completed the rollout to all production versions and confirmed that hot-warm deployment changes are now succeeding. We will monitor for the next day.

Posted Jun 08, 2019 - 00:28 UTC

Update

We have tested the fix in our staging environment and are working on getting it rolled out to production. We have confirmed that the rate of errors on deployment changes for hot-warm has decreased to normal levels. The production rollout is ongoing, and we will update you when it is completed.

Posted Jun 07, 2019 - 22:26 UTC

Identified

Identified:

This incident is impacting all hot-warm cluster deployment changes in AWS regions. Connectivity to clusters is not impacted.

The preliminary mitigation that we rolled out last evening was a timeout increase, which allowed more hot-warm plans to succeed. This gave us a little bit of breathing room to dig in deep into the code base and spend some time profiling thread contention, and we've had a breakthrough. We discovered that a networking method was writing a state file to disk with every run, rather than when it changed, causing an increased load on disk and additional thread contention. We have merged a fix and have begun the testing and rollout process.

We will update you in six hours with our progress.

Posted Jun 07, 2019 - 16:48 UTC

Update

Based on preliminary findings the team was able to conclude recent configuration change has improved the success rate of new deployment creations. We are still in process of determining the root cause of increased failure rate in plan configuration changes of hot-warm deployments.

We will provide an update within the next 6 hours.

Posted Jun 07, 2019 - 09:50 UTC

Update

We've deployed our interim mitigations which increase the thresholds causing hot-warm plan changes to fail. We're awaiting data on the effectiveness of these changes, but expect these to reduce the failure rates of hot-warm deployment changes.

We'll update this issue within the next 4 hours as data comes to hand.

Posted Jun 07, 2019 - 06:24 UTC

Investigating

We are observing an elevated error rate in Deployment changes for Elasticsearch clusters in a hot-warm setup.

Customers making changes to or creating new hot-warm deployments may be experiencing errors, shown in the Deployment Activity tab.

We have identified that containers holding the Elasticsearch node processes are taking longer than expected to start, exceeding a threshold that leads to marking deployment changes as failed.

A team of cloud engineers are working on an interim mitigation that addresses the effects of this issue, and we expect a decrease in failure rates soon.

A new update will be posted within 4 hours.

Posted Jun 06, 2019 - 19:23 UTC