Potential performance issues with hot-warm infrastructure in AWS

Incident Report for ESS (Public)

Resolved

We have completed over 90% of the upgrades to 7.1.1 and the hot-warm clusters are performing significantly better. The remainder of the upgrades will be scheduled with individual customers. We will continue to monitor as required. This incident is now closed. Should you have additional questions or concerns, as always, you can contact support@elastic.co.

Posted May 30, 2019 - 15:53 UTC

Update

With the release of Elasticsearch 7.1.1 earlier today, we're starting a forced upgrade of all 7.x series hot-warm deployments.

No downtime is expected during the upgrade and all deployments are expected to be upgraded within a week.
We will continue moving warm nodes onto hardware running the latest kernel with disk scheduling improvements in parallel.

No action is required of customers at this time, although customers are free to upgrade any 7.x cluster early. We will continue to update this incident as our work progresses.

Posted May 29, 2019 - 00:47 UTC

Update

While waiting for the 7.1.1 release, our team has been digging deeper into the IO performance for the class of instances that hold the impacted Elasticsearch nodes.
We've discovered that a set of backports and configuration changes in a 4.15.0 LTS kernel version, distributed by Canonical, properly exposes the "deadline" IO scheduler.
Our SRE team has validated in our staging environments that the performance of "deadline", for the RAID setup of our instances on AWS, is significantly improving IO compared to blk-mq that bypasses the Linux IO schedulers.
Since Friday, May 24th, we've been canarying the new kernel in our production environments and migrating warm node workload over. We've been seeing measurable improvements throughout and we are proceeding with provisioning more infrastructure to migrate the aforementioned workload.
We expect that impacted deployments will be benefiting from the aforementioned changes gradually as the migration efforts continue.
No action is required from customers at this time.
We will update the incident status page soon, with more information on the 7.1.1 release and the migration status.

Posted May 27, 2019 - 14:21 UTC

Update

We've limited the creation and upgrade of hot-warm deployments to the 7.x series until 7.1.1 is released due to their excessive IO and resulting instability.

The release process for Elasticsearch 7.1.1 is underway and is expected to be available early next week. When 7.1.1 is available, we will begin upgrading all existing 7.x hot-warm deployments in Cloud to that version. Customers will be able to create new 7.x series hot-warm clusters on 7.1.1+.

If you have a hot-warm deployment that is having performance issues (slow api calls or nodes dropping out of the cluster), please reach out to support@elastic.co for assistance.

Posted May 24, 2019 - 00:21 UTC

Update

We have begun the release process for Elasticsearch 7.1.1 and it will take a few days to complete. Once we have 7.1.1 ready, we will begin upgrading all 7.x hot-warm deployments in Cloud to that version. We will be limiting the creation and upgrade of hot-warm deployments to the 7.x series until 7.1.1 is released due to their excessive IO and resulting instability. If you have a hot-warm deployment that is having performance issues (slow api calls or nodes dropping out of the cluster), please reach out to support@elastic.co for assistance.

We will update you again when 7.1.1 is released.

Posted May 23, 2019 - 18:05 UTC

Update

We've confirmed that the changes made by the Elasticsearch team fix performance issues with hot-warm deployments. We're continuing to work closely with the Elasticsearch team to make this available to Cloud customers as soon as possible. We will update you again when we have a timeframe on when the release will be available on Cloud and available to download via elastic.co/downloads.

Posted May 22, 2019 - 06:15 UTC

Identified

After digging into the weeds of Linux file systems (IO schedulers, xfsslower tracing and friends), we suspect that the performance degradation is due to excessive fsyncs. In-order to support Cross Cluster Replication , Elasticsearch retains historical operations on certain indices. The amount of historical operations that are retained in an index is controlled by a new mechanism called retention leases. The leases are maintained by the primary copy of each shard and are synchronized to the replicas. With every synchronization, we issue an fsync to the file system to persist the file where the leases are stored. For simplicity, we are currently syncing the leases every 30 seconds. Sadly, for clusters which have a lot of shards and run on spinning disks (hello, warm nodes!) this creates a lot of fsyncs. Those numerous fsyncs appear to be causing heavy IO load on the machines and cause delays in persisting cluster state updates to disk. The delays can be so large that the new cluster coordination subsystem deems the nodes unstable and removes them from the cluster. These fsyncs only arise on indices created since 6.5 with a special index setting; to support future features, that setting is the default for indices created since 7.0. The Elasticsearch team has already created a pull request to fix this issue and we are currently working on confirming it in our staging Cloud envirionment. Once we have confirmed that the pull request fixes the issue, we will take the necessary next steps to get this fixed for all impacted Cloud users (and other users of Elasticsearch). We will update you again when we have confirmation of the fix (ETA 6 hours).

Posted May 21, 2019 - 21:22 UTC

Update

We've determined this issue is only affecting up to 5% of hot-warm deployments in both AWS and GCP based regions. Customers with hot-warm deployments experiencing any of the following symptoms are encouraged to contact support@elastic.co:
* Slow response times
* API calls timing out
* Temporarily unavailable shards on the warm tier
* New hot-warm deployments timing out during initial deployment

We'll update this incident within 24 hours, or when we determine an effective mitigation for the affected instances, and a longer term fix.

Posted May 20, 2019 - 23:52 UTC

Update

We're currently looking into short-term mitigations for the disk performance issues with warm-tier nodes in hot-warm deployments in AWS. We are also discussing potential longer term solutions for the disk performance. Further updates on this issue will come within the next 6 hours. If you are impacted by this issue or have questions related to the issue, please contact support@elastic.co.

Posted May 20, 2019 - 18:14 UTC

Update

We are continuing to investigate this issue.

Posted May 20, 2019 - 16:15 UTC

Investigating

We're currently experiencing performance issues with a small percentage of warm-tier nodes in hot-warm deployments in AWS. We have identified the issue and it seems to be related to disk performance on that server tier. Customers may experience the following symptoms: slow response times, timed out API calls, and temporarily unavailable shards on the warm tier, and new hot-warm deployments may time out during initial deployment. We are currently determining a long-term fix for the problem. If you are impacted by this issue, please contact support@elastic.co.

Posted May 20, 2019 - 15:48 UTC

This incident affected: AWS N. Virginia (us-east-1) (AWS EC2 Health: us-east-1) and AWS Ireland (eu-west-1) (AWS EC2 Health: eu-west-1).