AWS Network connectivity issues within ap-northeast-1

Incident Report for ESS (Public)

Resolved

AWS has resolved the incident for the ap-northeast-1 region with the following status update:

"Aug 23, 4:18 AM PDT Beginning at 8:36 PM PDT a small percentage of EC2 servers in a single Availability Zone in the AP-NORTHEAST-1 Region shutdown due to overheating. This resulted in impaired EC2 instances and degraded EBS volume performance for resources in the affected area of the Availability Zone. The overheating was caused by a control system failure that caused multiple, redundant cooling systems to fail in parts of the affected Availability Zone. The chillers were restored at 11:21 PM PDT and temperatures in the affected areas began to return to normal. As temperatures returned to normal, power was restored to the affected instances. By 2:30 AM PDT, the vast majority of instances and volumes had recovered. We have been working to recover the remaining instances and volumes. A small number of remaining instances and volumes are hosted on hardware which was adversely affected by the loss of power. We continue to work to recover all affected instances and volumes. For immediate recovery, we recommend replacing any remaining affected instances or volumes if possible. Some of the affected instances may require action from customers and we will be reaching out to those customers with next steps."

From our side we have been closely monitoring the situation for the past couple of hours. Since AWS has resolved the case, we didn't notice any host issues in the AWS ap-northeast-1 region.

Posted Aug 23, 2019 - 13:42 UTC

Update

We're continuing to monitor platform health within AWS ap-northeast-1b. We don't currently see any host issues in the region, however there is still an active AWS incident for the region.

Customers may still be experiencing issues connecting to Elasticsearch, Kibana and APM deployments within ap-northeast-1. Our monitoring does not indicate high rates of connection failures and retrying any requests will likely succeed.

We have no ETA on when this incident will be resolved. We will update the incident in an hour or as the situation changes.

Posted Aug 23, 2019 - 06:00 UTC

Update

We're no longer seeing host failures within AWS ap-northeast-1b, however AWS still has an active incident within the region.

Customers may still be experiencing issues connecting to deployments, which may result in timeouts or connection errors when accessing Kibana, Elasticsearch or APM instances within ap-northeast-1.

We will continue to monitor the incident and repair any impacted services when possible. We will update this incident within the next 40 minutes or as the situation changes.

Posted Aug 23, 2019 - 05:22 UTC

Monitoring

We're monitoring an AWS incident related to networking connectivity within ap-northeast-1. This incident is causing instability within underlying cloud coordination components, and has caused some customer deployment nodes to become unresponsive.

It's likely this incident will be causing deployment connectivity issues with customer deployments. We're currently investigating the impact on cluster connectivity.

We are monitoring host health and actively replacing any impacted instances. However the AWS API's are also impacted by the incident, affecting our ability to create replacement capacity.

We have no ETA on when this incident will be resolved, however we will update this incident within the next hour.

Posted Aug 23, 2019 - 05:04 UTC

This incident affected: AWS Tokyo (ap-northeast-1) (Elasticsearch connectivity: AWS ap-northeast-1, Kibana connectivity: AWS ap-northeast-1, APM connectivity: AWS ap-northeast-1).