All Systems Operational
Cluster Management Operational
Cluster Management Console Service Operational
Cluster Management API Operational
Cluster Orchestration ? Operational
Cluster Metrics Operational
Cluster Snapshots ? Operational
AWS Marketplace Operational
GCP us-central1 Operational
Cluster Connectivity: GCP us-central1 Operational
Kibana Connectivity: GCP us-central1 Operational
APM Connectivity: GCP us-central1 Operational
GCP us-west1 Operational
Cluster Connectivity: GCP us-west1 Operational
Kibana Connectivity: GCP us-west1 Operational
APM Connectivity: GCP us-west1 Operational
GCP europe-west1 Operational
Cluster Connectivity: GCP europe-west1 Operational
Kibana Connectivity: GCP europe-west1 Operational
APM Connectivity: GCP europe-west1 Operational
GCP europe-west3 Operational
Cluster Connectivity: GCP europe-west3 Operational
Kibana Connectivity: GCP europe-west3 Operational
APM Connectivity: GCP europe-west3 Operational
Google Cloud Platform ? Operational
Google Compute Engine ? Operational
Google Cloud Storage ? Operational
AWS N. Virginia (us-east-1) Operational
Cluster Connectivity: AWS us-east-1 Operational
AWS EC2 Health: us-east-1 ? Operational
Snapshot Storage Infrastructure (S3): us-east-1 ? Operational
Kibana Connectivity: AWS us-east-1 ? Operational
APM Connectivity: AWS us-east-1 Operational
AWS N. California (us-west-1) Operational
Cluster Connectivity: AWS us-west-1 Operational
AWS EC2 Health: us-west-1 ? Operational
Snapshot Storage Infrastructure (S3): us-west-1 ? Operational
Kibana Connectivity: AWS us-west-1 Operational
APM Connectivity: AWS us-west-1 Operational
AWS Ireland (eu-west-1) Operational
Cluster Connectivity: AWS eu-west-1 Operational
AWS EC2 Health: eu-west-1 ? Operational
Snapshot Storage Infrastructure (S3): eu-west-1 ? Operational
Kibana Connectivity: AWS eu-west-1 Operational
APM Connectivity: AWS eu-west-1 Operational
AWS Frankfurt (eu-central-1) Operational
Cluster Connectivity: AWS eu-central-1 Operational
AWS EC2 Health: eu-central-1 Operational
Snapshot Storage Infrastructure (S3): eu-central-1 Operational
Kibana Connectivity: AWS eu-central-1 Operational
APM Connectivity: AWS eu-central-1 Operational
AWS Oregon (us-west-2) Operational
Cluster Connectivity: AWS us-west-2 Operational
AWS EC2 Health: us-west-2 ? Operational
Snapshot Storage Infrastructure (S3): us-west-2 ? Operational
Kibana Connectivity: AWS us-west-2 Operational
APM Connectivity: AWS us-west-2 Operational
AWS São Paulo (sa-east-1) Operational
Cluster Connectivity: AWS sa-east-1 Operational
AWS EC2 Health: sa-east-1 ? Operational
Snapshot Storage Infrastructure (S3): sa-east-1 ? Operational
Kibana Connectivity: AWS sa-east-1 Operational
APM Connectivity: AWS sa-east-1 Operational
AWS Singapore (ap-southeast-1) Operational
Cluster Connectivity: AWS ap-southeast-1 Operational
AWS EC2 Health: ap-southeast-1 ? Operational
Snapshot Storage Infrastructure (S3): ap-southeast-1 ? Operational
Kibana Connectivity: AWS ap-southeast-1 Operational
APM Connectivity: AWS ap-southeast-1 Operational
AWS Sydney (ap-southeast-2) Operational
Cluster Connectivity: AWS ap-southeast-2 Operational
AWS EC2 Health: ap-southeast-2 ? Operational
Snapshot Storage Infrastructure (S3): ap-southeast-2 ? Operational
Kibana Connectivity: AWS ap-southeast-2 Operational
APM Connectivity: AWS ap-southeast-2 Operational
AWS Tokyo (ap-northeast-1) Operational
Cluster Connectivity: AWS ap-northeast-1 Operational
AWS EC2 Health: ap-northeast-1 ? Operational
Snapshot Storage Infrastructure (S3): ap-northeast-1 ? Operational
Kibana Connectivity: AWS ap-northeast-1 Operational
APM Connectivity: AWS ap-northeast-1 Operational
Heroku ? Operational
Elastic Maps Service ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Jun 24, 2019
Resolved - We have been monitoring the cluster throughout the day and the queues have remained at normal levels.
Jun 24, 22:22 UTC
Update - We have caught up processing our logging and metrics queues. All cluster logs should now be processed as expected.

Unfortunately, we were unable to save all the backlog queue files and have therefore lost some data for the period of June 23rd 2019 15:00pm UTC and June 24th 2019 10:00AM UTC. At this stage we are assessing the impact of data ingestion issue and we will provide an update in a few hours.
Jun 24, 13:39 UTC
Update - Queues for logging and metrics data in us-east-1 are now draining at a more acceptable rate. We believe we have identified all the logging and metrics clusters suffering contention, and increased their number of shards to improve ingest.

We will update this incident when we can confirm improvements, but we expect customer-facing logs will still be delayed for several hours.
Jun 24, 10:25 UTC
Update - Queues for logging and metrics data in us-east-1 remain high, and cluster logs and metrics will be delayed as a result. We have increased the number of shards on the active indices in the delayed clusters, and we are monitoring for improvements in our ingestion rates.

We'll update this incident when we can confirm improvements, but we expect logs will still be delayed for several hours.
Jun 24, 07:43 UTC
Update - Queue's for logging and metrics data in us-east-1 remain high and cluster logs and metrics will be delayed as a result. We're working on increasing the number of shards on the active indices within the delayed logging cluster which should improve our ingestion rates.

We'll update this incident when these changes have been made, however we expect logs will be delayed for several hours still.
Jun 24, 06:19 UTC
Update - Queue's for logging and metrics data in us-east-1 remain high and cluster logs and metrics will be delayed as a result. We're continuing remediation work to speed up ingestion, however it will still take some time to clear these queues.

We'll continue to monitor the log and metrics pipelines and will update this incident with any new information as it comes to light.
Jun 24, 03:32 UTC
Update - The logging and metrics queues in us-east-1 are still high, we're continuing remediation work to speed up the ingestion but it will take some time to reduce the queues. We are continuing to monitor the situation.
Jun 24, 00:01 UTC
Monitoring - The logging delay in us-east-1 has improved, although the queues have not fully drained. This may cause logs and metrics from earlier today to not appear yet. We are continuing to monitor the situation.
Jun 23, 20:27 UTC
Update - The proxy layer in this region has been stable for the past 30 minutes. We are still working on the logging delay, which is currently two hours behind.
Jun 23, 17:06 UTC
Update - The rate of increased 5xx errors on our proxy rates in AWS us-east-1 has returned to normal levels and the proxy layer is more stable. We are still seeing a delay in logging in the region.

The backend zookeeper ensemble is still under increased load and we are continuing to investigate. We'll have another update for you in 1 hour.
Jun 23, 16:47 UTC
Identified - A failure in a backend Zookeeper node at 14:44 UTC has caused increased proxy 5xx error rates starting at 15:11 UTC which also caused disruption to customer intra-cluster connectivity and a logging delay in the us-east-1 region. The initial Zookeeper failure has been corrected and engineers are currently working to correct the impact to our regional logging and metrics clusters.
Jun 23, 16:08 UTC
Jun 22, 2019

No incidents reported.

Jun 21, 2019

No incidents reported.

Jun 20, 2019
Resolved - This incident has been resolved.
Jun 20, 03:54 UTC
Update - Cluster logging and metric data should now be up to date for all deployments. We'll continue to monitor our ingestion pipelines and monitoring clusters as per our normal monitoring.

There should be no ongoing customer impact as part of this incident.
Jun 20, 03:28 UTC
Monitoring - Cluster logging and metric data for most clusters should now be up to date. Deployments in AWS eu-central-1, GCP us-central1 and GCP europe-west1 regions will still be delayed, however those regions are processing queues and will be up to date within the hour.

We will update this issue within the next 30 minutes.
Jun 20, 03:01 UTC
Identified - Customer logs and metrics appear to be flowing again, with queues trending downwards. We expect cluster logging and metric data to be up to date within the next hour.

One of our backend monitoring clusters had entered a degraded state which was impacting indexing rates within our Logstash consumers. We're still investigating what caused this cluster to get into this state, but believe two nodes were impacted by our recent hot-warm incident resulting in those nodes not able to keep up with cluster state changes. We've restored this cluster to an operational state, which in turn restored our log ingestion rates.

We'll post another update in 30 minutes, or as new information comes to hand.
Jun 20, 02:43 UTC
Investigating - Currently cluster logs and metrics across all regions are delayed. At this stage you won't be able to see up to date logs or metrics for your clusters. You can still access historical logs and metrics. We're investigating the root cause and will post an update shortly.
Jun 20, 02:03 UTC
Jun 19, 2019

No incidents reported.

Jun 18, 2019

No incidents reported.

Jun 17, 2019

No incidents reported.

Jun 16, 2019

No incidents reported.

Jun 15, 2019

No incidents reported.

Jun 14, 2019

No incidents reported.

Jun 13, 2019

No incidents reported.

Jun 12, 2019

No incidents reported.

Jun 11, 2019

No incidents reported.

Jun 10, 2019
Resolved - We have tested extensively over the weekend and monitored our failure rates. The rate of failures has dropped to our usual level and we are now resolving this incident. Thank you for your patience!
Jun 10, 15:02 UTC
Monitoring - We have completed the rollout to all production versions and confirmed that hot-warm deployment changes are now succeeding. We will monitor for the next day.
Jun 8, 00:28 UTC
Update - We have tested the fix in our staging environment and are working on getting it rolled out to production. We have confirmed that the rate of errors on deployment changes for hot-warm has decreased to normal levels. The production rollout is ongoing, and we will update you when it is completed.
Jun 7, 22:26 UTC
Identified - Identified:

This incident is impacting all hot-warm cluster deployment changes in AWS regions. Connectivity to clusters is not impacted.

The preliminary mitigation that we rolled out last evening was a timeout increase, which allowed more hot-warm plans to succeed. This gave us a little bit of breathing room to dig in deep into the code base and spend some time profiling thread contention, and we've had a breakthrough. We discovered that a networking method was writing a state file to disk with every run, rather than when it changed, causing an increased load on disk and additional thread contention. We have merged a fix and have begun the testing and rollout process.

We will update you in six hours with our progress.
Jun 7, 16:48 UTC
Update - Based on preliminary findings the team was able to conclude recent configuration change has improved the success rate of new deployment creations. We are still in process of determining the root cause of increased failure rate in plan configuration changes of hot-warm deployments.

We will provide an update within the next 6 hours.
Jun 7, 09:50 UTC
Update - We've deployed our interim mitigations which increase the thresholds causing hot-warm plan changes to fail. We're awaiting data on the effectiveness of these changes, but expect these to reduce the failure rates of hot-warm deployment changes.

We'll update this issue within the next 4 hours as data comes to hand.
Jun 7, 06:24 UTC
Investigating - We are observing an elevated error rate in Deployment changes for Elasticsearch clusters in a hot-warm setup.

Customers making changes to or creating new hot-warm deployments may be experiencing errors, shown in the Deployment Activity tab.

We have identified that containers holding the Elasticsearch node processes are taking longer than expected to start, exceeding a threshold that leads to marking deployment changes as failed.

A team of cloud engineers are working on an interim mitigation that addresses the effects of this issue, and we expect a decrease in failure rates soon.

A new update will be posted within 4 hours.
Jun 6, 19:23 UTC