The system has been stable for over 12 hours. We are continuing work on adding additional performance and stability enhancements, but the main cause of the downtime yesterday has been resolved. If you have any questions, please feel free to contact us at firstname.lastname@example.org.
Posted over 1 year ago. May 11, 2018 - 06:58 PDT
The effects of losing connectivity on such a large portion of the cluster and the ensuing restoration have caused some delays in processing tests or starting new runs as queue processing is saturated in different areas of the test run pipeline. We're doing what we can to work to resolve the bottlenecks as they arise, but you may continue to see occasional issues while we work on a full resolution.
Posted over 1 year ago. May 10, 2018 - 17:04 PDT
Two separate security updates applied to our instances caused them to fail to boot under certain conditions (we are still determining the exact set of circumstances that caused the failed boots). We have reverted or reconfigured the hosts to work around this issue and service levels are currently returning to normal. We don't expect any regressions, but we are not yet out of the woods. We will keep this incident open while we finalize a resolution.
Posted over 1 year ago. May 10, 2018 - 11:26 PDT
We have narrowed down the issue to a few possible causes and are rolling back the associated changes while deploying new instances. As we restore instances you may see some service levels restored to normal, but this process will take some time due to the number of instances affected.
Posted over 1 year ago. May 10, 2018 - 10:35 PDT
We are continuing to work with AWS support to pinpoint the source of the issue causing our instances to have connectivity failures after being launched. We are working through different remedies to eliminate possible causes and restore service levels to normal.
Posted over 1 year ago. May 10, 2018 - 09:27 PDT
We are continuing to experience ongoing issues across our EC2 cluster. Newly-created and existing instances are failing connectivity checks. We're working as fast as we can to try to restore service.
Posted over 1 year ago. May 10, 2018 - 08:15 PDT
We are continuing to restore affected services through new instances and failing over to secondary systems. You may see intermittent availability while services are brought online, but expect continued issues until the instances are fully restored.
Posted over 1 year ago. May 10, 2018 - 07:44 PDT
We experienced a sudden loss of multiple instances in our AWS cluster. We are continuing to investigate the cause while restoring instances and failing over to secondary systems.
Posted over 1 year ago. May 10, 2018 - 07:27 PDT
The issue is also causing intermittent errors loading the dashboard. We are continuing to investigate.
Posted over 1 year ago. May 10, 2018 - 07:17 PDT
We are investigating an issue causing test runs not to run. Updates to follow as we know more.
Posted over 1 year ago. May 10, 2018 - 07:14 PDT
This incident affected: Dashboard - runscope.com and API Tests.