Koyeb - Some services are currently unavailable – Incident details

All systems operational

Some services are currently unavailable

Resolved
Major outage
Started 3 days agoLasted about 4 hours

Affected

North America

Major outage from 12:10 PM to 4:06 PM

Washington, D.C. - WAS

Major outage from 12:10 PM to 4:06 PM

San Francisco - SFO

Major outage from 12:10 PM to 4:06 PM

Europe

Major outage from 12:10 PM to 4:06 PM

Frankfurt - FRA

Major outage from 12:10 PM to 4:06 PM

Paris - PAR

Major outage from 12:10 PM to 4:06 PM

Updates
  • Postmortem
    Postmortem

    Incident Summary

    On November 20, 2025 at 12:00 UTC, we experienced a service disruption that impacted workloads running on Koyeb. During this incident, affected services were inaccessible between 20 minutes and 3 hours 30 for hobbyist workload.

    The incident was triggered by a human error that altered one of our system databases and unintentionally paused a large number of services.

    What happened

    A change intended for an internal environment was mistakenly applied to production, causing nearly all production services to transition into a paused state.

    Pausing a service is an action we expose to Koyeb users to temporarily suspend a service. When a service is paused, its configuration is preserved, it can be resumed at any time, and no charges are incurred while it remains paused. This state is typically used intentionally by users, not as part of platform operations.

    At Koyeb, most of the resources follow a defined lifecycle and are continuously reconciled to match their desired state. This reconciliation loop is responsible for ensuring services converge toward the expected configuration.

    This incident began when an unintended change set the desired state of many services to paused. When the reconciliation loop ran, it applied that state and began pausing the affected services.

    Immediately after identifying the issue, we reverted the change. About 15 minutes after the incident began, services started to recover in three ways:

    1. Users could resume their services themselves and recover immediately.

    2. We intervened and prioritized restoring critical customers workloads ourselves.

    3. In parallel, we worked to gradually recover the remaining services as fast as possible.

    Approximately 30 minutes after the incident began, critical and business customer workloads were back online.

    About 1 hour 50 minutes after the incident, additional recovery steps were applied to allow the remaining affected services to fully recover.

    Impact on Postgres services

    During recovery, we uncovered a secondary issue affecting a small subset of users.

    A bug in our database recovery logic caused database roles to be re-provisioned while services were transitioning through paused and resuming states. As a result, database user passwords for affected Postgres services were changed, making existing credentials invalid.

    Because we don’t store database passwords, the original credentials couldn’t be restored. We directly supported affected users and sent targeted communications with instructions to update their configuration.

    Incident Timeline

    • 2025-12-20 12:00 - A faulty manual operation was executed during an internal change.

    • 2025-12-20 12:00 - Services began transitioning into a pausing state. Traffic to affected services started returning HTTP 404.

    • 2025-12-20 12:15 - The rate of 404 started to raise significantly as our systems moved more services to the Paused state.

    • 2025-12-20 12:29 - Impact reached its peak. At that point, the majority of Instances on the platform were stopped.

    • 2025-12-20 12:30 - We started to recover Services, giving priority to critical and business customer workloads.

    • 2025-12-20 12:46 - The majority of business workload had recovered.

    • 2025-12-20 13:31 - While manually restoring critical workloads, we worked on a mechanism to enable batch restores while skipping builds, allowing faster recovery for affected services.

    • 2025-12-20 14:10 - We started to batch recover all remaining services.

    • 2025-12-20 14:25 - By then, more than 50% of remaining services had recovered.

    • 2025-12-20 16:00 - All services were fully recovered.

    Corrective and Preventative Measures

    We’re already changing the system and our processes to prevent this kind of outage from occurring again. Below are the first measures we’re taking; some already implemented, while others are in progress:

    • Rate limiting for reconciliations: a mechanism to throttle reconciliations. It will ensure mass operations happen over a long period of time, giving us time to react and revert changes in case of unexpected changes.

    • Audited database services recovery paths: recovery logic has been reviewed and fixed to prevent unintended service state changes.

    • Mandatory approval by a peer for a set of admin operations, including batch changes. On-call engineers can escalate incidents to a peer to get an approval during off-hours.

    • More robust service state evaluation: simplified state handling by focusing on the intended service state rather than individual intermediate transitions, reducing the risk of unexpected or unhandled state changes.

    Conclusion

    This incident should not have happened and the impact is totally unacceptable. We should not have been able to affect production at this scale, and our existing safeguards failed to contain the incident.

    We take full responsibility for this failure. The root cause was not an individual action, but gaps in our systems, tooling, and processes that allowed a small error to escalate into a broad outage.

    We are already acting on multiple fronts to prevent any forms of recurrence, with some corrective measures already in place and others actively in progress (see above). These changes are focused on preventing similar mistakes from reaching production, limiting how far mistakes can propagate and improving recovery behavior to be safer and more predictable.

    As always, feel free to contact us in case you need anything.

    We’re committed to doing better.

  • Resolved
    Resolved
    This incident has been resolved.
  • Monitoring
    Monitoring

    We implemented a fix and are currently monitoring the result.

    The majority of the services have been recovered; we are still recovering some of the free services.

  • Identified
    Identified

    We are working on a fix, you may resume your services to recover your service faster.


    We are continuing to work on a fix for this incident.

  • Investigating
    Investigating
    We are currently investigating this incident.