We identified the root cause of the issue, some of the physical servers are lost because of high load, it seems that some reschedule of the impacted workload triggered a cascaded failure on other nodes.
We are actively trying to recover access to the impacted servers.
Some workload are still impacted and will soon recover.
New build and container start are also impacted.