At 11:19 AM on Thursday 16th June an unexpected call event occurred on one of the PBX instances in our cluster. This event caused an immense spike in CPU and Memory usage on the host node of the instance which in turn affected its ability to route SIP traffic after a few minutes. This in turn caused a cascade of issues on our Registrar and some other cluster instances, which were already under quite heavy load at the time, in being unable to offload calls quickly enough.
Our automated monitoring picked this issue up @ 11:30 AM as the issue progressed, attempting to self-heal services on 50% of the PBX cluster to clear the issue. This was completed @ 11:35 AM, but in the interim put the other 50% of the cluster under further pressure, eventually triggering the same self-healing process on those remaining instances. These completed @ 11:46 but retransmissions on top of additional failed call attempts increased traffic further still.
As the Registrar tried to cope with the increased load and was unable to respond within normal timeframes to a number of device registrations, these devices (over 50% of normal handsets levels) attempted and successfully registered to our backup POP in AWS. As our engineers investigated the alerts received and became aware of the extent of the issue, we triggered a switchover of all inbound DDIs to route through our backup POP @ 11:45 AM allowing the newly registered devices be able to receive incoming calls (voicemail and external numbers included within ring groups and followme would have operated normally throughout the majority of the incident through the operational instances as well as outbound calls). The devices that did not failover in an acceptable time period, may need further investigation and follow the steps outlined in the below section of our wiki, which were identified after our Failover test on 31/05/22.
https://byphone.atlassian.net/wiki/spaces/BYP/pages/426023/Handset+Checks#No-Failover-between-POPs
Finding that the PBX cluster had already self-healed, but had impacted the primary Registrar, our engineers then restarted the processes on this server to clear the backlog @ 11:50 AM. Once completed and confirmed that devices were able to register and operate normally again on our primary POP, we reverted inbound DDIs to route to the primary POP again @ 12:05 PM. Devices continued to return to the primary POP based on their registration expiration. Engineers continued to monitor throughout the hour, marking the incident resolved @ 12:52 PM.
Once the root cause was identified, steps were taken on all PBX instances to prevent a recurrence and this was applied on Monday evening (19/06/22) and all services were unaffected otherwise. In addition to this, a further review has been carried out to improve efficiency of all PBX instances to both reduce and better spread load across the cluster.