Root Cause

At 11:19 AM on Thursday 16th June an unexpected call event occurred on one of the PBX instances in our cluster. This event caused an immense spike in CPU and Memory usage on the host node of the instance which in turn affected its ability to route SIP traffic after a few minutes. This in turn caused a cascade of issues on our Registrar and some other cluster instances, which were already under quite heavy load at the time, in being unable to offload calls quickly enough.

Monitoring

Our automated monitoring picked this issue up @ 11:30 AM as the issue progressed, attempting to self-heal services on 50% of the PBX cluster to clear the issue. This was completed @ 11:35 AM, but in the interim put the other 50% of the cluster under further pressure, eventually triggering the same self-healing process on those remaining instances. These completed @ 11:46 but retransmissions on top of additional failed call attempts increased traffic further still.

Failover

As the Registrar tried to cope with the increased load and was unable to respond within normal timeframes to a number of device registrations, these devices (over 50% of normal handsets levels) attempted and successfully registered to our backup POP in AWS. As our engineers investigated the alerts received and became aware of the extent of the issue, we triggered a switchover of all inbound DDIs to route through our backup POP @ 11:45 AM allowing the newly registered devices be able to receive incoming calls (voicemail and external numbers included within ring groups and followme would have operated normally throughout the majority of the incident through the operational instances as well as outbound calls). The devices that did not failover in an acceptable time period, may need further investigation and follow the steps outlined in the below section of our wiki, which were identified after our Failover test on 31/05/22.

https://byphone.atlassian.net/wiki/spaces/BYP/pages/426023/Handset+Checks#No-Failover-between-POPs

Recovery

Finding that the PBX cluster had already self-healed, but had impacted the primary Registrar, our engineers then restarted the processes on this server to clear the backlog @ 11:50 AM. Once completed and confirmed that devices were able to register and operate normally again on our primary POP, we reverted inbound DDIs to route to the primary POP again @ 12:05 PM. Devices continued to return to the primary POP based on their registration expiration. Engineers continued to monitor throughout the hour, marking the incident resolved @ 12:52 PM.

Follow-up

Once the root cause was identified, steps were taken on all PBX instances to prevent a recurrence and this was applied on Monday evening (19/06/22) and all services were unaffected otherwise. In addition to this, a further review has been carried out to improve efficiency of all PBX instances to both reduce and better spread load across the cluster.

Posted Jun 24, 2022 - 16:32 BST

Resolved

This incident has been resolved.
We had more than half of registered devices fail over to our backup within 10 minutes of the initial incident and re-routed inbound DDIs shortly after.
A small number of handsets remain on our backup registrar having not yet failed back to our primary POP. However services on our primary registrar have remained stable since the fix was implemented and DDI routed via the primary POP again.
A post-mortem will follow in due course with further information on the cause and steps taken throughout.

Posted Jun 16, 2022 - 12:52 BST

Update

We are continuing to monitor for any further issues.

Posted Jun 16, 2022 - 12:11 BST

Monitoring

A fix has been implemented and we are monitoring the results. Handsets may have automatically failed over to our backup POP and should fall back to the primary POP within 10 minutes. This could be forced back more quickly with a reboot of the device.

Posted Jun 16, 2022 - 11:58 BST

Identified

The issue has been identified and a fix is being implemented.

Posted Jun 16, 2022 - 11:55 BST

Update

We are continuing to investigate this issue.

Posted Jun 16, 2022 - 11:46 BST

Investigating

We are currently investigating this issue.

Posted Jun 16, 2022 - 11:46 BST

This incident affected: SIP Registrar.