Reports of issues making receiving calls

Incident Report for Byphone

Postmortem

Incident Report : Service affecting issue 11:32-12:35 on Monday 8th May.

‌

Summary of incident:

====================

Between 11:32 and 12:35 on Monday 8th May we had a degraded voice service. Most inbound calls failed, and a lot of outbound calls also failed.
Some AWS instances including one of our "non-critical" presence servers had an availability issue on Saturday morning. Our presence server did not come back up cleanly and it then locked up on Saturday night. It was restarted manually on Monday morning. A flood of backed up voicemail presence messages then flooded our network causing intermittent database lookup failures, DNS lookup failures, and TCP socket overload.
Some calls (about 20%) did continue to connect throughout the incident. Most of the affected calls were inbound calls. Average Call Duration of the calls that did connect was normal.
Engineering reacted promptly but were overwhelmed by failure in the diagnostics and monitoring, and by the variety of intermittent failures we were seeing on many different components of the system.
The Voice system recovered itself after an hour.
Engineers have since spent a lot of time pouring over our logs, metrics and traces to understand what happened, recreate parts in sandbox, and implement mitigations to prevent recurrence.

‌

Analysis and Corrective Actions during the issue:

=================================================

Throughout the incident we could see some calls were being processed correctly, so we were looking for the common feature of the failing calls.

Due to the fact it was a UK Bank Holiday there were fewer calls than usual on the platform than expected, so it was hard to ascertain what "normal" should look like.

We could see that we were dropping a high percentage of inbound calls, and a smaller percentage of outbound calls. With lots of SIP timeout messages.

The normal process in an intermittent or partial failure is to use SIP diagnostic platform to trace and find a distinguishing feature between the working and failing calls, unfortunately the SIP diagnostic platform was not working during the incident as it had been overwhelmed by the volume of data thrown at it that morning.

‌

Timeline of Incident:

=====================

On the morning of Saturday 6th May there was an AWS incident on some EC2 instances, including a Presence server, which appeared to be resolved automatically.

The Presence server logs however indicate that it locked up on Saturday at 18:21. It was still running, but not able to process any requests. No alarms were triggered, and we were unaware throughout the weekend.

On Monday morning we noticed, and received customer tickets informing us, that many BLF keys were mis-behaving.

This was quickly traced to the problem Presence server, and that server was restarted at 11:22 on May 8th.

At 11:33 we saw a drop off in the volume of calls on the platform.

At 11:47 we received customer's tickets about call problems.

We attempted to trace the issue but saw all components of the system were working...but receiving timeout messages from the next hop.

Attempting to trace on the sip monitoring was impossible as it was reporting incorrect username and password when we tried to log in.

At this stage we suspected that we had been hacked.

We then divided the workload with a team working through a systematic analysis of each component of the voice system, and another working on bringing monitoring back on line and investigating security and integrity of the platform.

12:10 We brought the SIP monitoring back online with additional capacity and could trace and diagnose calls properly.

12:30 Normal calls began flowing, and some "ghost" calls cleared through the system.

12:35 Inbound and Outbound were working normally again.

‌

Root Cause Analysis:

====================

Following the AWS issue on Saturday morning a Presence server did not come up cleanly. And it locked up in an indeterminate state on Saturday evening.

Our fleet of voice registrars and PBX instances continued to send messages to that presence server throughout the weekend causing a volume of network retransmissions, and consequent very high volume of messages sent to our monitoring platform Our SIP monitoring and diagnostic platform succumbed to the pressure at midnight on Sunday night, and that was not noticed until we tried to use it to diagnose the issue on Monday.

When we tried to login to diagnostics it would not allow us to log in due to a failure to connect to its own DB. The auth failure threw us off for a while, leading us to suspect a hack and compromised system integrity.

Over the course of the weekend a small batch job "vm-presence" continued to run every half hour. The job is intended to check each customer's mailbox and generate SIP messages to control the BLF (message waiting indication). It sends these messages to the presence server.

When the presence server didn't respond ""vm-presence" kept re-attempting each transmission. This slowed the job down so instead of taking 5 minutes it was taking many hours to complete.

This led to an unexpected situation where many instances of "vm-presence" were running concurrently, but consuming almost no resource as they could not connect to anything.

When the presence server was restarted then suddenly all the running "vm-presence" instances were able to connect. Resource consumption spiked to upper limits. The presence server was trying to handle orders of magnitude more messages than its design limit.

Each of the Notify messages being sent required database access from presence and registrar proxy services in order to route the messages back to customer premises equipment.

Each of the PUBLISH messages sent by "vm-presence" required multiple DNS lookups in order to find the presence server.

This led to an unexpected and previously unseen scenario where we had timeouts on DNS requests to Amazon Route53 DNS request on the registrar, so outbound calls were all routing over a failover route. (which threw our diagnostics as well) The database servers were running the "vm-presence" job, so most CPU resource was consumed by the database server trying to handle the SIP presence messages from "vm-presence". This caused random timeouts across all our PBX and Proxies, as they tried to route SIP traffic.

The registrar proxies were trying to keep so many connections open that they then ran out of network sockets. This affected Inbound calls trying to connect to customer premises more than the Outbound calls, hence the bias in drop-off of normal traffic.

At 12:30 the "vm-presence" batch jobs had caught up with reality, the servers had plenty of sockets, the database load was back to normal, DNS and everything else had plenty of resource, and the diagnostics were all working properly.

‌

Mitigating Actions Taken:

=========================

Capacity of our SIP monitoring and diagnostic platform was increased on Monday during the incident.

The "vm_presence" job has been moved to a dedicated instance.

The "vm_presence" job has been modified so that it aborts if the DNS lookup or presence server connection fails.

The "vm_presence" job has had a semaphore added so that only one instance can run at any given time.

The "vm_presence" job has been modified to reduce the ratio of DNS lookups to mailbox messages sent.

Alerts have been added for the instances immediately affected during the incident regarding general status checks.

Additional dashboards have been added to provide an improved view for comparison of key metrics across numerous related instances.

‌

Mitigating Actions Planned:

===========================

We should add a DNS cache on some more hosts, e.g. the The "vm_presence" host.

We should add further monitoring and alerts on the "non-critical" services and on the monitoring services.

We should add some more monitoring of performance metrics, and additional dashboards to give us a better way of finding the critical areas during an incident...we were overwhelmed by a huge volume of details but couldn't see the big picture.

We should add some more intelligence to the SIP and PBX instances so that they don't send presence messages if the presence server has "gone away".

We suspect that the one hour till system recovered is in part due to timeout on stale SIP connections on the edge proxy. We have not yet replicated that. There may be scope for faster recovery by reducing socket timeouts, but that may reduce reliability of SIP registrations to customer premises equipment. This area requires more research.

Posted May 12, 2023 - 17:45 BST

Resolved

This incident has been resolved. A post-mortem will be provided in due course.

Posted May 08, 2023 - 13:14 BST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 08, 2023 - 12:42 BST

Investigating

We are currently investigating this issue.

Posted May 08, 2023 - 11:55 BST

This incident affected: SIP Gateway.