Internal networking issues impacting our systems

Incident Report for Red Sift UK

Postmortem

At 08:35 UTC we detected an issue with network connectivity on 2 nodes that were part of DNS and traffic routing infrastructure which caused disruption.

Further investigation revealed that on 2 of our nodes networking was unstable - e.g. packet loss, no stable DNAT and so on. The reason was identified as the overlay networking (flannel) which wasn't correctly set the DNAT/SNAT rules on the pods running on those nodes

The monitors have been properly triggered, and the offending components identified and restarted.

9:23 UTC - all nodes were fully operational. Incident was closed.

Posted Jul 10, 2025 - 09:46 UTC

Resolved

There was an internal networking issue which briefly impact our web tools, and our API

Posted Jul 10, 2025 - 08:30 UTC