At 08:35 UTC we detected an issue with network connectivity on 2 nodes that were part of DNS and traffic routing infrastructure which caused disruption.
Further investigation revealed that on 2 of our nodes networking was unstable - e.g. packet loss, no stable DNAT and so on. The reason was identified as the overlay networking (flannel) which wasn't correctly set the DNAT/SNAT rules on the pods running on those nodes
The monitors have been properly triggered, and the offending components identified and restarted.
9:23 UTC - all nodes were fully operational. Incident was closed.