Fixing Mystery Timeouts With a Simple Traffic Map

18 Jun 2026 6 min read Networking

A B2B SaaS team saw API timeouts spike every few hours. Dashboards pointed at the database — connection pool saturation, slow queries, the usual suspects. They scheduled a week to tune indexes.

The assumption

Before changing the datastore, an SRE asked for a one-page traffic map: which services called the API, which security groups and subnets sat between them, and which health checks ran on each hop.

What the map revealed

East-west traffic from a background worker to the API crossed a security group that allowed port 443 but not the internal gRPC port the worker actually used. Failures were intermittent because the worker retried on a jittered schedule — classic "sometimes works" behavior that looks like database flakiness in aggregate metrics.

The fix

One rule change on the worker subnet, plus a synthetic check on the gRPC path. Timeouts dropped within a day. The index tuning sprint was cancelled.

Takeaway: when symptoms move between layers, draw the wire first. Names on a architecture slide are not the same as allowed ports on the wire.