A B2B SaaS team saw API timeouts spike every few hours. Dashboards pointed at the database — connection pool saturation, slow queries, the usual suspects. They scheduled a week to tune indexes.
The assumption
Before changing the datastore, an SRE asked for a one-page traffic map: which services called the API, which security groups and subnets sat between them, and which health checks ran on each hop.
What the map revealed
East-west traffic from a background worker to the API crossed a security group that allowed port 443 but not the internal gRPC port the worker actually used. Failures were intermittent because the worker retried on a jittered schedule — classic "sometimes works" behavior that looks like database flakiness in aggregate metrics.
The fix
One rule change on the worker subnet, plus a synthetic check on the gRPC path. Timeouts dropped within a day. The index tuning sprint was cancelled.
Takeaway: when symptoms move between layers, draw the wire first. Names on a architecture slide are not the same as allowed ports on the wire.