us-east — upstream Prometheus rate limitOne large customer’s Prometheus federation started rate-limiting our queries when their cluster hit a maintenance event. Pull latency for that customer briefly spiked. We added per-source backoff + jitter; no signal data was missed (we re-pull on backoff success).
GitHub webhook delivery to our control plane was backed up after a deploy. Customer YAML changes took 2–4 min to apply instead of the usual ~45s. Live signal evaluation was unaffected. Webhook ingest sharding was added in the next deploy.
Slack webhook endpoint experienced elevated latency on their side. PagerDuty + email routing unaffected; our retry queue absorbed the backlog and all messages eventually delivered. Confirmed with Slack incident report.
One email when an incident opens, one when it resolves. No marketing.