Alert Production — Operational Quick Reference¶
On This Page
Overview¶
Service Group: Alert Production
Service Tier: Real Time — failures during observing directly block alert delivery
Operating Hours: Nightly observing windows (Chilean night)
Primary vCluster: usdf-prompt-processing (prod) / usdf-prompt-processing-dev (dev)
Primary Slack Channel: #lsstcam-prompt-processing
Phalanx ArgoCD: https://phalanx.lsst.io/applications/prompt.html
Pod-Internal Processing Phases¶
Each KEDA Scaled Job Pod executes these phases in sequence per visit-detector.
Note
The boundaries between phases are not perfectly sequential. In particular: alert packets are sent to Alert Distribution during the pipeline phase (before export), and Sasquatch metrics dispatch happens after export. The table below does not necessarily reflect the actual ordering but is a good approximation.
# |
Phase |
What Happens |
Reads From |
Writes To |
|---|---|---|---|---|
1 |
Scaled Job Start |
Pod starts, consumes a fanned-out event from Redis Stream |
Prompt Redis |
(nothing) |
2 |
Preload |
Loads a local Butler from a read-only replica of Embargo Butler. Reads calibration data from embargo storage (Ceph) and templates from Weka ( |
Embargo Butler RO replica, Embargo storage (Ceph — calibrations), Weka (templates), APDB (Cassandra), Presence microservice, S3 File Notifications Kafka |
Local Repo (pod-local) |
3 |
Alert Production Pipeline |
Runs difference imaging, source detection, association. Queries Sattle to filter satellite trails from DIASources (per DMTN-199 policy — if Sattle is unavailable, no alerts are transmitted). Sends alert packets to Alert Distribution Service during this phase. |
APDB, Sattle |
APDB (new DIASources), Local Repo, Alert Distribution Service (alert packets) |
4 |
Export Outputs |
Sends dataset records (processed data) as Kafka messages to Prompt Kafka. Writes outputs to embargo storage. Dispatches metrics to Sasquatch via REST Proxy (after export). |
Local Repo |
Prompt Kafka (dataset records), Embargo storage, Sasquatch (metrics) |
5 |
Loop / Terminate |
Listens for another Redis Stream event. If no new event arrives before timeout, the Scaled Job terminates. This avoids churn from creating new pods for each event. |
Prompt Redis |
(nothing) |
Complete Service Inventory¶
The following table lists every service required for Alert Production to function, their runtime details, and monitoring status. Services are listed in data flow order.
Service |
Tier |
vCluster / Host |
K8s Namespace |
Slack Channel |
Grafana Dashboards |
Doc Gaps |
|---|---|---|---|---|---|---|
Next Visit Fan Out |
Real Time |
usdf-prompt-processing |
next-visit-fan-out |
lsstcam-prompt-processing |
✅ Incorporated into PP Production dashboards. Alerts configured for Kafka and Schema Registry connectivity. |
|
Prompt Redis |
Real Time |
usdf-prompt-processing |
prompt-redis |
lsstcam-prompt-processing |
||
Prompt KEDA |
Real Time |
usdf-prompt-processing |
keda |
lsstcam-prompt-processing |
||
Prompt Processing (KEDA Scaled Jobs) |
Real Time |
usdf-prompt-processing |
prompt-keda-lsstcam |
lsstcam-prompt-processing |
||
Prompt Kafka |
Real Time |
usdf-prompt-processing |
prompt-kafka |
lsstcam-prompt-processing |
Dashboard now functional for dev. Verify prod once PodMonitor is added. |
|
Butler Writer |
Real Time |
usdf-prompt-processing |
butler-writer-service |
(none) |
⚠️ None documented |
No dashboard, no Slack channel |
APDB (Cassandra) |
Real Time |
bare-metal: sdfk8sk001-006 (prod), sdfk8sk007-012 (dev) |
N/A |
ops-apdb-alerts |
||
Sattle (Satellite Catalog) |
Real Time |
Embargo Rack (systemd, 2× HA) |
N/A (not in K8s) |
(none) |
⚠️ None — logs to local disk only |
No dashboard, no Slack, no centralized logs, no K8s visibility |
S3 File Notifications Kafka |
Real Time |
usdf-prompt-processing |
s3-file-notifications |
usdf-infra-support |
⚠️ None documented |
Part of Data Curation group but critical path for AP |
Embargo Transfer / Presence (Auto-Ingest) |
Real Time |
usdf-embargo-dmz |
summit-new (LSSTCam data) |
usdf-data-curation |
⚠️ None — metrics in internal Redis, not published externally |
PP only directly uses Presence microservice (not the ingest workers) |
USDF Sasquatch |
Real Time |
usdf-rsp |
sasquatch |
usdf-rsp-support, status-efd |
✅ (Sasquatch has its own dashboards) |
PP publishes metrics via REST Proxy; Schema Registry IP changes are a known issue |
Alert Distribution Service |
Real Time |
usdf-alert-stream-broker-dev (prod vCluster not documented) |
alert-stream-broker |
(none) |
⚠️ None documented |
Prod vCluster, dashboard, Slack all missing. Note: “Alert Stream Broker” and “Alert Distribution” are the same service. |
Alert Archive |
Critical |
usdf-alert-stream-broker-dev (prod vCluster not documented) |
alert-stream-broker |
(none) |
⚠️ None documented |
Daytime ops only. Prod vCluster, dashboard, Slack all missing. |
Infrastructure Dependencies¶
Every dependency required for nightly alert production, rated by failure impact.
Dependency |
Consumed By |
Severity |
Impact Description |
|---|---|---|---|
LHN (Long Haul Network) |
Next Visit Fan Out |
❌ HALT |
No nextVisit events reach USDF from Summit. Total pipeline halt. |
Summit Sasquatch Kafka |
Next Visit Fan Out (via socat) |
❌ HALT |
No nextVisit events produced by ScriptQueue. Total pipeline halt. |
socat proxy ( |
Next Visit Fan Out |
❌ HALT |
DNS in |
Kubernetes Cluster + Embargo Nodes |
All K8s-based AP services |
❌ HALT |
Nothing runs. Total pipeline halt. PP runs on dedicated embargo nodes. |
SLAC LDAP |
vCluster authentication |
⚠️ DEGRADED |
Processing continues normally. However, cluster can only be accessed via an S3DF person — no vCluster authentication for Rubin staff. New deployments/syncs may fail. |
Ceph ``rubin-summit`` bucket |
S3 File Notifications, Embargo Transfer |
❌ HALT |
Raw images not stored or accessible. Processing cannot start. |
Embargo Storage (Ceph) |
Prompt Processing (preload — calibration data), Export (output writes) |
❌ HALT |
PP reads calibration data from embargo storage during preload and writes outputs during export. Ceph unavailable = preload fails, export fails. |
S3 File Notifications Kafka |
Prompt Processing pods |
❌ HALT |
Presence microservice is only queried once after preload and is not consulted again. If S3 File Notifications Kafka is down, PP will not receive |
Presence Microservice (part of Embargo Transfer) |
Prompt Processing |
⚠️ DEGRADED |
PP queries Presence once after preload to check if images have already arrived. If Presence is unavailable, PP still works via S3 File Notifications Kafka. Note: PP does not directly depend on auto-ingest workers, only on Presence. |
Embargo Butler DB (PostgreSQL) |
Prompt Processing (preload), Butler Writer |
❌ HALT |
Cannot load Butler or write results. Preload fails. |
Cassandra Cluster (APDB) |
Prompt Processing (preload + pipeline) |
❌ HALT |
Cannot read previous detections or write new DIASources. 6 prod nodes, 12 total. Backups via |
Weka |
Prompt Processing (preload) |
❌ HALT |
Cannot load templates. Processing fails immediately. Note: calibration data is stored separately in embargo storage (Ceph), not on Weka. |
Sattle (Embargo Rack) |
Prompt Processing (pipeline) |
❌ HALT |
Per DMTN-199 policy (Sec. 3.4): alerts are not transmitted if Sattle does not succeed. No Sattle = no alerts. |
Sasquatch (USDF RSP) |
Prompt Processing (pipeline metrics) |
⚠️ DEGRADED |
Metrics publication fails but processing continues. Known issue: Schema Registry IP changes during Sasquatch maintenance. |
Vault |
All services needing secrets |
⚠️ ON REDEPLOY |
Secrets cached at deploy time. Running services unaffected. New deploys/syncs will fail. |
Internet connectivity |
Alert Distribution Service (outbound to brokers), Sattle (space-track.org) |
❌ HALT |
Alerts cannot reach community brokers. Sattle catalog update from space-track.org fails — stale catalog means Sattle may refuse to clear alerts, leading to halt. |
Kubernetes Observability¶
What to Watch |
Tool |
Dashboard / Command |
Notes |
|---|---|---|---|
Pod status (all AP namespaces) |
Grafana |
Filter by |
|
Scaled Job count & scaling |
kubectl / Grafana |
|
Should scale up ~20s before exposures. |
KEDA scaler health |
Prometheus |
|
High latency suggests Redis Streams scaler is slow. Errors may indicate API connectivity issues. |
Redis Stream lag |
Grafana |
Redis Streams KEDA Performance dashboard |
|
Processing throughput |
Grafana |
PP Production Overview + Output dashboards |
Check per-visit processing time, error rate, alert count per visit. |
Prompt Kafka health |
Grafana / kubectl |
Strimzi Kafka Dashboard |
Filter by |
Butler Writer throughput |
kubectl logs |
|
⚠️ No dashboard. Monitor for batch insert errors, connection timeouts to Embargo Butler. |
Cassandra cluster health |
Grafana |
Cassandra System Metrics dashboard |
6 prod nodes (sdfk8sk001-006). Alerts configured to |
S3 File Notifications |
kubectl |
|
⚠️ No dashboard. Three brokers for redundancy. No auth/SSL (Ceph limitation). |
Embargo Transfer / Presence |
kubectl |
|
⚠️ Metrics in internal Redis, not published externally. PP only depends on Presence microservice. |
Sattle health |
SSH / systemd |
|
⚠️ No K8s or Grafana. Bare-metal only. Check both HA nodes. |
socat proxy |
kubectl |
|
Verify DNS entries resolve Summit Kafka addresses to socat services. |
CoreDNS (vCluster) |
kubectl |
|
DNS failures break inter-service connectivity. Restart: |
Known Issues & Failure Modes¶
Documented failure modes from the troubleshooting runbooks:
Issue |
Symptom |
Resolution |
|---|---|---|
Missing raw data at USDF |
PP workers time out waiting for images. No data in embargo bucket. |
Upstream issue (Summit transfer). PP auto-skips after timeout. |
Missing fanned-out events |
No fanned-out events received in Redis Stream. KEDA doesn’t scale. |
Check Next Visit Fan Out pods. Check socat proxy. Check Summit Sasquatch. |
Schema Registry IP change |
|
Get new IP: |
Missing file arrival notifications |
Bucket notifications silently disabled. PP will not run pipelines — S3 File Notifications Kafka is essential (HALT). Presence is only queried once after preload and cannot substitute for ongoing S3 notifications. |
Reconfigure Ceph bucket notifications if not a wider storage issue. This is a HALT-level failure. |
Missing dataset types |
New dataset types not registered in |
Manual registration: playbook. |
Stack version mismatch |
Auto-ingest and PP running different stack versions. Write conflicts. |
Auto-ingest may need stack version update. |
KEDA cannot connect to K8s API |
KEDA log errors about API connectivity. |
Restart KEDA operator. If persists: restart |
Redis NOGROUP error |
|
Redis cluster was recreated. Run the Creating Redis Streams procedure to recreate keys and consumer groups. |
PP cannot connect to Redis |
Connection errors to |
Verify |
Images rejected as unprocessable |
Inconsistency between nextVisit and actual exposure. Missing metadata. |
Upstream issue. PP raises RuntimeError, skips the image. |
PipelineExecutionError |
Errors from pipeline payload (science code), framework is working correctly. |
Non-urgent. Report to Pipelines team. |
Prompt service bugs |
Major regression in stack or middleware. |
Rollback to previously working version. Report to PP team. |
Observability Gaps — Priority Action Items¶
Services that are Real Time critical but lack adequate observability:
Service |
Priority |
What’s Missing & Recommended Actions |
|---|---|---|
Next Visit Fan Out |
🟡 MEDIUM |
Now incorporated into PP Production dashboards with alerts for Kafka and Schema Registry connectivity. Kafka dashboard panels not yet populated — investigating with infra. Still need: event throughput per detector, socat proxy latency, reconnection events. |
Prompt Kafka |
🟡 MEDIUM (improved) |
Strimzi Kafka Dashboard now functional on dev after adding Kafka resource PodMonitor. Being rolled out to prod. Still need once prod is live: |
Butler Writer |
🔴 HIGH |
No Grafana dashboard, no Slack channel. Need: batch insert throughput, error rate, Embargo Butler DB connection pool health, queue depth. |
Sattle |
🔴 HIGH |
No centralized monitoring. Bare-metal with local logs only. Need: request latency, error rate, satellite catalog freshness (time since last space-track.org sync), HA failover status, centralized log shipping. |
S3 File Notifications Kafka |
🟡 MEDIUM |
No dashboard. Need: broker health, notification throughput, consumer lag for PP workers. |
Embargo Transfer / Presence |
🟡 MEDIUM |
Internal Redis metrics not published. PP only directly depends on Presence microservice. Need: presence microservice latency and availability, ingestion throughput, worker error rate, queue depth. |
Alert Distribution Service |
🟡 MEDIUM |
No dashboard, no Slack channel, prod vCluster not documented. Need: alert throughput, broker connection count, consumer lag per community broker. |
Alert Archive |
🟡 MEDIUM |
No dashboard, no Slack channel. Daytime service, not on nightly critical path. |
Deployment Repos & Configuration¶
Service |
Deployment / Config Repo |
Vault Secrets (Prod) |
|---|---|---|
Prompt Processing |
|
|
Next Visit Fan Out |
|
|
Prompt Redis |
(no secrets) |
|
KEDA |
|
|
Prompt Kafka |
|
|
Butler Writer |
(not documented) |
|
S3 File Notifications |
(no secrets) |
|
Embargo Transfer |
|
|
Alert Distribution Service |
(not documented) |
|
APDB (Cassandra) |
|
|
Sattle |
(on-host, not in Vault) |
Source Code Repos:
Prompt Processing: lsst-dm/prompt_processing
Next Visit Fan Out: lsst-dm/next_visit_fan_out
Butler Writer: lsst-dm/prompt_processing_butler_writer
APDB: lsst/dax_apdb
Sattle: lsst-dm/sattle
Alert Archive: lsst-dm/alert_database_server
Alert Distribution Service: lsst-dm/alert_database_ingester
Slack Channels¶
Channel |
Purpose |
|---|---|
|
Primary channel for AP/PP operations (Prompt Processing, NVFO, Redis, KEDA, Kafka) |
|
Automated Cassandra alerts from Grafana |
|
Embargo Transfer / Auto-Ingest support |
|
Kubernetes / vCluster / DNS infrastructure escalation |
|
Sasquatch / Schema Registry support |
|
EFD / Sasquatch status updates |
Technical Reference¶
DMTN-310: Reducing Butler database contention in Prompt Processing — explains the Prompt Kafka / Butler Writer architecture.
DMTN-199: Rubin Observatory Data Security Standards Implementation — defines data security standards including the satellite-catalog scrub policy (Sec. 3.4) implemented by Sattle.
Key technical details:
KEDA scaling strategy:
eagerwithpendingEntriesCount=1— one pending Redis message triggers a new Scaled Job.Redis Stream naming:
instrument:lsstcamwithlsstcam_consumer_group.Prompt Kafka topics:
butler-writer(dataset records),butler-writer-ingestion-events(ingestion events consumed by Prompt Publication Service).socat proxy DNS: ConfigMap in
kube-systemnamespace resolves Summit Sasquatch Kafka bootstrap/broker addresses to socat services inkafka-proxynamespace.Next Visit timing: Events available ≥20s before first exposure, allowing pre-scaling.
Scaled Job lifecycle: Jobs listen in a while loop; after processing, they wait for the next event. Timeout → termination. Reduces pod churn.
Sattle policy: Two machines in HA on Embargo Rack (source:
docs/usdf-applications/ap/sattle/info.rst). Per DMTN-199 Sec. 3.4 and the Sattleinfo.rst: the service “fails safe” — alerts are not transmitted if the service does not succeed. This is a halt condition, not a graceful degradation.APDB backups:
cassandra-medusato S3 bucket. Off-site backup not yet implemented. Recovery can take hours to a day.Embargo Transfer recovery: Re-trigger ingest via
trigger_ingest.pyinlsst-dm/data-curation-tools, orbutler ingest-raws.