Troubleshooting¶
Intended audience: Anyone who is administering APDB.
Known Issues¶
Issue |
Description |
Workaround |
|---|---|---|
Monitoring¶
Dashboard for production Cassandra cluster defines a number of graphs that reflect the state of the Cassandra and its host. It also includes logs for Cassandra and cassandra-medusa services with levels WARNING and ERROR.
Grafana dashboard defines a number of alerts for various conditions:
client connectivity failures, checked by a cron job every 10 minutes;
Cassandra monitoring updates, alarm raised when metrics stop to arrive;
Cassandra generates
OutOfMemoryexception;ZFS pool status is different from
ONLINE.
Alerts go to #ops-apdb-alerts Slack channel and a Pushover notification system of a responsible person.
Grafana can generate alerts for the dev and int clusters as well, but those are less critical and may happen due to those clusters being temporarily down.
A separate Grafana dashboard displays APDB-related metrics collected from Prompt Production clients. The data for that dashboard are extracted from log files in Loki by a cron job that runs every hour.
Metrics from Cassandra hosts are sent to two InfluxDB servers:
influxdb.slac.stanford.edufor general OS metrics,pp-influxdb.sdf.slac.stanford.edu- for Cassandra-specific metrics.
If one or both of those servers experiences a failure, Grafana will generate an alert due to a missing metrics data.
Logs from both Cassandra and Medusa services are sent to Loki, they can be retrieved with the query {compose_service="cassandra"} and {compose_service="medusa"} respectively.
Only a subset of Cassandra logging messages is saved in Loki to avoid overloading that service.
Recent Cassandra logs exist on each Cassandra node in folder /zfspool/cassandra/logs.
Sample Troubleshooting¶
Symptoms: “Clients connection check” alert is raised but other metrics on dashboard look healthy.
Cause: Occasionally cron jobs fail to run resulting in missing data.
Solution: Check cron host (sdfcron001), if unhealthy report it to USDF support.
Symptoms: “ZFS pool status” alert is raised but other metrics on dashboard look healthy.
Cause: One or more disks in ZFS pool have failed.
Solution: Logon to the node, verify pool status (zpool status), report to USDF support. There were cases when after a reboot two disks were in OFFLINE state, this was caused by a flaky controller card, it needs to be re-seated.
Symptoms: “Cassandra OutOfMemoryError” alert is raised.
Cause: Usually it is due to heavy load, e.g. too many concurrent jobs in a daily catch-up processing.
Solution: If possible reduce the number of concurrent clients.
Symptoms: Multiple types of alerts from a single host.
Cause: The host may be down or services fail to (re-)start.
Solution: Try to connect to the host, if it is down report to USDF support. If services are down (docker compose ps), restart the services. If Cassandra is in a bad state (nodetool status) then restart the service.
Symptoms: Alerts are generated for multiple hosts, but cluster looks healthy.
Cause: InfluxDB services may be down.
Solution: Check the state of influxdb.slac.stanford.edu and pp-influxdbpp-influxdb.sdf.slac.stanford.edu, report to USDF support if any service is not accessible.