Procedures¶

Intended audience: Anyone who is administering APDB.

The dax_apdb_deploy package implements deployment and management procedures based on Ansible. Details of operations are documented in its README file.

dax_apdb_deploy needs to be cloned to a user’s directory on any rubin-devl machine and setup:

cd somewhere
git clone git@github.com:lsst-dm/dax_apdb_deploy.git
cd dax_apdb_deploy
make setup
. ./setup.sh

Deployment¶

Deployment of a new Cassandra cluster consists of defining a set of parameters for the new instance. This is done by adding a new group to group_vars folder and a new inventory YAML file. Additionally new secrets need to be setup in the Vault for both a standard user account and superuser account. Once the group and inventory are defined one needs to run a series of Ansible playbooks to configure remote systems and bring up all services.

Maintenance¶

Maintenance operations that could impact APDB availability need to be announced on #dm-prompt-processing-dev Slack channel in advance.

Operations that involve CQL queries (e.g. altering keyspace or table properties) need to be executed on one of the cluster nodes. There is a cqlsh wrapper script installed in a Docker deployment directory on each cluster node.

ssh rubincas@sdfk8sk001
cd apdb_deploy/docker
# Some operations may need superuser access. Cqlsh will prompt for a password.
./cqlsh -u superuser
Password:
cqlsh> DESCRIBE KEYSPACES
cqlsh> ...

Tasks that use nodetool command can be run from dax_apdb_deploy using its ansible-pssh command, e.g.:

# -d means to change to docker deployment directory before running the command.
# -1 means execute command on a single cluster node, default is to run on all nodes.
ansible-pssh -i inventory/apdb_prod.yaml -d -1 "./nodetool status"

Periodic maintenance tasks are executed by cron jobs from user’s account (currently salnikov). Scripts used by the cron jobs are located in dax_apdb_deploy/tree/main/etc/cron directory. Cron jobs use pre-installed dax_apdb_deploy location in user’s home directory.

Presently there are a few periodic cron jobs:

backups of the prod Cassandra cluster (daily/weekly/monthly/yearly) (etc/cron/backup-job.sh)
backups of the dev Cassandra cluster (daily/weekly) (etc/cron/backup-job.sh)
cleanup of the old backups (etc/cron/backup-cleanup-job.sh)
cluster repair (twice a week) (etc/cron/repair-job.sh)
check of cluster connection status by checking Cassandra port on each node (every 10 minutes)

Backup¶

Backup operations are based on cassandra-medusa service that runs on every node on the cluster. Location of the backups on S3 is configured in the group_vars. The medusa-backup CLI is implemented to make and manage backups.

Backups of the production cluster happen daily at 7:00 USDF time. In addition to daily backups there are weekly, monthly, and yearly backups. We keep 10 latest daily backups, 8 weekly backups, and 12 monthly backups. Yearly backups will be kept forever in case we will need to access data that was removed.

Each backup created by cassandra-medusa is a full backup, but backups use deduplication.

Cold Startup¶

The services run in docker containers and will restart automatically after downtime. Recovery from a disaster involves additional steps to restore data files from S3 to local filesystem. Restore procedure is documented in dax_apdb_deploy.

Cold Shutdown¶

Shutdown consists of running down.yml playbook using a corresponding inventory.

Reproduce Service¶

dax_apdb_deploy already has inventory and configuration for production, development (used for Prompt Production development), and integration clusters. Creating another Cassandra cluster would require finding a separate set of nodes with sufficient local storage.