Worker Management Guide

Quick Reference

# All commands run from the repo on any calim node (e.g. calim10):
cd /opt/devel/nkosogor/nkosogor/distributed-pipeline

Action	Command
Start all workers	`./deploy/manage-workers.sh start`
Stop all workers	`./deploy/manage-workers.sh stop`
Check status	`./deploy/manage-workers.sh status`
Code change → deploy	`./deploy/manage-workers.sh deploy`
Start/stop one node	`./deploy/manage-workers.sh start calim08`
Tail logs	`./deploy/manage-workers.sh logs calim08`
Clear log files	`./deploy/manage-workers.sh clean-logs`

After Code Changes

# 1. Push what you've developed
git push

# 2. SSH to any calim node and run ONE command:
./deploy/manage-workers.sh deploy

This does git pull + restart on all 7 nodes automatically.

Adding/Removing Nodes

Edit AVAILABLE_NODES in deploy/manage-workers.sh:

AVAILABLE_NODES=(calim01 calim05 calim06 calim07 calim08 calim09 calim10)

Submitting Jobs

# Dry run (verify file discovery):
python pipeline/subband_celery.py \
  --range 04-05 --date 2026-01-31 \
  --bp_table /path/to/bandpass.B.flagged \
  --xy_table /path/to/xyphase.Xf \
  --subbands 73MHz 78MHz \
  --peel_sky --peel_rfi --dry_run

# Real run (with NVMe cleanup after archiving):
python pipeline/subband_celery.py \
  --range 04-05 --date 2026-01-31 \
  --bp_table /path/to/bandpass.B.flagged \
  --xy_table /path/to/xyphase.Xf \
  --subbands 73MHz 78MHz \
  --peel_sky --peel_rfi --cleanup_nvme

# Remap subbands to different nodes:
--remap 18MHz=calim01 23MHz=calim05

Monitoring

Flower: http://localhost:5555 (SSH tunnel: ssh -L 5555:localhost:5555 lwacalim10)
Worker logs: ./deploy/manage-workers.sh logs calim08
Log files on disk: deploy/logs/calim08.log

Troubleshooting

Symptom	Likely Cause	Fix
Worker never picks up tasks	Wrong queue name	Verify worker listens on the correct queue (check `-Q` flag)
`FileNotFoundError: orca-conf.yml`	Missing config on worker node	Copy `~/orca-conf.yml` to the worker’s home dir
`No files for 73MHz in 14h`	No data for that date/hour	Check `/lustre/pipeline/night-time/averaged/73MHz/<date>/<hour>/` exists
Calibration fails	Bad cal table path or SPW mismatch	Check paths; inspect `logs/casa_pipeline.log` on NVMe
TTCal / peeling fails	Conda env missing	Run `conda env list` on worker — need `julia060` and `ttcal_dev`
`wsclean: command not found`	WSClean not on PATH	Set `export WSCLEAN_BIN=/opt/bin/wsclean` or check install
Worker OOM killed	Too much concurrency	Reduce `-c` (e.g. `-c 2`), or check `mem` in imaging config
`Connection refused` on broker	RabbitMQ down or wrong URI	Check `~/orca-conf.yml` broker_uri; verify RabbitMQ is running
Task stuck in PENDING	Worker not running or queue mismatch	Start worker; confirm queue matches `get_queue_for_subband()`
Phase 2 never starts	A Phase 1 task failed all retries	Check Flower for failed tasks; fix and resubmit

Useful debug commands

# Check RabbitMQ queues:
rabbitmqctl list_queues name messages consumers

# Check Celery cluster status:
celery -A orca.celery inspect active

# Check registered tasks:
celery -A orca.celery inspect registered

# Purge a queue (careful!):
celery -A orca.celery purge -Q calim08

# Check NVMe usage:
df -h /fast/