Worker Management Guide

Quick Reference

# All commands run from the repo on any calim node (e.g. calim10):
cd /opt/devel/nkosogor/nkosogor/distributed-pipeline

Action

Command

Start all workers

./deploy/manage-workers.sh start

Stop all workers

./deploy/manage-workers.sh stop

Check status

./deploy/manage-workers.sh status

Code change → deploy

./deploy/manage-workers.sh deploy

Start/stop one node

./deploy/manage-workers.sh start calim08

Tail logs

./deploy/manage-workers.sh logs calim08

Clear log files

./deploy/manage-workers.sh clean-logs

After Code Changes

# 1. Push what you've developed
git push

# 2. SSH to any calim node and run ONE command:
./deploy/manage-workers.sh deploy

This does git pull + restart on all 7 nodes automatically.

Adding/Removing Nodes

Edit AVAILABLE_NODES in deploy/manage-workers.sh:

AVAILABLE_NODES=(calim01 calim05 calim06 calim07 calim08 calim09 calim10)

Submitting Jobs

# Dry run (verify file discovery):
python pipeline/subband_celery.py \
  --range 04-05 --date 2026-01-31 \
  --bp_table /path/to/bandpass.B.flagged \
  --xy_table /path/to/xyphase.Xf \
  --subbands 73MHz 78MHz \
  --peel_sky --peel_rfi --dry_run

# Real run (with NVMe cleanup after archiving):
python pipeline/subband_celery.py \
  --range 04-05 --date 2026-01-31 \
  --bp_table /path/to/bandpass.B.flagged \
  --xy_table /path/to/xyphase.Xf \
  --subbands 73MHz 78MHz \
  --peel_sky --peel_rfi --cleanup_nvme

# Remap subbands to different nodes:
--remap 18MHz=calim01 23MHz=calim05

Monitoring

  • Flower: http://localhost:5555 (SSH tunnel: ssh -L 5555:localhost:5555 lwacalim10)

  • Worker logs: ./deploy/manage-workers.sh logs calim08

  • Log files on disk: deploy/logs/calim08.log

Troubleshooting

Symptom

Likely Cause

Fix

Worker never picks up tasks

Wrong queue name

Verify worker listens on the correct queue (check -Q flag)

FileNotFoundError: orca-conf.yml

Missing config on worker node

Copy ~/orca-conf.yml to the worker’s home dir

No files for 73MHz in 14h

No data for that date/hour

Check /lustre/pipeline/night-time/averaged/73MHz/<date>/<hour>/ exists

Calibration fails

Bad cal table path or SPW mismatch

Check paths; inspect logs/casa_pipeline.log on NVMe

TTCal / peeling fails

Conda env missing

Run conda env list on worker — need julia060 and ttcal_dev

wsclean: command not found

WSClean not on PATH

Set export WSCLEAN_BIN=/opt/bin/wsclean or check install

Worker OOM killed

Too much concurrency

Reduce -c (e.g. -c 2), or check mem in imaging config

Connection refused on broker

RabbitMQ down or wrong URI

Check ~/orca-conf.yml broker_uri; verify RabbitMQ is running

Task stuck in PENDING

Worker not running or queue mismatch

Start worker; confirm queue matches get_queue_for_subband()

Phase 2 never starts

A Phase 1 task failed all retries

Check Flower for failed tasks; fix and resubmit

Useful debug commands

# Check RabbitMQ queues:
rabbitmqctl list_queues name messages consumers

# Check Celery cluster status:
celery -A orca.celery inspect active

# Check registered tasks:
celery -A orca.celery inspect registered

# Purge a queue (careful!):
celery -A orca.celery purge -Q calim08

# Check NVMe usage:
df -h /fast/