Worker Management Guide
Quick Reference
# All commands run from the repo on any calim node (e.g. calim10):
cd /opt/devel/nkosogor/nkosogor/distributed-pipeline
Action |
Command |
|---|---|
Start all workers |
|
Stop all workers |
|
Check status |
|
Code change → deploy |
|
Start/stop one node |
|
Tail logs |
|
Clear log files |
|
After Code Changes
# 1. Push what you've developed
git push
# 2. SSH to any calim node and run ONE command:
./deploy/manage-workers.sh deploy
This does git pull + restart on all 7 nodes automatically.
Adding/Removing Nodes
Edit AVAILABLE_NODES in deploy/manage-workers.sh:
AVAILABLE_NODES=(calim01 calim05 calim06 calim07 calim08 calim09 calim10)
Submitting Jobs
# Dry run (verify file discovery):
python pipeline/subband_celery.py \
--range 04-05 --date 2026-01-31 \
--bp_table /path/to/bandpass.B.flagged \
--xy_table /path/to/xyphase.Xf \
--subbands 73MHz 78MHz \
--peel_sky --peel_rfi --dry_run
# Real run (with NVMe cleanup after archiving):
python pipeline/subband_celery.py \
--range 04-05 --date 2026-01-31 \
--bp_table /path/to/bandpass.B.flagged \
--xy_table /path/to/xyphase.Xf \
--subbands 73MHz 78MHz \
--peel_sky --peel_rfi --cleanup_nvme
# Remap subbands to different nodes:
--remap 18MHz=calim01 23MHz=calim05
Monitoring
Flower:
http://localhost:5555(SSH tunnel:ssh -L 5555:localhost:5555 lwacalim10)Worker logs:
./deploy/manage-workers.sh logs calim08Log files on disk:
deploy/logs/calim08.log
Troubleshooting
Symptom |
Likely Cause |
Fix |
|---|---|---|
Worker never picks up tasks |
Wrong queue name |
Verify worker listens on the correct queue (check |
|
Missing config on worker node |
Copy |
|
No data for that date/hour |
Check |
Calibration fails |
Bad cal table path or SPW mismatch |
Check paths; inspect |
TTCal / peeling fails |
Conda env missing |
Run |
|
WSClean not on PATH |
Set |
Worker OOM killed |
Too much concurrency |
Reduce |
|
RabbitMQ down or wrong URI |
Check |
Task stuck in PENDING |
Worker not running or queue mismatch |
Start worker; confirm queue matches |
Phase 2 never starts |
A Phase 1 task failed all retries |
Check Flower for failed tasks; fix and resubmit |
Useful debug commands
# Check RabbitMQ queues:
rabbitmqctl list_queues name messages consumers
# Check Celery cluster status:
celery -A orca.celery inspect active
# Check registered tasks:
celery -A orca.celery inspect registered
# Purge a queue (careful!):
celery -A orca.celery purge -Q calim08
# Check NVMe usage:
df -h /fast/