# Worker Management Guide ## Quick Reference ```bash # All commands run from the repo on any calim node (e.g. calim10): cd /opt/devel/nkosogor/nkosogor/distributed-pipeline ``` | Action | Command | |---|---| | Start all workers | `./deploy/manage-workers.sh start` | | Stop all workers | `./deploy/manage-workers.sh stop` | | Check status | `./deploy/manage-workers.sh status` | | **Code change → deploy** | `./deploy/manage-workers.sh deploy` | | Start/stop one node | `./deploy/manage-workers.sh start calim08` | | Tail logs | `./deploy/manage-workers.sh logs calim08` | | Clear log files | `./deploy/manage-workers.sh clean-logs` | ## After Code Changes ```bash # 1. Push what you've developed git push # 2. SSH to any calim node and run ONE command: ./deploy/manage-workers.sh deploy ``` This does `git pull` + restart on all 7 nodes automatically. ## Adding/Removing Nodes Edit `AVAILABLE_NODES` in `deploy/manage-workers.sh`: ```bash AVAILABLE_NODES=(calim01 calim05 calim06 calim07 calim08 calim09 calim10) ``` ## Submitting Jobs ```bash # Dry run (verify file discovery): python pipeline/subband_celery.py \ --range 04-05 --date 2026-01-31 \ --bp_table /path/to/bandpass.B.flagged \ --xy_table /path/to/xyphase.Xf \ --subbands 73MHz 78MHz \ --peel_sky --peel_rfi --dry_run # Real run (with NVMe cleanup after archiving): python pipeline/subband_celery.py \ --range 04-05 --date 2026-01-31 \ --bp_table /path/to/bandpass.B.flagged \ --xy_table /path/to/xyphase.Xf \ --subbands 73MHz 78MHz \ --peel_sky --peel_rfi --cleanup_nvme # Remap subbands to different nodes: --remap 18MHz=calim01 23MHz=calim05 ``` ## Monitoring - **Flower**: `http://localhost:5555` (SSH tunnel: `ssh -L 5555:localhost:5555 lwacalim10`) - **Worker logs**: `./deploy/manage-workers.sh logs calim08` - **Log files on disk**: `deploy/logs/calim08.log` ## Troubleshooting | Symptom | Likely Cause | Fix | |---------|-------------|-----| | Worker never picks up tasks | Wrong queue name | Verify worker listens on the correct queue (check `-Q` flag) | | `FileNotFoundError: orca-conf.yml` | Missing config on worker node | Copy `~/orca-conf.yml` to the worker's home dir | | `No files for 73MHz in 14h` | No data for that date/hour | Check `/lustre/pipeline/night-time/averaged/73MHz///` exists | | Calibration fails | Bad cal table path or SPW mismatch | Check paths; inspect `logs/casa_pipeline.log` on NVMe | | TTCal / peeling fails | Conda env missing | Run `conda env list` on worker — need `julia060` and `ttcal_dev` | | `wsclean: command not found` | WSClean not on PATH | Set `export WSCLEAN_BIN=/opt/bin/wsclean` or check install | | Worker OOM killed | Too much concurrency | Reduce `-c` (e.g. `-c 2`), or check `mem` in imaging config | | `Connection refused` on broker | RabbitMQ down or wrong URI | Check `~/orca-conf.yml` broker_uri; verify RabbitMQ is running | | Task stuck in PENDING | Worker not running or queue mismatch | Start worker; confirm queue matches `get_queue_for_subband()` | | Phase 2 never starts | A Phase 1 task failed all retries | Check Flower for failed tasks; fix and resubmit | ### Useful debug commands ```bash # Check RabbitMQ queues: rabbitmqctl list_queues name messages consumers # Check Celery cluster status: celery -A orca.celery inspect active # Check registered tasks: celery -A orca.celery inspect registered # Purge a queue (careful!): celery -A orca.celery purge -Q calim08 # Check NVMe usage: df -h /fast/ ```