Grafana Alloy + Loki Deployment for Slurm Log Ingestion¶
This document describes all steps, configurations, troubleshooting notes, and validation procedures used to deploy Grafana Alloy on all compute nodes, forward Slurm logs into Loki on the controller, and expose Loki to Grafana on the monitoring node.
1. Pipeline Overview¶
+-----------------------------------------+
| Monitoring Node |
|-----------------------------------------|
| Grafana (queries Loki @ controller) |
+-----------------------------------------+
^
|
|
HTTP query and response
|
|
v
+-----------------------+ +-----------------------------+
| Compute Nodes | Push logs | Controller |
|-----------------------|----------->|-----------------------------|
| /var/log/slurm/ | | /var/log/slurm/slurmctld.log|
| slurmd.log | | |
| | | Alloy (reads slurmctld.log)|
| Alloy (in Docker) | | | |
| - file_match | | | push |
| - source.file | | v |
| - loki.write ----->| | +-----------+ |
+-----------------------+ | | Loki | |
| | (Docker) | |
| +-----------+ |
| Port: 3200 (exposed) |
+-----------------------------+
Alloy runs in Docker on all compute nodes and the controller.
Loki runs in Docker on the controller, exposed on port 3200.
The primary drain message is contained in /var/log/slurm/slurmctld.log in cais-controller.
2. Final Alloy Configurations¶
Version: grafana/alloy:1.12.0
2.1 Compute Nodes — /etc/alloy/config.alloy¶
// Discover the slurmd log
local.file_match "slurmd_logs" {
path_targets = [
{
"__path__" = "/var/log/slurm/slurmd.log",
job = "slurmd",
hostname = constants.hostname,
},
]
}
// Read log entries and forward to Loki
loki.source.file "slurmd" {
targets = local.file_match.slurmd_logs.targets
tail_from_end = true
forward_to = [loki.write.cerberus_loki.receiver]
}
// Write logs to controller’s Loki
loki.write "cerberus_loki" {
endpoint {
url = "http://cais-controller:3200/loki/api/v1/push"
}
}
````
## 2.2 Controller — `/etc/alloy/config.alloy`
```alloy
// Discover the slurmctld log
local.file_match "slurmd_logs" {
path_targets = [
{
"__path__" = "/var/log/slurm/slurmctld.log",
job = "slurmctld",
hostname = constants.hostname,
},
]
}
loki.source.file "slurmctld" {
targets = local.file_match.slurmd_logs.targets
tail_from_end = true
forward_to = [loki.write.cerberus_loki.receiver]
}
loki.write "cerberus_loki" {
endpoint {
url = "http://cais-controller:3200/loki/api/v1/push"
}
}
3. Loki Configuration (Controller)¶
Version: grafana/loki:2.8.10
/etc/loki/loki.yml
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /var/lib/loki
replication_factor: 1
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
ingester:
wal:
dir: /var/lib/loki/wal
lifecycler:
address: 0.0.0.0
ring:
kvstore:
store: inmemory
num_tokens: 128
chunk_idle_period: 1h
max_chunk_age: 1h
chunk_retain_period: 30s
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: filesystem
schema: v12
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /var/lib/loki/index
cache_location: /var/lib/loki/cache
shared_store: filesystem
filesystem:
directory: /var/lib/loki/chunks
limits_config:
ingestion_rate_mb: 20
ingestion_burst_size_mb: 40
Run Loki:
sudo docker run -d \
--name slurm-log-loki \
-p 3200:3100 \
-v /etc/loki/loki.yml:/etc/loki/loki.yml \
-v /var/lib/loki:/var/lib/loki \
grafana/loki:2.8.10 \
-config.file=/etc/loki/loki.yml
4. Validate Alloy Configuration¶
sudo docker run --rm \
-v /etc/alloy/config.alloy:/etc/alloy/config.alloy:ro \
grafana/alloy:latest \ # we are using v1.12.0
validate /etc/alloy/config.alloy
5. Deploy Alloy to All Compute Nodes¶
Distribute config¶
If the above command doesn't work since pdcp does not elevate permission, use:
pdcp -w ^/home/ubuntu/hosts /etc/alloy/config.alloy /tmp/config.alloy
pdsh -w ^/home/ubuntu/hosts "sudo cp /tmp/config.alloy /etc/alloy/config.alloy"
Restart Alloy (fresh container)¶
pdsh -w ^/home/ubuntu/hosts 'sudo docker rm -f alloy' # if a container with the same name exists
pdsh -w ^/home/ubuntu/hosts '
sudo docker run -d \
--name alloy \
-v /etc/alloy/config.alloy:/etc/alloy/config.alloy:ro \
-v /var/log/slurm:/var/log/slurm:ro \
grafana/alloy:latest \
run --storage.path=/var/lib/alloy/data \
/etc/alloy/config.alloy
'
Verify¶
pdsh -w ^/home/ubuntu/hosts 'echo -n "$(hostname): "; sudo docker ps --filter name=alloy --format "{{.Status}}"'
Example Output
compute-156: compute-156: Up 3 days
compute-178: compute-178: Up 3 days
compute-319: compute-319: Up 3 days
compute-447: compute-447: Up 3 days
...
6. Connectivity & Ingestion Testing¶
Compute → Loki check¶
Successful queries showHTTP/1.1 200 OK
Query last 5 minutes of logs in Loki¶
START_NS=$(date -d '5 minutes ago' +%s%N)
END_NS=$(date +%s%N)
curl -sG "http://cais-controller:3200/loki/api/v1/query_range" \
--data-urlencode 'query={job="slurmd"}' \
--data-urlencode "start=$START_NS" \
--data-urlencode "end=$END_NS" \
| jq
Force log write to ensure ingestion¶
Query for test line¶
curl -G 'http://cais-controller:3200/loki/api/v1/query' \
--data-urlencode 'query={job="slurmd"} |= "ALLOY_TEST"'
7. Debugging Lessons¶
Issue: Alloy produced entries but didn't forward¶
Cause: missing bind mount of /var/log/slurm.
Fix: include:
Issue: slurmd logs are infrequent¶
Solution: manually append test lines for validation.
Issue: a few nodes not sending¶
Fix: fully reset Alloy:
pdsh -w compute-264,compute-522,compute-790 '
CID=$(sudo docker ps -q --filter name=alloy)
[ -n "$CID" ] && sudo docker stop "$CID" && sudo docker rm "$CID"
rm -rf /var/lib/alloy && mkdir -p /var/lib/alloy
sudo docker run -d --name alloy \
-v /etc/alloy/config.alloy:/etc/alloy/config.alloy:ro \
-v /var/log/slurm:/var/log/slurm:ro \
-v /var/lib/alloy:/var/lib/alloy \
grafana/alloy:latest \
run --storage.path=/var/lib/alloy/data \
/etc/alloy/config.alloy
'
Node → container hostname mapping (Loki label)¶
Current hostname. Hostname will change every time the alloy container is removed and recreated.
compute-447 → 3ad6473c6529
compute-156 → 031055a522f2
compute-178 → a1fd50deb5d7
compute-319 → d10d58266b88
compute-427 → f5363d77454e
compute-891 → a43db5354b8a
compute-556 → ff5882598ecc
compute-540 → f246b77b2429
compute-264 → 640951f0aa06
compute-317 → 8e3525999d39
compute-522 → 9c208ca97214
compute-837 → 971946d669de
compute-967 → c4ba2f228b56
compute-675 → ce38f2512eff
compute-790 → 42c2788cbf94
8. Grafana Integration¶
Verify connectivity (monitoring node)¶
Successful connection isHTTP/1.1 200 OK in the output.
Register Loki Data Source (UI)¶
- Connections → Data Sources → Add → Loki
- URL:
http://158.180.5.146:3200 - Access: Server
- Save & Test → should return “Connection...left intact”.
Explore Queries¶
Top queries:
Dashboard Panel: Drain Reason Summary¶
sum by (reason) (
count_over_time(
{job="slurmctld"}
| regexp "\\[(?P<ts>[^]]+)\\].*reason set to:\\s*(?P<reason>[^ ].+)"
| reason != ""
[24h]
)
)
Important: By default, Grafana runs query over a range of time. To get a total distribution without redundant counts, change from Type: Range to Type: Instant¶
9. Additional Loki Sanity-Checks¶
List hostnames Loki has seen¶
List streams¶
curl -G 'http://cais-controller:3200/loki/api/v1/series' \
--data-urlencode 'match[]={job="slurmd"}'
Useful Tips and Summary¶
- Alloy cannot filter individual log lines; forwarding is stream-based. We should filter via Grafana query string-matching.
- All compute nodes must mount
/var/log/slurminto the Alloy container. - Loki must be reachable on port 3200 from compute and monitoring nodes.
- slurmd logs are low-volume; manual writes may be needed for ingestion tests.
- For drained log tests, you can manually drain compute nodes.
- Grafana UI is the easiest datasource provisioning method.
- In addition to manual entries, monitor whether drain-related entries get ingested and displayed in Grafana.