General¶
Rebasing our codebase onto OCI-HPC Head¶
How to do it step-by-step
Create a backup
Make a new release for the Cerberus repo on github with today's date. (in case we need to revert). Download the zip of the Cerberus repo and attach it to the release.
Rebase the repo
Look through the changes manually. If they're harmless bring them over by default. Such as most of the terraform that does not affect us (because we do not redeploy our cluster). If they touch common code review the code and then decide which version we keep or why it's updated.
# Add the remote (original repo that we forked) and call it “upstream”
git remote add upstream https://github.com/oracle-quickstart/oci-hpc.git
# Fetch all branches of remote upstream
git fetch upstream
# Rewrite your current branch with upstream’s master using git rebase
git rebase upstream/master
# Resolve all merge conflicts
git rebase --continue # You may need to rerun this command multiple times
# Push your updates to remote main branch. You may need to force the push with “--force-with-lease”. This option stops the force push if someone decided to modify our main branch.
git push origin main --force-with-lease
General Debugging¶
Port is in use¶
Figure out what process is using it then potentially kill it.
Setting CPU Caps for User¶
Summarizes how we clamped login-node memory usage with cgroup v2 slices.
Configure slice overrides¶
- SSH to the login node with
ssh loginand work inside/etc/systemd/system/. - Ensure the generic and ubuntu-specific slice drop-ins exist:
- Create
/etc/systemd/system/user.slice.d/override.confso every non-ubuntu session inherits a 33.51 GiB ceiling: - Create
/etc/systemd/system/user-1001.slice.d/override.confso the ubuntu admin user (UID 1001) keeps unlimited RAM allocation: - Reload the overrides and logind to make the new caps active:
Stress test and monitor¶
- Log in as (or
su -lto) the user you want to validate.sudo sukeeps you inside root’s slice, so it will not exercise the per-user limits. - Save the static allocator below as
~/static-ram-test.py, then execute it with the target number of gigabytes (for example,python3 static-ram-test.py 40). Each gigabyte is faulted in so the kernel charges real pages to the user slice: ```python import mmap, sys, os
GB = int(sys.argv[1]) if len(sys.argv) > 1 else 50 CHUNK = 1024 * 1024 * 1024 # 1GB
mmaps = [] print("Allocating", GB, "GB...")
page = 4096 # 4KB touch = b'\xAA' * page
for i in range(GB): m = mmap.mmap(-1, CHUNK, access=mmap.ACCESS_WRITE) for offset in range(0, CHUNK, page): m.seek(offset) m.write(touch) mmaps.append(m) print(f"{i+1} GB allocated") 3. From another shell on the login node, confirm the user’s cgroup inherits the desired limit and watch live usage:
UID=10447 # replace with the target UID
sudo cat /sys/fs/cgroup/user.slice/user-${UID}.slice/memory.max # ensure the max memory matches what we set in systemd override
watch -n1 "cat /sys/fs/cgroup/user.slice/user-${UID}.slice/memory.current"
How to add a node¶
Weka: After the node has been provisioned and added to Slurm we need to manually add the node to Weka. Follow the steps detailed in Adding backends and Mounting the filesystem using fstab.Monitoring:
In order to monitor the new node we need to add it to our list of targets in /etc/prometheous/prometheous.yml which is used by Victoria Metrics.
Egress Usage Tracking:
Run the /opt/oci-hpc/playbooks/sync_iptables_for_usage_tracking.yml playbook.
How to remove a node¶
# Node name can be found with
ssh compute-####
hostname -a
# Replace inst-#### with the real node name
/opt/oci-hpc/bin/resize.sh remove --nodes inst-####
# Multiple nodes can be removed at the same time
/opt/oci-hpc/bin/resize.sh remove --nodes inst-####-0 inst-####-1
# If node is unreachable
/opt/oci-hpc/bin/resize.sh remove_unreachable --nodes inst-####
Monitoring:
Remove the node from our target list from our /etc/prometheous/prometheous.yml file as we will no longer need to monitor this node.
Slurm debugging¶
Adjusting job priority¶
Adjusting priority
Fixing slurmd configless errors on login/compute nodes¶
Symptoms: When some nodes were missing /etc/systemd/system/slurmd.service.d/type.conf and /etc/default/slurmd, slurmd never switched into configless mode. Users saw:
scontrol: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
scontrol: error: fetch_config: DNS SRV lookup failed
(Tracked in centerforaisafety/cerberus-cluster#358.)
Manual remediation (if Ansible cannot run)¶
sudo sinfo -S "%n" -o "%n" | tail -n +2 > hosts
pdcp -w ^hosts ~/temp/type.conf /tmp
pdcp -w ^hosts ~/temp/slurmd /tmp
pdsh -w ^hosts 'sudo mv /tmp/slurmd /etc/default'
pdsh -w ^hosts 'sudo mv /tmp/type.conf /etc/systemd/system/slurmd.service.d'
pdsh -w ^hosts 'sudo chown root:root /etc/systemd/system/slurmd.service.d/type.conf'
pdsh -w ^hosts 'sudo chown root:root /etc/default/slurmd'
pdsh -w ^hosts 'sudo chmod 644 /etc/systemd/system/slurmd.service.d/type.conf'
pdsh -w ^hosts 'sudo chmod 644 /etc/default/slurmd'
pdsh -w ^hosts 'sudo systemctl stop slurmd'
pdsh -w ^hosts 'sudo systemctl daemon-reload'
pdsh -w ^hosts 'sudo systemctl restart slurmd'
Where:
Permanent fix (preferred)¶
PR centerforaisafety/cerberus-cluster#379 updated playbooks/site.yml to call a new type-notify role before the CAIS roles. Running:
pushes the drop-in and /etc/default/slurmd to every compute + login host, restarts slurmd, and keeps the configless controller names in sync with inventory overrides. Use the manual commands only when automation is unavailable or risky.
What to do if a node is in drain¶
Immediate steps¶
Tell slurm that the node is useable.
# This checks if a job messed something up but the node is otherwise useable.
sudo scontrol update nodename=compute-## state=resume
# Wait a few (5) minutes and check with
sinfo
# if the node goes back into draining
If that fails then take more serious measures.
# Copy the logs from the node to wherever as we might be removing the node soon.
sudo scp compute-###:/var/log/slurm/slurmd.log .
# Reboot the node
ssh compute-###
sudo reboot
# Wait for the node to reboot then run
sudo scontrol update nodename=compute-## state=resume
During weekdays¶
If its still failing create a support ticket and inform Oracle.
During weekends or holiday periods¶
If it's still failing remove the node and inform Oracle. See section about adding and removing nodes.
Investigation¶
These steps can be taken with more time to see what possibly went wrong software wise.
Get the job id that caused the issue. Note you may need to run the following because slurm converts array jobs into new jobids but it doesn't save them as the new job id only the array version.
# grep for the job id to possibly get the real job id
sudo less /var/log/slurm/slurmctld.log | grep #####
Continue on the investigation detective.
# Note this command usually sucks because it's only effective
# while the job is running or 5 min after it failed by default.
# We changed it to be 6 hours of stored logs.
# Inspect the job that failed
sudo scontrol show job -dd #####
ssh compute-###
# This is the same file we copied out of the compute node earlier in case we already removed the node.
sudo less /var/log/slurm/slurmd.log
# Other possibly useful places to explore for error messages:
/var/log/messages
journalctl
Set up Quality of Service (QOS)¶
This allows us to have more fine grained control allowing some users longer access for example.
Add in a new Quality of Service (QOS):
sudo sacctmgr add qos default_qos
sudo sacctmgr modify qos where name=default_qos set MaxWall=2-0
sudo sacctmgr modify qos where name=default_qos set GraceTime=300
sacctmgr modify qos gputest set flags=DenyOnLimit
sudo sacctmgr list qos
sudo sacctmgr show assoc format=cluster,user,qos
sudo sacctmgr delete qos normal
#Code to set for all users.
#Necssary because just awk or grep for `show users` shortens the names
#!/bin/bash
# List all users
users=$(sudo sacctmgr show users format=User%25 -n|awk '{print $1}')
# Iterate through each user and modify the QOS
for user in $users; do
sudo sacctmgr modify -i user $user set qos=default_qos
done
Finally add the secondary qos whereby some users can run jobs for longer than 2 days.
sudo sacctmgr add qos restricted_qos
sudo sacctmgr modify qos where name=restricted_qos set MaxWall=5-0
sudo sacctmgr modify qos where name=restricted_qos set GraceTime=300
sudo sacctmgr modify user where name=steven_test set qos=default_qos,restricted_qos
# Make sure the top level is aware of and has both qos
sudo sacctmgr modify cluster name=cluster set DefaultQOS=default_qos
sudo sacctmgr modify cluster name=cluster set QOS=default_qos,restricted_qos
# Nice to have where if a user submits a bad job it'll just kill it immediately.
sacctmgr modify qos default_qos set flags=DenyOnLimit
sacctmgr modify qos restricted_qos set flags=DenyOnLimit
In slurm.conf need to add qos to accounting storage enforce
Adding or Removing a partition¶
Adding a partition¶
Changing the configurations for the nodes or gres¶
Example:
{% elif shape == "BM.GPU.A100-v2.8" %}
{% set gres = "Gres=gpu:A100:8 CoresPerSocket=255 ThreadsPerCore=1 CpuSpecList=2-25,192-207 RealMemory=1800000 MemSpecLimit=263976" %}
Andriy's suggested workflow¶
Rerun the site.yml playbook with the following changes
Add
{% for partition in queues %}
{% for instance in partition.instance_types %}
Nodeset=cais Feature=cais # NodeSet for CAIS nodes
NodeSet=schmidt_sciences Feature=schmidt_sciences # NodeSet for Schmidt Sciences nodes
NodeSet=tamper_resistance Feature=tamper_resistance # NodeSet for Tamper Resistance nodes
{% endfor %}
{% endfor %}
# Schmidt Sciences partition
PartitionName=schmidt_sciences_cpu Nodes=schmidt_sciences DefaultTime=02:00:00 MaxTime=2-0 State=UP PriorityTier=2 PreemptMode=OFF AllowAccounts=schmidt_sciences QOS=cpuonly
PartitionName=schmidt_sciences Nodes=schmidt_sciences DefaultTime=02:00:00 MaxTime=1-0 State=UP PriorityTier=1 PreemptMode=OFF AllowAccounts=schmidt_sciences
to /opt/oci-hpc/playbooks/roles/slurm/templates/slurm.conf.j2
Bug when re-running the playbook it will comment out the cronjobs
Arnaud's suggested workflow¶
1.Edit slurm.conf Let us we need to add a new slurm partition named "genomics". On the Controller node, open the file and add the following line /etc/slurm/slurm.conf defining the toward the end of the file.Nodeset
In the same file, add the following next line defining the new PartitionName
2.Move an existing node into new partition Let us try first to see the current partition details. sinfo Assuming you have an existing node , run the following commands to move this node to the Slurm partition, followed by reconfiguring gpu-176 genomics Slurm to apply the changes.sudo scontrol update NodeName=gpu-176 ActiveFeatures=
sudo scontrol update NodeName=gpu-176 AvailableFeatures=genomics
sudo scontrol reconfigure
To view new partition, issue the following command:
sinfo
The newly created Slurm partition will now appear in the cluster.
3.Update slurm.conf.j2
Finally, in order to survive Slurm reconfiguration, add the following line to /opt/oci-hpc/playbooks/roles/slurm/templates/slurm.conf.j2
How to tag nodes? How to put a node into a Node set?¶
Do it manually:
sudo scontrol update NodeName=gpu-176 ActiveFeatures=
sudo scontrol update NodeName=gpu-176 AvailableFeatures=genomics
sudo scontrol reconfigure
Remove a partition¶
Remove partition name and Nodeset
Adding and removing nodes¶
Adding a node to the cluster¶
- Provision the hardware / node (use resize command)
- Add it to Weka
- Add node to nodeset
- Install docker
- Add monitoring
Look on the eng docs for adding a node.
TODO: Update docs a bit
Now make sure /home/ubuntu/weka/manager is a real backend (run weka local ps on it to double check). That file points to an IP address make sure the commands work.
Managing cluster partitions (old way) (Double check if it's safe to delete the instructions below)¶
Give a partition a name. The compute partition should be the only Default partition. Finally edit the node lists to say which nodes should go into what partitions.
PartitionName=compute Nodes=compute-hpc-node-[1-914],compute-[1-914] Default=YES DefaultTime=02:00:00 State=UP DenyAccounts=grads,seri_mat
# Note that the partitions can overlap in terms of which nodes get assigned to both.
# This allows for some groups like CAIS to have access to more resources. Note that cais also has 915-918.
PartitionName=cais Nodes=compute-hpc-node-[1-918],compute-[1-918] Default=NO DefaultTime=02:00:00 MaxTime=6-0 State=UP AllowGroups=cais
Here's an example of just giving a user a random node:
# This gives them node 605 but we then need to remove 605 from all other partitions
PartitionName=imagenet Nodes=compute-hpc-node-[605-606],compute-[605-606] Default=NO DefaultTime=02:00:00 State=UP AllowGroups=lab_name_here
# Remove that one node from the other partitions
PartitionName=compute Nodes=compute-hpc-node-[1-604],compute-[1-604],compute-hpc-node-[606-906],compute-[606-906] Default=YES DefaultTime=02:00:00 State=UP DenyAccounts=grads,seri_mats
Finally make your changes take effect.
Change a running jobs priority¶
Use this when a user complains that their job has not run in forever.
# sudo scontrol update job=<job-id> Priority=<any-integer>
# Example with multiple jobs
sudo scontrol update job=835006,835007,835009,835010 Priority=100000
Updating slurm.conf¶
Update the slurm.conf file
Restart the slurm controller.
sudo scontrol reconfigure
# The first command should be enough and second is more for emergencies.
# sudo systemctl restart slurmctld
Handy dandy slurm commands¶
Useful commands to help in debugging.
sinfo # show the partitions and the current node states.
scontrol show partitions # detailed information about the partitions.
sacctmgr show associations # Displays information about users and which accounts they belong to.
sshare -al # See fair share values.
scontrol show jobid -dd <jobid> # Show detailed job information.
scontrol update job=XXXXX TimeLimit=+03:00:00 # Add time to an existing job.
SLURM Notifications¶
We utilize goslmailer, as our Slurm mail program. It is a fork of the original repository located at https://github.com/CLIP-HPC/goslmailer. This tool is developed in the Go programming language and offers extendable connectors for integrating additional messaging platforms as needed. We currently have goslmailer configured to send Email and Slack notifications. While it currently supports spooling for Slack messages (not email), the source code can be modified to address this.
Each of the configs have annotated examples (in toml form) at the following links goslmailer and gobler.
Installation and Uninstall¶
Follow the instructions in the ansible README.
Note that by default slurm.conf has MailProg set and although this doesn't cause any fatal errors it does throw errors to slurmctld if a user tries to use mailing. Note that this errors even if MailProg is not set as it has a default value that cannot be unset.
Enabling logging¶
Logging is disabled by default as the logging is incredibly verbose and not easy to reduce (have to edit the source code and manually get rid of the logging). So it instead dumps it to stderr. If you want to enable logging you can edit the /etc/slurm/goslmailer.conf and /etc/slurm/gobler.conf files and add: "logfile": "/data/spool/goslmailer.log", as the first json key value pair. If you want this to be permanent you will have to add these as a new release to the goslmailer github. Note that you would have to ensure the structure of the zip is the same (download the latest release and then edit the files and rezip them).
Enabling Spooling¶
Enabling in production¶
Spooling is disabled by default, can be enabled by editing the /etc/slurm/goslmailer.conf and /etc/slurm/gobler.conf files and changing "renderToFile" option to "spool". Then, start the gobler service using the following command:
Currently the timings are set to 20 seconds for spool and 1 second for picker. Think of it as loading messages every 20 seconds and then firing one of those messages once per second. This can be edited in the /etc/slurm/gobler.conf. See the toml for an explanation of the options.
Changing ansible to enable spooling¶
You can edit the ansible playbook to start the service by default by navigating to the disable gobler service task in goslmailer_install.yml. Change the enabled option to true and add the following task right after it:
To make the config changes to /etc/slurm/goslmailer.conf and /etc/slurm/gobler.conf you have two options: You can make a new release for the Goslmailer repo and upload the changed configs, or you can add the following lines to the Edit config files task of the ansible playbook goslmailer_install.yml:
- { path: '/etc/slurm/goslmailer.conf', regex: '"renderToFile": "no",', new: '"renderToFile": "spool"'}
- { path: '/etc/slurm/gobler.conf', regex: '"renderToFile": "no",', new: '"renderToFile": "spool"'}
Managing Admin Notifications¶
Admin notifications are all listed under /etc/slurm/triggers with the script trigger.sh doing the majority of the work and each sub script specifying the message and what events trigger it. By default every script does an @channel notification, but this can be adjusted by editing the relevant script or a script can be turned off using the command: strigger --clear --id {id of strigger to disable}. You can view all striggers using this command: strigger --get. Specifics on what triggers each script can be found at this link
If you want to change which striggers are enabled by default that can be done in the notifications_install.yml playbook. Simply delete the lines that correspond to the striggers you don't want.