General¶

Rebasing our codebase onto OCI-HPC Head¶

How to do it step-by-step

Create a backup

Make a new release for the Cerberus repo on github with today's date. (in case we need to revert). Download the zip of the Cerberus repo and attach it to the release.

Rebase the repo

Look through the changes manually. If they're harmless bring them over by default. Such as most of the terraform that does not affect us (because we do not redeploy our cluster). If they touch common code review the code and then decide which version we keep or why it's updated.

# Add the remote (original repo that we forked) and call it “upstream”
git remote add upstream https://github.com/oracle-quickstart/oci-hpc.git
# Fetch all branches of remote upstream
git fetch upstream 
# Rewrite your current branch with upstream’s master using git rebase
git rebase upstream/master
# Resolve all merge conflicts
git rebase --continue # You may need to rerun this command multiple times
# Push your updates to remote main branch. You may need to force the push with “--force-with-lease”. This option stops the force push if someone decided to modify our main branch.
git push origin main --force-with-lease

General Debugging¶

Port is in use¶

Figure out what process is using it then potentially kill it.

sudo netstat -tunp |grep PORT

lsof -i :PORT

fuser PORT/tcp

Setting CPU Caps for User¶

Summarizes how we clamped login-node memory usage with cgroup v2 slices.

Configure slice overrides¶

SSH to the login node with ssh login and work inside /etc/systemd/system/.

Ensure the generic and ubuntu-specific slice drop-ins exist:

sudo mkdir -p /etc/systemd/system/user.slice.d
sudo mkdir -p /etc/systemd/system/user-1001.slice.d

Create /etc/systemd/system/user.slice.d/override.conf so every non-ubuntu session inherits a 33.51 GiB ceiling:
```
[Slice]
MemoryMax=33.51G
```
Create /etc/systemd/system/user-1001.slice.d/override.conf so the ubuntu admin user (UID 1001) keeps unlimited RAM allocation:
```
[Slice]
MemoryMax=infinity
```

Reload the overrides and logind to make the new caps active:

sudo systemctl daemon-reload
sudo systemctl restart systemd-logind

Stress test and monitor¶

Log in as (or su -l to) the user you want to validate. sudo su keeps you inside root’s slice, so it will not exercise the per-user limits.
Save the static allocator below as ~/static-ram-test.py, then execute it with the target number of gigabytes (for example, python3 static-ram-test.py 40). Each gigabyte is faulted in so the kernel charges real pages to the user slice: ```python import mmap, sys, os

GB = int(sys.argv[1]) if len(sys.argv) > 1 else 50 CHUNK = 1024 * 1024 * 1024 # 1GB

mmaps = [] print("Allocating", GB, "GB...")

page = 4096 # 4KB touch = b'\xAA' * page

for i in range(GB): m = mmap.mmap(-1, CHUNK, access=mmap.ACCESS_WRITE) for offset in range(0, CHUNK, page): m.seek(offset) m.write(touch) mmaps.append(m) print(f"{i+1} GB allocated") 3. From another shell on the login node, confirm the user’s cgroup inherits the desired limit and watch live usage:

UID=10447  # replace with the target UID
sudo cat /sys/fs/cgroup/user.slice/user-${UID}.slice/memory.max # ensure the max memory matches what we set in systemd override
watch -n1 "cat /sys/fs/cgroup/user.slice/user-${UID}.slice/memory.current"

4. If the user is still stuck by stale processes and ctrl+c doesn't work, list and hard-kill any runaway Python stress tests:

pgrep -u <UID> -fl python
sudo kill -9 <pid ...>
sudo loginctl terminate-user <UID>

How to add a node¶

/opt/oci-hpc/bin/resize.sh add 1

Weka: After the node has been provisioned and added to Slurm we need to manually add the node to Weka. Follow the steps detailed in Adding backends and Mounting the filesystem using fstab.

Monitoring: In order to monitor the new node we need to add it to our list of targets in /etc/prometheous/prometheous.yml which is used by Victoria Metrics.

Egress Usage Tracking: Run the /opt/oci-hpc/playbooks/sync_iptables_for_usage_tracking.yml playbook.

How to remove a node¶

# Node name can be found with
ssh compute-####
hostname -a

# Replace inst-#### with the real node name
/opt/oci-hpc/bin/resize.sh remove --nodes inst-####

# Multiple nodes can be removed at the same time
/opt/oci-hpc/bin/resize.sh remove --nodes inst-####-0 inst-####-1

# If node is unreachable
/opt/oci-hpc/bin/resize.sh remove_unreachable --nodes inst-####

Weka: After a node is destroyed and removed from Slurm we need to manually remove the node from Weka. Follow the steps detailed in the Removing a backend section of the Weka documentation.

Monitoring: Remove the node from our target list from our /etc/prometheous/prometheous.yml file as we will no longer need to monitor this node.

Slurm debugging¶

Adjusting job priority¶

Adjusting priority

# set the number at the end to be a big number.
sudo scontrol update job=<job_id> Priority=100000

Fixing slurmd configless errors on login/compute nodes¶

Symptoms: When some nodes were missing /etc/systemd/system/slurmd.service.d/type.conf and /etc/default/slurmd, slurmd never switched into configless mode. Users saw:

scontrol: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
scontrol: error: fetch_config: DNS SRV lookup failed

(Tracked in centerforaisafety/cerberus-cluster#358.)

Manual remediation (if Ansible cannot run)¶

sudo sinfo -S "%n" -o "%n" | tail -n +2 > hosts
pdcp -w ^hosts ~/temp/type.conf /tmp
pdcp -w ^hosts ~/temp/slurmd /tmp
pdsh -w ^hosts 'sudo mv /tmp/slurmd /etc/default'
pdsh -w ^hosts 'sudo mv /tmp/type.conf /etc/systemd/system/slurmd.service.d'
pdsh -w ^hosts 'sudo chown root:root /etc/systemd/system/slurmd.service.d/type.conf'
pdsh -w ^hosts 'sudo chown root:root /etc/default/slurmd'
pdsh -w ^hosts 'sudo chmod 644 /etc/systemd/system/slurmd.service.d/type.conf'
pdsh -w ^hosts 'sudo chmod 644 /etc/default/slurmd'
pdsh -w ^hosts 'sudo systemctl stop slurmd'
pdsh -w ^hosts 'sudo systemctl daemon-reload'
pdsh -w ^hosts 'sudo systemctl restart slurmd'

Where:

# /tmp/type.conf contents
[Service]
Type=notify

# /tmp/slurmd contents
SLURMD_OPTIONS="--conf-server=cais-controller,cais-backup"

Permanent fix (preferred)¶

PR centerforaisafety/cerberus-cluster#379 updated playbooks/site.yml to call a new type-notify role before the CAIS roles. Running:

ansible-playbook playbooks/site.yml --tags type-notify

pushes the drop-in and /etc/default/slurmd to every compute + login host, restarts slurmd, and keeps the configless controller names in sync with inventory overrides. Use the manual commands only when automation is unavailable or risky.

What to do if a node is in drain¶

Immediate steps¶

Tell slurm that the node is useable.

# This checks if a job messed something up but the node is otherwise useable.
sudo scontrol update nodename=compute-## state=resume

# Wait a few (5) minutes and check with 
sinfo
# if the node goes back into draining

If that fails then take more serious measures.

# Copy the logs from the node to wherever as we might be removing the node soon.
sudo scp compute-###:/var/log/slurm/slurmd.log .

# Reboot the node
ssh compute-###
sudo reboot

# Wait for the node to reboot then run
sudo scontrol update nodename=compute-## state=resume

During weekdays¶

If its still failing create a support ticket and inform Oracle.

During weekends or holiday periods¶

If it's still failing remove the node and inform Oracle. See section about adding and removing nodes.

Investigation¶

These steps can be taken with more time to see what possibly went wrong software wise.

# View the slurm logs
sudo less /var/log/slurm/slurmctld.log | grep compute-###

Get the job id that caused the issue. Note you may need to run the following because slurm converts array jobs into new jobids but it doesn't save them as the new job id only the array version.

# grep for the job id to possibly get the real job id
sudo less /var/log/slurm/slurmctld.log | grep #####

Continue on the investigation detective.

# Note this command usually sucks because it's only effective 
# while the job is running or 5 min after it failed by default.
# We changed it to be 6 hours of stored logs.

# Inspect the job that failed
sudo scontrol show job -dd #####

ssh compute-###
# This is the same file we copied out of the compute node earlier in case we already removed the node.
sudo less /var/log/slurm/slurmd.log

# Other possibly useful places to explore for error messages:
/var/log/messages

journalctl

Set up Quality of Service (QOS)¶

This allows us to have more fine grained control allowing some users longer access for example.

Add in a new Quality of Service (QOS):

sudo sacctmgr add qos default_qos
sudo sacctmgr modify qos where  name=default_qos set MaxWall=2-0
sudo sacctmgr modify qos where  name=default_qos set GraceTime=300
sacctmgr modify qos gputest set flags=DenyOnLimit

sudo sacctmgr list qos

sudo sacctmgr show assoc format=cluster,user,qos

sudo sacctmgr delete qos normal

#Code to set for all users. 
#Necssary because just awk or grep for `show users` shortens the names
#!/bin/bash

# List all users
users=$(sudo sacctmgr show users format=User%25 -n|awk '{print $1}')

# Iterate through each user and modify the QOS
for user in $users; do
    sudo sacctmgr modify -i user $user set qos=default_qos
done

Finally add the secondary qos whereby some users can run jobs for longer than 2 days.

sudo sacctmgr add qos restricted_qos
sudo sacctmgr modify qos where name=restricted_qos set MaxWall=5-0
sudo sacctmgr modify qos where  name=restricted_qos set GraceTime=300
sudo sacctmgr modify user where name=steven_test set qos=default_qos,restricted_qos

# Make sure the top level is aware of and has both qos
sudo sacctmgr modify cluster name=cluster set DefaultQOS=default_qos
sudo sacctmgr modify cluster name=cluster set QOS=default_qos,restricted_qos

# Nice to have where if a user submits a bad job it'll just kill it immediately.
sacctmgr modify qos default_qos set flags=DenyOnLimit
sacctmgr modify qos restricted_qos set flags=DenyOnLimit

In slurm.conf need to add qos to accounting storage enforce

Adding or Removing a partition¶

Adding a partition¶

Changing the configurations for the nodes or gres¶

/opt/oci-hpc/playbooks/roles/slurm/templates/systemd/slurmd.service.d$ vim unit.conf.j2

Example:

{% elif shape == "BM.GPU.A100-v2.8" %}
{% set gres = "Gres=gpu:A100:8 CoresPerSocket=255 ThreadsPerCore=1 CpuSpecList=2-25,192-207 RealMemory=1800000 MemSpecLimit=263976" %}

Andriy's suggested workflow¶

Rerun the site.yml playbook with the following changes

Add

{% for partition in queues %}
{% for instance in partition.instance_types %}
Nodeset=cais Feature=cais # NodeSet for CAIS nodes
NodeSet=schmidt_sciences Feature=schmidt_sciences # NodeSet for Schmidt Sciences nodes
NodeSet=tamper_resistance Feature=tamper_resistance # NodeSet for Tamper Resistance nodes
{% endfor %}
{% endfor %}


# Schmidt Sciences partition
PartitionName=schmidt_sciences_cpu Nodes=schmidt_sciences DefaultTime=02:00:00 MaxTime=2-0 State=UP PriorityTier=2 PreemptMode=OFF AllowAccounts=schmidt_sciences QOS=cpuonly
PartitionName=schmidt_sciences Nodes=schmidt_sciences DefaultTime=02:00:00 MaxTime=1-0 State=UP PriorityTier=1 PreemptMode=OFF AllowAccounts=schmidt_sciences

to /opt/oci-hpc/playbooks/roles/slurm/templates/slurm.conf.j2

Bug when re-running the playbook it will comment out the cronjobs

Arnaud's suggested workflow¶

1.Edit slurm.conf Let us we need to add a new slurm partition named "genomics". On the Controller node, open the file and add the following line /etc/slurm/slurm.conf defining the toward the end of the file.Nodeset

Nodeset=genomics Feature=genomics

In the same file, add the following next line defining the new PartitionName

PartitionName=genomics Nodes=genomics Default=NO

2.Move an existing node into new partition Let us try first to see the current partition details. sinfo Assuming you have an existing node , run the following commands to move this node to the Slurm partition, followed by reconfiguring gpu-176 genomics Slurm to apply the changes.

sudo scontrol update NodeName=gpu-176 ActiveFeatures=
sudo scontrol update NodeName=gpu-176 AvailableFeatures=genomics
sudo scontrol reconfigure

To view new partition, issue the following command: sinfo The newly created Slurm partition will now appear in the cluster.

3.Update slurm.conf.j2 Finally, in order to survive Slurm reconfiguration, add the following line to /opt/oci-hpc/playbooks/roles/slurm/templates/slurm.conf.j2

PartitionName=genomics Nodes=gpu-176 Default=NO

How to tag nodes? How to put a node into a Node set?¶

Do it manually:

sudo scontrol update NodeName=gpu-176 ActiveFeatures=
sudo scontrol update NodeName=gpu-176 AvailableFeatures=genomics
sudo scontrol reconfigure

Remove a partition¶

Remove partition name and Nodeset

Adding and removing nodes¶

Adding a node to the cluster¶

Provision the hardware / node (use resize command)
Add it to Weka
Add node to nodeset
Install docker
Add monitoring

Look on the eng docs for adding a node.

TODO: Update docs a bit

Now make sure /home/ubuntu/weka/manager is a real backend (run weka local ps on it to double check). That file points to an IP address make sure the commands work.

Managing cluster partitions (old way) (Double check if it's safe to delete the instructions below)¶

sudo vim /etc/slurm/slurm.conf

Give a partition a name. The compute partition should be the only Default partition. Finally edit the node lists to say which nodes should go into what partitions.

PartitionName=compute Nodes=compute-hpc-node-[1-914],compute-[1-914] Default=YES DefaultTime=02:00:00 State=UP DenyAccounts=grads,seri_mat

# Note that the partitions can overlap in terms of which nodes get assigned to both.  
# This allows for some groups like CAIS to have access to more resources. Note that cais also has 915-918.
PartitionName=cais Nodes=compute-hpc-node-[1-918],compute-[1-918] Default=NO DefaultTime=02:00:00 MaxTime=6-0 State=UP AllowGroups=cais

Here's an example of just giving a user a random node:

# This gives them node 605 but we then need to remove 605 from all other partitions
PartitionName=imagenet Nodes=compute-hpc-node-[605-606],compute-[605-606] Default=NO DefaultTime=02:00:00 State=UP AllowGroups=lab_name_here

# Remove that one node from the other partitions
PartitionName=compute Nodes=compute-hpc-node-[1-604],compute-[1-604],compute-hpc-node-[606-906],compute-[606-906] Default=YES DefaultTime=02:00:00 State=UP DenyAccounts=grads,seri_mats

Finally make your changes take effect.

sudo scontrol reconfigure

Change a running jobs priority¶

Use this when a user complains that their job has not run in forever.

# sudo scontrol update job=<job-id> Priority=<any-integer>

# Example with multiple jobs
sudo scontrol update job=835006,835007,835009,835010 Priority=100000

sudo scontrol update job=835006,835007,835009,835010 Priority=100000

Updating slurm.conf¶

Update the slurm.conf file

sudo vim /etc/slurm/slurm.conf

Restart the slurm controller.

sudo scontrol reconfigure
# The first command should be enough and second is more for emergencies.
# sudo systemctl restart slurmctld

Handy dandy slurm commands¶

Useful commands to help in debugging.

sinfo                                            # show the partitions and the current node states.
scontrol show partitions                         # detailed information about the partitions.
sacctmgr show associations                       # Displays information about users and which accounts they belong to.
sshare -al                                       # See fair share values.
scontrol show jobid -dd <jobid>                  # Show detailed job information.
scontrol update job=XXXXX TimeLimit=+03:00:00    # Add time to an existing job.

SLURM Notifications¶

We utilize goslmailer, as our Slurm mail program. It is a fork of the original repository located at https://github.com/CLIP-HPC/goslmailer. This tool is developed in the Go programming language and offers extendable connectors for integrating additional messaging platforms as needed. We currently have goslmailer configured to send Email and Slack notifications. While it currently supports spooling for Slack messages (not email), the source code can be modified to address this.

Each of the configs have annotated examples (in toml form) at the following links goslmailer and gobler.

Installation and Uninstall¶

Follow the instructions in the ansible README.

Note that by default slurm.conf has MailProg set and although this doesn't cause any fatal errors it does throw errors to slurmctld if a user tries to use mailing. Note that this errors even if MailProg is not set as it has a default value that cannot be unset.

Enabling logging¶

Logging is disabled by default as the logging is incredibly verbose and not easy to reduce (have to edit the source code and manually get rid of the logging). So it instead dumps it to stderr. If you want to enable logging you can edit the /etc/slurm/goslmailer.conf and /etc/slurm/gobler.conf files and add: "logfile": "/data/spool/goslmailer.log", as the first json key value pair. If you want this to be permanent you will have to add these as a new release to the goslmailer github. Note that you would have to ensure the structure of the zip is the same (download the latest release and then edit the files and rezip them).

Enabling Spooling¶

Enabling in production¶

Spooling is disabled by default, can be enabled by editing the /etc/slurm/goslmailer.conf and /etc/slurm/gobler.conf files and changing "renderToFile" option to "spool". Then, start the gobler service using the following command:

sudo systemctl enable gobler
sudo systemctl start gobler

Currently the timings are set to 20 seconds for spool and 1 second for picker. Think of it as loading messages every 20 seconds and then firing one of those messages once per second. This can be edited in the /etc/slurm/gobler.conf. See the toml for an explanation of the options.

Changing ansible to enable spooling¶

You can edit the ansible playbook to start the service by default by navigating to the disable gobler service task in goslmailer_install.yml. Change the enabled option to true and add the following task right after it:

- name: Start gobler service
  systemd:
    name: gobler
    state: started

To make the config changes to /etc/slurm/goslmailer.conf and /etc/slurm/gobler.conf you have two options: You can make a new release for the Goslmailer repo and upload the changed configs, or you can add the following lines to the Edit config files task of the ansible playbook goslmailer_install.yml:

- { path: '/etc/slurm/goslmailer.conf', regex: '"renderToFile": "no",', new: '"renderToFile": "spool"'}
- { path: '/etc/slurm/gobler.conf', regex: '"renderToFile": "no",', new: '"renderToFile": "spool"'}

Managing Admin Notifications¶

Admin notifications are all listed under /etc/slurm/triggers with the script trigger.sh doing the majority of the work and each sub script specifying the message and what events trigger it. By default every script does an @channel notification, but this can be adjusted by editing the relevant script or a script can be turned off using the command: strigger --clear --id {id of strigger to disable}. You can view all striggers using this command: strigger --get. Specifics on what triggers each script can be found at this link

If you want to change which striggers are enabled by default that can be done in the notifications_install.yml playbook. Simply delete the lines that correspond to the striggers you don't want.