Skip to content

Backup and Restore procedure of SLURM

Slurm has several things that need backed up and restored to preserve BC, namely: * Slurm Config files * sacctmgr data * mysql database(s)

Prereqs

  • Added compartment to use instance-principal access to slurm-d3 bucket to copy and download backup files between vcn's.

Boot disk imaging

A catch all is to clone the boot disks of the bastion, login and backup nodes.

  • Snapshot and restore bastion, login and backup boot disks and restore them.
  • Snapshot and restore FSS exports to dev env (same)
  • Validate restored env. e.g. make sure this command works prior to imaging and on the restored image the dev env.

Here is an example of creating a boot volume backup - reference.

Using the Console

Open the navigation menu and click Storage. Under Block Storage, click Block Volumes. In the Block Storage menu on the sidebar, click Boot Volumes.
Click the boot volume that you want to create a backup for.
Click Create Manual Backup.

Enter a name for the backup. Avoid entering confidential information.

Select the backup type, either incremental or full. See Boot Volume Backup Types for information about backup types.
If you have permissions to create a resource, then you also have permissions to apply free-form tags to that resource. To apply a defined tag, you must have permissions to use the tag namespace. For more information about tagging, see Resource Tags. 
If you are not sure whether to apply tags, skip this option (you can apply tags later) or ask your administrator.

Click Create Backup.

The backup is completed when its icon no longer lists it as CREATING in the Boot Volume Backup list.

Slurm files and configs

SlurmConfigs

cd /etc/slurm
echo y | oci os object put --bucket-name backups --file slurm.conf --name slurm-$(date +%Y-%m-%d).conf --auth instance_principal
echo y | oci os object put --bucket-name backups --file slurmdbd.conf --name slurmdbd-$(date +%Y-%m-%d).conf --auth instance_principal

SlurmAcctMgr

sacctmgr dump cluster file=slurm_cluster-$(date +%Y-%m-%d).cfg
(example)
echo y | oci os object put --bucket-name backups --file slurm_cluster-2023-07-12.cfg --name slurm_cluster-2023-07-12.cfg --auth instance_principal

Mysqldb

Login to the bastion and backup the db

mysqldump -u root -p slurm_accounting > slurm_accounting_backup-$(date +%Y-%m-%d).sql
Note: root password is > '/etc/opt/oci-hpc/passwords/mysql/root.txt'

Upload the dump to s3

(example)
echo y | oci os object put --bucket-name backups --file slurm_accounting_backup_2023-07-12.sql --name slurm_accounting_backup_2023-07-12.sql --auth instance_principal

LDAP

Connect to the server: Log in to the server where OpenLDAP is installed using SSH or any other remote access method.

Stop the LDAP service: Depending on your system configuration, you may need to stop the LDAP service to ensure data consistency during the backup process. Use the appropriate command for your system. For example, on a Linux system, you can use the following command:

sudo systemctl stop slapd

Locate the LDAP data directory: Find the directory where the LDAP data is stored. The default directory is often /var/lib/ldap or /var/lib/openldap-data. If you have a different configuration, refer to your OpenLDAP installation documentation to determine the location.

Create a backup directory: Create a directory where you will store the backup files. For example:

sudo mkdir /backup/ldap

Perform the backup: Use a tool like slapcat to generate a backup of the LDAP database. The slapcat command reads the LDAP database and writes it to an LDIF file. Execute the following command:

sudo slapcat -n 0 -l /backup/ldap/ldap_backup.ldif

Verify the backup: Once the backup process completes, check the backup file in the specified directory to ensure it was created successfully.

Start the LDAP service: Restart the LDAP service to resume normal operations. Use the appropriate command for your system. For example, on a Linux system:

sudo systemctl start slapd

Restore

Download backups

DB

oci os object get --bucket-name backups --file slurm_accounting_backup.sql --name slurm_accounting_backup_2023-07-12.sql --auth instance_principal

SlurmConfigs

oci os object get --bucket-name backups --file slurm.conf --name slurm-2023-07-12.conf --auth instance_principal
oci os object get --bucket-name backups --file slurmdbd.conf --name slurmdbd-2023-07-12.conf --auth instance_principal

SlurmAcctMgr

oci os object get --bucket-name backups --file slurm_cluster.cfg --name slurm_cluster-2023-07-12.cfg --auth instance_principal

Start Restoring

Slurm

Update SlurmctldHost variables in slurm.conffile to correspond to the new bastion and backup nodes.
To restore the slurm.conf file:

* replace the default slurm.conf with the backup
sudo mv {path-to-slurm-conf-backup}/slurm.conf /etc/slurm/slurm.conf
* update variables in slurm.conf
* SlurmctldHost={cluster-name}-bastion
* SlurmctldHost={cluster-name}-backup
* AccountingStorageHost={cluster-name}-bastion
* NodeName={cluster-name}-login
```bash
sudo vim /etc/slurm/slurm.conf

DB

Update the StoragePass in the Slurmdbd.conf file to the password found in the /etc/opt/oci-hpc/passwords/mysql/root.txt file of the new bastion node.

After making changes to the slurm.conf and slurmdbd.conf files we need to apply the changes:

sudo scontrol reconfigure
mysql -u root -p slurm_accounting <  slurm_accounting_backup.sql 

Sacctmgr

sacctmgr load slurm_cluster file=slurm_cluster.cfg

Ldap

Connect to the server: Log in to the server where OpenLDAP is installed using SSH or any other remote access method.

Stop the LDAP service: Stop the LDAP service to prevent any conflicts during the restore process. Use the appropriate command for your system. For example, on a Linux system:

sudo systemctl stop slapd

Locate the LDAP data directory: Identify the directory where the LDAP data is stored. If you have a non-default configuration, refer to your OpenLDAP installation documentation.

Move the existing data directory (optional): If you want to replace the existing LDAP database with the restored backup, you can move or remove the current LDAP data directory. Be cautious as this step permanently removes any existing data. For example, you can use the following command to move the directory:

sudo mv /var/lib/ldap /var/lib/ldap_old

Restore the backup: Use the slapadd command to restore the LDAP database from the backup LDIF file. Execute the following command:

sudo slapadd -n 0 -F /etc/ldap/slapd.d/ -l /backup/ldap/ldap_backup.ldif

Set appropriate ownership and permissions: Ensure that the restored LDAP data directory has the correct ownership and permissions. Typically, the directory should be owned by the LDAP user and group with restricted permissions.

Start the LDAP service: Start the LDAP service to make the restored database accessible. Use the appropriate command for your system. For example, on a Linux system:

sudo systemctl start slapd

Verify the restore

Check that slurm is running as expected.