Backup and Restore procedure of SLURM¶
Slurm has several things that need backed up and restored to preserve BC, namely: * Slurm Config files * sacctmgr data * mysql database(s)
Prereqs¶
- Added compartment to use instance-principal access to slurm-d3 bucket to copy and download backup files between vcn's.
Boot disk imaging¶
A catch all is to clone the boot disks of the bastion, login and backup nodes.
- Snapshot and restore bastion, login and backup boot disks and restore them.
- Snapshot and restore FSS exports to dev env (same)
- Validate restored env. e.g. make sure this command works prior to imaging and on the restored image the dev env.
Here is an example of creating a boot volume backup - reference.
Using the Console
Open the navigation menu and click Storage. Under Block Storage, click Block Volumes. In the Block Storage menu on the sidebar, click Boot Volumes.
Click the boot volume that you want to create a backup for.
Click Create Manual Backup.
Enter a name for the backup. Avoid entering confidential information.
Select the backup type, either incremental or full. See Boot Volume Backup Types for information about backup types.
If you have permissions to create a resource, then you also have permissions to apply free-form tags to that resource. To apply a defined tag, you must have permissions to use the tag namespace. For more information about tagging, see Resource Tags.
If you are not sure whether to apply tags, skip this option (you can apply tags later) or ask your administrator.
Click Create Backup.
The backup is completed when its icon no longer lists it as CREATING in the Boot Volume Backup list.
Slurm files and configs¶
SlurmConfigs¶
cd /etc/slurm
echo y | oci os object put --bucket-name backups --file slurm.conf --name slurm-$(date +%Y-%m-%d).conf --auth instance_principal
echo y | oci os object put --bucket-name backups --file slurmdbd.conf --name slurmdbd-$(date +%Y-%m-%d).conf --auth instance_principal
SlurmAcctMgr¶
sacctmgr dump cluster file=slurm_cluster-$(date +%Y-%m-%d).cfg
(example)
echo y | oci os object put --bucket-name backups --file slurm_cluster-2023-07-12.cfg --name slurm_cluster-2023-07-12.cfg --auth instance_principal
Mysqldb¶
Login to the bastion and backup the db¶
Note: root password is > '/etc/opt/oci-hpc/passwords/mysql/root.txt'Upload the dump to s3¶
(example)
echo y | oci os object put --bucket-name backups --file slurm_accounting_backup_2023-07-12.sql --name slurm_accounting_backup_2023-07-12.sql --auth instance_principal
LDAP¶
Connect to the server: Log in to the server where OpenLDAP is installed using SSH or any other remote access method.
Stop the LDAP service: Depending on your system configuration, you may need to stop the LDAP service to ensure data consistency during the backup process. Use the appropriate command for your system. For example, on a Linux system, you can use the following command:
Locate the LDAP data directory: Find the directory where the LDAP data is stored. The default directory is often /var/lib/ldap or /var/lib/openldap-data. If you have a different configuration, refer to your OpenLDAP installation documentation to determine the location.
Create a backup directory: Create a directory where you will store the backup files. For example:
Perform the backup: Use a tool like slapcat to generate a backup of the LDAP database. The slapcat command reads the LDAP database and writes it to an LDIF file. Execute the following command:
Verify the backup: Once the backup process completes, check the backup file in the specified directory to ensure it was created successfully.
Start the LDAP service: Restart the LDAP service to resume normal operations. Use the appropriate command for your system. For example, on a Linux system:
Restore¶
Download backups
DB
oci os object get --bucket-name backups --file slurm_accounting_backup.sql --name slurm_accounting_backup_2023-07-12.sql --auth instance_principal
SlurmConfigs
oci os object get --bucket-name backups --file slurm.conf --name slurm-2023-07-12.conf --auth instance_principal
oci os object get --bucket-name backups --file slurmdbd.conf --name slurmdbd-2023-07-12.conf --auth instance_principal
SlurmAcctMgr
oci os object get --bucket-name backups --file slurm_cluster.cfg --name slurm_cluster-2023-07-12.cfg --auth instance_principal
Start Restoring
Slurm
Update SlurmctldHost variables in slurm.conffile to correspond to the new bastion and backup nodes.
To restore the slurm.conf file:
* replace the default slurm.conf with the backup
sudo mv {path-to-slurm-conf-backup}/slurm.conf /etc/slurm/slurm.conf
* update variables in slurm.conf
* SlurmctldHost={cluster-name}-bastion
* SlurmctldHost={cluster-name}-backup
* AccountingStorageHost={cluster-name}-bastion
* NodeName={cluster-name}-login
```bash
sudo vim /etc/slurm/slurm.conf
DB
Update the StoragePass in the Slurmdbd.conf file to the password found in the /etc/opt/oci-hpc/passwords/mysql/root.txt file of the new bastion node.
After making changes to the slurm.conf and slurmdbd.conf files we need to apply the changes:
Sacctmgr
Ldap
Connect to the server: Log in to the server where OpenLDAP is installed using SSH or any other remote access method.
Stop the LDAP service: Stop the LDAP service to prevent any conflicts during the restore process. Use the appropriate command for your system. For example, on a Linux system:
Locate the LDAP data directory: Identify the directory where the LDAP data is stored. If you have a non-default configuration, refer to your OpenLDAP installation documentation.
Move the existing data directory (optional): If you want to replace the existing LDAP database with the restored backup, you can move or remove the current LDAP data directory. Be cautious as this step permanently removes any existing data. For example, you can use the following command to move the directory:
Restore the backup: Use the slapadd command to restore the LDAP database from the backup LDIF file. Execute the following command:
Set appropriate ownership and permissions: Ensure that the restored LDAP data directory has the correct ownership and permissions. Typically, the directory should be owned by the LDAP user and group with restricted permissions.
Start the LDAP service: Start the LDAP service to make the restored database accessible. Use the appropriate command for your system. For example, on a Linux system:
Verify the restore¶
Check that slurm is running as expected.