Weka¶

It is strongly encouraged to read the WEKA System Overiew before working on the WEKA filesystem. Briefly, it is a container based distributed filesystem. It stripes data in an N+2 or N+4 redundancy scheme across a series of backend servers that each have their own drives. Losing any number of backends less than or equal to the redundancy number is recoverable and imperceptable to users (besides a slight performance impact proportionate to the number of servers that went down). Additionally, a temporary failure of any quantity of backends is recoverable so long as every backend recovers (it is recommended to stop-io before intentionally doing this to many nodes for a restart or the like). The system requires a certain number of cores to be dedicated to the containers (~20 for 4 drives is about ideal). These cores are unavailable to any other processes such as Slurm (controlled with cgroups).

Admin Password¶

The current WEKA admin password:

Mqxs&gYrtpN.RGb2V9U!7d

The password can be change with the following command:

weka user passwd

Initial Installation on the Cluster¶

Create a weka directory in ubuntu home directory.

mkdir /home/ubuntu/weka

Clone WEKA tools repo

cd /home/ubuntu/weka && git clone http://github.com/weka/tools

Now the install script. This can be run start to end by just copy pasting it. Make sure you are currently sshed into the lowest numbered compute node on the cluster. Also take note of the comments before running and see if you need to change any of the numbers (ie number of drives, number of nodes, etc). You may also need to change the WEKA tar name in any place it is mentioned depending on the version downloaded from WEKA.

# Create list of compute nodes. Recommended to double check that this list contains all the compute nodes before continuing.
sudo sinfo -S "%n" -o "%n" | tail -n +2 > /home/ubuntu/weka/weka_hosts
# Configure mellanox devices
pdsh -w ^/home/ubuntu/weka/weka_hosts 'sudo mst start'
pdsh -w ^/home/ubuntu/weka/weka_hosts 'sudo ibdev2netdev' 
pdsh -w ^/home/ubuntu/weka/weka_hosts 'sudo mlxconfig -y -d /dev/mst/mt4125_pciconf0 s PCI_WR_ORDERING=1 ADVANCED_PCI_SETTINGS=1'
wget --auth-no-challenge https://VqWHzWpKWBeaVCHU:@get.weka.io/dist/v1/pkg/weka-4.2.18.14.tar --directory-prefix /home/ubuntu/weka/
pdcp -w ^/home/ubuntu/weka/weka_hosts /home/ubuntu/weka/weka-4.2.18.14.tar /tmp
pdsh -w ^/home/ubuntu/weka/weka_hosts 'sudo tar xf /tmp/weka-4.2.18.14.tar -C /tmp'
pdsh -w ^/home/ubuntu/weka/weka_hosts 'cd /tmp/weka-4.2.18.14/ && sudo ./install.sh'
pdsh -w ^/home/ubuntu/weka/weka_hosts 'hostname && sudo weka local ps'
pdsh -w ^/home/ubuntu/weka/weka_hosts 'sudo sed -i 's/^cgroups_mode=.*/cgroups_mode=force_v2/' /etc/wekaio/service.conf'
pdsh -w ^/home/ubuntu/weka/weka_hosts 'sudo sed -i 's/^isolate_cpusets=.*/isolate_cpusets=false/' /etc/wekaio/service.conf'
pdsh -w ^/home/ubuntu/weka/weka_hosts 'sudo weka local stop'
pdsh -w ^/home/ubuntu/weka/weka_hosts 'sudo weka local rm --all -f'
# Set cores to be equal to number of drives per node
pdsh -w ^/home/ubuntu/weka/weka_hosts 'sudo weka local setup container --name drives0 --cores 4 --core-ids 96,97,98,99 --failure-domain FD-$(hostname | sed 's/compute-//') --only-drives-cores --net eth0'
# Get the node that will be the join for the cluster. After install it doesn't matter, so just picked lowest numbered node
head -1 /home/ubuntu/weka/weka_hosts | xargs -I {} -P 1 ssh {} "sudo ip a | grep inet | grep eth0 | awk '{ print \$2 }' | cut -d '/' -f 1 | grep -o '[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+'" > /home/ubuntu/weka/manager
pdsh -w ^/home/ubuntu/weka/weka_hosts 'sudo weka local setup container --name compute0 --cores 12 --core-ids 1,2,3,4,5,6,7,8,9,10,11,12 --failure-domain FD-$(hostname | sed 's/compute-//') --only-compute-cores --memory 128GB --base-port 14200 --net eth0 --join-ips $(cat /home/ubuntu/weka/manager)'
pdsh -w ^/home/ubuntu/weka/weka_hosts 'sudo weka local setup container --name frontends0 --cores 4 --core-ids 100,101,102,103 --failure-domain FD-$(hostname | sed 's/compute-//') --only-frontend-cores --base-port 14300 --net eth0 --join-ips $(cat /home/ubuntu/weka/manager)'
# This Command creates the cluster, it gets hostnames and ip addresses from sinfo 
weka cluster create $(sudo sinfo -S "%n" -o "%n" | tail -n +2 | paste -sd ' ' -) --host-ips $(cat /home/ubuntu/weka/weka_hosts | xargs -I {} -P 1 ssh {} "sudo ip a | grep inet | grep eth0 | awk '{ print \$2 }' | cut -d '/' -f 1 | grep -o '[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+'" | paste -sd ',' -)
# 0 and 5 are the cluster node numbers (0-31 for prod). 0-3 are drive numbers 
seq 0 9 | xargs -I {} -P 0 weka cluster drive add {} /dev/nvme{0..3}n1 --force
weka cluster hot-spare 1
weka cluster start-io 
# Name this whatever we want, currently named default. No users see it as it is only for management.
weka fs group create default
weka fs create default default $(weka status -J | jq .capacity.unprovisioned_bytes)
pdsh -w ^/home/ubuntu/weka/weka_hosts 'sudo mkdir /data'
pdsh -w ^/home/ubuntu/weka/weka_hosts 'sudo mount -t wekafs default /data'
pdsh -w ^/home/ubuntu/weka/weka_hosts 'echo "default /data wekafs x-systemd.requires=weka-agent.service,x-systemd.mount-timeout=infinity,_netdev 0 0" | sudo tee -a /etc/fstab'

WEKA is now up and running on all the compute nodes. You can run the WEKA benchmarking tool to check instant performance by running the following from one of the compute nodes: sudo /home/ubuntu/weka/tools/wekatester/wekatester. Note that you will have to login: sudo weka user login admin and have passwordless SSH setup for the root user.

SLURM COMPATIBILITY Cgroups ensure Slurm and WEKA don't interfere with each other's cores. A Slurm job will fail if any of the cores it tries to allocate for a job are a part of WEKA cgroups. To explicitly isolate Slurm and WEKA core usage you will want to set TaskPluginParam=SlurmdOffSpec in /etc/slurm/slurm.conf. Then change the Compute node specification to include a CoreSpecList parameter that contains all the core ids being allocated to WEKA. This must be done in the Slurm Daemon unit file template. By default, if the cluster is deployed with A100s the unit file will already be deployed with the necessary cores isolated. Slurm documentation for specialized cores

Config /home/ubuntu/weka/weka_clients to be the hostnames of the clients to want to add. Then run the following commands.

# Configure weka_clients 
touch /home/ubuntu/weka/weka_clients
echo 'login' > /home/ubuntu/weka/weka_clients
echo 'controller' >> /home/ubuntu/weka/weka_clients
echo 'backup' >> /home/ubuntu/weka/weka_clients

# Add weka_clients
pdcp -w ^/home/ubuntu/weka/weka_clients /home/ubuntu/weka/weka-4.2.18.14.tar /tmp
pdsh -w ^/home/ubuntu/weka/weka_clients 'tar xf /tmp/weka-4.2.18.14.tar -C /tmp'
pdsh -w ^/home/ubuntu/weka/weka_clients 'cd /tmp/weka-4.2.18.14/ && sudo ./install.sh'
pdsh -w ^/home/ubuntu/weka/weka_clients 'hostname && sudo weka local ps'
# Create /data directory
pdsh -w ^/home/ubuntu/weka/weka_clients 'sudo mkdir /data'  
# Change grabs the lowest id compute node from sinfo and assigns it as the join point
pdsh -w ^/home/ubuntu/weka/weka_clients 'sudo mount -t wekafs -o net=udp $(sudo sinfo -S "%n" -o "%n" | head -2 | tail -1)/default /data'

We now have WEKA using a single core. To configure clients with more than a single CPU we need to use dpdk. Read the following section for details.

Adding backends¶

The replacement scripts will fail if scratch_nfs setting is true in /etc/ansible/hosts. Be sure it is false before running the scripts. Once you have gone through the node replacement steps described in admin docs proceed with the following: Change new_weka_hosts to just be the hostname of the new compute nodes. Then run the following commands on controller.

# Create new_weka_hosts file
rm /home/ubuntu/weka/new_weka_hosts
touch /home/ubuntu/weka/new_weka_hosts
# Add hostnames
# Example:
#echo 'compute-##' > /home/ubuntu/weka/new_weka_hosts
#echo 'compute-##' >> /home/ubuntu/weka/new_weka_hosts
#echo 'compute-##' >> /home/ubuntu/weka/new_weka_hosts
# Configure mellanox devices
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo mst start'
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo ibdev2netdev' 
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo mlxconfig -y -d /dev/mst/mt4125_pciconf0 s PCI_WR_ORDERING=1 ADVANCED_PCI_SETTINGS=1'
# Reboot for changes to take effect (~20 minutes) 
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo reboot now'

Now the install script. Make sure you have correctly changed new_weka_hosts to only be the new compute nodes. Now make sure /home/ubuntu/weka/manager is a real backend (run weka local ps on it to double check). Now you can copy and paste the following block of commands.

# Download tarball if needed. If missing make sure to download the same version as the cluster.
# wget --auth-no-challenge https://VqWHzWpKWBeaVCHU:@get.weka.io/dist/v1/pkg/weka-4.2.18.14.tar --directory-prefix /home/ubuntu/weka/
pdcp -w ^/home/ubuntu/weka/new_weka_hosts /home/ubuntu/weka/weka-4.2.18.14.tar /tmp
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'tar xf /tmp/weka-4.2.18.14.tar -C /tmp'
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'cd /tmp/weka-4.2.18.14/ && sudo ./install.sh'
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'hostname && sudo weka local ps'
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo sed -i "s/^cgroups_mode=.*/cgroups_mode=force_v2/" /etc/wekaio/service.conf'
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo sed -i "s/^isolate_cpusets=.*/isolate_cpusets=false/" /etc/wekaio/service.conf'
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo systemctl restart weka-agent'
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo weka local stop'
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo weka local rm --all -f'
# set cores to be equal to number of drives per node (4 for prod cluster). Be sure to specify core ids that align with those setup in slurm.conf
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo weka local setup container --name drives0 --cores 4 --core-ids 96,97,98,99 --failure-domain FD-$(hostname | sed 's/compute-//') --only-drives-cores --net eth0 --join-ips $(cat /home/ubuntu/weka/manager)'
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo weka local setup container --name compute0 --cores 12 --core-ids 1,2,3,4,5,6,7,8,9,10,11,12 --failure-domain FD-$(hostname | sed 's/compute-//') --only-compute-cores --memory 128GB --base-port 14200 --net eth0 --join-ips $(cat /home/ubuntu/weka/manager)'
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo weka local setup container --name frontends0 --cores 4 --core-ids 100,101,102,103 --only-frontend-cores --failure-domain FD-$(hostname | sed 's/compute-//') --base-port 14300 --net eth0 --join-ips $(cat /home/ubuntu/weka/manager)'
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo mkdir /data'
pdsh -w ^/home/ubuntu/weka/new_weka_hosts 'sudo mount -t wekafs default /data'

Now run weka cluster container from any node in the cluster to find the container id of the newly created drives0 container (should be at the bottom of the output). Then run weka cluster drive add ## /dev/nvme{0..3}n1 --force replacing ## with the container id. Now wait for the cluster to rebuild. This can take a while (30 minutes+) so run watch weka status and wait for it to say fully protected again.

You can check that WEKA was correctly mounted with a cd /data && ll. You should see the home directories of our researchers.

Finally, the node(s) can be added to the cluster.

# replace #### with the node number
sudo scontrol update nodename=compute-#### state=resume

Troubleshooting When deactivating drives, if you encounter an error similar to this: error: Total capacity left will be 133.58 TiB, less than the total budget for filesystems which is 163.26 TiB. then you will need to temporarily decrease the amount of hot spare available. To view the current number of hot spares available use weka status. You can change the number of hot spares with the weka cluster hot-spare # command where the # represents the new number of hot spares that will be available. Try incrementally decreasing the number of hot spares available until you are able to deactivate the drives. Once you've been able to deactivate the drives finish adding the new node. When the new node and drives have been added to the cluster increase the number of hot spares to the maximum amount. This can be done incrementally as the Weka cluster won't be able to set a greater amount of hot spares than possible.

Removing a Backend¶

Follow these steps when retiring a node or removing its drives:

Confirm Full Protection
Run weka status to verify the cluster is fully protected (no active rebuilds).

Deactivate and Remove Drives
List inactive or failed drives:

weka cluster drive -F status=INACTIVE -o uuid --no-header | paste -sd ' ' -

Deactivate:

weka cluster drive deactivate <drive_uuid_1> <drive_uuid_2> ...

Remove:

weka cluster drive remove <drive_uuid_1> <drive_uuid_2> ...

Deactivate and Remove Containers
List containers that are down:

weka cluster container -F status=DOWN

Deactivate, then remove:

weka cluster container deactivate <container_id_1> <container_id_2> ...
weka cluster container remove <container_id_1> <container_id_2> ...

Verify Cluster Health
Check that weka status and weka alerts show no errors, and the cluster is Fully Protected again.

Mounting the filesystem using fstab¶

We have setup our clients to automatically mount the WEKA filesystem on boot by using fstab. The filesystem was mounted in dpdk mode for best performance. Before we can mount the WEKA filesystem in dpdk mode we need to create a vnic for each core that we want to assign to the WEKA client. This can be done from the OCI console by navigating to an instances details page and then to assigned vnics on the left side options column.

The general structure of the entry is as follows:

# Depending on the number of cores specified you will need to add multiple `net=` options. One for each core. This means that each core needs its own vnic.
{backend_ip_addresses_comma_separated}:/default /data wekafs num_cores={num_cores},net={interface},x-systemd.after=weka-agent.service,x-systemd.mount-timeout=infinity,_netdev 0 0

Login node:

# The login node has higher performance requirement. We dedicated 4 cpu cores on the login node to the WEKA client. Increase this number if you need better performance.
172.16.4.34,172.16.4.47,172.16.4.75:/default /data wekafs num_cores=4,net=ens4,net=ens5,net=ens6,net=ens7,x-systemd.after=weka-agent.service,x-systemd.mount-timeout=infinity,_netdev 0 0

Controller node:

172.16.4.34,172.16.4.47,172.16.4.75:/default /data wekafs num_cores=1,net=ens4,x-systemd.after=weka-agent.service,x-systemd.mount-timeout=infinity,_netdev 0 0

Backup node:

172.16.4.34,172.16.4.47,172.16.4.75:/default /data wekafs num_cores=1,net=ens4,x-systemd.after=weka-agent.after,x-systemd.mount-timeout=infinity,_netdev 0 0

Backends:

sudo weka umount /data
sudo rm /etc/init.d/weka-agent
sudo touch /etc/systemd/system/weka-agent.service
sudo tee /etc/systemd/system/weka-agent.service > /dev/null <<EOF
[Unit]
Description=WEKA Agent Service
Wants=network.target network-online.target
After=network.target network-online.target rpcbind.service
Documentation=http://docs.weka.io
Before=remote-fs-pre.target remote-fs.target

[Service]
Type=simple
ExecStart=/usr/bin/weka --agent
Restart=always
WorkingDirectory=/
EnvironmentFile=/etc/environment
# Increase the default a bit in order to allow many simultaneous
# files to be monitored, we might need a lot of fds.
LimitNOFILE=65535

[Install]
RequiredBy=remote-fs-pre.target remote-fs.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now weka-agent.service
echo "default /data wekafs noauto,x-systemd.automount,_netdev 0 0" | sudo tee -a /etc/fstab
sudo reboot now

Setting Quotas¶

We can use weka to set soft and hard data limits on a per directory scale with configurable grace periods for the soft limits. Weka Documentation

The commands accept all standard Storage formats (TiB, TB, GiB, GB, etc). Can set defaults for all subfolders of the directory:

sudo weka fs quota set-default /path/to/folder --hard 10GB --grace 1d

Alternatively can set a special capacity for a specific folder:

sudo weka fs quota set /path/to/folder --hard 10GB --grace 1d

View all the quotas in the file system and their usage:

weka fs quota list --all

View all users breaking their soft quotas:

weka fs quota list

View the default quotas set on the file system:

weka fs quota list-default

Rebooting a node¶

Remount weka it should already been in fstab.

cd /
sudo mount /data

Double check slurm is still working with squeue or another slurm command otherwise restart slurmd sudo systemctl restart slurmd.

The WEKA filesystem should automatically mount on reboot. See section titled "Mounting the filesystem using fstab" for more details.

How to change WekaFS total capacity¶

# Increase total Weka FS capacity to 270TB
weka fs update default --total-capacity 270TB

Snapshots¶

How to setup automatic snapshots

# create snaptool user in weka with clusteradmin role
# password: toswyN-tupsy4-pyrtod
weka user add snaptool clusteradmin
# generate auth token for user
weka user login snaptool --path /tmp/snaptool-authtoken.json
# create dir for auth token 
sudo mkdir /root/.weka
# move auth token to new dir
mv /tmp/snaptool-authtoken.json /root/.weka/auth-token.json
# set permissions
sudo chown root:root /root/.weka/auth-token.json
sudo chmod 400 /root/.weka/auth-token.json
# pull snaptool docker image
sudo docker pull wekasolutions/snaptool:latest
# download latest snaptool release
wget --directory-prefix=/tmp/ https://github.com/weka/snaptool/releases/download/v1.6.2/snaptool.tar
# unarchive
cd /opt/weka/ && sudo tar -xvf /tmp/snaptool.tar

Setup config files

# update /opt/weka/snaptool/snaptool.yml config file
sudo vim /opt/weka/snaptool/snaptool.yml
# use the following config
    cluster:
        auth_token_file: auth-token.json
        hosts: 10.0.6.157,10.0.6.84,10.0.5.151
        force_https: True   # only 3.10+ clusters support https
        verify_cert: False  # default cert cannot be verified

    snaptool:
        port: 8090         # use 0 to disable the status web ui.  This overrides the command line port argument

    filesystems:
        default: backup

    schedules:
        backup:
            hourly:
                every: day
                interval: 60
                retain: 24
                upload: remote
                # day: 1   (this is default)
                # at: 0000 (this is default)
            daily:
                every: day
                retain: 7
                upload: remote
                # at: 0000 (this is default)

# update auth dir and time zone in docker_run.sh
# auth dir should be /opt/weka/snaptool/.weka/
# time zone should be US/Pacific
sudo vim /opt/weka/snaptool/docker_run.sh

Start snaptool service

# start the snaptool service/container
sudo ./docker_run.sh

Object Storage Snapshot Backups¶

# Start by creating a bucket in OCI. If we don't already have a bucket we won't be able to add it.
# NOTE: snapshots aren't removed from the bucket automatically. The bucket should be purge regularly so that old data that doesn't exist on the filesystem or in the snapshot doesn't pile up. 
# At the moment we replace the bucket at the beginning of every month.
# This can be automated!

# View existing OBS
# After each step you can run these commands to view the OBS that you added and attached.
weka fs tier obs
weka fs tier s3

# Add OBS
# <bucket-namespace> - This can be found by navigating to the configuration page for the bucket we created. This can be found in the same place as the OCID.
# <bucket-name> - This is the name of the bucket we created in OCI.
weka fs tier s3 add OCI \
--site remote \
--hostname <bucket-namespace>.compat.objectstorage.ap-sydney-1.oraclecloud.com \ 
--bucket <bucket-name> \
--auth-method AWSSignature4 \
--region ap-sydney-1 \
--access-key-id 52e5048e2668f8447e7521fbf900b862b8f4983b \
--secret-key HDWEo2ScfP2KMCh7ino5Foj+a691WVA2dja2iGwp2sk= \
--protocol HTTPS

# Attach OBS
# After the OBS is added to the cluster we need to attach it to our filesystem.
weka fs tier s3 attach default OCI --mode remote

# Upload snapshot to OBS manually
# We can view snapshots with `weka fs snapshot`. Pick a name and upload that snapshot to the attached OBS.
weka fs snapshot upload default <snapshot-name> --site remote

# Detach OBS
weka fs tier s3 detach default OCI

# Delete OBS
weka fs tier s3 delete OCI

How to restore filesystem from remote snapshot backup¶

Add a new local object-store, using weka fs tier obs add

# Example:
weka fs tier obs add local-OCI-april \
  --site local \
  --obs-type AWS \
  --hostname axvscsfozusv.compat.objectstorage.ap-sydney-1.oraclecloud.com \
  --protocol HTTPS \
  --auth-method AWSSignature4 \
  --region ap-sydney-1 \
  --access-key-id 52e5048e2668f8447e7521fbf900b862b8f4983b \
  --secret-key "HDWEo2ScfP2KMCh7ino5Foj+a691WVA2dja2iGwp2sk="

Add a local object-store bucket, referring to the bucket containing the snapshot to recover, using weka fs tier s3 add

# Example
weka fs tier s3 add local-OCI-april-bucket \
  --site local \
  --obs-name local-OCI-april \
  --bucket bucket-weka-april \
  --region ap-sydney-1 \
  --protocol HTTPS \
  --auth-method AWSSignature4

Download the filesystem, using weka fs download

# Example
# The LOCATOR_STRING for a snapshot can be found with `weka fs snapshot`
weka fs download myNewFS myGroup 10TB 1TB local-OCI-april-bucket <LOCATOR_STRING> --snapshot-name mySnapshotName

(Optional) If the recovered filesystem should also be tiered, add a local object store bucket for tiering using weka fs tier s3 add

# Add another bucket (usually local) for tiering. 
weka fs tier s3 add local-tier-bucket \
  --site local \
  --obs-name local-OCI-april \
  --bucket bucket-weka-tiering

# Then attach that bucket tothe new filesystem to use it for tiering
weka fs tier s3 attach myNewFS local-tier-bucket

Detach the initial object store bucket from the filesystem

weka fs tier s3 detach myNewFS local-OCI-april-bucket

Remove the local object store bucket and local object store created for this procedure.

# Remove the local bucket object
weka fs tier s3 rm local-OCI-april-bucket

# Remove the local object store
weka fs tier obs rm local-OCI-april

Uninstalling WEKA¶

The following should uninstall WEKA from all nodes in weka_hosts (run from one of those nodes). However it currently causes a segmentation fault, but it does successfully uninstall WEKA.

sudo sinfo -S "%n" -o "%n" | tail -n +2 > weka_hosts
pdsh -w ^weka_hosts sudo weka umount /data
pdsh -w ^weka_hosts sudo weka agent uninstall --force
pdsh -w ^weka_hosts sudo reboot

# Wait for the nodes to come up then run:
pdsh -w ^weka_hosts sudo rm -rf /opt/weka

Useful Commands¶

All the WEKA commands are decently self documented (-h option), but some useful ones I have found are below.

# Gives overall status of the cluster and file systems
weka status

# gives info and stats on all the processes in the cluster. Has more tools for filtering etc with -h
weka cluster process
weka cluster process --filter status=down
weka cluster process --filter role=DRIVES
weka cluster process --filter hostname=compute-13

# Useful per node performance stats
watch weka stats realtime -s -cpu -o node,hostname,role,writeps,writebps,wlatency,readps,readbps,rlatency,ops,cpu,l6recv,l6send

# Viewing all the drives and their status. Can filter, sort etc (see -h)
weka cluster drive

# Shows basic stats about all the servers
weka cluster servers list

# Can then see specifics using a uid from the list command
weka cluster servers show SERVER_UID

# Can view status of the containers (see -h for formatting and sorting options)
weka cluster container

# View any active alerts about nodes being down and the like:
weka alerts

# If Segmentation faults are happening it can be good to ensure the WEKA agent is running
sudo weka --agent

# View status of containers on the current node
weka local ps

# Manage resources per container on a local node (cores, ram, network card, etc)
weka local resources -h

File System Benchmarking¶

The benchmarking is slightly less automized and requires configuration for the specific system you are running it on. First clone the benchmarking repo. In the folder the repo makes, the script poc_tests/benchmark.sh runs a full benchmarking suite. Before running it however, you need to install fio (yum or nix) and fio-plot. Then run fio --server on each node you want the benchmark to run io from (from what nodes does it access the weka file system and perform io). In addition update the poc_tests/hosts.txt file to list the hostnames of said nodes.

The Bencharks: - First it does a single client (the client the command is run from) access /data. These results are found in poc_tests/results/fio_weka_single_client.out - Next it does the same benchmark suite but with all the hosts specified in hosts.txt. These results are found in poc_tests/results/fio_weka_multi_client.out - Now it starts the long benchmarks. These are full preformance characterizations that thuroughly test the system but take many hours to run. If the benchmarks make it to this point without failing the system is almost certainly stable, but we don't have perfect numbers for comparing and analyzing performance. This test will use the bench-fio tool to perform this broad spectrum analysis from a single client and from the clients specified in hosts.txt. It then outputs the results to directories in the poc_tests/results directory.

Parsing bench-fio Results

This is done with the two python scripts located in the top level of the repo. Be sure to match the parameters folder_name (batch size), iodepth_values, numjobs_values, and rw_values to match those configured in the benchmark.ini and benchmark_all_hosts.ini for the normal import and import_multihost respectively. You will also want to double check all the directory paths in the python scripts and ensure they resolve properly. Once you have done that you should be good to go with all the benchmarking tools. The output csvs can be imported into a google sheets and formatted in any way you like (examples).