Slurm is one of the most popular workload managers used by the HPC community, and is present in ~50% of the Top 100 Supercomputers in the world. Slurm provides an open-source, fault-tolerant, and highly-scalable workload management and job scheduling system for small and large Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained.

As a cluster workload manager, Slurm has three key functions:

It allocates exclusive or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work.
It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
It arbitrates contention for resources, and maintains a high utilization of a given set of resources by managing a queue of pending work.

In 2017 Google Cloud partnered with SchedMD, the commercial backers and maintainers of Slurm, to co-develop Google Cloud-native capabilities for Slurm, and release it publicly as Open Source Software on Github. The Slurm on Google Cloud software supports three usage models:

A stand-alone auto-scaling Slurm Cluster in Google Cloud
Hybrid burst from on-premise to Google Cloud
Federated multi-cluster environment with stand-alone Slurm cluster(s) in Google Cloud

Click here to see the public Github repository, including a detailed README about the project.

Click here to watch a video overview of the Slurm on GCP integration, and a demonstration of them in action by SchedMD’s Director of Training Shawn Hoopes.

This document describes the deployment, configuration, operation and maintenance of the Slurm on GCP software, and is intended for administrators and users of the Slurm on GCP software.

If you have feedback on this document, please post on the Google Cloud Slurm Discussion Group: https://groups.google.com/g/google-cloud-slurm-discuss.

Configuring Slurm

Terraform Configuration

With the Slurm on GCP V4 scripts, Terraform is now the main tool used to automate the deployment of a Slurm cluster. Specific to the Slum deployment, there is an example tfvars file that shows available deployment options. These fields determine how the Slurm cluster is deployed, configured, and managed. This section outlines the available configuration fields.

Summary

Below is a list of the fields offered by the TFVars, including the field’s type, and a short description of the purpose of each field. Links to more information in public Google Cloud documentation, relevant sections of this User Guide, and external websites have been included. Each field has the format “name (type): Description”.

cluster_name (string): Name of the Slurm cluster.
project (string): Google Cloud Project ID for where resources will be deployed.
zone (string): Google Cloud zone (i.e. “us-central1-a”) which will contain the controller and login instances of this cluster. If “regional_capacity” is enabled in the partition configuration, this can also be the Google Cloud region (i.e. “europe-west4”).
network_name (string): Virtual Private Cloud network to deploy the Slurm cluster into.
subnetwork_name (string): Virtual Private Cloud subnetwork to deploy the Slurm cluster into. Be sure to also specify this subnetwork name or another subnetwork name using the “vpc_subnet” field in your partitions!
shared_vpc_host_project (string): Shared VPC network to deploy the Slurm cluster into.
disable_controller_public_ips (boolean): Assign an external IP to the Slurm controller?
disable_login_public_ips (boolean): Disable external IPs to the Slurm login node?
disable_compute_public_ips (boolean): Disable external IPs to the Slurm login node?
suspend_time (integer): Slurm “SuspendTime”. Time to wait after a node is idle before suspending the node.
controller_machine_type (string): Controller node instance type.
controller_image (string): Google Cloud Image used to create the Slurm controller instance.
controller_disk_type (string): Controller instance’s boot persistent disk type.
controller_disk_size_gb (integer): Controller instance’s boot persistent disk size.
controller_labels (object): Label(s) to attach to the controller instance.
controller_service_account (string): Service account to be used on the controller instance.
controller_scopes (comma separated strings): Access scope of the controller instance.
cloudsql (object): Google CloudSQL, or other SQL-compatible server to use as the Slurm database instead of hosting a database on the controller instance.

server_ip (string): Server IP
user (string): Username
password (string): Password
db_name (string): Database name

controller_secondary_disk (boolean): Add a secondary disk for NFS server storage?

controller_secondary_disk_type (string): Controller instance’s secondary disk persistent disk type. Types available:

pd-standard
pd-ssd
pd-balanced
pd-extreme

controller_secondary_disk_size_gb (integer): Controller instance’s secondary disk persistent disk size.
controller_instance_template (string): The GCP instance template to use for the controller instance. See the “Instance Templates” section for more information.
login_machine_type (string): Login (interaction) node instance type.
login_image (string): Google Cloud Image used to create the Slurm login instance.
login_disk_type (string): Login instance boot persistent disk type. See “controller_secondary_disk_type” for more information.
login_disk_size_gb (integer): Login instance boot persistent disk size
login_labels (object): Label(s) to attach to the login instance
login_node_count (integer): Number of login nodes to create
login_node_service_account (string): Service account to be used on the login instance(s)
login_node_scopes (comma separated strings): Access scope of the login instance
login_instance_template (string): The GCP instance template to use for the login instance. See the “Instance Templates” section for more information.

network_storage (object): Network storage to mount on all nodes. Each network_storage object can be repeated for additional mounts. See the “Network Storage” section for more detail.

server_ip (string): Storage server IP
remote_mount (string): Storage mount name (filesystem name). Include the leading “/”.
local_mount (string): Local mount directory. Include the leading “/”.
fs_type (string): Filesystem type. NFS, CIFS, Lustre, GCSFuse installed by default. To add support for a filesystem type, see Adding a new file system type.
mount_options (string): Comma separated filesystem mount options. By default, the “defaults,_netdev” options will be included.

login_network_storage (object): Network storage to mount on login and controller nodes only. See the “Network Storage” section for more detail
compute_node_service_account (string): Service account to be used on the compute instance(s)
compute_node_scopes (comma separated strings): Access scope of the compute instances
partitions (object): Slurm partition configuration. Can be repeated for additional partitions.

name (string): Slurm partition name
machine_type (string): Compute node(s) instance type
static_node_count (integer): Number of always-on compute nodes
max_node_count (integer): Maximum number of total compute nodes allowed in this partition.
zone (string): Google Cloud zone (i.e. “us-central1-a”) which will contain the resources of this partition
image (string): Google Cloud Image used to create the partition’s compute instances
image_hyperthreads (boolean): Enable Symmetric Multithreading (SMT) on the instance?
compute_disk_type (string): Compute instance’s boot persistent disk type. See “controller_secondary_disk_type” for more information.
compute_disk_size_gb (integer): Compute instance’s boot persistent disk size.
compute_labels: Label(s) to attach to the compute instance
cpu_platform (string): Minimum CPU platform required for all compute nodes
gpu_count (integer): Number of GPUs to attach to each instance in the partition
gpu_type (string): GPU type to attach to the partition’s instances. Types available:

nvidia-tesla-a100
nvidia-tesla-t4
nvidia-tesla-v100
nvidia-tesla-p100
nvidia-tesla-p4
nvidia-tesla-k80
nvidia-tesla-t4-vws
nvidia-tesla-p100-vws
nvidia-tesla-p4-vws

network_storage (object): Network storage to mount on all compute nodes in the partition. See the “Network Storage” section for more detail.
preemptible_bursting (boolean): Will the instances be preemptible instances?
vpc_subnet (string): Virtual Private Cloud subnetwork to deploy the Slurm partition into. This field is required if “network_name” and “subnetwork_name” are specified, the partition’s resource creation will fail without it!
exclusive (boolean): Enables Slurm’s exclusive option for the entire partition, which prevents a node from being shared by multiple jobs, ensuring 1 node is used for 1 job.
enable_placement (boolean): Enable placement policies where instances will be located close to each other for low network latency between the instances.
regional_capacity (boolean): Enables the Google Cloud Bulk API, which improves large-scale VM creation performance, and allows an instance to be placed in any zone in the region based on availability. See the “Bulk API” section for more information.
regional_policy (object): If “regional_capacity” is true, users can define a location policy to define any zones to not use.
Instance_template (string): The GCP instance template to use for compute instances. See the “Instance Templates” section for more information.

Network Storage

The Network Storage YAML dictionaries allow the YAML configuration to specify one or multiple network storage devices to be automatically mounted. Fields will be added directly to fstab, and align closely with the fstab structure, and the mount command. For example, the fields below would be used in a mount command this way: “mount -t fs_type -o mount_options server_ip:remote_count local_mount”.

There are three places in the YAML configuration file where you can specify a “network_storage” type YAML dictionary. Two outside of partitions, one specifically for the controller and login node(s) called “login_network_storage”, one for all nodes in the cluster called “network_storage”, and one inside of the partitions dictionary, which can specify network_storage configurations per partition. This allows users to easily specify network storage that should be available only to controller and login nodes (admin storage), network storage available to all nodes in the cluster (home storage), and network storage available only to specific partitions (user-specific scratch).

The “network_storage” dictionaries contain five fields, which can be repeated multiple times for multiple entries. These fields include:

server_ip: Storage server IP (i.e. “192.169.0.10”). For Filestore this can be found after creating the filestore instance. We recommend IP over DNS name.
remote_mount: Remote storage mount point. This is your filesystem or share name preceded by a slash (i.e. “/scratch”). Be sure to include the leading “/”.
local_mount: Local mount directory to mount over (i.e. “/mnt/scratch”). This directory will be created if it is not already. Be sure to include the leading “/”.
fs_type: Filesystem type. NFS, CIFS, Lustre, GCSFuse installed by default. To add support for a filesystem type, see Adding a new file system type.
mount_options: Comma separated filesystem mount options. By default, the “defaults,_netdev” options will be included.

Filestore Example

For example, let’s take the following Filestore deployment as a storage device and configure a network_storage field for it:

Below is the network_storage field which would mount this filestore device, assuming that the Filestore is paired with or shared the same VPC as is defined in the vpc fields in the YAML configuration file.

network_storage :

- server_ip: 10.18.193.58

remote_mount: /home

local_mount: /home

fs_type: nfs

mount_options:

Above we set the “server_ip” as the Filestore “IP address”; the “remote_mount” field as the “File share name”; “local_mount” as “/home” since we’ll be using this Filestore as our home directory storage, and will mount over the “/home” directory on the cluster; “fs_type” as “nfs” reflecting that Filestore is an NFS-based storage system; and left “mount_options” blank as the defaults of “defaults,_netdev” are satisfactory for our uses.

Google Cloud Storage Bucket Example

For example, let’s take the following Google Cloud Storage (GCS) bucket as a storage device and configure a network_storage field for it:

Below is the network_storage field which would mount this GCS bucket using Cloud Storage FUSE (GCSFuse), assuming that the service account used on the compute nodes has permissions to access that bucket.

network_storage :

- server_ip: none

remote_mount: test_bucket

local_mount: /mnt/test_bucket

fs_type: gcsfuse

mount_options: file_mode=664,dir_mode=775,allow_other

Above we set the “server_ip” to “none” because we specify the bucket name in “remote_mount”; we set the “remote_mount” field as the GCS bucket name, “test_bucket” here; “local_mount” is set as “/mnt/test_bucket”; “fs_type” as “gcsfuse”, this will install gcsfuse during deployment; and set “mount_options” to “file_mode=664,dir_mode=777,allow_other” to set posix permissions to allow all users and groups on the cluster to have read/write access the storage.

NOTE: In order to allow auto-scaling VMs with GCSFuse configured as a storage type, you must specify “https://www.googleapis.com/auth/devstorage.full_control” to the “scopes” TFVars fields for the instances that are using GCSFuse (controller_scopes, login_node_scopes, compute_node_scopes).

Instance Templates

Instance Templates are supported in the Slurm TFVars fields to define the Controller (“controller_instance_template”), Login (“login_instance_template”), and Partition instances (“instance_template”). Instance Templates can be used to easily define a reusable instance configuration, consisting of configuration options including machine type, machine size, GPUs, disks, networking, metadata, labels, and more.

There are some options in Google Compute Engine (GCE) which may not be exposed directly in the Slurm TFVars fields. For example, Local SSD options are not available in the partition configuration. There may also be new features or products released in the future that may not have support for them in the Slurm TFVars fields. These options are able to be enabled and utilized in a Slurm cluster using the Instance Template option, but configuring an Instance Template with the features and configuration desired, and specifying that Instance Template ID in the relevant instance’s or partition’s “instance_template” configuration.

NOTE: Any compute fields specified in the TFVars will override the template properties.For example, if “controller_image” is specified, it will overwrite the image in the instance template.

Bulk API

The Bulk API enables two features to improve scalability and reliability of large-scale instance creation in the Google Compute Engine (GCE) API. First, it supports creation of up to 1,000 instances in a single API call. Second, Bulk API supports regional creation, allowing capacity-finding abilities across zones in a region.

Supporting up to 1,000 instances per API call reduces cluster deployment time versus the previous 1:1 API call to instance creation model. In tests of scaling up a 5,000 node cluster with and without Bulk API, Google saw an improvement in the cluster spin-up time of 500%.

Regional creation changes a GCE creation API call from a call to a single zone, which can lead to “Resource Exhaustion” errors in cases of stockouts of the required configuration in that zone, to a regional API call which will round-robin the zones in the specified region in case one zone cannot satisfy the entire request. Currently, the Bulk API will place your entire request within a single zone, and does not support multi-zone deployment. This can help alleviate delays and failures in deploying large instances, GPU VMs, and so on.

Enable Bulk API in Slurm on GCP using the “regional_capacity” and “regional_policy” options. Setting “regional_capacity” to TRUE enables Bulk API, and the partition’s “zone” field will either be parsed for the region if a full zone is provided (“us-central1-a”), or can accept a region (“us-central1”). The “regional_policy” field can be used to limit the Bulk API call using the location policy.

Deploying Slurm

There are two basic ways to run Slurm clusters, depending on the requirements of your environment. The simplest is to run a stand-alone cluster on GCP. The second is to run an on-premise cluster connecting to GCP as required to meet additional needs.

Stand-alone auto-scaling Slurm Cluster

Rather than repeat the instructions on deploying a stand-alone Slurm cluster, you can find various resources describing the deployment of a simple, stand-alone auto-scaling Slurm cluster on GCP:

Deploying a Slurm cluster on Compute Engine
Slurm on GCP Github Readme
Google Cloud Codelab
Qwiklab (requires a Qwiklab login and credits)

Controller configuration recommendations

When considering the machine type and size of your controller instance, take this guidance into account:

VM families with fewer cores and a higher clock speed will perform better than more cores at a lower clock speed. This means that the Compute Optimized VMs, such as the C2 VM family, are a better choice than the General Purpose VMs, such as the N2D VM family.
At least four vCPUs are recommended for the controller instance.
PD-SSD disks are recommended over PD-Standard when the Slurm controller is hosting the SlurmDB and the NFS share.
At least 32GB of RAM is recommended for the Slurm on GCP controller, which hosts the slurmctld daemon and also typically hosts the slurmdbd daemon.

RAM requirement increases as the number of jobs increases.

Your controller instance sizing should also increase with the number of total instances the controller will manage. Here are some guidelines for sizing your controller instance according to the number of compute nodes in the cluster:

Number of Compute Nodes	Recommended Controller Instance Type
50	c2-standard-4
400	c2-standard-8
1000	c2-standard-16
2000	c2-standard-16
4500	c2-standard-30

Hybrid burst from on-premise

Bursting out individual Virtual Machines from an on-premise Slurm cluster to a Google Cloud Project is done using the Power Saving feature in Slurm, specifically the ResumeProgram and the SuspendProgram parameters in the slurm.conf. In the example provided, these parameters point to the Python programs, resume.py and suspend.py, found in the scripts directory. Your config.yaml should be configured so that the scripts can create and destroy compute instances in a GCP project. See Cloud Scheduling Guide for more information.

Prerequisites

Slurm installed on-premise
A GCP Project with VPC network and subnet (the Default VPC is fine)
VPN between On-Premise and GCP

Working bi-directional communications between GCP and on-premise cluster

Create bi-directional DNS between on-premise and GCP according to the DNS Best Practices
Open ports on the on-premise firewall:

Slurmctld (default: Ports 6817)
Srun (default: Port 6818, outbound)
Slurmdbd (default: Port 6819)
SrunPortRange (typically: Ports 60001-63000)

Open firewall ports in GCP for NFS from on-premise:

Port 111 (TCP and UDP) and 2049 (TCP and UDP)

Configuration Steps

Node Addressing

There are two options:

Configure DNS peering between the on-premise network and the GCP network.
Configure Slurm to use NodeAddr to communicate with cloud compute nodes.

In the end, the slurmctld and any login nodes should be able to communicate with cloud compute nodes, and the cloud compute nodes should be able to communicate with the controller.

Configure DNS peering

GCP instances need to be resolvable by name from the controller and any login nodes.
The controller needs to be resolvable by name from GCP instances, or the controller IP address needs to be added to /etc/hosts. Refer to the DNS Best practices.

Use IP addresses with NodeAddr

Remove “cloud_dns” from the “SlurmctldParameters” field in the on-premise slurm.conf, if it is being used. Also, add “cloud_reg_addrs” to the “SlurmctldParameters” field.
Disable hierarchical communication in slurm.conf by setting “TreeWidth=65533”
Set “update_node_addrs” to “true” in config.yaml
Add controller's ip address to /etc/hosts on compute image

Create a compute node image

Create a Compute Instance

Create a new instance using the Slurm on GCP public operating system image:

gcloud compute instances create slurm-image

--zone us-central1-a

--image-family schedmd-slurm-20-11-4-hpc-centos-7

--image-project schedmd-slurm-public

Install and configure any additional packages that you are used to for a Slurm compute node.

Create a Compute Image

Then create an image from it, creating a family either in the form "<cluster_name>-compute-#-image-family", or in a name of your choosing:

gcloud compute images create slurm-compute-0-image-v1 \

--source-disk slurm-image \

--source-disk-zone us-central1-a \

--family slurm-compute-cloud-image-family

Create Service Accounts

In order to securely allow each instance type (Compute, Controller, Login) to perform their duties (adding/removing nodes, accessing cloud storage and other APIs), we must configure at least two Service Accounts with different levels of IAM permissions: the controller’s service account, and login and worker node service accounts.

Create a service account that will have permissions to create and delete instances (e.g. with the “compute.instanceAdmin.v1” role) in the remote project.

gcloud iam service-accounts create sa-name \

--description="sa-description" \

--display-name="sa-display-name"

Add roles to the service account that you just created that will allow the service account to create and delete instances. The “compute.instanceAdmin.v1” role provides this permission. Run this command to give the “compute.instanceAdmin.v1” role to the Service Account:

gcloud project add-iam-policy-binding <My Project ID> \

--member=sa-name --role=roles/compute.instanceAdmin.v1

On your on-premise Slurm controller, create a service account key that we will use to authenticate to Google Cloud from our on-premise.

gcloud iam service-accounts keys create /shared/slurm/scripts/service_account.key --iam-account=sa-name

We will configure Slurm to use the “/shared/slurm/scripts/service_account.key” file in another section.

Install scripts

Clone the slurm-gcp git repo to your Slurm controller node.

NOTE: This assumes internet access from the Slurm controller node.

git clone https://github.com/SchedMD/slurm-gcp.git

In the scripts directory, copy the resume.py, suspend.py, slurmsync.py and config.yaml.example from the slurm-gcp repository's scripts directory to a location on the slurmctld.

sudo cp resume.py suspend.py slurmsync.py config.yaml.example /apps/slurm/scripts

Rename config.yaml.example to config.yaml and modify the highlighted values.

cluster_name: slurm

project: slurm-184304

region: us-central1

zone: us-central1-a

external_compute_ips: false

google_app_cred_path: /shared/slurm/scripts/service_account.key

shared_vpc_host_project: <Shared VPC Project>

vpc_subnet: <Shared VPC Subnet>

slurm_cmd_path: /shared/slurm/current/bin

log_dir: /var/log/slurm

compute_node_prefix: <compute_node_prefix>

compute_node_scopes:

- https://www.googleapis.com/auth/monitoring.write

- https://www.googleapis.com/auth/logging.write

compute_node_service_account: sa-name

update_node_addrs: true

partitions:

- name: cloud

machine_type: n1-standard-16

zone: us-central1-a

compute_disk_size_gb: 20

compute_disk_type: pd-standard

compute_labels: null

gpu_count: 0

max_node_count: 20

preemptible_bursting: false

static_node_count: 0

network_storage: []

compute_image_family: slurm-compute-cloud-image-family

vpc_subnet: <VPC Subnet>

Ensure that you specify the path of the service account key we created in the “google_app_cred_path” field in config.yaml. Also, specify the “compute_image_family” field for each partition if different than the naming schema, "<cluster_name>-compute-#-image-family".

Configure Slurm

Modify the on-premise cluster’s slurm.conf to add the following configurations. Specifically, this includes specifying the parameters required for Slurm PowerSave, modifying the SlurmctldParameters to resolve cloud nodes currently, adding burst partition(s), and adding a cronjob for a Slurm State Sync script. For more details on the below options, see the Power Saving documentation.

Modify Slurm Controller Configuration

Be sure to modify the highlighted values:

PrivateData=cloud

## Slurm PowerSave Configuration

SuspendProgram=/path/to/suspend.py

ResumeProgram=/path/to/resume.py

ResumeFailProgram=/path/to/suspend.py

SuspendTimeout=600

ResumeTimeout=600

ResumeRate=0

SuspendRate=0

SuspendTime=300

## Tell Slurm to not power off non-cloud nodes. By default, it will try to power off all nodes including on-prem nodes.

## SuspendExcParts will probably be the easiest one to use, and will exclude entire partitions.

## SuspendExcNodes specifies individual or ranges of nodes.

#SuspendExcNodes=

#SuspendExcParts=

## Slurm Controller configuration

SchedulerParameters=salloc_wait_nodes

## If using Cloud DNS, uncomment this line

#SlurmctldParameters=cloud_dns,idle_on_node_suspend

## If not using Cloud DNS, uncomment this line

#SlurmctldParameters=idle_on_node_suspend

CommunicationParameters=NoAddrCache

LaunchParameters=enable_nss_slurm

SrunPortRange=60001-63000

Note the “SuspendExcParts” and “SuspendExcNodes”. Make sure and add any nodes or partitions to these lists that are not cloud nodes or Slurm will set them to a power saving state.

Add Burst Partition(s)

Next you will need to add one or multiple cloud partitions to the end of your slurm.conf, where partitions are defined. Add a new section with the following contents, matching them with what you configured in config.yaml earlier. Be sure to modify the highlighted values:

NodeName=DEFAULT CPUs=16 RealMemory=59240 State=UNKNOWN

NodeName=slurm-compute-cloud-[0-19] State=CLOUD

PartitionName=cloud Nodes=slurm-compute-cloud-[0-19] MaxTime=INFINITE State=UP DefMemPerCPU=3702 LLN=yes

Slurm will not burst beyond the number of nodes configured.

Restart Slurm Daemons

Once your changes to slurm.conf have been made, you need to restart the Slurm Controller Daemon and Slurm Daemons on all instances in the cluster.

To restart the SlurmCtld daemon on the controller node, log in to the controller node and run the following command:

sudo systemctl restart slurmctld

To restart the Slurmd daemon on the compute nodes, you can run the following command with the online instances specified by the -w flag:

sbatch -w <node range> --wrap=”srun sudo -i systemctl restart slurmd”

Once the daemons are restarted, sinfo and scontrol should reflect your updates and you should be able to deploy instances with your new configurations.

Configure Slurm State Sync

Add a cronjob/crontab to call slurmsync.py to be called by SlurmUser:

*/1 * * * * /path/to/slurmsync.py

This script ensures that the Slurm cluster state matches the actual instance state according to Google Cloud.

Test the configuration

Manual Test

First, try creating and deleting instances in GCP by calling the resume and suspend scripts directly as SlurmUser:

su slurm

./resume.py slurm-compute-cloud-0

This should create an instance called slurm-compute-cloud-0, with the configuration you specified in config.yaml. You can see this instance in the Google Cloud Console, or by running this command:

gcloud compute instances list

In order to delete that instance, run this command:

./suspend.py slurm-compute-cloud-0

This should delete the instance slurm-compute-cloud-0. You can see this instance has been deleted in the Google Cloud Console, or by running this command:

gcloud compute instances list

Slurm Job Test

Next, try launching a job to the cloud burst partition(s) you’ve created:

srun -p cloud -N1 hostname

This will create a single instance in the cloud partition, and run hostname on it. If the node is created in the cloud and the job succeeds, you’ve configured your Hybrid cloud-bursting partition correctly!

Users and Groups in a Hybrid Cluster

The simplest way to handle user synchronization in a hybrid cluster is to use Slurm’s nss_slurm plugin. This permits passwd and group resolution for a job on the compute node to be serviced by the local slurmstepd process, rather than some other network-based service. It works by sending user information from the controller for each job and is handled on the compute instance by the slurm step daemon.

The nss_slurm plugin needs to be installed on the compute node image, when the image is created. The Slurm HPC image already has nss_slurm installed. Check the Slurm documentation for details on how to configure nss_slurm.

Installing Applications

There are three methods available to install applications on a Slurm cluster on GCP. First, installing software at deployment time, through the custom installation scripts. Second, installing software to the shared NFS server at /apps. Third, is using environment modules.

Custom Installation Scripts

Slurm offers custom installation scripts for users to define a set of commands to be run either deployment time, which allows them to execute arbitrary commands to configure instances, download or compile software, and so on.

There are two files, “custom-controller-install” and “custom-compute-install”, in the “scripts” directory that can be used to add custom installations for the given instance type. The “custom-controller-install” script will run once on the controller instance at deployment time, and the “custom-compute-install” scripts will run on each partition’s compute-image instance before an image is made of that instance’s disk.

The files will be executed as root during startup of the instance types, and can either be run as Python or Bash scripts by specifying an interpreter in the first line of the script:

Bash:

#!/bin/bash

Python:

#!/bin/python

If using the custom installation scripts to compile software, it is advisable to use the custom-controller-install script for any commands performed on a common file system that are to be run only once (for example, compiling a common piece of software). This is because the custom-compute-install scripts will be run once per partition, so a cluster with 5 partitions will run the same command five times, which may have unintended side effects. From within the script, it is possible to check which number partition that instance is using a command like “hostname | rev | cut -d’-’ -f2 | rev”, which will return the partition index. The partition index is the second to last number in the node hostname. For example, an instance in the zero’th partition would be named “slurm-compute-0-0”, an instance in the first partition would be named “slurm-compute-1-0”, and the third instance in the first partition would be named “slurm-compute-1-2”.

Install to NFS

You can find instructions on how to install software to the shared NFS server at /apps, or any other shared storage, here.

Cluster Management

User Authentication

OS Login

The OS Login tool is a Google Cloud-specific daemon which provides your Google Directory credentials to compute instances in order to maintain consistency in user attributes like UID/GID and Username. This can replace traditional systems like Active Directory and LDAP.

OS Login is enabled by default on all Slurm instances. You can tell you are using OS Login because, unless specified otherwise for your user in Google Directory, your username when logged in to instances will appear as your full email address, including domain, with special characters like “@” and “.” replaced with underscores (“_”). For example, a Google Cloud user at “someone@domain.com” will have the username “someone_domain_com”. Users from external organizations will have “ext_” prepended to their username. For example, a Google Cloud user at “someone@external.com” will have the username “ext_someone_external_com”.

IAM Permissions

A user’s IAM permissions determines how access to the Slurm cluster is handled.

Adding or Modifying a Partitions

The process for adding a new partition or modifying an existing partition’s configuration is straightforward, and requires modifying two files and restarting the Slurm daemons on all cluster nodes. An administrator may want to add a partition or modify a partition’s configuration on a live cluster in order to increase or decrease the maximum number of instances in a partition, or modify the specific configuration of a partition.

Modify config.yaml

First, modify the config.yaml file located at “/slurm/scripts/config.yaml”. This file defines the partition configurations for all the partitions in the cluster, and is where Slurm pulls the partition and instance configuration from every time it bursts a new instance. This file contains a number of fields. See the "Terraform Configuration Summary” section for more details on those fields.

Note that the order the partitions refers to the naming of the nodes. e.g.

<compute_node_prefix> - <pid> - <nid>

where <pid> represents the partitions index (0 based) in config.yaml.

Once you’ve modified the values in config.yaml and saved the file, you may need to modify slurm.conf to match these changes as well.

Modify slurm.conf

Open slurm.conf on the controller instance at “/usr/local/etc/slurm/slurm.conf”. At the bottom of the file you will find a section marked “# COMPUTE NODES”, with the partition definitions, as seen below:

# COMPUTE NODES

NodeName=DEFAULT CPUs=4 RealMemory=15504 State=UNKNOWN

NodeName=slurm-demo-compute-0-[0-9] State=CLOUD

PartitionName=covm Nodes=slurm-demo-compute-0-[0-9] MaxTime=INFINITE State=UP DefMemPerCPU=3876 LLN=yes Default=YES

If you’ve changed information in config.yaml that is also defined here, including the machine type, partition name, or maximum node count, you will need to update the information in slurm.conf as well. This ensures that Slurm is aware of the changes.

For example, if you’ve changes the machine type from a c2-standard-4 defined above to an c2-standard-8, you would need to modify “CPUs=4” to “CPUs=8”, and double the amount of Memory by changing “RealMemory=15504” to “RealMemory=31008” and “DefMemPerCPU=3876” to “DefMemPerCPU=7752”. If you changed the maximum node count for the partition from 10 to 20, you would need to modify “NodeName=slurm-demo-compute-0-[0-9]” to “NodeName=slurm-demo-compute-0-[0-19]”.

Once you’ve modified Slurm.conf to reflect your changes, you need to restart the Slurm daemons to propagate the changes. SlurmCtld daemon on the controller node, and the Slurmd daemon on the compute/login nodes.

Restart Slurm Daemons

In order for the cluster to take the changes, you must restart the SlurmCtld daemon on the controller node, and the Slurmd daemon on the compute/login nodes.

To restart the SlurmCtld daemon on the controller node, log in to the controller node and run the following command:

sudo systemctl restart slurmctld

To restart the Slurmd daemon on the compute nodes, you can run the following command with the online instances specified by the -w flag:

sbatch -w <node range> --wrap=”srun sudo -i systemctl restart slurmd”

Once the daemons are restarted, sinfo and scontrol should reflect your updates and you should be able to deploy instances with your new configurations.

Resizing shared storage

By default the /home and /apps directories are mounted on NFS storage hosted on the controller instance. These can also be replaced by external storage using the “network_storage” and “login_network_storage” tfvars fields.

If using the default controller-hosted shared storage, it may be necessary to increase or decrease the size of the shared storage, which can be done by modifying the size of the controller’s disk hosting the shared storage. In the case that your requirements grow beyond the size or performance of a single Persistent Disk, we recommend that you consider options including Filestore, which is Google Cloud’s fully managed NFS service, NetApp, Dell EMC, DDN EXAScaler, or self-managed Lustre.

To resize the controller instance’s disk which hosts the shared storage, follow the process described here. This process can be done online.

Adding a new file system type

In order to add a new file system type you simply need to install the client software so that the desired instances can mount the file system, and add the appropriate entry to the “network_storage” fields in the tfvars file.

You can install the client software either by adding the file system client installation steps in the custom install scripts (best if using a user-space client), by building a new image from the default Slurm image with the client software installed, or by customizing the setup.py script in the foundry scripts to install the client software if you’re using Slurm’s image foundry to create images.

For example, if a user wanted to install the CernVM file system client to their controller and compute nodes, they could write these lines to their “custom-compute-install” and “custom-controller-install” scripts:

#!/bin/bash

sudo yum install https://ecsft.cern.ch/dist/cvmfs/cvmfs-release/cvmfs-release-latest.noarch.rpm

sudo yum install -y cvmfs

This code would execute at the compute and controller instance’s startup, and add the CernVM FS RPM and client software on the instance.

Once the client software is configured to be installed, the tfvars file can be configured the same way that a fstab entry would be, including the “fs_type” field, which is a 1:1 translation to fstab’s “type” field.

For example, if using a CernVM file system hosted at 10.0.0.10, you can use the following “network_storage” entry to mount it:

network_storage :

- server_ip: 10.0.0.10

remote_mount: /cernvmfs

local_mount: /mnt/cernvmfs

fs_type: cvmfs

mount_options:

This tfvars entry would add an fstab entry with the server IP configured to 10.0.0.10, the remote file system name as “cernvmfs”, and the local mount directory as “/mnt/cernvmfs”, with a file system type of “cvmfs”, and no mount options provided, which will default to use “defaults,_netdev”.

Running Jobs

There are two ways to run jobs within the Slurm workload manager: srun, and sbatch.

There are also multiple considerations to take into account when running jobs, depending on the resources required. We will cover some of these topics below.

GPU Jobs

In order to execute a job with GPUs you must include the “--gpus-per-node” option in the options of the sbatch script or srun or salloc command. Without specifying the number of GPUs Per Node that you wish to allocate, Slurm’s cgroups will prevent access to any GPUs on the VM, even if there are GPUs provisioned and you are the only user on the VM.

For example, the following command with the “--gpus-per-node” option will run nvidia-smi properly on a GPU compute node run in a “gpu” partition:

sbatch -p gpu --gpus-per-node=1 --wrap=”nvidia-smi”

However, if that command is run without the “--gpus-per-node” option, or on a compute node without GPUs provisioned, the nvidia-smi command will fail.

Cluster Monitoring

The cluster’s status has two components, the Slurm Cluster’s status, and the Cloud Infrastructure’s status. The Slurm Cluster’s status can be viewed using the Slurm CLI. The Cloud Infrastructure’s status can be viewed using the Google Cloud Console and Google Cloud API, specifically using the Google Cloud Ops Suite.

Slurm Cluster Status

Slurm uses a number of Command Line Interface (CLI) tools to monitor various parts of the Slurm Infrastructure, including jobs, and nodes.

Jobs

Slurm uses squeue and scontrol to view job queues and specific job information.

You can view all your Slurm Queues and jobs by running the following command from the controller node with a valid Slurm user:

squeue

This command will return information about any jobs that are in the Slurm queues, and the output includes the following fields:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

You may also use various options to control the output of squeue. Common options used are “-A” to specify to show only jobs submitted by a specific account, and “-p” to specify to show only jobs submitted to a specific partition.

You can find more information about how to use squeue here.

You can view detailed information of a specific job by running the follow command from the controller node with a valid Slurm user:

scontrol show job <JOB_ID>

The scontrol command will return detailed information about the job:

JobId=6 JobName=test

UserId=slurm_user(1000) GroupId=slurm_user(1000) MCS_label=N/A

Priority=4294901755 Nice=0 Account=default QOS=normal

JobState=COMPLETED Reason=None Dependency=(null)

Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

RunTime=00:00:01 TimeLimit=UNLIMITED TimeMin=N/A

SubmitTime=2020-07-14T18:56:09 EligibleTime=2020-07-14T18:56:09

AccrueTime=Unknown

StartTime=2020-07-14T18:59:00 EndTime=2020-07-14T18:59:01 Deadline=N/A

SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-14T18:56:09

Partition=c2-60 AllocNode:Sid=slurm-controller:28360

ReqNodeList=(null) ExcNodeList=(null)

NodeList=slurm-compute-0-0

BatchHost=slurm-compute-0-0

NumNodes=1 NumCPUs=30 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

TRES=cpu=30,mem=119070M,node=1,billing=30

Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

MinCPUsNode=1 MinMemoryCPU=3969M MinTmpDiskNode=0

Features=(null) DelayBoot=00:00:00

OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)

Command=test

WorkDir=/home/slurm_user

Power=

This information includes the account that submitted the job, runtime and time limit of the job, and instances included in the job and details of instance configuration.

You can find more information about how to use scontrol here.

Nodes

Cloud-based nodes have many states in Slurm which indicate their status (see the “Node State” table of the Slurm elastic computing documentation page). These states can be seen in the output of the “sinfo” command, under the STATE column. For example:

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

pvm-c2 up infinite 4 alloc g1-compute-0-[0-3]

pvm-c2 up infinite 6 idle~ g1-compute-0-[4-9]

n2d up infinite 20 idle~ g1-compute-1-[0-19]

gpu up infinite 2 idle% g1-compute-2-[0-1]

gpu up infinite 2 mix# g1-compute-1-[2-3]

gpu up infinite 18 idle~ g1-compute-1-[4-19]

In cloud, the important states include:

idle~ indicates that the Slurm node(s) do not have corresponding Google Cloud VMs created, but that VMs can be provisioned if a job is assigned to that node.
mix# indicates that the Slurm node(s) are being powered up, and provisioned in Google Cloud.
alloc indicates that the Slurm node(s) are allocated to one or many jobs.
idle% indicates that the Slurm node(s) are powering down, and being deprovisioned in Google Cloud.
* indicates that the Slurm node(s) are not responsive, and they will be marked down. This can happen during preemption events on Preemptible VMs, or if a node is otherwise unresponsive to the Slurm controller.

In the example above, the pvm-c2 partition has g1-compute-0-[0-3] allocated to job(s) with VMs running, and g1-compute-0-[4-9] idle without VMs running; the n2d partition has all 20 of it’s nodes g1-compute-1-[0-19] idle without VMs running; the gpu partition has g1-compute-2-[0-1] powering down with their VMs being destroyed, as well as g1-compute-1-[2-3] being provisioned for job(s) to run on them, and g1-compute-1-[4-19] idle without VMs running.

Troubleshooting

Logs

The Slurm on GCP scripts have several layers of logs to consider when troubleshooting.

The controller instance maintains a set of logs in “/var/log/slurm”. These include:

resume.log - Contains logs from the VM creation process, includes errors and warnings on resource quotas, resource availability, and permissions
suspend.log - Contains logs from the VM destruction process.
slurmctld.log - Contains logs from the Slurm Controller daemon (slurmctld).
slurmdbd.log - Contains logs from the Slurm Database daemon (slurmdbd).
slurmsync.log - Contains logs from the Slurm Sync cron script, which aligns VM state with Slurm state.

Disable Shielded VMs and vTPM

In the scenario where you need to turn off Shielded VMs, this can disabled by adding the following red, bolded lines in the resume.py file

if instance_def.gpu_count:

config['guestAccelerators'] = [{

'acceleratorCount': instance_def.gpu_count,

'acceleratorType': instance_def.gpu_type

}]

config['scheduling'] = {'onHostMaintenance': 'TERMINATE'}

config['shieldedInstanceConfig'] = [{

'enableIntegrityMonitoring': False

'enableSecureBoot': False

'enableVtpm': False

}]

Support

Support for Slurm on GCP is available either via commercial support from SchedMD, or community-based support from the Slurm on Google Cloud user and developer community.

SchedMD Support

Commercial support is available directly from SchedMD, the commercial backers, developers, and maintainers of Slurm. You can read more about SchedMD’s commercial support offerings and contact information here at their website.

Community-based Support

Community support is available on the Google Cloud Slurm Discuss google group. Questions will be answered on a best-effort basis by community members, including SchedMD and Google employees.