Slurm on Google Cloud
User Guide
A guide to configuring, deploying, and administrating Slurm on Google Cloud
_______
_______
Updated: Apr 19, 2022
Self-Link: https://goo.gle/slurm-gcp-user-guide
Feedback: https://groups.google.com/g/google-cloud-slurm-discuss
_______
Table of Contents
Google Cloud Storage Bucket Example
Stand-alone auto-scaling Slurm Cluster
Controller configuration recommendations
Use IP addresses with NodeAddr
Modify Slurm Controller Configuration
Users and Groups in a Hybrid Cluster
Adding or Modifying a Partitions
_______
Welcome to the User Guide for the Slurm on Google Cloud software!
Slurm is one of the most popular workload managers used by the HPC community, and is present in ~50% of the Top 100 Supercomputers in the world. Slurm provides an open-source, fault-tolerant, and highly-scalable workload management and job scheduling system for small and large Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained.
As a cluster workload manager, Slurm has three key functions:
In 2017 Google Cloud partnered with SchedMD, the commercial backers and maintainers of Slurm, to co-develop Google Cloud-native capabilities for Slurm, and release it publicly as Open Source Software on Github. The Slurm on Google Cloud software supports three usage models:
Click here to see the public Github repository, including a detailed README about the project.
Click here to watch a video overview of the Slurm on GCP integration, and a demonstration of them in action by SchedMD’s Director of Training Shawn Hoopes.
This document describes the deployment, configuration, operation and maintenance of the Slurm on GCP software, and is intended for administrators and users of the Slurm on GCP software.
If you have feedback on this document, please post on the Google Cloud Slurm Discussion Group: https://groups.google.com/g/google-cloud-slurm-discuss.
With the Slurm on GCP V4 scripts, Terraform is now the main tool used to automate the deployment of a Slurm cluster. Specific to the Slum deployment, there is an example tfvars file that shows available deployment options. These fields determine how the Slurm cluster is deployed, configured, and managed. This section outlines the available configuration fields.
Below is a list of the fields offered by the TFVars, including the field’s type, and a short description of the purpose of each field. Links to more information in public Google Cloud documentation, relevant sections of this User Guide, and external websites have been included. Each field has the format “name (type): Description”.
The Network Storage YAML dictionaries allow the YAML configuration to specify one or multiple network storage devices to be automatically mounted. Fields will be added directly to fstab, and align closely with the fstab structure, and the mount command. For example, the fields below would be used in a mount command this way: “mount -t fs_type -o mount_options server_ip:remote_count local_mount”.
There are three places in the YAML configuration file where you can specify a “network_storage” type YAML dictionary. Two outside of partitions, one specifically for the controller and login node(s) called “login_network_storage”, one for all nodes in the cluster called “network_storage”, and one inside of the partitions dictionary, which can specify network_storage configurations per partition. This allows users to easily specify network storage that should be available only to controller and login nodes (admin storage), network storage available to all nodes in the cluster (home storage), and network storage available only to specific partitions (user-specific scratch).
The “network_storage” dictionaries contain five fields, which can be repeated multiple times for multiple entries. These fields include:
For example, let’s take the following Filestore deployment as a storage device and configure a network_storage field for it:
Below is the network_storage field which would mount this filestore device, assuming that the Filestore is paired with or shared the same VPC as is defined in the vpc fields in the YAML configuration file.
network_storage : - server_ip: 10.18.193.58 remote_mount: /home local_mount: /home fs_type: nfs mount_options: |
Above we set the “server_ip” as the Filestore “IP address”; the “remote_mount” field as the “File share name”; “local_mount” as “/home” since we’ll be using this Filestore as our home directory storage, and will mount over the “/home” directory on the cluster; “fs_type” as “nfs” reflecting that Filestore is an NFS-based storage system; and left “mount_options” blank as the defaults of “defaults,_netdev” are satisfactory for our uses.
For example, let’s take the following Google Cloud Storage (GCS) bucket as a storage device and configure a network_storage field for it:
Below is the network_storage field which would mount this GCS bucket using Cloud Storage FUSE (GCSFuse), assuming that the service account used on the compute nodes has permissions to access that bucket.
network_storage : - server_ip: none remote_mount: test_bucket local_mount: /mnt/test_bucket fs_type: gcsfuse mount_options: file_mode=664,dir_mode=775,allow_other |
Above we set the “server_ip” to “none” because we specify the bucket name in “remote_mount”; we set the “remote_mount” field as the GCS bucket name, “test_bucket” here; “local_mount” is set as “/mnt/test_bucket”; “fs_type” as “gcsfuse”, this will install gcsfuse during deployment; and set “mount_options” to “file_mode=664,dir_mode=777,allow_other” to set posix permissions to allow all users and groups on the cluster to have read/write access the storage.
NOTE: In order to allow auto-scaling VMs with GCSFuse configured as a storage type, you must specify “https://www.googleapis.com/auth/devstorage.full_control” to the “scopes” TFVars fields for the instances that are using GCSFuse (controller_scopes, login_node_scopes, compute_node_scopes).
Instance Templates are supported in the Slurm TFVars fields to define the Controller (“controller_instance_template”), Login (“login_instance_template”), and Partition instances (“instance_template”). Instance Templates can be used to easily define a reusable instance configuration, consisting of configuration options including machine type, machine size, GPUs, disks, networking, metadata, labels, and more.
There are some options in Google Compute Engine (GCE) which may not be exposed directly in the Slurm TFVars fields. For example, Local SSD options are not available in the partition configuration. There may also be new features or products released in the future that may not have support for them in the Slurm TFVars fields. These options are able to be enabled and utilized in a Slurm cluster using the Instance Template option, but configuring an Instance Template with the features and configuration desired, and specifying that Instance Template ID in the relevant instance’s or partition’s “instance_template” configuration.
NOTE: Any compute fields specified in the TFVars will override the template properties.For example, if “controller_image” is specified, it will overwrite the image in the instance template.
The Bulk API enables two features to improve scalability and reliability of large-scale instance creation in the Google Compute Engine (GCE) API. First, it supports creation of up to 1,000 instances in a single API call. Second, Bulk API supports regional creation, allowing capacity-finding abilities across zones in a region.
Supporting up to 1,000 instances per API call reduces cluster deployment time versus the previous 1:1 API call to instance creation model. In tests of scaling up a 5,000 node cluster with and without Bulk API, Google saw an improvement in the cluster spin-up time of 500%.
Regional creation changes a GCE creation API call from a call to a single zone, which can lead to “Resource Exhaustion” errors in cases of stockouts of the required configuration in that zone, to a regional API call which will round-robin the zones in the specified region in case one zone cannot satisfy the entire request. Currently, the Bulk API will place your entire request within a single zone, and does not support multi-zone deployment. This can help alleviate delays and failures in deploying large instances, GPU VMs, and so on.
Enable Bulk API in Slurm on GCP using the “regional_capacity” and “regional_policy” options. Setting “regional_capacity” to TRUE enables Bulk API, and the partition’s “zone” field will either be parsed for the region if a full zone is provided (“us-central1-a”), or can accept a region (“us-central1”). The “regional_policy” field can be used to limit the Bulk API call using the location policy.
There are two basic ways to run Slurm clusters, depending on the requirements of your environment. The simplest is to run a stand-alone cluster on GCP. The second is to run an on-premise cluster connecting to GCP as required to meet additional needs.
Rather than repeat the instructions on deploying a stand-alone Slurm cluster, you can find various resources describing the deployment of a simple, stand-alone auto-scaling Slurm cluster on GCP:
When considering the machine type and size of your controller instance, take this guidance into account:
Number of Compute Nodes | Recommended Controller Instance Type |
50 | c2-standard-4 |
400 | c2-standard-8 |
1000 | c2-standard-16 |
2000 | c2-standard-16 |
4500 | c2-standard-30 |
Bursting out individual Virtual Machines from an on-premise Slurm cluster to a Google Cloud Project is done using the Power Saving feature in Slurm, specifically the ResumeProgram and the SuspendProgram parameters in the slurm.conf. In the example provided, these parameters point to the Python programs, resume.py and suspend.py, found in the scripts directory. Your config.yaml should be configured so that the scripts can create and destroy compute instances in a GCP project. See Cloud Scheduling Guide for more information.
There are two options:
In the end, the slurmctld and any login nodes should be able to communicate with cloud compute nodes, and the cloud compute nodes should be able to communicate with the controller.
Create a new instance using the Slurm on GCP public operating system image:
gcloud compute instances create slurm-image --zone us-central1-a --image-family schedmd-slurm-20-11-4-hpc-centos-7 --image-project schedmd-slurm-public |
Install and configure any additional packages that you are used to for a Slurm compute node.
Then create an image from it, creating a family either in the form "<cluster_name>-compute-#-image-family", or in a name of your choosing:
gcloud compute images create slurm-compute-0-image-v1 \ --source-disk slurm-image \ --source-disk-zone us-central1-a \ --family slurm-compute-cloud-image-family |
In order to securely allow each instance type (Compute, Controller, Login) to perform their duties (adding/removing nodes, accessing cloud storage and other APIs), we must configure at least two Service Accounts with different levels of IAM permissions: the controller’s service account, and login and worker node service accounts.
Create a service account that will have permissions to create and delete instances (e.g. with the “compute.instanceAdmin.v1” role) in the remote project.
gcloud iam service-accounts create sa-name \ --description="sa-description" \ --display-name="sa-display-name" |
Add roles to the service account that you just created that will allow the service account to create and delete instances. The “compute.instanceAdmin.v1” role provides this permission. Run this command to give the “compute.instanceAdmin.v1” role to the Service Account:
gcloud project add-iam-policy-binding <My Project ID> \ --member=sa-name --role=roles/compute.instanceAdmin.v1 |
On your on-premise Slurm controller, create a service account key that we will use to authenticate to Google Cloud from our on-premise.
gcloud iam service-accounts keys create /shared/slurm/scripts/service_account.key --iam-account=sa-name |
We will configure Slurm to use the “/shared/slurm/scripts/service_account.key” file in another section.
Clone the slurm-gcp git repo to your Slurm controller node.
NOTE: This assumes internet access from the Slurm controller node.
git clone https://github.com/SchedMD/slurm-gcp.git |
In the scripts directory, copy the resume.py, suspend.py, slurmsync.py and config.yaml.example from the slurm-gcp repository's scripts directory to a location on the slurmctld.
sudo cp resume.py suspend.py slurmsync.py config.yaml.example /apps/slurm/scripts |
Rename config.yaml.example to config.yaml and modify the highlighted values.
cluster_name: slurm project: slurm-184304 region: us-central1 zone: us-central1-a external_compute_ips: false google_app_cred_path: /shared/slurm/scripts/service_account.key shared_vpc_host_project: <Shared VPC Project> vpc_subnet: <Shared VPC Subnet> slurm_cmd_path: /shared/slurm/current/bin log_dir: /var/log/slurm compute_node_prefix: <compute_node_prefix> compute_node_scopes: - https://www.googleapis.com/auth/monitoring.write - https://www.googleapis.com/auth/logging.write compute_node_service_account: sa-name update_node_addrs: true partitions: - name: cloud machine_type: n1-standard-16 zone: us-central1-a compute_disk_size_gb: 20 compute_disk_type: pd-standard compute_labels: null gpu_count: 0 max_node_count: 20 preemptible_bursting: false static_node_count: 0 network_storage: [] compute_image_family: slurm-compute-cloud-image-family vpc_subnet: <VPC Subnet> |
Ensure that you specify the path of the service account key we created in the “google_app_cred_path” field in config.yaml. Also, specify the “compute_image_family” field for each partition if different than the naming schema, "<cluster_name>-compute-#-image-family".
Modify the on-premise cluster’s slurm.conf to add the following configurations. Specifically, this includes specifying the parameters required for Slurm PowerSave, modifying the SlurmctldParameters to resolve cloud nodes currently, adding burst partition(s), and adding a cronjob for a Slurm State Sync script. For more details on the below options, see the Power Saving documentation.
Be sure to modify the highlighted values:
PrivateData=cloud ## Slurm PowerSave Configuration SuspendProgram=/path/to/suspend.py ResumeProgram=/path/to/resume.py ResumeFailProgram=/path/to/suspend.py SuspendTimeout=600 ResumeTimeout=600 ResumeRate=0 SuspendRate=0 SuspendTime=300 ## Tell Slurm to not power off non-cloud nodes. By default, it will try to power off all nodes including on-prem nodes. ## SuspendExcParts will probably be the easiest one to use, and will exclude entire partitions. ## SuspendExcNodes specifies individual or ranges of nodes. #SuspendExcNodes= #SuspendExcParts= ## Slurm Controller configuration SchedulerParameters=salloc_wait_nodes ## If using Cloud DNS, uncomment this line #SlurmctldParameters=cloud_dns,idle_on_node_suspend ## If not using Cloud DNS, uncomment this line #SlurmctldParameters=idle_on_node_suspend CommunicationParameters=NoAddrCache LaunchParameters=enable_nss_slurm SrunPortRange=60001-63000 |
Note the “SuspendExcParts” and “SuspendExcNodes”. Make sure and add any nodes or partitions to these lists that are not cloud nodes or Slurm will set them to a power saving state.
Next you will need to add one or multiple cloud partitions to the end of your slurm.conf, where partitions are defined. Add a new section with the following contents, matching them with what you configured in config.yaml earlier. Be sure to modify the highlighted values:
NodeName=DEFAULT CPUs=16 RealMemory=59240 State=UNKNOWN NodeName=slurm-compute-cloud-[0-19] State=CLOUD PartitionName=cloud Nodes=slurm-compute-cloud-[0-19] MaxTime=INFINITE State=UP DefMemPerCPU=3702 LLN=yes |
Slurm will not burst beyond the number of nodes configured.
Once your changes to slurm.conf have been made, you need to restart the Slurm Controller Daemon and Slurm Daemons on all instances in the cluster.
To restart the SlurmCtld daemon on the controller node, log in to the controller node and run the following command:
sudo systemctl restart slurmctld |
To restart the Slurmd daemon on the compute nodes, you can run the following command with the online instances specified by the -w flag:
sbatch -w <node range> --wrap=”srun sudo -i systemctl restart slurmd” |
Once the daemons are restarted, sinfo and scontrol should reflect your updates and you should be able to deploy instances with your new configurations.
Add a cronjob/crontab to call slurmsync.py to be called by SlurmUser:
*/1 * * * * /path/to/slurmsync.py |
This script ensures that the Slurm cluster state matches the actual instance state according to Google Cloud.
First, try creating and deleting instances in GCP by calling the resume and suspend scripts directly as SlurmUser:
su slurm ./resume.py slurm-compute-cloud-0 |
This should create an instance called slurm-compute-cloud-0, with the configuration you specified in config.yaml. You can see this instance in the Google Cloud Console, or by running this command:
gcloud compute instances list |
In order to delete that instance, run this command:
./suspend.py slurm-compute-cloud-0 |
This should delete the instance slurm-compute-cloud-0. You can see this instance has been deleted in the Google Cloud Console, or by running this command:
gcloud compute instances list |
Next, try launching a job to the cloud burst partition(s) you’ve created:
srun -p cloud -N1 hostname |
This will create a single instance in the cloud partition, and run hostname on it. If the node is created in the cloud and the job succeeds, you’ve configured your Hybrid cloud-bursting partition correctly!
The simplest way to handle user synchronization in a hybrid cluster is to use Slurm’s nss_slurm plugin. This permits passwd and group resolution for a job on the compute node to be serviced by the local slurmstepd process, rather than some other network-based service. It works by sending user information from the controller for each job and is handled on the compute instance by the slurm step daemon.
The nss_slurm plugin needs to be installed on the compute node image, when the image is created. The Slurm HPC image already has nss_slurm installed. Check the Slurm documentation for details on how to configure nss_slurm.
There are three methods available to install applications on a Slurm cluster on GCP. First, installing software at deployment time, through the custom installation scripts. Second, installing software to the shared NFS server at /apps. Third, is using environment modules.
Slurm offers custom installation scripts for users to define a set of commands to be run either deployment time, which allows them to execute arbitrary commands to configure instances, download or compile software, and so on.
There are two files, “custom-controller-install” and “custom-compute-install”, in the “scripts” directory that can be used to add custom installations for the given instance type. The “custom-controller-install” script will run once on the controller instance at deployment time, and the “custom-compute-install” scripts will run on each partition’s compute-image instance before an image is made of that instance’s disk.
The files will be executed as root during startup of the instance types, and can either be run as Python or Bash scripts by specifying an interpreter in the first line of the script:
Bash:
#!/bin/bash |
Python:
#!/bin/python |
If using the custom installation scripts to compile software, it is advisable to use the custom-controller-install script for any commands performed on a common file system that are to be run only once (for example, compiling a common piece of software). This is because the custom-compute-install scripts will be run once per partition, so a cluster with 5 partitions will run the same command five times, which may have unintended side effects. From within the script, it is possible to check which number partition that instance is using a command like “hostname | rev | cut -d’-’ -f2 | rev”, which will return the partition index. The partition index is the second to last number in the node hostname. For example, an instance in the zero’th partition would be named “slurm-compute-0-0”, an instance in the first partition would be named “slurm-compute-1-0”, and the third instance in the first partition would be named “slurm-compute-1-2”.
You can find instructions on how to install software to the shared NFS server at /apps, or any other shared storage, here.
The OS Login tool is a Google Cloud-specific daemon which provides your Google Directory credentials to compute instances in order to maintain consistency in user attributes like UID/GID and Username. This can replace traditional systems like Active Directory and LDAP.
OS Login is enabled by default on all Slurm instances. You can tell you are using OS Login because, unless specified otherwise for your user in Google Directory, your username when logged in to instances will appear as your full email address, including domain, with special characters like “@” and “.” replaced with underscores (“_”). For example, a Google Cloud user at “someone@domain.com” will have the username “someone_domain_com”. Users from external organizations will have “ext_” prepended to their username. For example, a Google Cloud user at “someone@external.com” will have the username “ext_someone_external_com”.
A user’s IAM permissions determines how access to the Slurm cluster is handled.
The process for adding a new partition or modifying an existing partition’s configuration is straightforward, and requires modifying two files and restarting the Slurm daemons on all cluster nodes. An administrator may want to add a partition or modify a partition’s configuration on a live cluster in order to increase or decrease the maximum number of instances in a partition, or modify the specific configuration of a partition.
First, modify the config.yaml file located at “/slurm/scripts/config.yaml”. This file defines the partition configurations for all the partitions in the cluster, and is where Slurm pulls the partition and instance configuration from every time it bursts a new instance. This file contains a number of fields. See the "Terraform Configuration Summary” section for more details on those fields.
Note that the order the partitions refers to the naming of the nodes. e.g.
<compute_node_prefix> - <pid> - <nid>
where <pid> represents the partitions index (0 based) in config.yaml.
Once you’ve modified the values in config.yaml and saved the file, you may need to modify slurm.conf to match these changes as well.
Open slurm.conf on the controller instance at “/usr/local/etc/slurm/slurm.conf”. At the bottom of the file you will find a section marked “# COMPUTE NODES”, with the partition definitions, as seen below:
# COMPUTE NODES NodeName=DEFAULT CPUs=4 RealMemory=15504 State=UNKNOWN NodeName=slurm-demo-compute-0-[0-9] State=CLOUD PartitionName=covm Nodes=slurm-demo-compute-0-[0-9] MaxTime=INFINITE State=UP DefMemPerCPU=3876 LLN=yes Default=YES |
If you’ve changed information in config.yaml that is also defined here, including the machine type, partition name, or maximum node count, you will need to update the information in slurm.conf as well. This ensures that Slurm is aware of the changes.
For example, if you’ve changes the machine type from a c2-standard-4 defined above to an c2-standard-8, you would need to modify “CPUs=4” to “CPUs=8”, and double the amount of Memory by changing “RealMemory=15504” to “RealMemory=31008” and “DefMemPerCPU=3876” to “DefMemPerCPU=7752”. If you changed the maximum node count for the partition from 10 to 20, you would need to modify “NodeName=slurm-demo-compute-0-[0-9]” to “NodeName=slurm-demo-compute-0-[0-19]”.
Once you’ve modified Slurm.conf to reflect your changes, you need to restart the Slurm daemons to propagate the changes. SlurmCtld daemon on the controller node, and the Slurmd daemon on the compute/login nodes.
In order for the cluster to take the changes, you must restart the SlurmCtld daemon on the controller node, and the Slurmd daemon on the compute/login nodes.
To restart the SlurmCtld daemon on the controller node, log in to the controller node and run the following command:
sudo systemctl restart slurmctld |
To restart the Slurmd daemon on the compute nodes, you can run the following command with the online instances specified by the -w flag:
sbatch -w <node range> --wrap=”srun sudo -i systemctl restart slurmd” |
Once the daemons are restarted, sinfo and scontrol should reflect your updates and you should be able to deploy instances with your new configurations.
By default the /home and /apps directories are mounted on NFS storage hosted on the controller instance. These can also be replaced by external storage using the “network_storage” and “login_network_storage” tfvars fields.
If using the default controller-hosted shared storage, it may be necessary to increase or decrease the size of the shared storage, which can be done by modifying the size of the controller’s disk hosting the shared storage. In the case that your requirements grow beyond the size or performance of a single Persistent Disk, we recommend that you consider options including Filestore, which is Google Cloud’s fully managed NFS service, NetApp, Dell EMC, DDN EXAScaler, or self-managed Lustre.
To resize the controller instance’s disk which hosts the shared storage, follow the process described here. This process can be done online.
In order to add a new file system type you simply need to install the client software so that the desired instances can mount the file system, and add the appropriate entry to the “network_storage” fields in the tfvars file.
You can install the client software either by adding the file system client installation steps in the custom install scripts (best if using a user-space client), by building a new image from the default Slurm image with the client software installed, or by customizing the setup.py script in the foundry scripts to install the client software if you’re using Slurm’s image foundry to create images.
For example, if a user wanted to install the CernVM file system client to their controller and compute nodes, they could write these lines to their “custom-compute-install” and “custom-controller-install” scripts:
#!/bin/bash sudo yum install https://ecsft.cern.ch/dist/cvmfs/cvmfs-release/cvmfs-release-latest.noarch.rpm sudo yum install -y cvmfs |
This code would execute at the compute and controller instance’s startup, and add the CernVM FS RPM and client software on the instance.
Once the client software is configured to be installed, the tfvars file can be configured the same way that a fstab entry would be, including the “fs_type” field, which is a 1:1 translation to fstab’s “type” field.
For example, if using a CernVM file system hosted at 10.0.0.10, you can use the following “network_storage” entry to mount it:
network_storage : - server_ip: 10.0.0.10 remote_mount: /cernvmfs local_mount: /mnt/cernvmfs fs_type: cvmfs mount_options: |
This tfvars entry would add an fstab entry with the server IP configured to 10.0.0.10, the remote file system name as “cernvmfs”, and the local mount directory as “/mnt/cernvmfs”, with a file system type of “cvmfs”, and no mount options provided, which will default to use “defaults,_netdev”.
There are two ways to run jobs within the Slurm workload manager: srun, and sbatch.
There are also multiple considerations to take into account when running jobs, depending on the resources required. We will cover some of these topics below.
In order to execute a job with GPUs you must include the “--gpus-per-node” option in the options of the sbatch script or srun or salloc command. Without specifying the number of GPUs Per Node that you wish to allocate, Slurm’s cgroups will prevent access to any GPUs on the VM, even if there are GPUs provisioned and you are the only user on the VM.
For example, the following command with the “--gpus-per-node” option will run nvidia-smi properly on a GPU compute node run in a “gpu” partition:
sbatch -p gpu --gpus-per-node=1 --wrap=”nvidia-smi” |
However, if that command is run without the “--gpus-per-node” option, or on a compute node without GPUs provisioned, the nvidia-smi command will fail.
The cluster’s status has two components, the Slurm Cluster’s status, and the Cloud Infrastructure’s status. The Slurm Cluster’s status can be viewed using the Slurm CLI. The Cloud Infrastructure’s status can be viewed using the Google Cloud Console and Google Cloud API, specifically using the Google Cloud Ops Suite.
Slurm uses a number of Command Line Interface (CLI) tools to monitor various parts of the Slurm Infrastructure, including jobs, and nodes.
Slurm uses squeue and scontrol to view job queues and specific job information.
You can view all your Slurm Queues and jobs by running the following command from the controller node with a valid Slurm user:
squeue |
This command will return information about any jobs that are in the Slurm queues, and the output includes the following fields:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
You may also use various options to control the output of squeue. Common options used are “-A” to specify to show only jobs submitted by a specific account, and “-p” to specify to show only jobs submitted to a specific partition.
You can find more information about how to use squeue here.
You can view detailed information of a specific job by running the follow command from the controller node with a valid Slurm user:
scontrol show job <JOB_ID> |
The scontrol command will return detailed information about the job:
JobId=6 JobName=test UserId=slurm_user(1000) GroupId=slurm_user(1000) MCS_label=N/A Priority=4294901755 Nice=0 Account=default QOS=normal JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:01 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2020-07-14T18:56:09 EligibleTime=2020-07-14T18:56:09 AccrueTime=Unknown StartTime=2020-07-14T18:59:00 EndTime=2020-07-14T18:59:01 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-14T18:56:09 Partition=c2-60 AllocNode:Sid=slurm-controller:28360 ReqNodeList=(null) ExcNodeList=(null) NodeList=slurm-compute-0-0 BatchHost=slurm-compute-0-0 NumNodes=1 NumCPUs=30 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=30,mem=119070M,node=1,billing=30 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=3969M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=test WorkDir=/home/slurm_user Power= |
This information includes the account that submitted the job, runtime and time limit of the job, and instances included in the job and details of instance configuration.
You can find more information about how to use scontrol here.
Cloud-based nodes have many states in Slurm which indicate their status (see the “Node State” table of the Slurm elastic computing documentation page). These states can be seen in the output of the “sinfo” command, under the STATE column. For example:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST pvm-c2 up infinite 4 alloc g1-compute-0-[0-3] pvm-c2 up infinite 6 idle~ g1-compute-0-[4-9] n2d up infinite 20 idle~ g1-compute-1-[0-19] gpu up infinite 2 idle% g1-compute-2-[0-1] gpu up infinite 2 mix# g1-compute-1-[2-3] gpu up infinite 18 idle~ g1-compute-1-[4-19] |
In cloud, the important states include:
In the example above, the pvm-c2 partition has g1-compute-0-[0-3] allocated to job(s) with VMs running, and g1-compute-0-[4-9] idle without VMs running; the n2d partition has all 20 of it’s nodes g1-compute-1-[0-19] idle without VMs running; the gpu partition has g1-compute-2-[0-1] powering down with their VMs being destroyed, as well as g1-compute-1-[2-3] being provisioned for job(s) to run on them, and g1-compute-1-[4-19] idle without VMs running.
The Slurm on GCP scripts have several layers of logs to consider when troubleshooting.
The controller instance maintains a set of logs in “/var/log/slurm”. These include:
In the scenario where you need to turn off Shielded VMs, this can disabled by adding the following red, bolded lines in the resume.py file
if instance_def.gpu_count: config['guestAccelerators'] = [{ 'acceleratorCount': instance_def.gpu_count, 'acceleratorType': instance_def.gpu_type }] config['scheduling'] = {'onHostMaintenance': 'TERMINATE'} config['shieldedInstanceConfig'] = [{ 'enableIntegrityMonitoring': False 'enableSecureBoot': False 'enableVtpm': False }] |
Support for Slurm on GCP is available either via commercial support from SchedMD, or community-based support from the Slurm on Google Cloud user and developer community.
Commercial support is available directly from SchedMD, the commercial backers, developers, and maintainers of Slurm. You can read more about SchedMD’s commercial support offerings and contact information here at their website.
Community support is available on the Google Cloud Slurm Discuss google group. Questions will be answered on a best-effort basis by community members, including SchedMD and Google employees.