Building a compute cluster

So you want to run code on more than one computer. How do you do it? Follow this simple three step process.

obtain computers
configure cluster work scheduler
run jobs on computers through the work scheduler.

Okay. Maybe not that simple after all. Let’s back up and revisit some reasons why we would want to go beyond a single computer in the first place; scale and reliability. And by “we would want” I mean “let me tell you about my experiences that led me to wanting to do this project”.

My first experience with compute clusters was during my time in undergrad doing a research project to simulate the swimming motion of sea lampreys with fifty coupled oscillators. Optimizing the coupling parameters by repeatedly solving the differential equations was too slow to complete in a single semester on my Pentium 4 laptop. The math department had a beowulf cluster and I wrote a bash script to ssh and scp files to a handful of computers to distribute the work and collect the results. nohup became my best friend that semester. I didn’t know it at the time, but this experience was the genesis for my lifelong obsession using computers to solve problems.

Wrapping up my physics degree and moving in to grad school I participated in a handful of physics projects¹ that granted me access to the university wide compute clusters. These clusters contained orders of magnitude more processors, had a centralized work scheduler, and used a shared filesystem. This is where I learned the utility of slurm and having the right abstraction. Enjoying this topic led me to switch my focus from physics simulation to distributed computing and started contributing to our department’s custom distributed workflow system.²

Fast forward a decade and I began an ongoing project to build a game playing agent for Pokémon. One of my primary constraints for that project is to avoid encoding my knowledge of the game into the agents. They need to learn it themselves. Which then requires simulating thousands of hours of Game Boy playing. Emulation is fast (headless reaching a 1,500 fps) and the screen resolution is tiny (160x144px), but the time adds up and it is much easier to experiment with multiple computers. Conveniently, I have a handful of old computers laying around the house. I spent a weekend adding ssh keys to each box and installing Julia. I then reverted back to my “just scp and ssh things” lifestyle with the excuse of “it will be quick, don’t waste the time setting up a real cluster”. Two months later and I’m still going. Time to invest in running a cluster myself.

Hardware

To keep things self contained and manage power consumption while I learn, I opted for the Turing Pi cluster board as my hardware platform for experimenting with cluster schedulers. As shown in the diagram to the right, the cluster board has four expansion slots for compute nodes all connected to an on-board ethernet switch. There are many options for compute boards including Raspberry Pi, Nvidia Jetson, and Turing's own RK1. I chose the RK1 since the price was competitive and there are no Raspberry Pi 5 compute modules yet.

The cpu layout of the RK1's Rockchip RK3588 SoC is heterogeneous. lscpu reports eight CPUs spread across three sockets. The four Cortex-A55 cores are on one socket and the four Cortex-A76 cores come in two pairs of two. This will cause some challenges configuring slurm.

The nodes are identical apart from their attached peripherals. Each node is powered from the same ATX power supply, connected to a dedicated m.2 ssd, has its own active cooler, and is configurable from the board management controller. Node 3 is unique since it is connected to two SATA 3 ports. This means that Node 3 will be our shared file server and control node. The others will be workers.

Operating System

Hardware obtained. On to the software starting with the operating system. Turing Pi provides documentation and an Ubuntu image for the boards. Using the web interface provided by the board management controller I flashed the operating system image to each node sequentially. Surprisingly painless as long a I remembered not to refresh the page.

With the operating system installed I ran a handful of manual steps to prepare the nodes as targets for my ansible playbook.

$ sudo su
# useradd -rm ansible
# echo "ansible ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ansible
# chmod 440 /etc/sudoers.d/ansible
# mkdir -m 700 /home/ansible/.ssh
# echo "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIO7kqBPUpVpPS4pTJXk5zQu2FMLZkNda2Q521btGCRDI" > /home/ansible/.ssh/authorized_keys
# chmod 600 /home/ansible/.ssh/authorized_keys
# chown -R ansible.ansible /home/ansible/.ssh
# exit
$ exit

Now we can use ansible to format and mount ssd for local scratch space. One bash script later and we are done.

#!/usr/bin/env bash

sudo lsblk -io NAME,TYPE,SIZE,MOUNTPOINT,FSTYPE,MODEL
sudo pvcreate /dev/nvme0n1
sudo vgcreate vgnfs /dev/nvme0n1
sudo vgs -v
sudo lvcreate -l 100%VG -n lvnfs vgnfs
sudo lvdisplay
sudo mkfs.ext4 /dev/vgnfs/lvnfs
sudo mkdir -p /mnt/bulk
sudo mount /dev/vgnfs/lvnfs /mnt/bulk
sudo df -H

sudo echo "/dev/mapper/vgnfs-lvnfs /mnt/bulk ext4 defaults 0 0" >> /etc/fstab

Slurm

Slurm is a cluster resource manager that allocates resources (CPU, Memory, GPU, ...) across multiple computers and manages contention between competing requests with a job queue. There are four main components we need to set up in order to have a functional cluster. Shared storage, authentication, workers, and job controller.

Shared Storage

Why shared storage? This way all nodes have access to the same file system and submitting jobs is "scp to node 3 and submit the job". Otherwise, the user would have to manually copy files to every node when the job starts. Undermining the entire goal of this setup.

The simplest way to create a shared file system is to use NFS. We will be using node3 to share a file system from the attached sata drives. We are also going to completely ignore multi-user permissions since I am the only user. fdisk, mkfs, and nfs-kernel-server are all that is needed on node 3. Then on every other node we need to add the network volume to fstab with echo "node3:/sharedfs /sharedfs nfs default 0 0" >> /etc/fstab.

Authentication

Slurm uses MUNGE to manage authentication between nodes in the cluster. Users of the cluster are already authenticated to the head node. Therefore, the cluster only needs a way to reliably encode this information and forward it to the workers. MUNGE uses a shared encryption key to securely transmit information between the nodes. The man page for MUNGE provides more details on the rationale and methodology. All we need to do is install munge, generate the key, copy the key to each node (conveniently reusing the shared file system we previously set up), and starting the daemon. Make sure that the permissions on the munge key are correct. Otherwise, the munged service will not start.

Workers

slurmd is the worker process. This is the agent that connects to the controller and runs jobs on a single node. Install the slurm-wlm apt package, configure the process, and start the service. Relatively straightfoward with a few steps I initially got tripped up with.

Every node, including the controller, should have the same configuration file.
Slurm has support for cgroups to isolate jobs on the worker nodes. This needs to be configured and explicitly list the allowed devices.
The node description needs to match the discovered hardware on the box. hwloc extracts the processor information and slurm compares the output to the config list. In my particular case, the processor is asymmetric which violates one of slurm's assumptions (number of cores is an integer multiple of the number of sockets). In the case of Intel processors with P- and E- cores there is an option to enable or disable ecores. No such option exists for the Rockchip procesors. So we need to tell slurm that the config takes precedence with SlurmdParameters=config_overrides.

Job Controller

Now that we have the workers configured the controller, slurmctld is trivial. Make sure the head node has the slurm-wlm apt package installed, copy the same configs, start the service. Once it is running you can check if everything is connected with sinfo.

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
workers*     up   infinite      3   idle node[1-2,4]

And then run a simple job to test that everything is working. Run hostname on 3 nodes.

$ srun --nodes=3 hostname
node4
node2
node1

What Next?

Run distributed Julia jobs on the cluster.
Repeat this process on the other computers in my house to build a compute cluster with more processing power than my laptop.

Rapidfire summary: simulating hydrogenation of graphene sheets to introduce bandgaps by bending the sheet at the bond site, OpenCV to analyze video of frog leg muscles stretching (again for robotics like the sea lamprey), and searching for paths through the potential energy landscape for folding proteins.
My outdated publication from 2013 building a packaging system for workflows soon to be functionally obscelete by the Docker monoculture.