Containers¶
We present here a basic test for containerized environments using Enroot and Pyxis, both from NVIDIA.
First, for testing enroot:
enroot import docker://ubuntu
enroot create -n ubuntu ubuntu.sqsh
enroot start ubuntu sh -c 'grep PRETTY /etc/os-release'
> PRETTY_NAME="Ubuntu 24.04.2 LTS"
Secondly, we ensure we get the same results from testing Pyxis:
Alternatively to use a custom image built in dockerd:
-
Build a custom dockerfile with:
-
Import dockerd image to Enroot (Can be done with
docker://IMAGE:TAGfrom registry) -
Use flag pointing to the name:tag.sqsh
Example: torchtitan multi-node¶
We clone cluster-tests into /home/ubuntu:
We build the image based on torchtitan.dockerfile:
NOTE: we need to include the HF_TOKEN in .bashrc or export it in the bash session with access granted for llama3 family models.
docker build -f torchtitan.dockerfile --build-arg HF_TOKEN="$HF_TOKEN" -t torchtitan_cuda128_torch27 .
Then we import the squash file, which Enroot will use:
Now, we execute torchtitan_multinode.sh: