README

Set up NeMo Framework Container

This makes a few environment variable modifications to the nvcr.io/nvidia/nemo:23.11.framework container, and submits a Slurm job to copy the framework launcher scripts and a few other auxiliary files into your working directory.
```
sbatch setup_nemo.sh
```
Install NeMo Framework Requirements

We suggest using a virtual environment, and this installs the necessary components to submit jobs using the NeMo framework.
```
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt # Copied from the NeMo Framework Container earlier
```

Run an example NeMo Framework Pre-Training

This will run an example of training a 5B parameter GPT3 model for 10 steps using mock data as the input.

cd launcher_scripts
mkdir data
python main.py \
    launcher_scripts_path=${PWD} \
    stages=[training] \
    env_vars.TRANSFORMERS_OFFLINE=0 \
    container=../nemofw+tcpx-23.11.sqsh \
    container_mounts='["/var/lib/tcpx/lib64","/run/tcpx-\${SLURM_JOB_ID}:/run/tcpx"]' \
    cluster.srun_args=["--container-writable"] \
    training.model.data.data_impl=mock \
    training.model.data.data_prefix=[] \
    training.trainer.max_steps=10 \
    training.trainer.val_check_interval=10 \
    training.exp_manager.create_checkpoint_callback=False

This will submit a pre-training job to your Slurm cluster. Once it starts, you will see results appearing in results/gpt3_5b/. For this example, the job should only take a few minutes.

Next Steps

Now that you've run an example training workload, you may find it preferable to customize conf/cluster/bcm.yaml, conf/config.yaml, and the training configuration file of your choosing as opposed to using command line arguments. For real training workloads you'll also want to use real data, as opposed to the mock datasets used here, and explore all tuning and configurations parameters for your use case through the NeMo Framework.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

README

Next Steps

Files

README.md

Latest commit

History

README.md

File metadata and controls

README

Next Steps