Authors: Yoel Drori, Augustus Odena, Joel Shor
-
Set up your Cloud resources. This involves setting up a Cloud Bucket (disk), a TPU to run the computation, and Virtual Machine to run your code. There are multiple ways of bringing each of these up. The easiest is to follow the instructions in this tutorial. Start at the top and finish after section
Verify your Compute Engine VM
.- Note: Be sure to complete the
Clean Up
steps below after finishing to avoid incurring charges to your GCP account. - Note: Be sure to read the TPU pricing guide and Cloud Storage pricing guide.
- Note: If you want to profile TPU utilization, be sure to install the cloud tpu profiler.
- Note: Be sure to complete the
-
The final command should leave you connected to your new VM. If it hasn't, please follow these instructions to connect.
-
On your new VM, run the following commands to download ImageNet from TensorFlow Datasets and convert them to TFRecords for easy training. This could take hours, so run it and take a coffee break:
pip install --upgrade tensorflow_datasets --user tmux STORAGE_BUCKET=gs://YOUR-BUCKET-NAME python -c 'import tensorflow_datasets as tfds; ds = tfds.load("imagenet2012:5.*.*", split="train", data_dir="'${STORAGE_BUCKET}/data'"); tfds.as_numpy(ds)'
-
Install the necessary packages and download the example code:
git clone https://github.com/tensorflow/gan.git pip install tensorflow_gan --user
-
Run the setup instructions in tensorflow_gan/examples/README.md to properly set up the
PYTHONPATH
. -
Save the location of your cloud resources.
export STORAGE_BUCKET=gs://YOUR-BUCKET-NAME export TPU_NAME=TPU-NAME export PROJECT_ID=PROJECT-ID export TPU_ZONE=ZONE
-
Run the example:
cd gan/tensorflow_gan/examples python self_attention_estimator/train_experiment_main.py \ --use_tpu=true \ --eval_on_tpu=true \ --use_tpu_estimator=true \ --mode=train_and_eval \ --max_number_of_steps=999999 \ --train_batch_size=1024 \ --eval_batch_size=1024 \ --predict_batch_size=128 \ --num_eval_steps=49 \ --train_steps_per_eval=1000 \ --tpu=$TPU_NAME \ --gcp_project=$PROJECT_ID \ --tpu_zone=$TPU_ZONE \ --model_dir=$STORAGE_BUCKET/logdir \ --imagenet_data_dir=$STORAGE_BUCKET/data \ --alsologtostderr
- Note: If you've run the data download step, training should start almost immediately. Otherwise, this will take a long time to start to run the first time, since the code needs to download the ImageNet dataset.
- Note: If your job starts downloading the data even though you ran
the pre-download step, you probably didn't enter the same
STORAGE_BUCKET
location as in the previous step. - Note: If your job fails with something like "Could not write to the internal temporary file.", you might need to follow these instructions and give the TPU permission to write to your cloud bucket.
- Note: If your job fails with "IOError: [Errno 2] No usable temporary directory found in ...", you might have run out of disk. Try clearing the temp directories listed and try again.
- Note: If your job fails with
Bad hardware status: ...
, try restarting your TPU. - Note: The batch sizes
train_batch_size
,eval_batch_size
andpredict_batch_size
must be a multiple of the number of TPU shards in your machine (note that each TPU core contains two computation shards). In addition,predict_batch_size
must be at least 16.
-
(Recommended) You can set up TensorBoard to track your training progress using these instructions.
-
Clean up by following the
Clean up
instructions in this tutorial.
This code is a TF-GAN Estimator implementation of Self-Attention Generative Adversarial Networks. It can run locally, on GPU, and on cloud TPU.
Real images | Generated images (GPU, 27 days) | Generated images (TPU, 2 days) |
---|---|---|
Inception score and Frechet Inception distance based on step number:
In this example, we compare the running time of a system with 8 Tesla V100 GPU cards to a system with 128 v2 TPUs cores. You can see that, as a function of train step, the GPU and TPU jobs are similar. However, in terms of time, the TPU job is more than 12x faster: