self_attention_estimator

Self-Attention GAN Estimator on ImageNet

Authors: Yoel Drori, Augustus Odena, Joel Shor

How to run

Run on cloud TPU

Set up your Cloud resources. This involves setting up a Cloud Bucket (disk), a TPU to run the computation, and Virtual Machine to run your code. There are multiple ways of bringing each of these up. The easiest is to follow the instructions in this tutorial. Start at the top and finish after section Verify your Compute Engine VM.
- Note: Be sure to complete the Clean Up steps below after finishing to avoid incurring charges to your GCP account.
- Note: Be sure to read the TPU pricing guide and Cloud Storage pricing guide.
- Note: If you want to profile TPU utilization, be sure to install the cloud tpu profiler.
The final command should leave you connected to your new VM. If it hasn't, please follow these instructions to connect.

On your new VM, run the following commands to download ImageNet from TensorFlow Datasets and convert them to TFRecords for easy training. This could take hours, so run it and take a coffee break:

pip install --upgrade tensorflow_datasets --user
tmux
STORAGE_BUCKET=gs://YOUR-BUCKET-NAME
python -c 'import tensorflow_datasets as tfds; ds = tfds.load("imagenet2012:5.*.*", split="train", data_dir="'${STORAGE_BUCKET}/data'"); tfds.as_numpy(ds)'

Install the necessary packages and download the example code:

git clone https://github.com/tensorflow/gan.git
pip install tensorflow_gan --user

Run the setup instructions in tensorflow_gan/examples/README.md to properly set up the PYTHONPATH.

Save the location of your cloud resources.

export STORAGE_BUCKET=gs://YOUR-BUCKET-NAME
export TPU_NAME=TPU-NAME
export PROJECT_ID=PROJECT-ID
export TPU_ZONE=ZONE

Run the example:
```
cd gan/tensorflow_gan/examples
python self_attention_estimator/train_experiment_main.py \
  --use_tpu=true \
  --eval_on_tpu=true \
  --use_tpu_estimator=true \
  --mode=train_and_eval \
  --max_number_of_steps=999999 \
  --train_batch_size=1024 \
  --eval_batch_size=1024 \
  --predict_batch_size=128 \
  --num_eval_steps=49 \
  --train_steps_per_eval=1000 \
  --tpu=$TPU_NAME \
  --gcp_project=$PROJECT_ID \
  --tpu_zone=$TPU_ZONE \
  --model_dir=$STORAGE_BUCKET/logdir \
  --imagenet_data_dir=$STORAGE_BUCKET/data \
  --alsologtostderr
```
- Note: If you've run the data download step, training should start almost immediately. Otherwise, this will take a long time to start to run the first time, since the code needs to download the ImageNet dataset.
- Note: If your job starts downloading the data even though you ran the pre-download step, you probably didn't enter the same STORAGE_BUCKET location as in the previous step.
- Note: If your job fails with something like "Could not write to the internal temporary file.", you might need to follow these instructions and give the TPU permission to write to your cloud bucket.
- Note: If your job fails with "IOError: [Errno 2] No usable temporary directory found in ...", you might have run out of disk. Try clearing the temp directories listed and try again.
- Note: If your job fails with Bad hardware status: ..., try restarting your TPU.
- Note: The batch sizes train_batch_size, eval_batch_size and predict_batch_size must be a multiple of the number of TPU shards in your machine (note that each TPU core contains two computation shards). In addition, predict_batch_size must be at least 16.
(Recommended) You can set up TensorBoard to track your training progress using these instructions.
Clean up by following the Clean up instructions in this tutorial.

Description

This code is a TF-GAN Estimator implementation of Self-Attention Generative Adversarial Networks. It can run locally, on GPU, and on cloud TPU.

Real images	Generated images (GPU, 27 days)	Generated images (TPU, 2 days)

Inception score and Frechet Inception distance based on step number:

In this example, we compare the running time of a system with 8 Tesla V100 GPU cards to a system with 128 v2 TPUs cores. You can see that, as a function of train step, the GPU and TPU jobs are similar. However, in terms of time, the TPU job is more than 12x faster:

Name		Name	Last commit message	Last commit date
parent directory ..
images		images
README.md		README.md
__init__.py		__init__.py
data_provider.py		data_provider.py
data_provider_test.py		data_provider_test.py
discriminator.py		discriminator.py
discriminator_test.py		discriminator_test.py
estimator_lib.py		estimator_lib.py
estimator_lib_test.py		estimator_lib_test.py
eval_lib.py		eval_lib.py
eval_lib_test.py		eval_lib_test.py
generator.py		generator.py
generator_test.py		generator_test.py
ops.py		ops.py
ops_test.py		ops_test.py
train_experiment.py		train_experiment.py
train_experiment_main.py		train_experiment_main.py
train_experiment_test.py		train_experiment_test.py

Files

self_attention_estimator

Directory actions

More options

Directory actions

More options

Latest commit

History

self_attention_estimator

Folders and files

parent directory

Self-Attention GAN Estimator on ImageNet

How to run

Run on cloud TPU

Description