Using multiple GPUs

Theano has a feature to allow the use of multiple GPUs at the same time in one function. The multiple gpu feature requires the use of the GpuArray Backend backend, so make sure that works correctly.

In order to keep a reasonably high level of abstraction you do not refer to device names directly for multiple-gpu use. You instead refer to what we call context names. These are then mapped to a device using the theano configuration. This allows portability of models between machines.


The code is rather new and is still considered experimental at this point. It has been tested and seems to perform correctly in all cases observed, but make sure to double-check your results before publishing a paper or anything of the sort.


For data-parallelism, you probably are better using platoon.

Defining the context map

The mapping from context names to devices is done through the config.contexts option. The format looks like this:


Let’s break it down. First there is a list of mappings. Each of these mappings is separated by a semicolon ‘;’. There can be any number of such mappings, but in the example above we have two of them: dev0->cuda0 and dev1->cuda1.

The mappings themselves are composed of a context name followed by the two characters ‘->’ and the device name. The context name is a simple string which does not have any special meaning for Theano. For parsing reasons, the context name cannot contain the sequence ‘->’ or ‘;’. To avoid confusion context names that begin with ‘cuda’ or ‘opencl’ are disallowed. The device name is a device in the form that gpuarray expects like ‘cuda0’ or ‘opencl0:0’.


Since there are a bunch of shell special characters in the syntax, defining this on the command-line will require proper quoting, like this:

$ THEANO_FLAGS="contexts=dev0->cuda0"

When you define a context map, if config.print_active_device is True (the default), Theano will print the mappings as they are defined. This will look like this:

$ THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1" python -c 'import theano'
Mapped name dev0 to device cuda0: GeForce GTX TITAN X (0000:09:00.0)
Mapped name dev1 to device cuda1: GeForce GTX TITAN X (0000:06:00.0)

If you don’t have enough GPUs for a certain model, you can assign the same device to more than one name. You can also assign extra names that a model doesn’t need to some other devices. However, a proliferation of names is not always a good idea since theano often assumes that different context names will be on different devices and will optimize accordingly. So you may get faster performance for a single name and a single device.


It is often the case that multi-gpu operation requires or assumes that all the GPUs involved are equivalent. This is not the case for this implementation. Since the user has the task of distributing the jobs across the different device a model can be built on the assumption that one of the GPU is slower or has smaller memory.

A simple graph on two GPUs

The following simple program works on two GPUs. It builds a function which perform two dot products on two different GPUs.

import numpy
import theano

v01 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
v02 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
v11 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
v12 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),

f = theano.function([], [, v02),
               , v12)])


This model requires a context map with assignations for ‘dev0’ and ‘dev1’. It should run twice as fast when the devices are different.

Explicit transfers of data

Since operations themselves cannot work on more than one device, they will pick a device to work on based on their inputs and automatically insert transfers for any input which is not on the right device.

However you may want some explicit control over where and how these transfers are done at some points. This is done by using the new transfer() method that is present on variables. It works for moving data between GPUs and also between the host and the GPUs. Here is a example.

import theano

v = theano.tensor.fmatrix()

# Move to the device associated with 'gpudev'
gv = v.transfer('gpudev')

# Move back to the cpu
cv = gv.transfer('cpu')

Of course you can mix transfers and operations in any order you choose. However you should try to minimize transfer operations because they will introduce overhead that may reduce performance.