Utility functions¶
Optimization¶

theano.gpuarray.opt_util.
alpha_merge
(cls, alpha_in, beta_in)[source]¶ Decorator to merge multiplication by a scalar on the output.
This will find a pattern of ts * <yourop>(some, params, alpha, beta) and update it so that the scalar multiplication happens as part of your op.
The op needs to accept an alpha and a beta scalar which act this way:
out = Op() * alpha + out_like * beta
Where out_like is a buffer that has the same size as the output and gets added to the “real” output of the operation. An example of an operation that respects this pattern is GEMM from blas.
The decorated function must have this signature:
maker(node, *inputs)
The node argument you receive is the original apply node that contains your op. You should use it to grab relevant properties for your op so that the new version performs the same computation. The *inputs parameters contains the new inputs for your op. You MUST use those inputs instead of the ones on node. Note that this function can be as simple as:
def maker(node, *inputs): return node.op(*inputs)
Parameters:  cls (op class) – The class of the op you want to merge
 alpha_in (int) – The input index for the alpha scalar for your op (in node.inputs).
 beta_in (int) – The input index for the beta scalar for your op (in node.inputs).
Returns: an unregistered local optimizer that has the same name as the decorated function.
Return type: local optimizer
Notes
This was factored out since the code to deal with intervening transfers and correctness in the presence of different values of alpha and beta scaling factors is not trivial.

theano.gpuarray.opt_util.
find_node
(fgraph, v, cls, ignore_clients=False)[source]¶ Find the node that has an op of of type cls in v.
This digs through possibly redundant transfers to for the node that has the type cls. If ignore_clients is False (the default) it will only dig through nodes that have a single client to avoid duplicating computations.
Parameters:  v – The variable to dig through
 cls (Op class) – The type of the node we are looking for
 ignore_clients (bool, optional) – Whether to ignore multiple clients or not.

theano.gpuarray.opt_util.
grab_cpu_scalar
(v, nd)[source]¶ Get a scalar variable value from the tree at v.
This function will dig through transfers and dimshuffles to get the constant value. If no such constant is found, it returns None.
Parameters:  v – Theano variable to extract the constant value from.
 nd (int) – Expected number of dimensions for the variable (for broadcasted constants).

theano.gpuarray.opt_util.
inplace_allocempty
(op, idx)[source]¶ Wrapper to make an inplace optimization that deals with AllocEmpty
This will duplicate the alloc input if it has more than one client to allow the op to work on it inplace.
The decorated function must have this signature:
maker(node, inputs)
The node argument you receive is the original apply node that contains your op. You should use it to grab relevant properties for your op so that the new version performs the same computation. You should also switch the op to work inplace. The *inputs parameters contains the new inputs for your op. You MUST use those inputs instead of the ones on node. Note that this function can be as simple as:
def maker(node, inputs): return [node.op.__class__(inplace=True)(*inputs)]
Parameters:  op (op class) – The op class to look for to make inplace
 idx (int) – The index of the (possibly) AllocEmpty input (in node.inputs).
Returns: an unregistered inplace local optimizer that has the same name as the decorated function.
Return type: local optimizer

theano.gpuarray.opt_util.
is_equal
(var, val)[source]¶ Returns True if var is always equal to val.
This will only return True if the variable will always be equal to the value. If it might not be true in some cases then it returns False.
Parameters:  var – Variable to compare
 val – Python value

theano.gpuarray.opt_util.
op_lifter
(OP, cuda_only=False)[source]¶ OP(…, host_from_gpu(), …) > host_from_gpu(GpuOP(…))
gpu_from_host(OP(inp0, …)) > GpuOP(inp0, …)

theano.gpuarray.opt_util.
output_merge
(cls, alpha_in, beta_in, out_in)[source]¶ Decorator to merge addition by a value on the output.
This will find a pattern of val * <yourop>(some, params, alpha, beta, out_like) and update it so that the addtition happens as part of your op.
The op needs to accept an alpha and a beta scalar which act this way:
out = Op() * alpha + out_like * beta
Where out_like is a buffer that has the same size as the output and gets added to the “real” output of the operation. An example of an operation that respects this pattern is GEMM from blas.
The decorated function must have this signature:
maker(node, *inputs)
The node argument you receive is the original apply node that contains your op. You should use it to grab relevant properties for your op so that the new version performs the same computation. The *inputs parameters contains the new inputs for your op. You MUST use those inputs instead of the ones on node. Note that this function can be as simple as:
def maker(node, *inputs): return node.op(*inputs)
Parameters:  cls (op class) – The class of the op you want to merge
 alpha_in (int) – The input index for the alpha scalar for your op (in node.inputs).
 beta_in (int) – The input index for the beta scalar for your op (in node.inputs).
 out_in (int) – The input index for the out_like input for your op (in node.inputs).
Returns: an unregistered local optimizer that has the same name as the decorated function.
Return type: local optimizer
Notes
This was factored out since the code to deal with intervening transfers and correctness in the presence of different values of alpha and beta scaling factors is not trivial.
This also correctly handles the case where the added value is broadcasted (by not performing the replacement).

theano.gpuarray.opt_util.
pad_dims
(input, leftdims, rightdims)[source]¶ Reshapes the input to a (leftdims + rightdims) tensor
This helper function is used to convert pooling inputs with arbitrary nonpooling dimensions to the correct number of dimensions for the GPU pooling ops.
This reduces or expands the number of dimensions of the input to exactly leftdims, by adding extra dimensions on the left or by combining some existing dimensions on the left of the input.
Use unpad_dims to reshape back to the original dimensions.
Examples
Given input of shape (3, 5, 7),
pad_dims(input, 2, 2)
adds a singleton dimension and reshapes to (1, 3, 5, 7). Given that output from pad_dims,unpad_dims(output, input, 2, 2)
reshapes back to (3, 5, 7).Given input of shape (3, 5, 7, 9),
pad_dims(input, 2, 2)
does not reshape and returns output with shape (3, 5, 7, 9).Given input of shape (3, 5, 7, 9, 11),
pad_dims(input, 2, 2)
combines the first two dimensions and reshapes to (15, 7, 9, 11).Given input of shape (3, 5, 7, 9),
pad_dims(input, 2, 3)
adds a singleton dimension and reshapes to (1, 3, 5, 7, 9).
Kernel generation¶
Helper routines for generating gpu kernels for nvcc.

theano.gpuarray.kernel_codegen.
code_version
(version)[source]¶ Decorator to support versionbased cache mechanism.

theano.gpuarray.kernel_codegen.
inline_reduce
(N, buf, pos, count, manner_fn)[source]¶ Return C++ code for a function that reduces a contiguous buffer.
Parameters:  N – Length of the buffer.
 buf – buffer pointer.
 pos – Index of executing thread.
 count – Number of executing threads.
 manner_fn –
A function that accepts strings of arguments a and b, and returns c code for their reduction.
return “%(a)s + %(b)s”for a sum reduction.
Notes
buf should be in gpu shared memory, we access it many times.
This function leaves the answer in position 0 of the buffer. The rest of the buffer is trashed by this function.
Return C++ code for a function that reduces a contiguous buffer.
This function leaves the answer in position 0 of the buffer. The rest of the buffer is trashed by this function.
Parameters:  N – Length of the buffer.
 buf – Buffer pointer of size warpSize * sizeof(dtype).
 x – Input data.
 stride_x – Input data stride.
 load_x – Wrapper to read from x.
 pos – Index of executing thread.
 count – Number of executing threads.
 manner_fn –
A function that accepts strings of arguments a and b, and returns c code for their reduction.
return “%(a)s + %(b)s”for a sum reduction.
 manner_init – A function that accepts strings of arguments a and return c code for its initialization.
 b – Optional, pointer to the bias.
 stride_b – Optional, the stride of b if b is provided.
 load_b – Optional, wrapper to read from b if b is provided.
 dtype – Optional, the dtype of the output.
Notes
buf should be in gpu shared memory, we access it many times.

theano.gpuarray.kernel_codegen.
inline_softmax
(N, buf, buf2, threadPos, threadCount, dtype='float32')[source]¶ Generate code for a softmax.
On entry, buf and buf2 must contain two identical copies of the input to softmax.
After the code returns buf contains the softmax, buf2 contains unnormalized softmax.
Parameters:  N – Length of the buffer.
 threadPos – Index of executing thread.
 threadCount – Number of executing threads.
 dtype – Dtype of the softmax’s output.
Notes
buf and buf2 should be in gpu shared memory, we access it many times.
We use __i as an int variable in a loop.
Generate code to perform softmax with a fixed amount of shared memory.
On entry, buf is assumed to be empty.
On exit, buf[0] contains the softmax, buf2 contains unnormalized softmax.
Parameters:  N – Length of the buffer, atleast waprSize(32).
 buf – A shared memory buffer of size warpSize * sizeof(dtype).
 x – A ptr to the gpu memory where the row is stored.
 stride_x – The stride between each element in x.
 load_x – Wrapper to read from x.
 sm – A ptr to the gpu memory to store the result.
 sm_stride – The stride between each sm element.
 write_sm – Wrapper before writing to sm.
 threadPos – Index of executing thread.
 threadCount – Number of executing threads.
 b – Optional, pointer to the bias.
 stride_b – Optional, the stride of b if b is provided.
 load_b – Optional, wrapper to read from b if b is provided.
 dtype – Optional, the dtype of the softmax’s output if not float32.
Notes
buf should be in gpu shared memory, we access it many times.
We use tx as an int variable in a loop.