mygrad.nnet.layers.gru#

mygrad.nnet.layers.gru(X, Uz, Wz, bz, Ur, Wr, br, Uh, Wh, bh, s0=None, bp_lim=None, dropout=0.0, constant=None)[source]#

Performs a forward pass of sequential data through a Gated Recurrent Unit layer, returning the ‘hidden-descriptors’ arrived at by utilizing the trainable parameters as follows:

Z_{t} = sigmoid(X_{t} Uz + S_{t-1} Wz + bz)
R_{t} = sigmoid(X_{t} Ur + S_{t-1} Wr + br)
H_{t} =    tanh(X_{t} Uh + (R{t} * S_{t-1}) Wh + bh)
S_{t} = (1 - Z{t}) * H{t} + Z{t} * S_{t-1}
Parameters:
Xarray_like, shape=(T, N, C)

The sequential data to be passed forward.

Uzarray_like, shape=(C, D)

The weights used to map sequential data to its hidden-descriptor representation

Wzarray_like, shape=(D, D)

The weights used to map a hidden-descriptor to a hidden-descriptor.

bzarray_like, shape=(D,)

The biases used to scale a hidden-descriptor.

Urarray_like, shape=(C, D)

The weights used to map sequential data to its hidden-descriptor representation

Wrarray_like, shape=(D, D)

The weights used to map a hidden-descriptor to a hidden-descriptor.

brarray_like, shape=(D,)

The biases used to scale a hidden-descriptor.

Uharray_like, shape=(C, D)

The weights used to map sequential data to its hidden-descriptor representation

Wharray_like, shape=(D, D)

The weights used to map a hidden-descriptor to a hidden-descriptor.

bharray_like, shape=(D,)

The biases used to scale a hidden-descriptor.

s0Optional[array_like], shape=(N, D)

The ‘seed’ hidden descriptors to feed into the RNN. If None, a Tensor of zeros of shape (N, D) is created.

bp_limOptional[int]

This feature is experimental and is currently untested. The (non-zero) limit of the depth of back propagation through time to be performed. If None back propagation is passed back through the entire sequence.

E.g. bp_lim=3 will propagate gradients only up to 3 steps backward through the recursive sequence.

dropoutfloat (default=0.), 0 <= dropout < 1

If non-zero, the dropout scheme described in [1] is applied. See Notes for more details.

constantbool, optional (default=False)

If True, the resulting Tensor is a constant.

Returns:
mygrad.Tensor, shape=(T+1, N, D)

The sequence of ‘hidden-descriptors’ produced by the forward pass of the RNN.

Notes

  • \(T\) : Sequence length

  • \(N\) : Batch size

  • \(C\) : Length of single datum

  • \(D\) : Length of ‘hidden’ descriptor

The GRU system of equations is given by:

\[ \begin{align}\begin{aligned}Z_{t} = \sigma (X_{t} U_z + S_{t-1} Wz + bz)\\R_{t} = \sigma (X_{t} U_r + S_{t-1} Wr + br)\\H_{t} = tanh(X_{t} U_h + (R_{t} * S_{t-1}) W_h + b_h)\\S_{t} = (1 - Z_{t}) * H_{t} + Z_{t} * S_{t-1}\end{aligned}\end{align} \]

Following the dropout scheme specified in [1], the hidden-hidden weights (Wz/Wr/Wh) randomly have their weights dropped prior to forward/back-prop. The input connections (via Uz/Ur/Uh) have variational dropout ([2]) applied to them with a common dropout mask across all t. That is three static dropout masks, each with shape-(N,D), are applied to

\[ \begin{align}\begin{aligned}X_{t} U_z\\X_{t} U_r\\X_{t} U_h\end{aligned}\end{align} \]

respectively, for all \(t\).

References

[1] (1,2)

S. Merity, et. al. “Regularizing and Optimizing LSTM Language Models”, arXiv:1708.02182v1, 2017.

[2]

Y. Gal, Z. Ghahramani “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks” arXiv:1512.05287v5, 2016.