MODEL#
Classes#
- class models.moe_adapters_utils.model.AttentionPool2d(spacial_dim, embed_dim, num_heads, output_dim=None)[source]#
Bases:
Module
- class models.moe_adapters_utils.model.Bottleneck(inplanes, planes, stride=1)[source]#
Bases:
Module
- expansion = 4#
- class models.moe_adapters_utils.model.CLIP(embed_dim, image_resolution, vision_layers, vision_width, vision_patch_size, context_length, vocab_size, transformer_width, transformer_heads, transformer_layers, baseline=False)[source]#
Bases:
Module
- property dtype#
- class models.moe_adapters_utils.model.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True, bias=True, device=None, dtype=None)[source]#
Bases:
LayerNorm
Subclass torch’s LayerNorm to handle fp16.
- class models.moe_adapters_utils.model.ModifiedResNet(layers, output_dim, heads, input_resolution=224, width=64)[source]#
Bases:
Module
A ResNet class that is similar to torchvision’s but contains the following changes: - There are now 3 “stem” convolutions as opposed to 1, with an average pool instead of a max pool. - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1 - The final pooling layer is a QKV attention instead of an average pool
- class models.moe_adapters_utils.model.ResidualAttentionBlock(d_model, n_head, attn_mask=None, text_or_image=None)[source]#
Bases:
Module
- cv_squared(x)[source]#
The squared coefficient of variation of a sample. Useful as a loss to encourage a positive distribution to be more uniform. Epsilons added for numerical stability. Returns 0 for an empty Tensor. Args: x: a Tensor. Returns: a Scalar.
- noisy_top_k_gating(x, train, w_gate, w_noise, noise_epsilon=0.01)[source]#
Noisy top-k gating. See paper: https://arxiv.org/abs/1701.06538. :param x: input Tensor with shape [batch_size, input_size] :param train: a boolean - we only add noise at training time. :param noise_epsilon: a float
- Returns:
a Tensor with shape [batch_size, num_experts] load: a Tensor with shape [num_experts]
- Return type:
gates
- class models.moe_adapters_utils.model.SparseDispatcher(num_experts, gates)[source]#
Bases:
object
Helper for implementing a mixture of experts. The purpose of this class is to create input minibatches for the experts and to combine the results of the experts to form a unified output tensor. There are two functions: dispatch - take an input Tensor and create input Tensors for each expert. combine - take output Tensors from each expert and form a combined output
Tensor. Outputs from different experts for the same batch element are summed together, weighted by the provided “gates”.
The class is initialized with a “gates” Tensor, which specifies which batch elements go to which experts, and the weights to use when combining the outputs. Batch element b is sent to expert e iff gates[b, e] != 0. The inputs and outputs are all two-dimensional [batch, depth]. Caller is responsible for collapsing additional dimensions prior to calling this class and reshaping the output to the original shape. See common_layers.reshape_like(). Example use: gates: a float32 Tensor with shape [batch_size, num_experts] inputs: a float32 Tensor with shape [batch_size, input_size] experts: a list of length num_experts containing sub-networks. dispatcher = SparseDispatcher(num_experts, gates) expert_inputs = dispatcher.dispatch(inputs) expert_outputs = [experts[i](expert_inputs[i]) for i in range(num_experts)] outputs = dispatcher.combine(expert_outputs) The preceding code sets the output for a particular example b to: output[b] = Sum_i(gates[b, i] * experts[i](inputs[b])) This class takes advantage of sparsity in the gate matrix by including in the Tensor`s for expert i only the batch elements for which `gates[b, i] > 0.
- combine(expert_out, multiply_by_gates=True)[source]#
Sum together the expert output, weighted by the gates. The slice corresponding to a particular batch element b is computed as the sum over all experts i of the expert output, weighted by the corresponding gate values. If multiply_by_gates is set to False, the gate values are ignored. :param expert_out: a list of num_experts `Tensor`s, each with shape
[expert_batch_size_i, <extra_output_dims>].
- Parameters:
multiply_by_gates – a boolean
- Returns:
a Tensor with shape [batch_size, <extra_output_dims>].
- dispatch(inp)[source]#
Create one input Tensor for each expert. The Tensor for a expert i contains the slices of inp corresponding to the batch elements b where gates[b, i] > 0. :param inp: a Tensor of shape “[batch_size, <extra_input_dims>]`
- Returns:
- a list of num_experts `Tensor`s with shapes
[expert_batch_size_i, <extra_input_dims>].