MODEL#

Classes#

class models.moe_adapters_utils.model.AttentionPool2d(spacial_dim, embed_dim, num_heads, output_dim=None)[source]#

Bases: Module

forward(x)[source]#

class models.moe_adapters_utils.model.Bottleneck(inplanes, planes, stride=1)[source]#

Bases: Module

expansion = 4#

forward(x)[source]#

class models.moe_adapters_utils.model.CLIP(args, embed_dim, image_resolution, vision_layers, vision_width, vision_patch_size, context_length, vocab_size, transformer_width, transformer_heads, transformer_layers, baseline=False)[source]#

Bases: Module

build_attention_mask()[source]#

property dtype#

encode_image(image)[source]#

encode_text(text)[source]#

forward(image, text, taskid, is_train)[source]#

initialize_parameters()[source]#

class models.moe_adapters_utils.model.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True, bias=True, device=None, dtype=None)[source]#

Bases: LayerNorm

Subclass torch’s LayerNorm to handle fp16.

forward(x)[source]#

class models.moe_adapters_utils.model.ModifiedResNet(layers, output_dim, heads, input_resolution=224, width=64)[source]#

Bases: Module

A ResNet class that is similar to torchvision’s but contains the following changes: - There are now 3 “stem” convolutions as opposed to 1, with an average pool instead of a max pool. - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1 - The final pooling layer is a QKV attention instead of an average pool

forward(x)[source]#

class models.moe_adapters_utils.model.QuickGELU(*args, **kwargs)[source]#

Bases: Module

forward(x)[source]#

class models.moe_adapters_utils.model.ResidualAttentionBlock(args, d_model, n_head, attn_mask=None, text_or_image=None)[source]#

Bases: Module

attention(x)[source]#

cv_squared(x)[source]#: The squared coefficient of variation of a sample. Useful as a loss to encourage a positive distribution to be more uniform. Epsilons added for numerical stability. Returns 0 for an empty Tensor. Args: x: a Tensor. Returns: a Scalar.

forward(x)[source]#

noisy_top_k_gating(x, train, w_gate, w_noise, noise_epsilon=0.01)[source]#

Noisy top-k gating. See paper: https://arxiv.org/abs/1701.06538. :param x: input Tensor with shape [batch_size, input_size] :param train: a boolean - we only add noise at training time. :param noise_epsilon: a float

Returns:: a Tensor with shape [batch_size, num_experts] load: a Tensor with shape [num_experts]
Return type:: gates

class models.moe_adapters_utils.model.SparseDispatcher(num_experts, gates)[source]#

Bases: object

Helper for implementing a mixture of experts. The purpose of this class is to create input minibatches for the experts and to combine the results of the experts to form a unified output tensor. There are two functions: dispatch - take an input Tensor and create input Tensors for each expert. combine - take output Tensors from each expert and form a combined output

Tensor. Outputs from different experts for the same batch element are summed together, weighted by the provided “gates”.

The class is initialized with a “gates” Tensor, which specifies which batch elements go to which experts, and the weights to use when combining the outputs. Batch element b is sent to expert e iff gates[b, e] != 0. The inputs and outputs are all two-dimensional [batch, depth]. Caller is responsible for collapsing additional dimensions prior to calling this class and reshaping the output to the original shape. See common_layers.reshape_like(). Example use: gates: a float32 Tensor with shape [batch_size, num_experts] inputs: a float32 Tensor with shape [batch_size, input_size] experts: a list of length num_experts containing sub-networks. dispatcher = SparseDispatcher(num_experts, gates) expert_inputs = dispatcher.dispatch(inputs) expert_outputs = [experts[i](expert_inputs[i]) for i in range(num_experts)] outputs = dispatcher.combine(expert_outputs) The preceding code sets the output for a particular example b to: output[b] = Sum_i(gates[b, i] * experts[i](inputs[b])) This class takes advantage of sparsity in the gate matrix by including in the Tensor`s for expert i only the batch elements for which `gates[b, i] > 0.

combine(expert_out, multiply_by_gates=True)[source]#

Sum together the expert output, weighted by the gates. The slice corresponding to a particular batch element b is computed as the sum over all experts i of the expert output, weighted by the corresponding gate values. If multiply_by_gates is set to False, the gate values are ignored. :param expert_out: a list of num_experts `Tensor`s, each with shape

[expert_batch_size_i, <extra_output_dims>].

Parameters:: multiply_by_gates – a boolean
Returns:: a Tensor with shape [batch_size, <extra_output_dims>].

dispatch(inp)[source]#

Create one input Tensor for each expert. The Tensor for a expert i contains the slices of inp corresponding to the batch elements b where gates[b, i] > 0. :param inp: a Tensor of shape “[batch_size, <extra_input_dims>]`

Returns:

a list of num_experts `Tensor`s with shapes: [expert_batch_size_i, <extra_input_dims>].

expert_to_gates()[source]#

Gate values corresponding to the examples in the per-expert `Tensor`s. :returns:

a list of num_experts one-dimensional Tensor`s with type `tf.float32
and shapes [expert_batch_size_i]

class models.moe_adapters_utils.model.Transformer(args, width, layers, heads, attn_mask=None, text_or_image=None)[source]#

Bases: Module

forward(x)[source]#

class models.moe_adapters_utils.model.VisualTransformer(args, input_resolution, patch_size, width, layers, heads, output_dim, text_or_image=None)[source]#

Bases: Module

forward(x)[source]#

Functions#

models.moe_adapters_utils.model.build_model(args, state_dict)[source]#

models.moe_adapters_utils.model.convert_weights(model)[source]#

Convert applicable model parameters to fp16

models.moe_adapters_utils.model.build_model(args, state_dict)[source]#

models.moe_adapters_utils.model.convert_weights(model)[source]#

Convert applicable model parameters to fp16