CLIP#

Arguments#

Options

--clip_backbonestr

Help: Backbone architecture for CLIP

  • Default: ViT-L/14

  • Choices: RN50, RN101, RN50x4, RN50x16, RN50x64, ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px

--save_predictions0|1|True|False -> bool

Help: Whether to save predictions of the TRAINING set after each task

  • Default: 0

--use_templates0|1|True|False -> bool

Help: Whether to use prompt templates for CLIP. NOTE: Datasets NEED to have a get_prompt_templates method implemented.

  • Default: 0

Functions#

models.moe_adapters_utils.clip.available_models()[source]#

Returns the names of available CLIP models

Return type:

List[str]

models.moe_adapters_utils.clip.load(name, device='cpu', jit=True, is_train=False, pretrained=True)[source]#

Load a CLIP model :param name: A model name listed by clip.available_models(), or the path to a model checkpoint containing the state_dict :type name: str :param device: The device to put the loaded model :type device: Union[str, torch.device] :param jit: Whether to load the optimized JIT model (default) or more hackable non-JIT model. :type jit: bool

Returns:

  • model (torch.nn.Module) – The CLIP model

  • preprocess (Callable[[PIL.Image], torch.Tensor]) – A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input

models.moe_adapters_utils.clip.tokenize(texts, context_length=77)[source]#

Returns the tokenized representation of given input string(s) :param texts: An input string or a list of input strings to tokenize :type texts: Union[str, List[str]] :param context_length: The context length to use; all CLIP models use 77 as the context length :type context_length: int

Return type:

A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length]