CLIP#
Arguments#
Options
- --clip_backbonestr
Help: Backbone architecture for CLIP
Default:
ViT-L/14
Choices:
RN50, RN101, RN50x4, RN50x16, RN50x64, ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px
- --save_predictions0|1|True|False -> bool
Help: Whether to save predictions of the TRAINING set after each task
Default:
0
- --use_templates0|1|True|False -> bool
Help: Whether to use prompt templates for CLIP. NOTE: Datasets NEED to have a get_prompt_templates method implemented.
Default:
0
Functions#
- models.moe_adapters_utils.clip.available_models()[source]#
Returns the names of available CLIP models
- models.moe_adapters_utils.clip.load(name, device='cpu', jit=True, is_train=False, pretrained=True)[source]#
Load a CLIP model :param name: A model name listed by clip.available_models(), or the path to a model checkpoint containing the state_dict :type name: str :param device: The device to put the loaded model :type device: Union[str, torch.device] :param jit: Whether to load the optimized JIT model (default) or more hackable non-JIT model. :type jit: bool
- Returns:
model (torch.nn.Module) – The CLIP model
preprocess (Callable[[PIL.Image], torch.Tensor]) – A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input
- models.moe_adapters_utils.clip.tokenize(texts, context_length=77)[source]#
Returns the tokenized representation of given input string(s) :param texts: An input string or a list of input strings to tokenize :type texts: Union[str, List[str]] :param context_length: The context length to use; all CLIP models use 77 as the context length :type context_length: int
- Return type:
A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length]