VIT PROMPT#

Vision Transformer (ViT) in PyTorch

A PyTorch implement of Vision Transformers as described in:

‘An Image Is Worth 16 x 16 Words: Transformers for Image Recognition at Scale’

https://arxiv.org/abs/2010.11929

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

https://arxiv.org/abs/2106.10270

The official jax code is released and available at https://github.com/google-research/vision_transformer

Acknowledgments: * The paper authors for releasing code and weights, thanks! * I fixed my class token impl based on Phil Wang’s https://github.com/lucidrains/vit-pytorch … check it out for some einops/einsum fun * Simple transformer style inspired by Andrej Karpathy’s https://github.com/karpathy/minGPT * Bert reference code checks against Huggingface Transformers and Tensorflow Bert

Classes#

class models.l2p_utils.vit_prompt.VisionTransformer(prompt_length=None, embedding_key='cls', prompt_init='uniform', prompt_pool=False, prompt_key=False, pool_size=None, top_k=None, batchwise_prompt=False, prompt_key_init='uniform', head_type='token', use_prompt_mask=False, prompt_shuffle=False, args=None, **kwargs)[source]#

Bases: VisionTransformer

forward(x, task_id=-1, cls_features=None, train=False, returnt='out')[source]#

forward_features(x, task_id=-1, cls_features=None, train=False)[source]#

forward_head(res, pre_logits=False)[source]#

load_pretrained(checkpoint_path, prefix='')[source]#

Functions#

models.l2p_utils.vit_prompt.checkpoint_filter_fn(state_dict, model, adapt_layer_scale=False)[source]#: convert patch embedding weight from manual patchify + linear proj to conv

models.l2p_utils.vit_prompt.resize_pos_embed(posemb, posemb_new, num_prefix_tokens=1, gs_new=())[source]#

models.l2p_utils.vit_prompt.vit_base_patch16_224_l2p(pretrained=False, **kwargs)[source]#: ViT-Base (ViT-B/16) from original paper (https://arxiv.org/abs/2010.11929). ImageNet-1k weights fine-tuned from in21k @ 224x224, source https://github.com/google-research/vision_transformer.

models.l2p_utils.vit_prompt.checkpoint_filter_fn(state_dict, model, adapt_layer_scale=False)[source]#: convert patch embedding weight from manual patchify + linear proj to conv

models.l2p_utils.vit_prompt.resize_pos_embed(posemb, posemb_new, num_prefix_tokens=1, gs_new=())[source]#

models.l2p_utils.vit_prompt.vit_base_patch16_224_l2p(pretrained=False, **kwargs)[source]#: ViT-Base (ViT-B/16) from original paper (https://arxiv.org/abs/2010.11929). ImageNet-1k weights fine-tuned from in21k @ 224x224, source https://github.com/google-research/vision_transformer.