Home NewsX Train Vision Transformer model and run Inference

Train Vision Transformer model and run Inference

by info.odysseyx@gmail.com
0 comment 5 views


Please follow and rate my GitHub repository.https://github.com/xinyuwei-david/david-share.gitthere’s a lot of useful code!

Current CV models are mainly based on convolutional neural networks. However, with the advent of Transformers, Vision Transformers are gradually being applied.

Next, let’s look at the mainstream CV implementations and their features.

CV Architecture

Younet

  • characteristic: Encoder-decoder architecture, skipping connections.
  • Network Type: Convolutional Neural Network (CNN).
  • Application: Image segmentation, medical image processing.
  • merit: Efficient for segmentation tasks and preserves details.
  • disadvantage: Limited scalability for large data sets.
  • usage: Widely used in medical image segmentation.
  • Main Model: Original U-Net, 3D U-Net, stable diffusion.

news

  • characteristic: Optional search to generate candidate regions.
  • Network Type: CNN based.
  • Application: Object detection.
  • merit: High detection accuracy.
  • disadvantage: The computational complexity is high and the speed is slow.
  • usage: Replaced by faster models such as Faster R-CNN.
  • Main Model: Faster R-CNN, Faster R-CNN.

liver

  • characteristic: Adversarial training between generator and discriminator.
  • Network Type: A framework that mainly uses CNN.
  • Application: Image creation, style transfer.
  • merit: Produces high quality images.
  • disadvantage: Unstable training, prone to mode collapse.
  • usage: Widely used in production work.
  • Main Model: DCGAN, StyleGAN.

RNN/LSTM

  • characteristic: Processes sequential data and remembers long-term dependencies.
  • Network Type: Recurrent Neural Network.
  • Application: Time series forecasting, video analytics.
  • merit: Suitable for sequential data.
  • disadvantage: Difficult to train, and the slope disappears.
  • usage: Typically used for sequential operations.
  • Main Model: LSTM, GRU.

news

  • characteristic: Processes data in graph structure.
  • Network Type: Graph neural networks.
  • Application: Social network analysis, chemical molecular modeling.
  • merit: Captures graph structure information.
  • disadvantage: Scalability is limited for large graphs.
  • usage: Used for working with graph data.
  • Main Model: GCN, GraphSAGE.

Capsule Network

  • characteristic: The capsule structure captures spatial hierarchy.
  • Network Type: CNN based.
  • Application: Image recognition.
  • merit: Captures pose changes.
  • disadvantage: The computational complexity is high.
  • usage: It is still in the research stage, so it is not widely applied.
  • Main Model: Dynamic routing.

Autoencoder

  • characteristic: Encoder-decoder architecture.
  • Network Type: It can be CNN based.
  • Application: Dimensionality reduction, feature learning.
  • merit: Unsupervised learning.
  • disadvantage: Limited generation quality.
  • usage: Used for feature extraction and dimensionality reduction.
  • Main Model: Variational Autoencoder (VAE).

Vision Transformer (ViT)

  • characteristic: Processes image patches based on self-attention mechanism.
  • Network Type: Transformer.
  • Application: Image classification.
  • merit: Collect global information.
  • disadvantage: A large amount of data is required for learning.
  • usage: It is gaining popularity especially for large data sets.
  • Main Model: Original ViT, DeiT.

ViT and U-Net

According to the paper “Understanding the Efficacy of U-Net and Vision Transformer for Groundwater Numerical Modeling”, U-Net is generally more efficient than ViT, especially in sparse data scenarios. The architecture of U-Net is simpler with fewer parameters, making it more efficient in terms of computational resources and time. ViT has the advantage of capturing global information, but its self-attention mechanism has high computational complexity, especially when dealing with large-scale data.

In our experiments, the model combining U-Net and ViT outperforms the Fourier Neural Operator (FNO) in terms of accuracy and efficiency, especially under sparse data conditions.

In image processing, sparse data generally refers to information that is incomplete or unevenly distributed in an image. For example:

  • low resolution image: Fewer pixels and missing details.
  • Occlusion or missing data: Part of the image is blocked or data is missing.
  • Non-uniform sampling: Low pixel density in certain areas.
  • In such cases, the model must infer the entire image content from limited pixel information.

shinwiwei_0-1726043839777.png

After the introduction of Vision Transformers, new lines and variations emerged.

  • DeiT (Data-Efficient Image Transformer) by Facebook AI: The DeiT model is a refined ViT model. The authors also released a more training-efficient ViT model that can be directly integrated into ViTModel or ViTForImageClassification. Four variants are provided (in three different sizes): Facebook/Date-Tiny-Patch16-224, Facebook/Date-Small-Patch16-224, Facebook/DatabasePatch16-224and Facebook/DatebasePatch16-384Images must be prepared using DeiTImageProcessor.
  • BEiT (BERT Pre-training for Image Transformers) by Microsoft Research: The BEiT model uses a self-supervised approach inspired by BERT (Brain Evolutionary Recognition and Mask Image Modeling) and is based on VQ-VAE, outperforming vision transformers with supervised learning pre-training.
  • DINO (Self-supervised Learning Method for Vision Transformers) by Facebook AI: Vision Transformers trained in the DINO manner exhibit an interesting property not found in synthetic models: they can segment objects without being explicitly trained. DINO checkpoints can be found in the hub.
  • Facebook AI’s Masked Autoencoder (MAE): After pretraining Vision Transformers (using an asymmetric encoder-decoder architecture) to reconstruct pixel values ​​for most (75%) of the masked patches, the authors show that this simple method outperforms supervised pretraining after fine-tuning.
  • The following diagram illustrates the workflow of the Vision Transformer (ViT).
  1. Image Patch: The input image is divided into small, fixed-size patches.
  2. Linear projection: Each image patch is flattened and converted to a vector via linear projection.
  3. Location embedding: Location embeddings are added to each image patch to maintain location information.
  4. CLS Token: A trainable CLS token is added to the beginning of the sequence for classification tasks.
  5. Transformer Encoder: These embedded vectors (including CLS tokens) are fed into a Transformer encoder for multi-layer processing. Each layer contains a multi-head attention mechanism and a feedforward neural network.
  6. MLP Manager: After processing in the encoder, the output of the CLS token is passed to a multilayer perceptron (MLP) head for final classification decision.
  • This entire process demonstrates how the Transformer architecture performs an image classification task by directly processing a sequence of image patches.

Training ViT

Pure ViT is mainly used for image classification.

class Attention(nn.Module):  
    def __init__(self, dim, heads=8, dim_head=64, dropout=0.):  
        super().__init__()  
        inner_dim = dim_head * heads  
        project_out = not (heads == 1 and dim_head == dim)  
        self.heads = heads  
        self.scale = dim_head ** -0.5  
        self.norm = nn.LayerNorm(dim)  
        self.attend = nn.Softmax(dim=-1)  
        self.dropout = nn.Dropout(dropout)  
        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False)  
        self.to_out = nn.Sequential(  
            nn.Linear(inner_dim, dim),  
            nn.Dropout(dropout)  
        ) if project_out else nn.Identity()  
  
    def forward(self, x):  
        x = self.norm(x)  
        qkv = self.to_qkv(x).chunk(3, dim=-1)  
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h=self.heads), qkv)  
        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale  
        attn = self.attend(dots)  
        attn = self.dropout(attn)  
        out = torch.matmul(attn, v)  
        out = rearrange(out, 'b h n d -> b n (h d)')  
        return self.to_out(out)  
  
# 定义Feed Forward Network (FFN)  
class FFN(nn.Module):  
    def __init__(self, dim, hidden_dim, dropout=0.):  
        super().__init__()  
        self.net = nn.Sequential(  
            nn.LayerNorm(dim),  
            nn.Linear(dim, hidden_dim),  
            nn.GELU(),  
            nn.Dropout(dropout),  
            nn.Linear(hidden_dim, dim),  
            nn.Dropout(dropout)  
        )  
  
    def forward(self, x):  
        return self.net(x)  
  
# 定义Transformer Encoder  
class Transformer(nn.Module):  
    def __init__(self, dim, depth, heads, dim_head, mlp_dim_ratio, dropout):  
        super().__init__()  
        self.layers = nn.ModuleList([])  
        mlp_dim = mlp_dim_ratio * dim  
        for _ in range(depth):  
            self.layers.append(nn.ModuleList([  
                Attention(dim=dim, heads=heads, dim_head=dim_head, dropout=dropout),  
                FFN(dim=dim, hidden_dim=mlp_dim, dropout=dropout)  
            ]))  
  
    def forward(self, x):  
        for attn, ffn in self.layers:  
            x = attn(x) + x  
            x = ffn(x) + x  
        return x  
  
# 定义Vision Transformer (ViT)  
class ViT(nn.Module):  
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim_ratio, pool="cls", channels=3, dim_head=64, dropout=0.):  
        super().__init__()  
        image_height, image_width = pair(image_size)  
        patch_height, patch_width = pair(patch_size)  
        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'  
        num_patches = (image_height // patch_height) * (image_width // patch_width)  
        patch_dim = channels * patch_height * patch_width  
  
        self.to_patch_embedding = nn.Sequential(  
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=patch_height, p2=patch_width),  
            nn.LayerNorm(patch_dim),  
            nn.Linear(patch_dim, dim),  
            nn.LayerNorm(dim)  
        )  
  
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))  
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))  
        self.dropout = nn.Dropout(dropout)  
        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim_ratio, dropout)  
        self.pool = pool  
        self.to_latent = nn.Identity()  
        self.mlp_head = nn.Linear(dim, num_classes)  
  
    def forward(self, img):  
        x = self.to_patch_embedding(img)  
        b, n, _ = x.shape  
        cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b=b)  
        x = torch.cat((cls_tokens, x), dim=1)  
        x += self.pos_embedding[:, :(n + 1)]  
        x = self.dropout(x)  
        x = self.transformer(x)  
        cls_token = x[:, 0]  
        feature_map = x[:, 1:]  
        pooled_output = cls_token if self.pool == 'cls' else feature_map.mean(dim=1)  
        pooled_output = self.to_latent(pooled_output)  
        classification_result = self.mlp_head(pooled_output)  
        return classification_result  
  
# 辅助函数  
def pair
    return t if isinstance(t, tuple) else (t, t)  
  
# 数据预处理  
transform = transforms.Compose([  
    transforms.Resize((32, 32)),  
    transforms.ToTensor(),  
    transforms.Normalize((0.5,), (0.5,))  
])  
  
# 加载CIFAR-10数据集  
train_dataset = datasets.CIFAR10(root="./data", train=True, download=True, transform=transform)  
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)  
  
# 初始化ViT模型  
model = ViT(  
    image_size=32,  
    patch_size=4,  
    num_classes=10,  
    dim=128,  
    depth=6,  
    heads=8,  
    mlp_dim_ratio=4,  
    dropout=0.1  
)  
  
# 定义损失函数和优化器  
criterion = nn.CrossEntropyLoss()  
optimizer = optim.Adam(model.parameters(), lr=3e-4)  
  
# 训练模型  
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  
model.to(device)  
  
for epoch in range(10):  # 训练10个epoch  
    model.train()  
    total_loss = 0  
    for images, labels in train_loader:  
        images, labels = images.to(device), labels.to(device)  
        optimizer.zero_grad()  
        outputs = model(images)  
        loss = criterion(outputs, labels)  
        loss.backward()  
        optimizer.step()  
        total_loss += loss.item()  
  
    print(f'Epoch {epoch+1}, Loss: {total_loss/len(train_loader)}')  
  
# 保存整个模型  
torch.save(model, 'vit_complete_model.pth')  
print("训练完成并保存模型!") 

Training Results:

Files already downloaded and verified
Epoch 1, Loss: 1.5606277365513774
Epoch 2, Loss: 1.2305729564498453
Epoch 3, Loss: 1.0941925532067829
Epoch 4, Loss: 1.0005672584714183
Epoch 5, Loss: 0.9230595080139082
Epoch 6, Loss: 0.8589703797379418
Epoch 7, Loss: 0.7988450761188937
Epoch 8, Loss: 0.7343863746546724
Epoch 9, Loss: 0.6837297593388716
Epoch 10, Loss: 0.6306750321632151
训练完成并保存模型!

Inference Test:

# 数据预处理  
transform = transforms.Compose([  
    transforms.Resize((32, 32)),  
    transforms.ToTensor(),  
    transforms.Normalize((0.5,), (0.5,))  
])  
  
# 加载CIFAR-10数据集  
test_dataset = datasets.CIFAR10(root="./data", train=False, download=True, transform=transform)  
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)  
  
# 加载整个模型  
model = torch.load('vit_complete_model.pth')  
model.eval()  
  
# 设备设置  
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  
model.to(device)  
  
# 进行推理测试  
with torch.no_grad():  
    for images, labels in test_loader:  
        images, labels = images.to(device), labels.to(device)  
        outputs = model(images)  
        _, predicted = torch.max(outputs, 1)  
  
        # 显示前5个样本的预测结果和图像  
        for i in range(5):  
            image = images[i].cpu().numpy().transpose((1, 2, 0))  
            image = (image * 0.5) + 0.5  # 反归一化  
            plt.imshow(image)  
            plt.title(f'预测: {test_dataset.classes[predicted[i]]}, 实际: {test_dataset.classes[labels[i]]}')  
            plt.show()  
  
        break  # 只显示一批数据 

Inference Results:

shinwiwei_1-1726044116872.png

shinwiwei_2-1726044129952.png

shinwiwei_3-1726044150741.png

shinwiwei_5-1726044196765.png

Florence-2

Microsoft’s Florence-2 uses a Transformer-based architecture, and in particular, adopts DeiT (Data-efficient Vision Transformer) as a visual encoder. DeiT’s architecture is the same as ViT’s, with the addition of distillation tokens to the input tokens. Distillation is a way to improve learning performance, especially since ViT’s performance deteriorates when data is insufficient.

However, the Phi-3 vision is also based on ViT (ViT-L).

The Florence-2 model architecture uses a sequence-to-sequence learning approach. That is, the model incrementally processes an input sequence (e.g., an image with a text prompt) and produces a sequence of outputs (e.g., a description or label). In the sequence-to-sequence framework, each task is treated as a translation problem. The model receives an input image and a specific task prompt, and then produces the corresponding output.

shinwiwei_6-1726044247785.png

For more information about Florence-2, see my repository.

https://github.com/xinyuwei-david/david-share/tree/master/Multimodal-Models/Florence-2-Inference-and…

Chen Wen-VL

Qwen2-VL adopts an encoder-decoder architecture to combine Vision Transformer (ViT) and Qwen2 language model. This architecture allows Qwen2-VL to process image and video inputs and support multimodal tasks.

shinwiwei_7-1726044281116.png

Qwen2-VL also leverages a novel multimodal rotational position embedding (M-ROPE). The position embedding is decomposed to capture 1D text, 2D visual, and 3D video position information, enhancing the model’s multimodal data processing capability.

Training of Qwen2-VL

Pre-training phase:

  • purpose: The main goal is to optimize the visual encoder and adapter while keeping the language model (LLM) fixed.
  • data set: We use a large, curated dataset of image-text pairs, which is essential for the model to understand the relationship between visual elements and text.
  • Optimization Goal: Improves the model’s text generation ability by minimizing the cross entropy of text tokens, enabling more accurate text descriptions for images.
  • Multitask pre-training phase:
  • Full model training: At this stage, the entire model including LLM is trained.
  • Type of work: This model has been trained on a variety of visual language tasks, such as image captioning and visual question answering.
  • Data Quality: Provides richer visual and linguistic information using high-quality, detailed data.
  • Input Resolution: Increasing the input resolution of the visual encoder reduces information loss and helps the model better capture image details.
  • Instructions Fine Tuning Steps:
  • purpose: Improves the model’s ability to talk and follow directions.
  • Stop Visual Encoder: The visual encoder remains frozen and we focus on optimizing the LLM and adapter.
  • Data Type: We use a mix of multi-modal conversation data and pure text conversation data for optimization, which helps the model better understand and generate natural language when processing multi-modal input.

Qianwen-VL-Inference

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "/root/image0.jpg",
            },
            {"type": "text", "text": "How many dogs do you see? What are they doing? Reply in Chinese."},
        ],
    }
]

shinwiwei_8-1726044348686.png

[‘在这张图片中,我看到两只狗。左边的狗看起来像是柯基犬,而右边的狗看起来像是约克夏梗犬。它们似乎在户外的环境中奔跑,可能是散步或玩耍。‘]

The English translation is as follows:

[‘In this picture, I see two dogs. The dog on the left looks like a Corgi, while the dog on the right appears to be a Yorkshire Terrier. They seem to be running outdoors, possibly taking a walk or playing. ’]

This model supports video analysis, but also uses frame segmentation. This model does not analyze audio.

model_name = "Qwen/Qwen2-VL-2B-Instruct"  
model = Qwen2VLForConditionalGeneration.from_pretrained(  
    model_name,   
    torch_dtype=torch.bfloat16,   
    attn_implementation="flash_attention_2",   
    device_map="auto"  
)  
processor = AutoProcessor.from_pretrained(model_name)  
  
messages = [  
    {  
        "role": "user",  
        "content": [  
            {  
                "type": "video",  
                "video": "/root/cars.mp4",  
                "max_pixels": 360 * 420,  
                "fps": 1.0,  # 确保 fps 正确传递  
                "video_fps": 1.0,  # 添加 video_fps  
            },  
            {"type": "text", "text": "Describe this video in Chinese."},  
        ],  
    }  
]  
  
text = processor.apply_chat_template(  
    messages, tokenize=False, add_generation_prompt=True  
)  
  
image_inputs, video_inputs = process_vision_info(messages)  
  
inputs = processor(  
    text=[text],  
    images=image_inputs,  
    videos=video_inputs,  
    padding=True,  
    return_tensors="pt",  
)  
  
inputs = inputs.to("cuda")  
  
generated_ids = model.generate(**inputs, max_new_tokens=256)  
generated_ids_trimmed = [  
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)  
]  
  
output_text = processor.batch_decode(  
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False  
)  
  
print(output_text)  

shinyouwei_9-1726044398011.png

[‘视频中展示了一条繁忙的街道,车辆密集,交通堵塞。街道两旁是高楼大厦,天空阴沉,可能是傍晚或清晨。‘]

The English translation is as follows:

[‘The video shows a busy street with heavy traffic and congestion. Tall buildings line both sides of the street, and the sky is overcast, suggesting it might be either dusk or dawn.’]





Source link

You may also like

Leave a Comment

Our Company

Welcome to OdysseyX, your one-stop destination for the latest news and opportunities across various domains.

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Laest News

@2024 – All Right Reserved. Designed and Developed by OdysseyX