Adopting multiple vision transformer layers for fine-grained image representation

Accurate discriminative regions proposal has an important effect for fine-grained image recognition. The vision transformer (ViT) brings about a striking effect in computer vision duo to its innate muti-head self-attention mechanism. However, the attention maps are gradually similar after certain la...

Full description

Saved in:
Bibliographic Details
Main Authors: Sun, Fayou, Ngo, Hea Choon, Yu, Yelan, Xiao, Zhengyu, Meng, Zuqiang
Format: Conference or Workshop Item
Language:English
Published: 2023
Online Access:http://eprints.utem.edu.my/id/eprint/27940/1/Adopting%20multiple%20vision%20transformer%20layers%20for%20fine-grained%20image%20representation.pdf
http://eprints.utem.edu.my/id/eprint/27940/
https://www-scopus-com.uitm.idm.oclc.org/record/display.uri?eid=2-s2.0-85174249067&origin=resultslist&sort=plf-f&src=s&sid=b55ff89e31efefd098739d17ce669a37&sot=
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.utem.eprints.27940
record_format eprints
spelling my.utem.eprints.279402024-10-09T17:10:08Z http://eprints.utem.edu.my/id/eprint/27940/ Adopting multiple vision transformer layers for fine-grained image representation Sun, Fayou Ngo, Hea Choon Yu, Yelan Xiao, Zhengyu Meng, Zuqiang Accurate discriminative regions proposal has an important effect for fine-grained image recognition. The vision transformer (ViT) brings about a striking effect in computer vision duo to its innate muti-head self-attention mechanism. However, the attention maps are gradually similar after certain layers and since ViT adds classification token for perform classification, it is unable to effectively select discriminative image patches for fine-grained image classification. To accurately detect discriminative regions, we propose a novel network AMTrans, which efficiently increases layers to learn diverse features and utilizes integrated raw attention maps to capture more salient feature. Specifically, we employ DeepViT as backbone to solve the attention collapse issue. Then, we fuse each head attention weight within each layer to produce attention weight map. After that, we alternatively use recurrent residual refinement blocks to promote salient feature detection and then utilize semantic grouping method to propose the discriminative feature region. A lot of experiments prove that AMTrans acquires the SOTA performance on three widely used fine-grained datasets under the same settings, involving Stanford-Cars, Stanford-Dogs and CUB-200-2011. 2023 Conference or Workshop Item PeerReviewed text en http://eprints.utem.edu.my/id/eprint/27940/1/Adopting%20multiple%20vision%20transformer%20layers%20for%20fine-grained%20image%20representation.pdf Sun, Fayou and Ngo, Hea Choon and Yu, Yelan and Xiao, Zhengyu and Meng, Zuqiang (2023) Adopting multiple vision transformer layers for fine-grained image representation. In: 1st International Conference on Computer Technology and Information Science, CTIS 2023, 17 June 2023through 18 June 2023, Virtual, Online. https://www-scopus-com.uitm.idm.oclc.org/record/display.uri?eid=2-s2.0-85174249067&origin=resultslist&sort=plf-f&src=s&sid=b55ff89e31efefd098739d17ce669a37&sot=
institution Universiti Teknikal Malaysia Melaka
building UTEM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknikal Malaysia Melaka
content_source UTEM Institutional Repository
url_provider http://eprints.utem.edu.my/
language English
description Accurate discriminative regions proposal has an important effect for fine-grained image recognition. The vision transformer (ViT) brings about a striking effect in computer vision duo to its innate muti-head self-attention mechanism. However, the attention maps are gradually similar after certain layers and since ViT adds classification token for perform classification, it is unable to effectively select discriminative image patches for fine-grained image classification. To accurately detect discriminative regions, we propose a novel network AMTrans, which efficiently increases layers to learn diverse features and utilizes integrated raw attention maps to capture more salient feature. Specifically, we employ DeepViT as backbone to solve the attention collapse issue. Then, we fuse each head attention weight within each layer to produce attention weight map. After that, we alternatively use recurrent residual refinement blocks to promote salient feature detection and then utilize semantic grouping method to propose the discriminative feature region. A lot of experiments prove that AMTrans acquires the SOTA performance on three widely used fine-grained datasets under the same settings, involving Stanford-Cars, Stanford-Dogs and CUB-200-2011.
format Conference or Workshop Item
author Sun, Fayou
Ngo, Hea Choon
Yu, Yelan
Xiao, Zhengyu
Meng, Zuqiang
spellingShingle Sun, Fayou
Ngo, Hea Choon
Yu, Yelan
Xiao, Zhengyu
Meng, Zuqiang
Adopting multiple vision transformer layers for fine-grained image representation
author_facet Sun, Fayou
Ngo, Hea Choon
Yu, Yelan
Xiao, Zhengyu
Meng, Zuqiang
author_sort Sun, Fayou
title Adopting multiple vision transformer layers for fine-grained image representation
title_short Adopting multiple vision transformer layers for fine-grained image representation
title_full Adopting multiple vision transformer layers for fine-grained image representation
title_fullStr Adopting multiple vision transformer layers for fine-grained image representation
title_full_unstemmed Adopting multiple vision transformer layers for fine-grained image representation
title_sort adopting multiple vision transformer layers for fine-grained image representation
publishDate 2023
url http://eprints.utem.edu.my/id/eprint/27940/1/Adopting%20multiple%20vision%20transformer%20layers%20for%20fine-grained%20image%20representation.pdf
http://eprints.utem.edu.my/id/eprint/27940/
https://www-scopus-com.uitm.idm.oclc.org/record/display.uri?eid=2-s2.0-85174249067&origin=resultslist&sort=plf-f&src=s&sid=b55ff89e31efefd098739d17ce669a37&sot=
_version_ 1814061436508307456
score 13.211869