Adopting multiple vision transformer layers for fine-grained image representation
Accurate discriminative regions proposal has an important effect for fine-grained image recognition. The vision transformer (ViT) brings about a striking effect in computer vision duo to its innate muti-head self-attention mechanism. However, the attention maps are gradually similar after certain la...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Conference or Workshop Item |
Language: | English |
Published: |
2023
|
Online Access: | http://eprints.utem.edu.my/id/eprint/27940/1/Adopting%20multiple%20vision%20transformer%20layers%20for%20fine-grained%20image%20representation.pdf http://eprints.utem.edu.my/id/eprint/27940/ https://www-scopus-com.uitm.idm.oclc.org/record/display.uri?eid=2-s2.0-85174249067&origin=resultslist&sort=plf-f&src=s&sid=b55ff89e31efefd098739d17ce669a37&sot= |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.utem.eprints.27940 |
---|---|
record_format |
eprints |
spelling |
my.utem.eprints.279402024-10-09T17:10:08Z http://eprints.utem.edu.my/id/eprint/27940/ Adopting multiple vision transformer layers for fine-grained image representation Sun, Fayou Ngo, Hea Choon Yu, Yelan Xiao, Zhengyu Meng, Zuqiang Accurate discriminative regions proposal has an important effect for fine-grained image recognition. The vision transformer (ViT) brings about a striking effect in computer vision duo to its innate muti-head self-attention mechanism. However, the attention maps are gradually similar after certain layers and since ViT adds classification token for perform classification, it is unable to effectively select discriminative image patches for fine-grained image classification. To accurately detect discriminative regions, we propose a novel network AMTrans, which efficiently increases layers to learn diverse features and utilizes integrated raw attention maps to capture more salient feature. Specifically, we employ DeepViT as backbone to solve the attention collapse issue. Then, we fuse each head attention weight within each layer to produce attention weight map. After that, we alternatively use recurrent residual refinement blocks to promote salient feature detection and then utilize semantic grouping method to propose the discriminative feature region. A lot of experiments prove that AMTrans acquires the SOTA performance on three widely used fine-grained datasets under the same settings, involving Stanford-Cars, Stanford-Dogs and CUB-200-2011. 2023 Conference or Workshop Item PeerReviewed text en http://eprints.utem.edu.my/id/eprint/27940/1/Adopting%20multiple%20vision%20transformer%20layers%20for%20fine-grained%20image%20representation.pdf Sun, Fayou and Ngo, Hea Choon and Yu, Yelan and Xiao, Zhengyu and Meng, Zuqiang (2023) Adopting multiple vision transformer layers for fine-grained image representation. In: 1st International Conference on Computer Technology and Information Science, CTIS 2023, 17 June 2023through 18 June 2023, Virtual, Online. https://www-scopus-com.uitm.idm.oclc.org/record/display.uri?eid=2-s2.0-85174249067&origin=resultslist&sort=plf-f&src=s&sid=b55ff89e31efefd098739d17ce669a37&sot= |
institution |
Universiti Teknikal Malaysia Melaka |
building |
UTEM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Teknikal Malaysia Melaka |
content_source |
UTEM Institutional Repository |
url_provider |
http://eprints.utem.edu.my/ |
language |
English |
description |
Accurate discriminative regions proposal has an important effect for fine-grained image recognition. The vision transformer (ViT) brings about a striking effect in computer vision duo to its innate muti-head self-attention mechanism. However, the attention maps are gradually similar after certain layers and since ViT adds classification token for perform classification, it is unable to effectively select discriminative image patches for fine-grained image classification. To accurately detect discriminative regions, we propose a novel network AMTrans, which efficiently increases layers to learn diverse features and utilizes integrated raw attention maps to capture more salient feature. Specifically, we employ DeepViT as backbone to solve the attention collapse issue. Then, we fuse each head attention weight within each layer to produce attention weight map. After that, we alternatively use recurrent residual refinement blocks to promote salient feature detection and then utilize semantic grouping method to propose the discriminative feature region. A lot of experiments prove that AMTrans acquires the SOTA performance on three widely used fine-grained datasets under the same settings, involving Stanford-Cars, Stanford-Dogs and CUB-200-2011. |
format |
Conference or Workshop Item |
author |
Sun, Fayou Ngo, Hea Choon Yu, Yelan Xiao, Zhengyu Meng, Zuqiang |
spellingShingle |
Sun, Fayou Ngo, Hea Choon Yu, Yelan Xiao, Zhengyu Meng, Zuqiang Adopting multiple vision transformer layers for fine-grained image representation |
author_facet |
Sun, Fayou Ngo, Hea Choon Yu, Yelan Xiao, Zhengyu Meng, Zuqiang |
author_sort |
Sun, Fayou |
title |
Adopting multiple vision transformer layers for fine-grained image representation |
title_short |
Adopting multiple vision transformer layers for fine-grained image representation |
title_full |
Adopting multiple vision transformer layers for fine-grained image representation |
title_fullStr |
Adopting multiple vision transformer layers for fine-grained image representation |
title_full_unstemmed |
Adopting multiple vision transformer layers for fine-grained image representation |
title_sort |
adopting multiple vision transformer layers for fine-grained image representation |
publishDate |
2023 |
url |
http://eprints.utem.edu.my/id/eprint/27940/1/Adopting%20multiple%20vision%20transformer%20layers%20for%20fine-grained%20image%20representation.pdf http://eprints.utem.edu.my/id/eprint/27940/ https://www-scopus-com.uitm.idm.oclc.org/record/display.uri?eid=2-s2.0-85174249067&origin=resultslist&sort=plf-f&src=s&sid=b55ff89e31efefd098739d17ce669a37&sot= |
_version_ |
1814061436508307456 |
score |
13.211869 |