實物特徵: Adopting multiple vision transformer layers for fine-grained image representation

Adopting multiple vision transformer layers for fine-grained image representation

Accurate discriminative regions proposal has an important effect for fine-grained image recognition. The vision transformer (ViT) brings about a striking effect in computer vision duo to its innate muti-head self-attention mechanism. However, the attention maps are gradually similar after certain la...

全面介紹

Saved in:

書目詳細資料
Main Authors:	Sun, Fayou, Ngo, Hea Choon, Yu, Yelan, Xiao, Zhengyu, Meng, Zuqiang
格式:	Conference or Workshop Item
語言:	English
出版:	2023
在線閱讀:	http://eprints.utem.edu.my/id/eprint/27940/1/Adopting%20multiple%20vision%20transformer%20layers%20for%20fine-grained%20image%20representation.pdf http://eprints.utem.edu.my/id/eprint/27940/ https://www-scopus-com.uitm.idm.oclc.org/record/display.uri?eid=2-s2.0-85174249067&origin=resultslist&sort=plf-f&src=s&sid=b55ff89e31efefd098739d17ce669a37&sot=
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!

實物特徵
總結:	Accurate discriminative regions proposal has an important effect for fine-grained image recognition. The vision transformer (ViT) brings about a striking effect in computer vision duo to its innate muti-head self-attention mechanism. However, the attention maps are gradually similar after certain layers and since ViT adds classification token for perform classification, it is unable to effectively select discriminative image patches for fine-grained image classification. To accurately detect discriminative regions, we propose a novel network AMTrans, which efficiently increases layers to learn diverse features and utilizes integrated raw attention maps to capture more salient feature. Specifically, we employ DeepViT as backbone to solve the attention collapse issue. Then, we fuse each head attention weight within each layer to produce attention weight map. After that, we alternatively use recurrent residual refinement blocks to promote salient feature detection and then utilize semantic grouping method to propose the discriminative feature region. A lot of experiments prove that AMTrans acquires the SOTA performance on three widely used fine-grained datasets under the same settings, involving Stanford-Cars, Stanford-Dogs and CUB-200-2011.

Adopting multiple vision transformer layers for fine-grained image representation

相似書籍