Loop and distillation: Attention weights fusion transformer for fine‐grained representation
Learning subtle discriminative feature representation plays a significant role in Fine-Grained Visual Categorisation (FGVC). The vision transformer (ViT) achieves promising performance in the traditional image classification filed due to its multi-head self-attention mechanism. Unfortunately, ViT ca...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
John Wiley & Sons Ltd
2023
|
Online Access: | http://eprints.utem.edu.my/id/eprint/27758/2/0130221062024102412871.pdf http://eprints.utem.edu.my/id/eprint/27758/ https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cvi2.12181 https://doi.org/10.1049/cvi2.12181 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.utem.eprints.27758 |
---|---|
record_format |
eprints |
spelling |
my.utem.eprints.277582024-10-07T12:41:28Z http://eprints.utem.edu.my/id/eprint/27758/ Loop and distillation: Attention weights fusion transformer for fine‐grained representation Sun, Fayou Ngo, Hea Choon Zuqiang, Meng Sek, Yong Wee Learning subtle discriminative feature representation plays a significant role in Fine-Grained Visual Categorisation (FGVC). The vision transformer (ViT) achieves promising performance in the traditional image classification filed due to its multi-head self-attention mechanism. Unfortunately, ViT cannot effectively capture critical feature regions for FGVC due to only focusing on classification token and adopting the strategy of one-time image input. Besides, the advantage of attention weights fusion is not applied to ViT. To promote the performance of capturing vital regions for FGVC, the authors propose a novel model named RDTrans, which proposes discriminative region with top priority in a recurrent learning way. Specifically, proposed vital regions at each scale will be cropped and amplified as the next input parameters to finally locate the most discriminative region. Furthermore, a distillation learning method is employed to provide better supervision for elevating the generalisation ability. Concurrently, RDTrans can be easily trained end-to-end in a weakly-supervised learning way. Extensive experiments demonstrate that RDTrans yields state-of-the-art performance on four widely used fine-grained benchmarks, including CUB-200-2011, Stanford Cars, Stanford Dogs, and iNat2017. John Wiley & Sons Ltd 2023-01 Article PeerReviewed text en http://eprints.utem.edu.my/id/eprint/27758/2/0130221062024102412871.pdf Sun, Fayou and Ngo, Hea Choon and Zuqiang, Meng and Sek, Yong Wee (2023) Loop and distillation: Attention weights fusion transformer for fine‐grained representation. IET Computer Vision, 17 (4). pp. 473-482. ISSN 1751-9632 https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cvi2.12181 https://doi.org/10.1049/cvi2.12181 |
institution |
Universiti Teknikal Malaysia Melaka |
building |
UTEM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Teknikal Malaysia Melaka |
content_source |
UTEM Institutional Repository |
url_provider |
http://eprints.utem.edu.my/ |
language |
English |
description |
Learning subtle discriminative feature representation plays a significant role in Fine-Grained Visual Categorisation (FGVC). The vision transformer (ViT) achieves promising performance in the traditional image classification filed due to its multi-head self-attention mechanism. Unfortunately, ViT cannot effectively capture critical feature regions for FGVC due to only focusing on classification token and adopting the strategy of one-time image input. Besides, the advantage of attention weights fusion is not applied to ViT. To promote the performance of capturing vital regions for FGVC, the authors propose a novel model named RDTrans, which proposes discriminative region with top priority in a recurrent learning way. Specifically, proposed vital regions at each scale will be cropped and amplified as the next input parameters to finally locate the most discriminative region. Furthermore, a distillation learning method is employed to provide better supervision for elevating the generalisation ability. Concurrently, RDTrans can be easily trained end-to-end in a weakly-supervised learning way. Extensive experiments demonstrate that RDTrans yields state-of-the-art performance on four widely used fine-grained benchmarks, including CUB-200-2011, Stanford Cars, Stanford Dogs, and iNat2017. |
format |
Article |
author |
Sun, Fayou Ngo, Hea Choon Zuqiang, Meng Sek, Yong Wee |
spellingShingle |
Sun, Fayou Ngo, Hea Choon Zuqiang, Meng Sek, Yong Wee Loop and distillation: Attention weights fusion transformer for fine‐grained representation |
author_facet |
Sun, Fayou Ngo, Hea Choon Zuqiang, Meng Sek, Yong Wee |
author_sort |
Sun, Fayou |
title |
Loop and distillation: Attention weights fusion transformer for
fine‐grained representation |
title_short |
Loop and distillation: Attention weights fusion transformer for
fine‐grained representation |
title_full |
Loop and distillation: Attention weights fusion transformer for
fine‐grained representation |
title_fullStr |
Loop and distillation: Attention weights fusion transformer for
fine‐grained representation |
title_full_unstemmed |
Loop and distillation: Attention weights fusion transformer for
fine‐grained representation |
title_sort |
loop and distillation: attention weights fusion transformer for
fine‐grained representation |
publisher |
John Wiley & Sons Ltd |
publishDate |
2023 |
url |
http://eprints.utem.edu.my/id/eprint/27758/2/0130221062024102412871.pdf http://eprints.utem.edu.my/id/eprint/27758/ https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cvi2.12181 https://doi.org/10.1049/cvi2.12181 |
_version_ |
1814061423675834368 |
score |
13.211869 |