Loop and distillation: Attention weights fusion transformer for fine‐grained representation

Learning subtle discriminative feature representation plays a significant role in Fine-Grained Visual Categorisation (FGVC). The vision transformer (ViT) achieves promising performance in the traditional image classification filed due to its multi-head self-attention mechanism. Unfortunately, ViT ca...

Full description

Saved in:
Bibliographic Details
Main Authors: Sun, Fayou, Ngo, Hea Choon, Zuqiang, Meng, Sek, Yong Wee
Format: Article
Language:English
Published: John Wiley & Sons Ltd 2023
Online Access:http://eprints.utem.edu.my/id/eprint/27758/2/0130221062024102412871.pdf
http://eprints.utem.edu.my/id/eprint/27758/
https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cvi2.12181
https://doi.org/10.1049/cvi2.12181
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.utem.eprints.27758
record_format eprints
spelling my.utem.eprints.277582024-10-07T12:41:28Z http://eprints.utem.edu.my/id/eprint/27758/ Loop and distillation: Attention weights fusion transformer for fine‐grained representation Sun, Fayou Ngo, Hea Choon Zuqiang, Meng Sek, Yong Wee Learning subtle discriminative feature representation plays a significant role in Fine-Grained Visual Categorisation (FGVC). The vision transformer (ViT) achieves promising performance in the traditional image classification filed due to its multi-head self-attention mechanism. Unfortunately, ViT cannot effectively capture critical feature regions for FGVC due to only focusing on classification token and adopting the strategy of one-time image input. Besides, the advantage of attention weights fusion is not applied to ViT. To promote the performance of capturing vital regions for FGVC, the authors propose a novel model named RDTrans, which proposes discriminative region with top priority in a recurrent learning way. Specifically, proposed vital regions at each scale will be cropped and amplified as the next input parameters to finally locate the most discriminative region. Furthermore, a distillation learning method is employed to provide better supervision for elevating the generalisation ability. Concurrently, RDTrans can be easily trained end-to-end in a weakly-supervised learning way. Extensive experiments demonstrate that RDTrans yields state-of-the-art performance on four widely used fine-grained benchmarks, including CUB-200-2011, Stanford Cars, Stanford Dogs, and iNat2017. John Wiley & Sons Ltd 2023-01 Article PeerReviewed text en http://eprints.utem.edu.my/id/eprint/27758/2/0130221062024102412871.pdf Sun, Fayou and Ngo, Hea Choon and Zuqiang, Meng and Sek, Yong Wee (2023) Loop and distillation: Attention weights fusion transformer for fine‐grained representation. IET Computer Vision, 17 (4). pp. 473-482. ISSN 1751-9632 https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cvi2.12181 https://doi.org/10.1049/cvi2.12181
institution Universiti Teknikal Malaysia Melaka
building UTEM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknikal Malaysia Melaka
content_source UTEM Institutional Repository
url_provider http://eprints.utem.edu.my/
language English
description Learning subtle discriminative feature representation plays a significant role in Fine-Grained Visual Categorisation (FGVC). The vision transformer (ViT) achieves promising performance in the traditional image classification filed due to its multi-head self-attention mechanism. Unfortunately, ViT cannot effectively capture critical feature regions for FGVC due to only focusing on classification token and adopting the strategy of one-time image input. Besides, the advantage of attention weights fusion is not applied to ViT. To promote the performance of capturing vital regions for FGVC, the authors propose a novel model named RDTrans, which proposes discriminative region with top priority in a recurrent learning way. Specifically, proposed vital regions at each scale will be cropped and amplified as the next input parameters to finally locate the most discriminative region. Furthermore, a distillation learning method is employed to provide better supervision for elevating the generalisation ability. Concurrently, RDTrans can be easily trained end-to-end in a weakly-supervised learning way. Extensive experiments demonstrate that RDTrans yields state-of-the-art performance on four widely used fine-grained benchmarks, including CUB-200-2011, Stanford Cars, Stanford Dogs, and iNat2017.
format Article
author Sun, Fayou
Ngo, Hea Choon
Zuqiang, Meng
Sek, Yong Wee
spellingShingle Sun, Fayou
Ngo, Hea Choon
Zuqiang, Meng
Sek, Yong Wee
Loop and distillation: Attention weights fusion transformer for fine‐grained representation
author_facet Sun, Fayou
Ngo, Hea Choon
Zuqiang, Meng
Sek, Yong Wee
author_sort Sun, Fayou
title Loop and distillation: Attention weights fusion transformer for fine‐grained representation
title_short Loop and distillation: Attention weights fusion transformer for fine‐grained representation
title_full Loop and distillation: Attention weights fusion transformer for fine‐grained representation
title_fullStr Loop and distillation: Attention weights fusion transformer for fine‐grained representation
title_full_unstemmed Loop and distillation: Attention weights fusion transformer for fine‐grained representation
title_sort loop and distillation: attention weights fusion transformer for fine‐grained representation
publisher John Wiley & Sons Ltd
publishDate 2023
url http://eprints.utem.edu.my/id/eprint/27758/2/0130221062024102412871.pdf
http://eprints.utem.edu.my/id/eprint/27758/
https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cvi2.12181
https://doi.org/10.1049/cvi2.12181
_version_ 1814061423675834368
score 13.211869