Clustering swap prediction for image-text pre-training

It is essential to delve into the strategy of multimodal model pre-training, which is an obvious impact on downstream tasks. Currently, clustering learning has achieved noteworthy benefits in multiple methods. However, due to the availability of open image-text pairs, it is challenging for multimoda...

Full description

Saved in:
Bibliographic Details
Main Authors: Fayou, Sun, Meng, Zuqiang, Ngo, Hea Choon, Sek, Yong Wee
Format: Article
Language:English
Published: Nature Research 2024
Online Access:http://eprints.utem.edu.my/id/eprint/27536/2/0130221062024105857.PDF
http://eprints.utem.edu.my/id/eprint/27536/
https://www.nature.com/articles/s41598-024-60832-x#:~:text=We%20argue%20that%20the%20advantages,can%20be%20dynamically%20adjusted%20with
https://doi.org/10.1038/s41598-024-60832-x
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.utem.eprints.27536
record_format eprints
spelling my.utem.eprints.275362024-07-25T09:04:32Z http://eprints.utem.edu.my/id/eprint/27536/ Clustering swap prediction for image-text pre-training Fayou, Sun Meng, Zuqiang Ngo, Hea Choon Sek, Yong Wee It is essential to delve into the strategy of multimodal model pre-training, which is an obvious impact on downstream tasks. Currently, clustering learning has achieved noteworthy benefits in multiple methods. However, due to the availability of open image-text pairs, it is challenging for multimodal with clustering learning. In this paper, we propose an approach that utilizes clustering swap prediction strategy to learn image-text clustering embedding space by interaction prediction between image and text features. Unlike existing models with clustering learning, our method (Clus) allows for an open number of clusters for web-scale alt-text data. Furthermore, in order to train the image and text encoders efficiently, we introduce distillation learning approach and evaluate the performance of the image-encoder in downstream visual tasks. In addition, Clus is pre-trained end-to-end by using large-scale image-text pairs. Specifically, both text and image serve as ground truth for swap prediction, enabling effective representation learning. Concurrently, extensive experiments demonstrate that Clus achieves state-of-the-art performance on multiple downstream fine-tuning and zero-shot tasks (i.e., Image-Text Retrieval, VQA, NLVR2, Image Captioning, Object Detection, and Semantic Segmentation). Nature Research 2024-05 Article PeerReviewed text en http://eprints.utem.edu.my/id/eprint/27536/2/0130221062024105857.PDF Fayou, Sun and Meng, Zuqiang and Ngo, Hea Choon and Sek, Yong Wee (2024) Clustering swap prediction for image-text pre-training. Scientific Reports, 14 (1). ISSN 2045-2322 https://www.nature.com/articles/s41598-024-60832-x#:~:text=We%20argue%20that%20the%20advantages,can%20be%20dynamically%20adjusted%20with https://doi.org/10.1038/s41598-024-60832-x
institution Universiti Teknikal Malaysia Melaka
building UTEM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknikal Malaysia Melaka
content_source UTEM Institutional Repository
url_provider http://eprints.utem.edu.my/
language English
description It is essential to delve into the strategy of multimodal model pre-training, which is an obvious impact on downstream tasks. Currently, clustering learning has achieved noteworthy benefits in multiple methods. However, due to the availability of open image-text pairs, it is challenging for multimodal with clustering learning. In this paper, we propose an approach that utilizes clustering swap prediction strategy to learn image-text clustering embedding space by interaction prediction between image and text features. Unlike existing models with clustering learning, our method (Clus) allows for an open number of clusters for web-scale alt-text data. Furthermore, in order to train the image and text encoders efficiently, we introduce distillation learning approach and evaluate the performance of the image-encoder in downstream visual tasks. In addition, Clus is pre-trained end-to-end by using large-scale image-text pairs. Specifically, both text and image serve as ground truth for swap prediction, enabling effective representation learning. Concurrently, extensive experiments demonstrate that Clus achieves state-of-the-art performance on multiple downstream fine-tuning and zero-shot tasks (i.e., Image-Text Retrieval, VQA, NLVR2, Image Captioning, Object Detection, and Semantic Segmentation).
format Article
author Fayou, Sun
Meng, Zuqiang
Ngo, Hea Choon
Sek, Yong Wee
spellingShingle Fayou, Sun
Meng, Zuqiang
Ngo, Hea Choon
Sek, Yong Wee
Clustering swap prediction for image-text pre-training
author_facet Fayou, Sun
Meng, Zuqiang
Ngo, Hea Choon
Sek, Yong Wee
author_sort Fayou, Sun
title Clustering swap prediction for image-text pre-training
title_short Clustering swap prediction for image-text pre-training
title_full Clustering swap prediction for image-text pre-training
title_fullStr Clustering swap prediction for image-text pre-training
title_full_unstemmed Clustering swap prediction for image-text pre-training
title_sort clustering swap prediction for image-text pre-training
publisher Nature Research
publishDate 2024
url http://eprints.utem.edu.my/id/eprint/27536/2/0130221062024105857.PDF
http://eprints.utem.edu.my/id/eprint/27536/
https://www.nature.com/articles/s41598-024-60832-x#:~:text=We%20argue%20that%20the%20advantages,can%20be%20dynamically%20adjusted%20with
https://doi.org/10.1038/s41598-024-60832-x
_version_ 1806430043396636672
score 13.211869