Clustering swap prediction for image-text pre-training
It is essential to delve into the strategy of multimodal model pre-training, which is an obvious impact on downstream tasks. Currently, clustering learning has achieved noteworthy benefits in multiple methods. However, due to the availability of open image-text pairs, it is challenging for multimoda...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Research
2024
|
Online Access: | http://eprints.utem.edu.my/id/eprint/27536/2/0130221062024105857.PDF http://eprints.utem.edu.my/id/eprint/27536/ https://www.nature.com/articles/s41598-024-60832-x#:~:text=We%20argue%20that%20the%20advantages,can%20be%20dynamically%20adjusted%20with https://doi.org/10.1038/s41598-024-60832-x |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.utem.eprints.27536 |
---|---|
record_format |
eprints |
spelling |
my.utem.eprints.275362024-07-25T09:04:32Z http://eprints.utem.edu.my/id/eprint/27536/ Clustering swap prediction for image-text pre-training Fayou, Sun Meng, Zuqiang Ngo, Hea Choon Sek, Yong Wee It is essential to delve into the strategy of multimodal model pre-training, which is an obvious impact on downstream tasks. Currently, clustering learning has achieved noteworthy benefits in multiple methods. However, due to the availability of open image-text pairs, it is challenging for multimodal with clustering learning. In this paper, we propose an approach that utilizes clustering swap prediction strategy to learn image-text clustering embedding space by interaction prediction between image and text features. Unlike existing models with clustering learning, our method (Clus) allows for an open number of clusters for web-scale alt-text data. Furthermore, in order to train the image and text encoders efficiently, we introduce distillation learning approach and evaluate the performance of the image-encoder in downstream visual tasks. In addition, Clus is pre-trained end-to-end by using large-scale image-text pairs. Specifically, both text and image serve as ground truth for swap prediction, enabling effective representation learning. Concurrently, extensive experiments demonstrate that Clus achieves state-of-the-art performance on multiple downstream fine-tuning and zero-shot tasks (i.e., Image-Text Retrieval, VQA, NLVR2, Image Captioning, Object Detection, and Semantic Segmentation). Nature Research 2024-05 Article PeerReviewed text en http://eprints.utem.edu.my/id/eprint/27536/2/0130221062024105857.PDF Fayou, Sun and Meng, Zuqiang and Ngo, Hea Choon and Sek, Yong Wee (2024) Clustering swap prediction for image-text pre-training. Scientific Reports, 14 (1). ISSN 2045-2322 https://www.nature.com/articles/s41598-024-60832-x#:~:text=We%20argue%20that%20the%20advantages,can%20be%20dynamically%20adjusted%20with https://doi.org/10.1038/s41598-024-60832-x |
institution |
Universiti Teknikal Malaysia Melaka |
building |
UTEM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Teknikal Malaysia Melaka |
content_source |
UTEM Institutional Repository |
url_provider |
http://eprints.utem.edu.my/ |
language |
English |
description |
It is essential to delve into the strategy of multimodal model pre-training, which is an obvious impact on downstream tasks. Currently, clustering learning has achieved noteworthy benefits in multiple methods. However, due to the availability of open image-text pairs, it is challenging for multimodal with clustering learning. In this paper, we propose an approach that utilizes clustering swap prediction strategy to learn image-text clustering embedding space by interaction prediction between image and text features. Unlike existing models with clustering learning, our method (Clus) allows for an open number of clusters for web-scale alt-text data. Furthermore, in order to train the image and text encoders efficiently, we introduce distillation learning approach and evaluate the performance of the image-encoder in downstream visual tasks. In addition, Clus is pre-trained end-to-end by using large-scale image-text pairs. Specifically, both text and image serve as ground truth for swap prediction, enabling effective representation learning. Concurrently, extensive experiments demonstrate that Clus achieves state-of-the-art performance on multiple downstream fine-tuning and zero-shot tasks (i.e., Image-Text Retrieval, VQA, NLVR2, Image Captioning, Object Detection, and Semantic Segmentation). |
format |
Article |
author |
Fayou, Sun Meng, Zuqiang Ngo, Hea Choon Sek, Yong Wee |
spellingShingle |
Fayou, Sun Meng, Zuqiang Ngo, Hea Choon Sek, Yong Wee Clustering swap prediction for image-text pre-training |
author_facet |
Fayou, Sun Meng, Zuqiang Ngo, Hea Choon Sek, Yong Wee |
author_sort |
Fayou, Sun |
title |
Clustering swap prediction for image-text pre-training |
title_short |
Clustering swap prediction for image-text pre-training |
title_full |
Clustering swap prediction for image-text pre-training |
title_fullStr |
Clustering swap prediction for image-text pre-training |
title_full_unstemmed |
Clustering swap prediction for image-text pre-training |
title_sort |
clustering swap prediction for image-text pre-training |
publisher |
Nature Research |
publishDate |
2024 |
url |
http://eprints.utem.edu.my/id/eprint/27536/2/0130221062024105857.PDF http://eprints.utem.edu.my/id/eprint/27536/ https://www.nature.com/articles/s41598-024-60832-x#:~:text=We%20argue%20that%20the%20advantages,can%20be%20dynamically%20adjusted%20with https://doi.org/10.1038/s41598-024-60832-x |
_version_ |
1806430043396636672 |
score |
13.211869 |