Text this: Clustering swap prediction for image-text pre-training