Benchmark for Thai sentence representation on Thai STS-B.
Sentence representation plays a crucial role in NLP downstream tasks such as NLI, text classification, and STS. Recent sentence representation training techniques require NLI or STS datasets. However, there are no equivalent Thai NLI or STS datasets for sentence representation training. To address this problem, we train a sentence representation model with an unsupervised technique called SimCSE.
We show that it is possible to train SimCSE with 1.3 M sentences from Wikipedia within 2 hours on the Google Colab (V100) where the performance of SimCSE-XLM-R is similar to mDistil-BERT<-mUSE (train on > 1B sentences).
Moreover, we provide the Thai sentence vector benchmark. We evaluate the Spearman correlation score of the sentence representations’ performance on Thai STS-B (translated version of STS-B).
How do we train unsupervised sentence representation?
- We use SimCSE:Simple Contrastive Learning of Sentence Embeddings on multilingual LM models (mBERT, distil-mBERT, XLM-R)
- Training data: Thai Wikipedia
- Example: SimCSE-Thai.ipynb
- Easy to train
- Compatible with every model
- Does not require any annotated dataset
- The performance of XLM-R (unsupervised) and m-Distil-BERT (trained on > 1B sentences) are similar (1% difference in correlation)
What about Supervised Learning?
- We recommend sentence-bert, which you can train with NLI, STS, triplet loss, contrastive loss, etc.
- We use STS-B translated ver. in which we translate STS-B from SentEval by using google-translate.
- How to evaluate sentence representation: SentEval.ipynb
|Base Model||Spearman's Correlation (*100)||Supervised?|
- Evaluation: https://colab.research.google.com/github/mrpeerat/Thai-Sentence-Vector-Benchmark/blob/main/SentEval.ipynb
- Training Example: https://colab.research.google.com/github/mrpeerat/Thai-Sentence-Vector-Benchmark/blob/main/SimCSE-Thai.ipynb
You can submit a pull request to show your model’s result in the benchmark table!!!!.
Thank you many codes from
- Can: proofread
- Charin: proofread + idea