Bootstrapping Embeddings for Low Resource Languages

arXiv:2603.01732v1 Announce Type: new Abstract: Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.

相关推荐

Who Gets Cited Most? Benchmarking Long-Context Numerical Reasoning on Scientific Articles

EigenBench: A Comparative Behavioral Measure of Value Alignment

CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion

See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

Who Gets Cited Most? Benchmarking Long-Context Numerical Reasoning on Scientific Articles

Toward Clinically Explainable AI for Medical Diagnosis: A Foundation Model with Human-Compatible Reasoning via Reinforcement Learning