Among the services provided by Softcatalà, a non-profit 25-year-old grassroots organization that localizes software into Catalan and develops software to ease the generation of Catalan content, one of the most us...
详细信息
language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts. However, commonly used language identifiers struggle to differentia...
详细信息
SmartBiC, an 18-month innovation project funded by the Spanish Government, aims at improving the full process of collecting, filtering and selecting in-domain parallel content to be used for machine translation and la...
详细信息
We describe the High Performance language Technologies project (HPLT), a 3-year EU-funded project started in September 2022. HPLT will build a space combining petabytes of natural language data with large-scale model ...
详细信息
The High Performance language Technologies (HPLT) project is a 3-year EU-funded project that started in September 2022. It aims to deliver free, sustainable, and reusable datasets, models, and work-flows at scale usin...
详细信息
language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts. However, commonly used language identifiers struggle to differentia...
详细信息
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collec...
详细信息
This paper describes the joint submission of Universitat d'Alacant and prompsit language engineering to the WMT 2020 shared task on parallel corpus filtering. Our submission, based on the free/open-source tool Bic...
暂无评论