hplt-project.org
Open in
urlscan Pro
141.94.134.111
Public Scan
Submitted URL: http://hplt-project.org/
Effective URL: https://hplt-project.org/
Submission: On November 01 via api from US — Scanned from FR
Effective URL: https://hplt-project.org/
Submission: On November 01 via api from US — Scanned from FR
Form analysis
0 forms found in the DOMText Content
High Performance Language Technologies More Tools & PipelinesAboutPublicationsDashboards DeliverablesModels Datasets v1 Releasev1.1 Releasev1.2 Releasev2.0 Release More Tools & PipelinesAboutPublicationsDashboards DeliverablesModels Datasets v1 releasev1.1 releasev1.2 Releasev2.0 Release DATASETS AVAILABLE version 1.2version 2.0 A SPACE THAT COMBINES PETABYTES OF NATURAL LANGUAGE DATA WITH LARGE-SCALE MODEL TRAINING Lots of monolingual and multilingual data consistently formatted and curated Efficient and high-quality language and translation models Sustainable and reusable workflows using high-perfomance computing HPLT's factsheet Image from storyset Freepik FAIR Our data and models will be shared through FAIR repositories, catalogues and marketplaces for easy discovery, access, replication and exploitation. TRANSPARENT Our models will be reproducible with information and evaluation metrics shown in publicly available dashboards and leaderboards. HIGH-QUALITY By applying consistent cleaning, anonymization, bias-reduction, and metadata routines will enhance the quality and ethical properties of texts. EFFICIENT Our models will make use of NLP-aware supercomputing power in HPC centres to produce efficient models and pipelines. CONTRIBUTED DATASETS We would like to thank the following institutions for their contributed datasets: * Institute of the Estonian Language contributed several versions of the Estonian National corpus in a suitable format to run HPLT cleaning tools. We redistribute the contributed datasets and the HPLT cleaned versions received under the original CC-by license. ESTONIAN NATIONAL CORPUS 19 AND 21 AND 23 (ORIGINAL UNDER CC-BY) 16.43M docs 3.25B words ESTONIAN NATIONAL CORPUS 19 AND 21 AND 23 (HPLT CLEANING APPLIED) 11.50M docs 2.95B words SUCCESS STORIES Dataset HPLT CURATION: INSTITUTE OF THE ESTONIAN LANGUGAGE CORPUS Institute of the Estonian Language contributed several versions of the Estonian National corpus in a suitable format to run HPLT cleaning tools. We redistribute the contributed datasets and the HPLT cleaned versions received under the original CC-by license. Dataset CULTURAX: A REFILTERED HPLT DATASET From the team that brought you CulturaX, we present CulturaY, another substantial multilingual dataset of 15TB (uncompressed)/3TB (zstd-compressed) that applies the same dataset cleaning methodology to the HPLT v1.1 dataset. Please note that HPLT v1.2 has also been released and is an alternative verison with different cleaning methodolgies. This data was used in part to train our SOTA Vietnamese model: Vistral-7B-Chat. https://huggingface.co/datasets/ontocord/CulturaP Dataset CULTURAP: A MORE PERMISSIVE HPLT DATASET From the team that brought you CulturaX and CulturaY, we present CulturaP, a filtered subset of the multilingual dataset CulturaY that we believe is more likely to be copyright permissive and usable. CulturaY is in turn based on HPLT v1.1 dataset. Ultimately, this dataset is based on Common Crawl and the Internet Archive. https://huggingface.co/datasets/ontocord/CulturaY Dataset HUGGINFACE-FRIENDLY HPLT DATASETS This repository contains the means to access the datasets created by the HPLT project. These large-scale web-crawled corpora based on CommonCrawl and the Internet Archive are accessible in 75 languages. The full dump is available as well as deduplicated and further cleaned versions depending on the config that you use (see the usage example below). https://huggingface.co/datasets/HPLT/hplt_monolingual_v1_2 LLM Models HPLT-BASED NORA.LLM USED BY SCHIBSTED MEDIA The Norwegian media group Schibsted media uses NORA.LLM language models in their NLP pipelines for Norwegian. These models were trained on HPLT datasets (among other data) Contact HPLT I WOULD LIKE TO... contribute a datasetsend suggestions and feedback -------------------------------------------------------------------------------- HPLT EVENTS HPLT & NLPL WINTER SCHOOL ON LARGE-SCALE LANGUAGE MODELING AND NEURAL MACHINE TRANSLATION WITH WEB DATA After a two-year pandemic hiatus, the NLPL network and Horizon Europe project High-Performance Language Technologies (HPLT) join f... 6-8 February, 2023 WORKSHOP ON OPEN COMMUNITY-DRIVEN MACHINE TRANSLATION The 1st edition of the Workshop on Open Community-Driven Machine Translation (CrowdMT 2023) will be held in Tampere, Finland, on J... 15 June, 2023 HPLT HACKATHON June, 17th-25th, 2023, the HPLT consortium will held a hackathon around a set of topics related to corpora curation: language iden... 17-25 June, 2023 HPLT TOOLS See all OPUSTRAINER The purpose of the trainer is to provide the user with a flexible way of scheduling various sources of input data, as well as augm... GitHub OPUSCLEANER OpusCleaner is a machine translation/language model data cleaner and training scheduler. The Training scheduler has moved to OpusT... GitHub HPLT ANALYTICS This tool provides a full range of analytics automatically computed on either monolingual or bilingual data sets to help making in... GitHub KEY ASPECTS OF HPLT + WHAT LANGUAGES WILL HPLT COVER? We aim at covering around 80 languages, the ones for which we have counts of at least 100 million words in large web crawls collections. For more information, please see the language table in the About section. + IS HPLT PLANNING TO DELIVER NEW DATA SETS? Yes! We will explore 7PB from the Internet Archive collections and 5PB from Common Crawl. We hope to deliver lots of new data. But not only! We will also reprocess available datasets to enhance their quality. + WILL HPLT TRAIN GPT, BERT OR T5-LIKE LARGE LANGUAGE MODELS? This is it! We intend to train 100s to 1,000s of large language models of different flavours. HPLT will give the NLP community access to a landscape of efficient and high-quality language models for a variety of languages. + DO HPLT'S GOALS INCLUDE MACHINE TRANSLATION MODELS AS WELL? Sure. Efficient machine translation models at scale are one of the ambitions of this project. We want to release models that run on CPU, that are easily reproducible and of the highest possible quality. + CAN ONE CONTRIBUTE TO HPLT? Of course. If you find a data set that you would like to contribute or that needs reprocessing by HPLT, please contact us. We will also need people to help us with taking a look at corpora and the output of models for particular tasks. Just get in touch with us! + CAN I USE THE DATA YOU ARE PRODUCING? Yes, we will publish several growing versions of both monolingual and parallel plain text datasets. They will be available to all. However, since we do not own the original data, it is your responsibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended. + WHY HPLT? Because we want to make language modelling better suited with consistent open data sets and reproducible and efficient models. Because we want HPC centres be suitable for NLP processing at scale. And because we still need transparent large language models and machine translation models for many languages to open research and business opportunities for them. If you were interested on the name, please visit the About section. CONSORTIUM PARTNERS -------------------------------------------------------------------------------- CHARLES UNIVERSITY UNIVERSITY OF OSLO UNIVERSITY OF EDINBURGH UNIVERSITY OF TURKU UNIVERSITY OF HELSINKI PROMPSIT CESNET SIGMA2 -------------------------------------------------------------------------------- Stay up-to-date with us! Get information about new releases, content and more! Visit our Twitter Ⓒ HPLT 2024 This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546] The contents of this publication are the sole responsibility of the HPLT consortium and do not necessarily reflect the opinion of the European Union. Visitor count