hplt-project.org Open in urlscan Pro
141.94.134.111  Public Scan

Submitted URL: http://hplt-project.org/
Effective URL: https://hplt-project.org/
Submission: On November 01 via api from US — Scanned from FR

Form analysis 0 forms found in the DOM

Text Content

High Performance
Language Technologies

More
Tools & PipelinesAboutPublicationsDashboards
DeliverablesModels
Datasets
v1 Releasev1.1 Releasev1.2 Releasev2.0 Release
More
Tools & PipelinesAboutPublicationsDashboards
DeliverablesModels
Datasets
v1 releasev1.1 releasev1.2 Releasev2.0 Release


DATASETS AVAILABLE

version 1.2version 2.0


A SPACE THAT COMBINES PETABYTES OF NATURAL LANGUAGE DATA WITH LARGE-SCALE MODEL
TRAINING

Lots of monolingual and multilingual data consistently formatted and curated

Efficient and high-quality language and translation models

Sustainable and reusable workflows using high-perfomance computing

HPLT's factsheet

Image from storyset Freepik


FAIR

Our data and models will be shared through FAIR repositories, catalogues and
marketplaces for easy discovery, access, replication and exploitation.


TRANSPARENT

Our models will be reproducible with information and evaluation metrics shown in
publicly available dashboards and leaderboards.


HIGH-QUALITY

By applying consistent cleaning, anonymization, bias-reduction, and metadata
routines will enhance the quality and ethical properties of texts.


EFFICIENT

Our models will make use of NLP-aware supercomputing power in HPC centres to
produce efficient models and pipelines.


CONTRIBUTED DATASETS

We would like to thank the following institutions for their contributed
datasets:

 * Institute of the Estonian Language contributed several versions of the
   Estonian National corpus in a suitable format to run HPLT cleaning tools. We
   redistribute the contributed datasets and the HPLT cleaned versions received
   under the original CC-by license.


ESTONIAN NATIONAL CORPUS 19 AND 21 AND 23 (ORIGINAL UNDER CC-BY)

16.43M docs

3.25B words


ESTONIAN NATIONAL CORPUS 19 AND 21 AND 23 (HPLT CLEANING APPLIED)

11.50M docs

2.95B words


SUCCESS STORIES

Dataset


HPLT CURATION: INSTITUTE OF THE ESTONIAN LANGUGAGE CORPUS

Institute of the Estonian Language contributed several versions of the Estonian
National corpus in a suitable format to run HPLT cleaning tools. We redistribute
the contributed datasets and the HPLT cleaned versions received under the
original CC-by license.

Dataset


CULTURAX: A REFILTERED HPLT DATASET

From the team that brought you CulturaX, we present CulturaY, another
substantial multilingual dataset of 15TB (uncompressed)/3TB (zstd-compressed)
that applies the same dataset cleaning methodology to the HPLT v1.1 dataset.
Please note that  HPLT v1.2 has also been released and is an alternative verison
with different cleaning methodolgies. This data was used in part to train our
SOTA Vietnamese model: Vistral-7B-Chat. 
https://huggingface.co/datasets/ontocord/CulturaP

Dataset


CULTURAP: A MORE PERMISSIVE HPLT DATASET

From the team that brought you CulturaX and CulturaY, we present CulturaP, a
filtered subset of the multilingual dataset CulturaY that we believe is more
likely to be copyright permissive and usable. CulturaY is in turn based on HPLT
v1.1  dataset. Ultimately, this dataset is based on Common Crawl and the
Internet Archive. https://huggingface.co/datasets/ontocord/CulturaY

Dataset


HUGGINFACE-FRIENDLY HPLT DATASETS

This repository contains the means to access the datasets created by the HPLT
project. These large-scale web-crawled corpora based on CommonCrawl and the
Internet Archive are accessible in 75 languages. The full dump is available as
well as deduplicated and further cleaned versions depending on the config that
you use (see the usage example below).
https://huggingface.co/datasets/HPLT/hplt_monolingual_v1_2

LLM Models


HPLT-BASED NORA.LLM USED BY SCHIBSTED MEDIA

The Norwegian media group Schibsted media uses NORA.LLM language models in their
NLP pipelines for Norwegian. These models were trained on HPLT datasets (among
other data)


Contact HPLT


I WOULD LIKE TO...

contribute a datasetsend suggestions and feedback

--------------------------------------------------------------------------------


HPLT EVENTS


HPLT & NLPL WINTER SCHOOL ON LARGE-SCALE LANGUAGE MODELING AND NEURAL MACHINE
TRANSLATION WITH WEB DATA

After a two-year pandemic hiatus, the NLPL network and Horizon Europe project
High-Performance Language Technologies (HPLT) join f...

6-8 February, 2023


WORKSHOP ON OPEN COMMUNITY-DRIVEN MACHINE TRANSLATION

The 1st edition of the Workshop on Open Community-Driven Machine Translation
(CrowdMT 2023) will be held in Tampere, Finland, on J...

15 June, 2023


HPLT HACKATHON

June, 17th-25th, 2023, the HPLT consortium will held a hackathon around a set of
topics related to corpora curation: language iden...

17-25 June, 2023


HPLT TOOLS

See all


OPUSTRAINER

The purpose of the trainer is to provide the user with a flexible way of
scheduling various sources of input data, as well as augm...

GitHub


OPUSCLEANER

OpusCleaner is a machine translation/language model data cleaner and training
scheduler. The Training scheduler has moved to OpusT...

GitHub


HPLT ANALYTICS

This tool provides a full range of analytics automatically computed on either
monolingual or bilingual data sets to help making in...

GitHub


KEY ASPECTS OF HPLT

+


WHAT LANGUAGES WILL HPLT COVER?

We aim at covering around 80 languages, the ones for which we have counts of at
least 100 million words in large web crawls collections. For more information,
please see the language table in the About section.

+


IS HPLT PLANNING TO DELIVER NEW DATA SETS?

Yes! We will explore 7PB from the Internet Archive collections and 5PB from
Common Crawl. We hope to deliver lots of new data. But not only! We will also
reprocess available datasets to enhance their quality.

+


WILL HPLT TRAIN GPT, BERT OR T5-LIKE LARGE LANGUAGE MODELS?

This is it! We intend to train 100s to 1,000s of large language models of
different flavours. HPLT will give the NLP community access to a landscape of
efficient and high-quality language models for a variety of languages.

+


DO HPLT'S GOALS INCLUDE MACHINE TRANSLATION MODELS AS WELL?

Sure. Efficient machine translation models at scale are one of the ambitions of
this project. We want to release models that run on CPU, that are easily
reproducible and of the highest possible quality.

+


CAN ONE CONTRIBUTE TO HPLT?

Of course. If you find a data set that you would like to contribute or that
needs reprocessing by HPLT, please contact us. We will also need people to help
us with taking a look at corpora and the output of models for particular tasks.
Just get in touch with us!

+


CAN I USE THE DATA YOU ARE PRODUCING?

Yes, we will publish several growing versions of both monolingual and parallel
plain text datasets. They will be available to all. However, since we do not own
the original data, it is your responsibility that any use of the data complies
with any applicable legal framework, such as, among others, the EU Copyright
Directive 2019/790 and the General Data Protection Regulation 2018, as amended.

+


WHY HPLT?

Because we want to make language modelling better suited with consistent open
data sets and reproducible and efficient models. Because we want HPC centres be
suitable for NLP processing at scale. And because we still need transparent
large language models and machine translation models for many languages to open
research and business opportunities for them. If you were interested on the
name, please visit the About section.


CONSORTIUM PARTNERS

--------------------------------------------------------------------------------

CHARLES UNIVERSITY

UNIVERSITY OF OSLO

UNIVERSITY OF EDINBURGH

UNIVERSITY OF TURKU

UNIVERSITY OF HELSINKI

PROMPSIT

CESNET

SIGMA2

--------------------------------------------------------------------------------

Stay up-to-date with us! Get information about new releases, content and more!

Visit our Twitter

Ⓒ HPLT 2024

This project has received funding from the European Union’s Horizon Europe
research and innovation programme under grant agreement No 101070350 and from UK
Research and Innovation (UKRI) under the UK government’s Horizon Europe funding
guarantee [grant number 10052546]

The contents of this publication are the sole responsibility of the HPLT
consortium and do not necessarily reflect the opinion of the European Union.

Visitor count