:: Le Tung Bach, Ph.D.

https://huggingface.co/datasets/PleIAs/common_corpus

Kaggle or Data Commons, but here are some sample data and prompts to try:

Stack Overflow Annual Developer Survey: try asking “Visualize most popular programming languages”
Iris Species: try asking “Calculate and visualize the Pearson, Spearman, and Kendall correlations in this data”
Glass Classification: try asking “Train a random forest classifier on this dataset”

One of the largest free datasets available for training large language models (LLMs) is the Common Corpus. It contains approximately 500 billion words and is multilingual, covering languages like English, French, German, Spanish, Dutch, and Italian. This dataset is designed to be open and free of copyright concerns, making it ideal for training open and reproducible LLMs.

Another notable resource is The Pile, a dataset curated by EleutherAI. It consists of 800GB of diverse text data, including academic papers, books, and web content, and is widely used for pretraining LLMs.

If you’re exploring datasets for specific purposes, platforms like LLMDataHub provide a curated collection of datasets tailored for LLM training. Let me know if you’d like more details!

https://kili-technology.com/large-language-models-llms/9-open-sourced-datasets-for-training-large-language-models

HuggingFace, Occiglot, Eleuther, and Nomic AI.

https://github.com/mlabonne/llm-datasets

Table of Contents