YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

OLMo Code Clean Dataset

This dataset contains cleaned Python 2 and Python 3 code chunks for language model fine-tuning.

Dataset Description

  • Repository: olmo-code-dataset
  • Type: Code dataset
  • Languages: Python 2, Python 3
  • Format: JSONL (JSON Lines)
  • Purpose: Fine-tuning language models for code generation

Files

The dataset contains multiple JSONL files:

  • python2_chunk_*.jsonl: Python 2 code chunks
  • python3_chunk_*.jsonl: Python 3 code chunks

Data Format

Each line in the JSONL files contains a JSON object with:

{
    "text": "code content here",
    "metadata": {
        "extension": "python2" or "python3",
        "source": "original source information",
        "length": "token length"
    }
}

Usage

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("dipikakhullar/olmo-code-dataset")

# Access training data
train_data = dataset["train"]

Citation

If you use this dataset, please cite the original sources and this repository.

License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support