YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
OLMo Code Clean Dataset
This dataset contains cleaned Python 2 and Python 3 code chunks for language model fine-tuning.
Dataset Description
- Repository: olmo-code-dataset
- Type: Code dataset
- Languages: Python 2, Python 3
- Format: JSONL (JSON Lines)
- Purpose: Fine-tuning language models for code generation
Files
The dataset contains multiple JSONL files:
python2_chunk_*.jsonl: Python 2 code chunkspython3_chunk_*.jsonl: Python 3 code chunks
Data Format
Each line in the JSONL files contains a JSON object with:
{
"text": "code content here",
"metadata": {
"extension": "python2" or "python3",
"source": "original source information",
"length": "token length"
}
}
Usage
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("dipikakhullar/olmo-code-dataset")
# Access training data
train_data = dataset["train"]
Citation
If you use this dataset, please cite the original sources and this repository.
License
MIT License
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support