CodeParrot-Multi π¦ (small)
CodeParrot-Multi π¦ is a GPT-2 model (110M parameters) trained to generate code in 9 programming languages: "Java", "JavaScript", "PHP", "Python", "C#", "C++", "GO", "Ruby" and "TypeScript".
Usage
You can load the CodeParrot-Multi model and tokenizer directly in transformers:
from transformers import AutoTokenizer, AutoModelWithLMHead
  
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot-small-multi")
model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot-small-multi")
inputs = tokenizer("def hello_world():", return_tensors="pt")
outputs = model(**inputs)
or with a pipeline:
from transformers import pipeline
pipe = pipeline("text-generation", model="codeparrot/codeparrot-small-multi")
outputs = pipe("def hello_world():")
Training
The model was trained on the small Github code small after near deduplication, a subset of Github code dataset with the following settings:
| Config | Value | 
|---|---|
| Batch size | 192 | 
| Context size | 1024 | 
| Training steps | 300'000 | 
| Gradient accumulation | 2 | 
| Gradient checkpointing | False | 
| Learning rate | 5e-4 | 
| Weight decay | 0.1 | 
| Warmup steps | 2000 | 
| Schedule | Cosine | 
The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 58 billion tokens.
Performance
We evaluated the model on OpenAI's HumanEval benchmark which consists of programming challenges:
| Metric | Value | 
|---|---|
| pass@1 | --% | 
| pass@10 | --% | 
| pass@100 | --% | 
The pass@k metric tells the probability that at least one out of k generations passes the tests.
Resources
- Code: repository
- Downloads last month
- 88
