Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -20,7 +20,6 @@ pinned: false
|
|
| 20 |
BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: **StarCoder**, a state-of-the-art language model for code, **OctoPack**, artifacts for instruction tuning large code models, **The Stack**, the largest available pretraining dataset with perimssive code, and **SantaCoder**, a 1.1B parameter model for code.
|
| 21 |
|
| 22 |
---
|
| 23 |
-
|
| 24 |
<details>
|
| 25 |
<summary>
|
| 26 |
<b><font size="+1">💫StarCoder</font></b>
|
|
@@ -50,39 +49,46 @@ BigCode is an open scientific collaboration working on responsible training of l
|
|
| 50 |
- [StarCoder Search](https://huggingface.co/spaces/bigcode/search): Full-text search code in the pretraining dataset.
|
| 51 |
- [StarCoder Membership Test](https://stack.dataportraits.org/): Blazing fast test if code was present in pretraining dataset.
|
| 52 |
</details>
|
| 53 |
-
|
| 54 |
---
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
OctoPack
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
- [
|
| 63 |
-
- [
|
| 64 |
-
- [
|
| 65 |
-
- [
|
| 66 |
-
- [
|
| 67 |
-
- [
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
| 69 |
---
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
The Stack
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
- [The Stack
|
| 77 |
-
- [The Stack
|
| 78 |
-
- [
|
| 79 |
-
|
|
|
|
|
|
|
| 80 |
---
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
SantaCoder
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
- [SantaCoder
|
| 88 |
-
- [SantaCoder
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: **StarCoder**, a state-of-the-art language model for code, **OctoPack**, artifacts for instruction tuning large code models, **The Stack**, the largest available pretraining dataset with perimssive code, and **SantaCoder**, a 1.1B parameter model for code.
|
| 21 |
|
| 22 |
---
|
|
|
|
| 23 |
<details>
|
| 24 |
<summary>
|
| 25 |
<b><font size="+1">💫StarCoder</font></b>
|
|
|
|
| 49 |
- [StarCoder Search](https://huggingface.co/spaces/bigcode/search): Full-text search code in the pretraining dataset.
|
| 50 |
- [StarCoder Membership Test](https://stack.dataportraits.org/): Blazing fast test if code was present in pretraining dataset.
|
| 51 |
</details>
|
|
|
|
| 52 |
---
|
| 53 |
+
<details>
|
| 54 |
+
<summary>
|
| 55 |
+
<b><font size="+1">🐙OctoPack</font></b>
|
| 56 |
+
</summary>
|
| 57 |
+
|
| 58 |
+
OctoPack consists of data, evals & models relating to Code LLMs that follow human instructions.
|
| 59 |
+
|
| 60 |
+
- [Paper](https://arxiv.org/abs/2308.07124): Research paper with details about all components of OctoPack.
|
| 61 |
+
- [GitHub](https://github.com/bigcode-project/octopack): All code used for the creation of OctoPack.
|
| 62 |
+
- [CommitPack](https://huggingface.co/datasets/bigcode/commitpack): 4TB of Git commits.
|
| 63 |
+
- [Am I in the CommitPack](https://huggingface.co/spaces/bigcode/in-the-commitpack): Check if your code is in the CommitPack.
|
| 64 |
+
- [CommitPackFT](https://huggingface.co/datasets/bigcode/commitpackft): 2GB of high-quality Git commits that resemble instructions.
|
| 65 |
+
- [HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack): Benchmark for Code Fixing/Explaining/Synthesizing across Python/JavaScript/Java/Go/C++/Rust.
|
| 66 |
+
- [OctoCoder](https://huggingface.co/bigcode/octocoder): Instruction tuned model of StarCoder by training on CommitPackFT.
|
| 67 |
+
- [OctoCoder Demo](https://huggingface.co/spaces/bigcode/OctoCoder-Demo): Play with OctoCoder.
|
| 68 |
+
- [OctoGeeX](https://huggingface.co/bigcode/octogeex): Instruction tuned model of CodeGeeX2 by training on CommitPackFT.
|
| 69 |
+
</details>
|
| 70 |
---
|
| 71 |
+
<details>
|
| 72 |
+
<summary>
|
| 73 |
+
<b><font size="+1">📑The Stack</font></b>
|
| 74 |
+
</summary>
|
| 75 |
+
The Stack is a 6.4TB of source code in 358 programming languages from permissive licenses.
|
| 76 |
+
|
| 77 |
+
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
|
| 78 |
+
- [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training).
|
| 79 |
+
- [The Stack issues](https://huggingface.co/datasets/bigcode/the-stack-github-issues): Collection of GitHub issues.
|
| 80 |
+
- [The Stack Metadata](https://huggingface.co/datasets/bigcode/the-stack-metadata): Metadata of the repositories in The Stack.
|
| 81 |
+
- [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
|
| 82 |
+
</details>
|
| 83 |
---
|
| 84 |
+
<details>
|
| 85 |
+
<summary>
|
| 86 |
+
<b><font size="+1">🎅SantaCoder</font></b>
|
| 87 |
+
</summary>
|
| 88 |
+
SantaCoder aka smol StarCoder: same architecture but only trained on Python, Java, JavaScript.
|
| 89 |
+
|
| 90 |
+
- [SantaCoder](https://huggingface.co/bigcode/santacoder): SantaCoder Model.
|
| 91 |
+
- [SantaCoder Demo](https://huggingface.co/spaces/bigcode/santacoder-demo): Write with SantaCoder.
|
| 92 |
+
- [SantaCoder Search](https://huggingface.co/spaces/bigcode/santacoder-search): Search code in the pretraining dataset.
|
| 93 |
+
- [SantaCoder License](https://huggingface.co/spaces/bigcode/license): The OpenRAIL license for SantaCoder.
|
| 94 |
+
</details>
|