Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -35,7 +35,7 @@ BigCode is an open scientific collaboration working on responsible training of l
|
|
| 35 |
|
| 36 |
### Data & Governance
|
| 37 |
- [Governance Card](https://huggingface.co/datasets/bigcode/governance-card): A card outlining the governance of the model.
|
| 38 |
-
- [
|
| 39 |
- [The Stack train smol](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids): The Software Heritage identifiers for the training dataset of StarCoder2 3B and 7B with 600B+ unique tokens.
|
| 40 |
- [The Stack train full](https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids): The Software Heritage identifiers for the training dataset of StarCoder2 15B with 900B+ unique tokens.
|
| 41 |
- [StarCoder2 Search](https://huggingface.co/spaces/bigcode/search-v2): Full-text search code in the pretraining dataset.
|
|
@@ -66,7 +66,7 @@ BigCode is an open scientific collaboration working on responsible training of l
|
|
| 66 |
|
| 67 |
### Data & Governance
|
| 68 |
- [Governance Card](https://huggingface.co/datasets/bigcode/governance-card): A card outlining the governance of the model.
|
| 69 |
-
- [
|
| 70 |
- [StarCoder Data](https://huggingface.co/datasets/bigcode/starcoderdata): Pretraining dataset of StarCoder.
|
| 71 |
- [StarCoder Search](https://huggingface.co/spaces/bigcode/search): Full-text search code in the pretraining dataset.
|
| 72 |
- [StarCoder Membership Test](https://stack.dataportraits.org/): Blazing fast test if code was present in pretraining dataset.
|
|
@@ -94,8 +94,8 @@ BigCode is an open scientific collaboration working on responsible training of l
|
|
| 94 |
<summary>
|
| 95 |
<b><font size="+1">📑The Stack</font></b>
|
| 96 |
</summary>
|
| 97 |
-
The Stack v1 is a 6.4TB of source code in 358 programming languages from permissive licenses.<br>
|
| 98 |
-
The Stack v2 is a 67.5TB of source code in over 600 programming languages with permissive licenses or no license.
|
| 99 |
|
| 100 |
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
|
| 101 |
- [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2): Exact deduplicated version of The Stack v2.
|
|
|
|
| 35 |
|
| 36 |
### Data & Governance
|
| 37 |
- [Governance Card](https://huggingface.co/datasets/bigcode/governance-card): A card outlining the governance of the model.
|
| 38 |
+
- [StarCoder2 License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement): The model is licensed under the BigCode OpenRAIL-M v1 license agreement.
|
| 39 |
- [The Stack train smol](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids): The Software Heritage identifiers for the training dataset of StarCoder2 3B and 7B with 600B+ unique tokens.
|
| 40 |
- [The Stack train full](https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids): The Software Heritage identifiers for the training dataset of StarCoder2 15B with 900B+ unique tokens.
|
| 41 |
- [StarCoder2 Search](https://huggingface.co/spaces/bigcode/search-v2): Full-text search code in the pretraining dataset.
|
|
|
|
| 66 |
|
| 67 |
### Data & Governance
|
| 68 |
- [Governance Card](https://huggingface.co/datasets/bigcode/governance-card): A card outlining the governance of the model.
|
| 69 |
+
- [StarCoder2 License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement): The model is licensed under the BigCode OpenRAIL-M v1 license agreement.
|
| 70 |
- [StarCoder Data](https://huggingface.co/datasets/bigcode/starcoderdata): Pretraining dataset of StarCoder.
|
| 71 |
- [StarCoder Search](https://huggingface.co/spaces/bigcode/search): Full-text search code in the pretraining dataset.
|
| 72 |
- [StarCoder Membership Test](https://stack.dataportraits.org/): Blazing fast test if code was present in pretraining dataset.
|
|
|
|
| 94 |
<summary>
|
| 95 |
<b><font size="+1">📑The Stack</font></b>
|
| 96 |
</summary>
|
| 97 |
+
The Stack v1 is a 6.4TB dataset of source code in 358 programming languages from permissive licenses.<br>
|
| 98 |
+
The Stack v2 is a 67.5TB dataset of source code in over 600 programming languages with permissive licenses or no license.
|
| 99 |
|
| 100 |
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
|
| 101 |
- [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2): Exact deduplicated version of The Stack v2.
|