Update README.md
Browse files
README.md
CHANGED
|
@@ -9,5 +9,24 @@ app_file: app.py
|
|
| 9 |
pinned: false
|
| 10 |
license: mit
|
| 11 |
---
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
pinned: false
|
| 10 |
license: mit
|
| 11 |
---
|
| 12 |
+
## ChatGPT Datasets ๐
|
| 13 |
+
- WebText
|
| 14 |
+
- Common Crawl
|
| 15 |
+
- BooksCorpus
|
| 16 |
+
- English Wikipedia
|
| 17 |
+
- Toronto Books Corpus
|
| 18 |
+
- OpenWebText
|
| 19 |
+
## ChatGPT Datasets - Details ๐
|
| 20 |
+
- **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
|
| 21 |
+
- [WebText: A Large-Scale Unsupervised Text Corpus by Radford et al.](https://paperswithcode.com/dataset/webtext)
|
| 22 |
+
- **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
|
| 23 |
+
- [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/common-crawl) by Brown et al.
|
| 24 |
+
- **BooksCorpus:** A dataset of over 11,000 books from a variety of genres.
|
| 25 |
+
- [Scalable Methods for 8 Billion Token Language Modeling](https://paperswithcode.com/dataset/bookcorpus) by Zhu et al.
|
| 26 |
+
- **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
|
| 27 |
+
- [Improving Language Understanding by Generative Pre-Training](https://huggingface.co/spaces/awacke1/WikipediaUltimateAISearch?logs=build) Space for Wikipedia Search
|
| 28 |
+
- **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
|
| 29 |
+
- [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://paperswithcode.com/dataset/bookcorpus) by Schwenk and Douze.
|
| 30 |
+
- **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
|
| 31 |
+
- [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/openwebtext) by Brown et al.
|
| 32 |
+
|