Spaces:
Running
Running
Update curated.py
Browse files- curated.py +3 -3
curated.py
CHANGED
|
@@ -33,7 +33,7 @@ curated_sources_intro = Div(
|
|
| 33 |
P(
|
| 34 |
"Curated sources comprise high-quality datasets that contain domain-specificity.",
|
| 35 |
B(
|
| 36 |
-
" TxT360 was strongly influenced by The Pile regarding both inclusion of the dataset and filtering techniques."
|
| 37 |
),
|
| 38 |
" These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide valuable data that is excluded from the web dataset mentioned above. Analyzing and processing non-web data can yield insights and opportunities for various applications. Details about each of the sources are provided below. ",
|
| 39 |
),
|
|
@@ -685,7 +685,7 @@ filtering_process = Div(
|
|
| 685 |
),
|
| 686 |
P(
|
| 687 |
B("Download and Extraction: "),
|
| 688 |
-
"All the data was downloaded in original latex format from
|
| 689 |
A("s3://arxic/src", href="s3://arxic/src"),
|
| 690 |
". We try to encode the downloaded data into utf-8 or guess encoding using chardet library. After that pandoc was used to extract information from the latex files and saved as markdown format",
|
| 691 |
D_code(
|
|
@@ -703,7 +703,7 @@ filtering_process = Div(
|
|
| 703 |
),
|
| 704 |
P(
|
| 705 |
B(" Filters Applied: "),
|
| 706 |
-
"multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset
|
| 707 |
),
|
| 708 |
Ul(
|
| 709 |
Li(
|
|
|
|
| 33 |
P(
|
| 34 |
"Curated sources comprise high-quality datasets that contain domain-specificity.",
|
| 35 |
B(
|
| 36 |
+
" TxT360 was strongly influenced by The Pile", D_cite(bibtex_key="thepile"), " regarding both inclusion of the dataset and filtering techniques."
|
| 37 |
),
|
| 38 |
" These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide valuable data that is excluded from the web dataset mentioned above. Analyzing and processing non-web data can yield insights and opportunities for various applications. Details about each of the sources are provided below. ",
|
| 39 |
),
|
|
|
|
| 685 |
),
|
| 686 |
P(
|
| 687 |
B("Download and Extraction: "),
|
| 688 |
+
"All the data was downloaded in original latex format from ArXiv official S3 repo: ",
|
| 689 |
A("s3://arxic/src", href="s3://arxic/src"),
|
| 690 |
". We try to encode the downloaded data into utf-8 or guess encoding using chardet library. After that pandoc was used to extract information from the latex files and saved as markdown format",
|
| 691 |
D_code(
|
|
|
|
| 703 |
),
|
| 704 |
P(
|
| 705 |
B(" Filters Applied: "),
|
| 706 |
+
"multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset", D_cite(bibtex_key="peS2o"),
|
| 707 |
),
|
| 708 |
Ul(
|
| 709 |
Li(
|