Dataset viewer documentation
DuckDB
DuckDB
DuckDB is a database that supports reading and querying Parquet files really fast. Begin by creating a connection to DuckDB, and then install and load the httpfs extension to read and write remote files:
 Python 
 JavaScript 
import duckdb
url = "https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet"
con = duckdb.connect()
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")Now you can write and execute your SQL query on the Parquet file:
 Python 
 JavaScript 
con.sql(f"SELECT sign, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM '{url}' GROUP BY sign ORDER BY avg_blog_length DESC LIMIT(5)")
βββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββ
β   sign    β count_star() β  avg_blog_length   β
β  varchar  β    int64     β       double       β
βββββββββββββΌβββββββββββββββΌβββββββββββββββββββββ€
β Cancer    β        38956 β 1206.5212034089743 β
β Leo       β        35487 β 1180.0673767858652 β
β Aquarius  β        32723 β 1152.1136815084192 β
β Virgo     β        36189 β 1117.1982094006466 β
β Capricorn β        31825 β  1102.397360565593 β
βββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββTo query multiple files - for example, if the dataset is sharded:
 Python 
 JavaScript 
urls = ["https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet", "https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0001.parquet"]
con.sql(f"SELECT sign, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM read_parquet({urls}) GROUP BY sign ORDER BY avg_blog_length DESC LIMIT(5)")
ββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββ
β   sign   β count_star() β  avg_blog_length   β
β varchar  β    int64     β       double       β
ββββββββββββΌβββββββββββββββΌβββββββββββββββββββββ€
β Aquarius β        49687 β  1191.417211745527 β
β Leo      β        53811 β 1183.8782219248853 β
β Cancer   β        65048 β 1158.9691612347804 β
β Gemini   β        51985 β 1156.0693084543618 β
β Virgo    β        60399 β 1140.9584430205798 β
ββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββDuckDB-Wasm, a package powered by WebAssembly, is also available for running DuckDB in any browser. This could be useful, for instance, if you want to create a web app to query Parquet files from the browser!
< > Update on GitHub