Many organizations such as Commoncrawl, WebRecorder, Archive.org and libraries around the world, use the warc format to archive and store web data.
The full datasets of these services range in the few pebibytes(PiB), making them impractical to query using non-distributed systems.
This project aims to make subsets such data easier to access and query using SQL.
Crawl a site with
wget and import it into WarcDB:
wget --warc-file changelog "https://changelog.com" warcdb import archive.warcdb changelog.warc.gz
Then you can query away using SQL, such as this one to get all response headers:
sqlite3 archive.warcdb <<SQL select json_extract(h.value, '$.header') as header, json_extract(h.value, '$.value') as value from response, json_each(http_headers) h SQL