Web crawl data as SQLite databases ↦
Many organizations such as Commoncrawl, WebRecorder, Archive.org and libraries around the world, use the warc format to archive and store web data.
The full datasets of these services range in the few pebibytes(PiB), making them impractical to query using non-distributed systems.
This project aims to make subsets such data easier to access and query using SQL.
Crawl a site with wget
and import it into WarcDB:
wget --warc-file changelog "https://changelog.com"
warcdb import archive.warcdb changelog.warc.gz
Then you can query away using SQL, such as this one to get all response headers:
sqlite3 archive.warcdb <<SQL
select json_extract(h.value, '$.header') as header,
json_extract(h.value, '$.value') as value
from response,
json_each(http_headers) h
SQL
Discussion
Sign in or Join to comment or subscribe