Web crawl data as SQLite databases

Web crawl data as SQLite databases ↦

Many organizations such as Commoncrawl, WebRecorder, Archive.org and libraries around the world, use the warc format to archive and store web data.

The full datasets of these services range in the few pebibytes(PiB), making them impractical to query using non-distributed systems.

This project aims to make subsets such data easier to access and query using SQL.

Crawl a site with wget and import it into WarcDB:

wget --warc-file changelog "https://changelog.com"

warcdb import archive.warcdb changelog.warc.gz

Then you can query away using SQL, such as this one to get all response headers:

sqlite3 archive.warcdb <<SQL
select  json_extract(h.value, '$.header') as header, 
        json_extract(h.value, '$.value') as value
from response,
     json_each(http_headers) h
SQL