
I’ve been working on a structured inventory of the datasets with a slightly different angle: rather than maximizing scrape coverage, I’m focusing on understanding what’s present vs. what appears to be structurally missing based on filename patterns, numeric continuity, file sizes, and anchor adjacency.
For Dataset 9 specifically, collapsing hundreds of thousands of files down into a small number of high-confidence “missing blocks” has been useful for auditing completeness once large merged sets (like yours) exist. The goal isn’t to assume missing content, but to identify ranges where the structure strongly suggests attachments or exhibits likely existed.
If anyone else here is doing similar inventory or diff work, I’d be interested in comparing methodology and sanity-checking assumptions. No requests for files (yet) Just notes on structure and verification
Just tested whether numeric gaps represent missing files or page-level numbering. In at least one major Dataset 9 block, the adjacent PDF’s page count exactly matches the numeric span, indicating page bundling rather than missing documents. I’m incorporating page counts into the audit model to distinguish the two.”
Thanks so much for setting that straight.