@WhatCD

WhatCD@lemmy.world · edit-2 14 hours ago

Sounds good, thank you, just thinking we should avoid platforms like discord and look for something more respectful of privacy

WhatCD@lemmy.world · 14 hours ago

I would really like a chat of some kind, matrix maybe?

WhatCD@lemmy.world · 14 hours ago

Perfect I’m on --startByte 134211436544 --endByte 144472801279 (9.56 GB)

WhatCD@lemmy.world · edit-2 14 hours ago

This would be the largest three gaps from what I have:

--startByte 49981423616 --endByte 60299411455 (9.61 GB)
--startByte 110131937280 --endByte 120424759295 (9.59 GB)
--startByte 134211436544 --endByte 144472801279 (9.56 GB)

WhatCD@lemmy.world · 15 hours ago

The next question is who goes after what part.

WhatCD@lemmy.world · edit-2 15 hours ago

Ok updated the script. Added --startByte and --endByte and --totalFileBytes

https://pastebin.com/sjMBCnzm

Using --totalFileBytes 192613274080 avoids an HTTP head request at the beginning of the script making it slightly less brittle.

To grab the last 5 GB of the file you would add the following to your command:

--startByte 187244564960 --endByte 192613274079 --totalFileBytes 192613274080

WhatCD@lemmy.world · 16 hours ago

Great idea, let me see what I can do!

WhatCD@lemmy.world · 17 hours ago

I don’t know exactly, but seems about an hour or two if you get a 401 unauthorized.

Would you be interested in joining out effort here? I’m hoping to crowd source these chunks and then combine our effort.

WhatCD@lemmy.world · 17 hours ago

Yeah when I run into this I’ve switched browsers and it’s helped. I’ve also switched IP addresses and it’s helped.

WhatCD@lemmy.world · 18 hours ago

Updated the script to display information better: https://pastebin.com/S4gvw9q1

It has one library dependency so you’ll have to do:

pip install rich

I haven’t been getting blocked with this:

python script.py 'https://www.justice.gov/epstein/files/DataSet%209.zip' -o 'DataSet 9.zip' --cookies cookie.txt --retries 2 --referer 'https://www.justice.gov/age-verify?destination=%2Fepstein%2Ffiles%2FDataSet+9.zip' --ua '<set-this>' --timeout 90 -t 16 -c auto

The new script can auto set threads and chunks, I updated the main comment with more info about those.

I’m setting the --ua option which let’s you override the user agent header. I’m making sure it matches the browser that I use to request the cookie.

WhatCD@lemmy.world · 18 hours ago

What happens when you go to https://www.justice.gov/epstein/files/DataSet%209.zip in your browser?

WhatCD@lemmy.world · 19 hours ago

I would be interested in obtaining the chunks that you gathered and stitch them to what I gathered.

WhatCD@lemmy.world · edit-2 19 hours ago

~~Yeah :/ I haven’t been able pull anything in a while now.~~ I was just able to pull 6 chunks, the data is still out there!

WhatCD@lemmy.world · 1 day ago

how big is the partial that you managed to get?

WhatCD@lemmy.world · edit-2 15 hours ago

I’m working on a different method of obtaining a complete dataset zip for dataset 9. For those who are unaware, for a time yesterday there was an official zip available from the DOJ. To my knowledge no one was able to fully grab it. But I believe the 49Gb zip is a partial of that before downloads got cut. It’s my thought that this original zip likely contained incriminating information and it’s why it got halted.

What I’ve observed is that Akamai still serves that zip sporadically in small chunks. It’s really strange and I’m not sure why it does, but I have verified with strings that there are pdf file names in the zip data. I’ve been able to use a script to pull small chunks from the CDN across the entire span of the file’s byte range.

Using the 49GB file as a starting point I’m working on piecing the file together, however progress is extremely extremely slow. If there is anyone willing to team up on this and combine the chunks please let me know.

How to grab the chunked data:

Script link: https://pastebin.com/sjMBCnzm

For the script will probably have to:

pip install rich

Grab DATASET 9, INCOMPLETE AT ~48GB:

 magnet:?xt=urn:btih:0a3d4b84a77bd982c9c2761f40944402b94f9c64&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce

Then name the downloaded file 0-(the last byte the file spans).bin

So for example the 48 GB file it would be: 0-48995762175.bin

Next to the python script make a directory called: DataSet 9.zip.chunks

Move the renamed first byte range 48 GB file in to that directory.

Make a new file next to the script called cookies.txt

Install the cookie editor browser extension (https://cookie-editor.com/)

With the browser extension open go to: https://www.justice.gov/age-verify?destination=%2Fepstein%2Ffiles%2FDataSet+9.zip

The download should start in your browser, cancel it.

Export the cookies in Netscape Format. They will copy to your clipboard.

Paste those in your cookies.txt, save and close it.

You can run the script like so:

python3 script.py \
  'https://www.justice.gov/epstein/files/DataSet%209.zip' \
  -o 'DataSet 9.zip' \
  --cookies cookies.txt --retries 3 \
  --backoff 5.0 \
  --referer 'https://www.justice.gov/age-verify?destination=%2Fepstein%2Ffiles%2FDataSet+9.zip' \
  -t auto -c auto

Script Options:

-t - The number of concurrent threads to use which results in trying that many byte ranges at the same time. Setting this to auto will auto calculate based on your CPU but will cap at 8 to be safe and avoid getting banned by Akamai.
-c - The chunk size to request from the server in MB. This is not always respected by the server and you may get a smaller or larger chunk, but the script should handle that. Setting this to auto scales with the file size, though feel free to try different sizes.
--backoff - The backoff factor between failures, helps prevent Akimai throttling your requests.
--retries - The number of times to retry a byte range for that iteration before moving on to the next byte range. If it moves on it will come back to it again on the next loop.
--cookies - The path to the file containing your Netscape formatted cookies.
-o - The final file name. The chunks directory is derived from this so make sure it matches the name of the chunk directory that you primed with the torrent chunk.
--referer - Just leave this for Akimai, set the referer http header.

There are more options if you tun the script with the --help option.

If you start to receive HTML and or HTTP/200 responses then you need to refresh your cookie.

If you start to receive HTTP/400 responses then you need to refresh your cookie in a different browser, Akamai is very fussy.

A VPN and multiple browser might be useful to change your cookie and location combo.

Edit

I tested the script on Dataset 8 and it was able to stitch a valid zip together so assuming we’re getting valid data with Dataset 9 it should work.