I need to scan very large JSONL files efficiently and am considering a parallel grep-style approach over line-delimited text.
Would love to hear how you would design it.
I need to scan very large JSONL files efficiently and am considering a parallel grep-style approach over line-delimited text.
Would love to hear how you would design it.
chunk_size := file_size / cpu_cores. Compile regex.spawn
cpu_coresworkers:2.a. worker #n starts at
n * chunk_sizebytes. Ifn > 0, skip bytes until newline encountered.2.b worker starts feeding bytes from file/chunk into regex. When match is found, write to output (
stdoutor file, whichever has better performance). When newline encountered, restart regex state automata.2.c after having read
chunk_sizebytes, continue until encountering a newline to ensure the whole file is covered by the parallel search.Optionally, keep track of byte number and attach them to the found matches when outputting, to facilitate eventually de-duplicating and/or navigating to said match in the file.
To avoid problems, have each worker output to a separate file, and only combine these output files when the workers are all finished.
As others have said, it’s going to be hard to get more speedup than this, and you will ultimately be limited by your storage’s read speed and throughput if the whole file cannot fit into memory.