I need to scan very large JSONL files efficiently and am considering a parallel grep-style approach over line-delimited text.
Would love to hear how you would design it.
I need to scan very large JSONL files efficiently and am considering a parallel grep-style approach over line-delimited text.
Would love to hear how you would design it.
Depends on programming language and built in methods that are being used. I described in a more fundamental way how it may work assuming, that OS itself will eventually use at least 1 thread to read a file. From my perspective of view, this will be our main body thread in which will be a cycle, that reads file line by line and gives ready chunks for other threads to process. As I described in other comment somewhere here, we can simplify this pipeline into firstly read the file into RAM splitting it into pieces. And only then process in parallel. I agree that second approach is more convenient one and easier to implement.