I need to scan very large JSONL files efficiently and am considering a parallel grep-style approach over line-delimited text.

Would love to hear how you would design it.

  • Bazell@lemmy.zip
    link
    fedilink
    arrow-up
    1
    ·
    edit-2
    2 hours ago

    Depends on programming language and built in methods that are being used. I described in a more fundamental way how it may work assuming, that OS itself will eventually use at least 1 thread to read a file. From my perspective of view, this will be our main body thread in which will be a cycle, that reads file line by line and gives ready chunks for other threads to process. As I described in other comment somewhere here, we can simplify this pipeline into firstly read the file into RAM splitting it into pieces. And only then process in parallel. I agree that second approach is more convenient one and easier to implement.