Context: my father is a lawyer and therefore has a bajillion pdf files that were digitised, stored in a server. I’ve gotten an idea on how to do OCR in all of them.

But after that, how can I make them easily searchable? (Keep in mind that unfortunately, the directory structure is important information to classify the files, aka you may have a path like clientABC/caseAV1/d.pdf

  • greyfox@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    2 days ago

    If you want the search to be flexible like handling things like root stemming (i.e. for matching words that are pluralized etc) you might want to put the text into an Elasticsearch database.

    You might run into problems with the field length if these are long documents. A possible solution to that would be an putting each page into its own field inside of the document.

    If this is for a non tech user to search, the Kibana interface should be relatively easy for anyone to use.