cross-posted from: https://thelemmy.club/post/17993801
First of all, let me explain what “hapax legomena” is: it refers to words (and, by extension, concepts) that occurred just once throughout an entire corpus of text. An example is the word “hebenon”, occurring just once within Shakespeare’s Hamlet. Therefore, “hebenon” is a hapax legomenon. The “hapax legomenon” concept itself is a kind of hapax legomenon, IMO.
According to Wikipedia, hapax legomena are generally discarded from NLP as they hold “little value for computational techniques”. By extension, the same applies to LLMs, I guess.
While “hapax legomena” originally refers to words/tokens, I’m extending it to entire concepts, described by these extremely unknown words.
I am a curious mind, actively seeking knowledge, and I’m constantly trying to learn a myriad of “random” topics across the many fields of human knowledge, especially rare/unknown concepts (that’s how I learnt about “hapax legomena”, for example). I use three LLMs on a daily basis (GPT-3, LLama and Gemini), expecting to get to know about words, historical/mythological figures and concepts unknown to me, lost in the vastness of human knowledge, but I now know, according to Wikipedia, that general LLMs won’t point me anything “obscure” enough.
This leads me to wonder: are there LLMs and/or NLP models/datasets that do not discard hapax? Are there LLMs that favor less frequent data over more frequent data?
They cut anything but the median part of the dataset: the most frequent words, as you said, as well as words that occurred just once across the entire dataset. At least it’s what Wikipedia states on Hapax legomenon:
I thought of LLMs because they’re trained on really, really big and vast datasets, datasets that we normally can’t really have access, let alone use it to compute in our personal computers (mine is a 12GB RAM Linux laptop, it’s a good Core i5 computer, but not enough to really big datasets). I mean, there are lots of downloadable datasets in platforms such as Kaggle and Huggingface, as well as internet archives of plain-text articles, books, BBS and so on, but I guess it’s just a tiny fraction of the datasets used for OpenAI’s GPT, Meta’s Llama and Google’s Gemini training. And I have a “gut feeling” that somewhere, somehow, those least-mentioned things (words, entire concepts, places, mythological figures and ancient deities, forgotten philosophical nomenclatures and so on) are lurking and waiting to be excavated from beneath these vast depths of datasets.
Maybe the ideal scenario would be having entire datasets and applying parsers and tokenizers to all of them (as the original comment suggested, parsers such as PEG or FLEX), then cut the slice of words/tokens that appeared just once across all of them. In order to it to properly work, there’s really a need of several datasets. For example: I tried to do it with two versions of the bible (because it’s an example of a long book readily available throughout the Web and ready to be parsed; I used both a JSON containing JKV verses and a JSON containing BBE verses) and I got around 3200 unique occurrences using the “Poor man’s technique” I described on the other comment (Node.js + Regex + JS dictionary object to count occurrences, not the best of approaches). If I’d to add more English versions/translations, maybe this would converge to more specific unique words.