PDF.

We show that large language models can be used to perform at-scale deanonymization. With full Internet access, our agent can re-identify Hacker News users and Anthropic Interviewer participants at high precision, given pseudonymous online profiles and conversations alone, matching what would take hours for a dedicated human investigator. We then design attacks for the closed-world setting. Given two databases of pseudonymous individuals, each containing unstructured text written by or about that individual, we implement a scalable attack pipeline that uses LLMs to: (1) extract identity-relevant features, (2) search for candidate matches via semantic embeddings, and (3) reason over top candidates to verify matches and reduce false positives. Compared to prior deanonymization work (e.g., on the Netflix prize) that required structured data or manual feature engineering, our approach works directly on raw user content across arbitrary platforms. We construct three datasets with known ground-truth data to evaluate our attacks. The first links Hacker News to LinkedIn profiles, using cross-platform references that appear in the profiles. Our second dataset matches users across Reddit movie discussion communities; and the third splits a single user’s Reddit history in time to create two pseudonymous profiles to be matched. In each setting, LLM-based methods substantially outperform classical baselines, achieving up to 68% recall at 90% precision compared to near 0% for the best non-LLM method. Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered.

  • Kissaki@feddit.org
    link
    fedilink
    English
    arrow-up
    1
    ·
    4 hours ago

    Germans with a website: well, it’s in clear text in the Impressum already, required by law

  • 🌞 Alexander Daychilde 🌞@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    18 hours ago

    Too late for me, I’ve been Daychilde since 1996, didn’t keep it separate from my real name, and I’m on wikipedia, so it’s trivial to find me. lol.

    The good is that I can report that it’s pretty safe to have an open identity. So far. heh

    • Goodman@discuss.tchncs.de
      link
      fedilink
      English
      arrow-up
      1
      ·
      14 minutes ago

      I read one of your blog posts about empathy this mornjng. It resonated with me and my recent views on the world.

      • 🌞 Alexander Daychilde 🌞@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        ·
        10 hours ago

        haha, oh man. That actually reminds me. I know I mentioned the wiki thing - this is me: https://en.wikipedia.org/wiki/Beck_v._Eiland-Hall

        Basically, back in 2009, I created glennbeckrapedandmurderedayounggirlin1990.com. It was largely in response to Glenn Beck’s stupid technique of interviewing people - like to our first sitting Muslim member of Congress: “Now, I wouldn’t say this, but some people are asking: Are you working for our enemies?” - to an elected member of Congress!

        Of course, this was back in the Muslim-scare days after 9/11 still in 2009… and now we definitely have people in Congress working for our enemies.

        But anyway. So the parody site.

        My wife found a forum where some idiots were trying to track me down. I mean, my real name and address was out there, but they were looking for more information about me and the site. They were talking about what organizations must be funding this attack on their beloved Beck.

        There was controversy at the time because an orgnization called ACORN was trying to get people to register to vote and supposedly signing up on behalf of people. IIRC the allegations were either bullshit or it wasn’t a big deal or maybe it was and it was dealt with. All I remember for sure is that I thought it would be hilarious to offer these chucklefucks “evidence” for their conspiracies.

        So I went out and copied the raw HTML from a 404 page on the ACORN website and made that the custom 404 page for my site. An then, to help these idiots “find” it, I made a “mistake” - I announced something on the main page and linked to a page that supposedly had the full story, only I intentionally put a typo in the link so the 404 page would come up. lol.

        Oh, man, they went N U T S over in the forum “HOLY SHIT ITS ACORN BEHIND THIS” lolololol…

        But anyway, your gif absolutely reminded me of those morons. That’s how I envisioned their “hacking” of me. lol

  • Silver Needle@lemmy.ca
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    1
    ·
    24 hours ago

    I call BS. We’ll see false positives go through the roof. Just another tool to arbitrarily harass opponents.

  • Goodman@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    12
    ·
    1 day ago

    Everyday the internet gets a little worse. I hate it here in this technological hellscape. I have more to say, but this bullshit makes me so so tired. Goodnight.

    • Silver Needle@lemmy.ca
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      1
      ·
      24 hours ago

      Don’t hate the technology. It’s great. Just how people organize themselves around technology is not up to date. Markets are not meant to coexist with an extremely fast global communication network that everyone can access, why do you think economies restrict internet access?

      Let the internet as a social activity die. It’s got to in order to be reborn haha

      • Goodman@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        1
        ·
        14 hours ago

        The internet can mostly die as far as I’m concerned. Just roll it back to file servers again, or something like gemspace. But being able to talk with people across cultures, borders freely is really important. It’s a tragedy that all these people will be hurt by the dystopification of the web. The new web needs to have a safe way to converse socially that is safe and easy enough to use for lay people. I have so much more to say on this, but real life is calling so I’ll leave it at this.

        I don’t really get your point about markets though. I’m genuinely trying to understand, so bear with me. This is what I got from your post:

        Our market has coexisted with an extremely fast global communication network for decades now. Given that the market feels like a quite organic thing, on what authority is the market not meant to coexists with the internet?

        I think that internet access is restricted because of technological constraints, a technological lag in rolling out higher speed infrastructure, and a the lack of demand for that access which is driven by technological and practical constraint. Some complex function of those factors haha. Still, I don’t really know what you are trying to get across.

        • Silver Needle@lemmy.ca
          link
          fedilink
          English
          arrow-up
          2
          ·
          13 hours ago

          Our market has coexisted with an extremely fast global communication network for decades now. Given that the market feels like a quite organic thing, on what authority is the market not meant to coexists with the internet?

          I’ll try to explain my thought.

          The condition for markets to exist as self reproducing and self-stabilizing objects is government, usu. in the form of a state-entity, which itself is an economic actor that exists in competition with other states and in cooperation within free trade zones. Important note: government forms from market activity, specifically from the control of estates. Taxation is a form of rent, for example. I am not putting the state-before the market.

          There is an interest for governments to:

          1. Maximize economic output

          2. To do so through cleverly tricking other economic actors outside of the own taxation system. I.e. trade agreements with built-in asymettries.

          3. And to minimize damage to domestic production. Outsourcing can lead to cornerstones of the economy eroding.

          Throw in the internet. We can now communicate and exchange with actors that are not in the same tax system. First and foremost this leads to issues with intellectual property. I’d cite geolocked internet radio stations and piracy. Japan doesn’t care about its citizens pirating manhwas, and vice-versa, Korea doesn’t care about anime piracy, and so on and so on. Then there is trade of physical objects. Say you need a laptop battery for your Linuxed MacBook M1 and a Chinese seller has batteries in stock that are cheaper and better than Apple’s own (happens rather frequently), with taxation at the border factored in you are still getting the most optimal deal. Some might find ways of circumventing customs which sweetens the pot further. Obviously there are issues to the domestic economy that can arise from this.

          Trade speeds up and global supply chains gain importance as cross border communication speeds up. At the level of national governments there is a distinct threat presenting itself. There is less control over market activity leading to a speedup of the self-polluting nature of trade, in other words the boom and butts cycle shortens. As a national government you’d want to lengthen the boom and bust cycle as crises are the natural killer of states, along with expansionist nations.

          Everything you are seeing, from Chat Control to China’s firewall are attempts to stabilize economies. The internet enables one to build structures that are wholly outside of state control. The state fails to direct the economy as planning starts happening between turfs. The internet due to its nation-decentralized function can aid in forming structures that oppose the state, should it falter.

          Let’s not forget one of the biggest threats to the economy that is open source. Patents and DRM are threatened by the unstoppable pace of Blender, Open Office and co… It’s as if people said YOLO, let’s stop exchanging goods and services and at the same time solve very real and pressing issues, some of the biggest problems in fact. It works with much less friction than anything before, it exists as this hobbyist thing that we cannot call economical in any sense of the current understanding of the word and it would not exist if it wasn’t for the internet.

          I think that internet access is restricted because of technological constraints, a technological lag in rolling out higher speed infrastructure, and a the lack of demand for that access which is driven by technological and practical constraint. Some complex function of those factors haha. Still, I don’t really know what you are trying to get across.

          India and China have smartphone ownership rates of over 85%. There are no significant technological constraints if you are not someone who needs exorbitant download upload speed and low latency. The Chinese have pretty decent internet speeds, faster than most European countries. I also do not at all believe that there is a lack of demand for practical access. The internet is most generally a sensible thing to have access to no matter who you are.

          • Goodman@discuss.tchncs.de
            link
            fedilink
            English
            arrow-up
            1
            ·
            22 minutes ago

            Thanks for explaining your thoughts. So to paraphrase you: you are saying that the market and by proxy, nations too, Are still adapting to the concept of the internet. One way to cope with the effects is to restricted access?

    • thinkercharmercoderfarmer@slrpnk.net
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      1
      ·
      edit-2
      19 hours ago

      Why not? if LLMs are good at predicting mean outcomes for the next symbol in a string, and humans have idiosyncrasies that deviate from that mean in a predictable way, I don’t see why you couldn’t detect and correlate certain language features that map to a specific user. You could use things like word choice, punctuation, slang, common misspellings, sentence structure… For example, I started with a contradicting question, I used “idiosyncrasies”, I wrote “LLMs” without an apostrophe, “language features” is a term of art, as is “map” as a verb, etc. None of these are indicative on their own, but unless people are taking exceptional care to either hyper-normalize their style, or explicitly spiking their language with confounding elements, I don’t see why an LLM wouldn’t be useful for this kind of espionage.

      I wonder if this will have a homogenizing effect on the anonymous web. It might become an accepted practice to communicate in a highly formalized style to make this kind of style fingerprinting harder.

      • thedeadwalking4242@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        12 hours ago

        It’s a language model not a classification model. People have already tried a similar experiment to have LLMs detect if a LLM wrote text or not and it couldn’t.

        • thinkercharmercoderfarmer@slrpnk.net
          link
          fedilink
          English
          arrow-up
          2
          ·
          6 hours ago

          This is in some ways an easier problem than classifying LLM vs non-LLM authorship. That only has two possible outcomes, and it’s pretty noisy because LLMs are trained to emulate the average human. Here, you can generate an agreement score based on language features per comment, and cluster the comments by how they disagree with the model. Comments that disagree in particular ways (never uses semicolons, claims to live in Canada, calls interlocutors “buddy”, writes run-on sentences, etc.) would be clustered together more tightly. The more comments two profiles have in the same cluster(s), the more confident the match becomes. I’m not saying this attack is novel or couldn’t be accomplished without an LLM, but it seems like a good fit for what LLMs actually do.

  • doug@lemmy.today
    link
    fedilink
    English
    arrow-up
    97
    ·
    edit-2
    2 days ago

    I think it was a Reddit scraper years ago that taught me that I should probably lie more often on the internet about my work, friends, family details, etc.

    Just like, little lies that don’t really matter in the comment, but would misdirect an AI or investigator into things that aren’t true.

    It’s just so much woooooork to think about this shit. And to come up with different screen names everywhere? And to like, sub to a city I don’t live in and comment there about shit I know nothing about? Exhausting.

    Thankfully my brothers and three uncles are here to support me. And my alligator.

    • Anarki_@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      4
      ·
      22 hours ago

      Oh hey my dearest friend. Say, did you end up moving to Perth or was that just a thought outloud? Well if you’re ever in the area let me know and we can meet up at that restaurant we enjoyed so much!

      xoxo

    • stickly@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      1 day ago

      The solution is simple, just launder each comment through an LLM to fudge the style and details a bit

      Edit, tried it for fun:

      lowkey just run every comment through an llm and let it switch up the words and details a bit so it dosnt sound like you wrote it

    • Insekticus@aussie.zone
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      1
      ·
      2 days ago

      Yeah exactly, like if youre 25, say youre 27. Then in another post 24. Youre still around that age, but the exact age is muddied in the waters.

      You can also use Americanized spelling in some sentences and or if you’re American, use British English, and become Unamericanised. Say you’re a half-Brit half-American dual citizen even though you’re from South Africa or something.

      • MountingSuspicion@reddthat.com
        link
        fedilink
        English
        arrow-up
        3
        ·
        2 days ago

        I feel like that may be worse. Kind of like how if you have certain security measures while browsing the web it’s almost easier to fingerprint you. It’ll get a good idea of your age and that’ll be enough rather than sticking to a specific lie. Just always be 3 years older with one additional sibling or a sibling of the opposite sex. If the sex of your sibling is relevant just describe them as a close family friend or close cousin in that instance. I can’t say for sure, but if I had to guess having a static lie is maybe more obfuscation than a variable one. Though even posting on this thread is bad opsec.

  • Bruncvik@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    arrow-down
    1
    ·
    2 days ago

    I have an account where I only post after I translated my writing through three different languages and back to English. The original input and the output convey the same message, but have very distinct styles. Randomizing the three languages in my translation sequence introduces enough variety that I doubt current LLM’s can identify me. (Full disclosure: I don’t post any sensitive information under any account; I do it just for fun.)

  • XLE@piefed.social
    link
    fedilink
    English
    arrow-up
    17
    ·
    2 days ago

    The doxxing efforts will be funded by venture capital.

    What can LLM providers do? Refusal guardrails and usage monitoring can help, but both have significant limitations. Our deanonymization framework splits an attack into seemingly benign tasks – summarizing profiles, computing embeddings, ranking candidates – that individually look like normal usage, making misuse hard to detect. Refusals can be bypassed through task decomposition.

    “Guardrails” are a joke and we all know Sam Altman and Elon Musk care about ethics as much as they care about not abusing their siblings or employees.

  • CerebralHawks@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    10
    ·
    2 days ago

    It is absolutely possible to identify users who post a lot on a public forum with a real name (e.g. Facebook or the like) as well as Reddit. So say you have some politician who claims to have X, Y, Z values and a Reddit user who has A, B, and C values that are antonymous to X, Y, and Z. By comparing common phrases, as well as by charting when the two seemingly separate users are online, you could say with reasonable certainty that the two people are one and the same, especially if you prompt them carefully to say the kinds of things they would say about neutral topics on both accounts. It would be hard to get 100% certainty, but you’d be close enough to imply it’s them.

    AIs (LLMs) just make it faster.

    Don’t post about controversial politics if you also post under your real name. It’s not a matter of “mask yourself better.” There will always be tells.

    • LwL@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 day ago

      I’ve always acted assuming this to be possible, but it used to require either an unhinged individual or some other reason for a very dedicated investigation. The barrier being potentially that much lower is scary, particularly for anyone with a bit of internet fame that would rather stay anonymous

  • Supervisor194@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    ·
    2 days ago

    I’ve never once posted on the Internet using a real name. I’ve never been a member of any social anything other than Reddit and Lemmy. I only even found Reddit because an IRC link aggregator I used to browse for news/memes went tits up.

  • Iconoclast@feddit.uk
    link
    fedilink
    English
    arrow-up
    6
    ·
    2 days ago

    For the past 10 years or so I’ve pretty much lived under the assumption that at some point someone figures out a system that digs through the entire internet and everything anyone has ever posted gets linked back to them.

    At the same time, it’s both great and absolutely horrifying.

    What’s horrifying is that everything you’ve ever posted gets linked back to you.

    What’s great is that none of it can really be used against you anymore - because we now know that absolutely everyone is a massive hypocrite and nobody is without sin.

    • Silver Needle@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      23 hours ago

      That’ll never work. The internet is messy like a jungle, I might find bird crap somewhere but it will not get me the bird. I might find a turned leaf, but what turned the leaf will never be known to me. All despite me being able to reason and investigate phenomena that occur.

      I view all things like particle systems: There are general trends, sometimes we can observe how single particles travel and we can derive rules from their behavior. Yet we are never able to see everything at full resolution, let alone know everyone in the way the “evil” “AI” thought experiments portray all knowing bots. What people say about Palantir is very similar falls into the category of we-don’t-know-the-rest-of-it.

      No use going paranoid over preliminary results from a tool we readily use but don’t fully comprehend the limitations of (in the meaning of: we don’t know how shitty and unreliable they are in actuality).

    • Scrollone@feddit.it
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 days ago

      I mean, there’s even a website (don’t remember the name) that lets you upload a photo of a person and it will show all pictures of that person that are on the web.

      Like a Google search but for your face. Super creepy.

    • Jrockwar@feddit.uk
      link
      fedilink
      English
      arrow-up
      3
      ·
      2 days ago

      Some really good advice that someone gave me once is that the internet doesn’t exist.

      Sure, it obviously does exist, but this was about communication style. When you send an email, you change codes and don’t write in the same way as a WhatsApp - you can expand your points more… But you should never forget you’re talking to a person - just because it’s internet, you shouldn’t talk any different to them.

      You shouldn’t assume that the message is anonymous just because it’s internet. You shouldn’t assume certain things are okay “just because it’s internet”.

      I don’t think they were 100% right because they were disregarding that code changing between different mediums and audiences is normal (you don’t talk the same way to your boss and your partner, or in written form vs spoken), but I do stand by the point that you shouldn’t change code or make assumptions just because “internet”.

      • krashmo@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 days ago

        Seems like we could all just mellow out a bit. You shouldn’t need to be afraid of saying stuff that isn’t perfectly pc now or in the past. Obviously there’s a difference between an off color joke and shit you would find in the Epstein files but I’m not particularly concerned about anything I’ve posted coming back to me. I’ve had bad takes (I’m sure I still do) and said things in the past that I no longer agree with, but who cares? That’s what life is like. You change over time in more ways than one. If someone wants to judge me harshly for that then we probably don’t weren’t going to hang out anyway so fuck em. Let them react how they want.

        That being said, the implications of this kind of technology being used by corporations or the government are quite different. There may be value in what you’re saying from that perspective.

  • MalReynolds@slrpnk.net
    link
    fedilink
    English
    arrow-up
    4
    ·
    2 days ago

    So, pretty much what Meta/Facebook (and the three letter agencies / GovInt) has been doing with deterministic code (like they’re not scraping reddit et.al, including Lemmy) for ages but probabilistic with more errors and new improved hallucination.

    Competition, filling in gaps or just looking to be bought out. Evil.