Hi all,

First off, I want to apologize for all the server instability. We long ago outgrew our instance size, but I was unable to afford a larger node on our provider, Vultr. We were maxing out every part of the server whenever any even slightly significant number of users were on the fediverse.

I’ve finally found the time to migrate us to a new provider, which allows us to step up to a much more powerful configuration. That migration has now been completed. I actually intended to post about the downtime on this community this morning before beginning, but when I went to do so, the server was already down and struggling to come back up. So I went ahead with the migration.

Server before 4cpu/16GB/400GB NVMe Server after 8cpu/64GB/1Tb NVMe

Please update this thread if you are seeing any issues around any part of the site. This means duplicate threads, things that aren’t federating, inability to load profiles, etc.

There is still database tuning that needs to occur, so you should expect some downtime here and there, but otherwise the instance should be much more stable from now on.

During this process I also improved several other aspects of operating the server, so any ‘actual’ downtime should be accompanied by proper maintenance pages (that hopefully don’t get wiped by ansible anymore), so that will also be a good indicator of legitimate maintenance.

Once again, I really apologize for all of the downtime. It’s very frustrating to use a server that operates like this, I understand.

snowe

  • tatterdemalion@programming.dev
    link
    fedilink
    arrow-up
    11
    ·
    8 days ago

    Very happy to hear this. I was noticing frequent slowness recently. I never really got to the point of considering leaving because I’m not on here that much anyway. But I do sponsor your github in a small way. I hope the sponsorships cover at least the hosting cost.

    You might want to update your Github sponsor page as it sounds like some of that info is out of date after this migration.

    • snowe@programming.devOPM
      link
      fedilink
      arrow-up
      8
      ·
      8 days ago

      it is helping, thank you for the sponsorship. I should have migrated a long time ago because the costs really were adding up. I’ll update my sponsor page after I have a fresh month of data for the bucket costs (which are still on Vultr) and the new server costs (which hopefully should be static). Thanks for the suggestion!

    • snowe@programming.devOPM
      link
      fedilink
      arrow-up
      6
      ·
      8 days ago

      and thank you for being a great contributor to the community! this site would be nothing without all of you!

    • snowe@programming.devOPM
      link
      fedilink
      arrow-up
      14
      ·
      8 days ago

      Hetzner. Honestly every provider was cheaper. I literally didn’t find a single provider that was even close to as expensive as Vultr. You can look at Vultr’s deploy page here (might need to be logged in for that). For 16GB of RAM on any product, the minimum cost is $80 a month. We were paying $120+.

      It’s honestly crazy how expensive Vultr is. The servers might have better processors, didn’t really check that, but all our performance depends on RAM and cores, so none of that really matters.

      Also was able to get 64GB of ECC RAM on Hetzner. No clue if Vultr provided that, but they don’t list it anywhere.

      Providers I looked at:

      • Scaleway
      • Hetzner
      • Contabo
      • Netcup
      • OVH
      • Space Hosting
      • I think one more, but can’t remember it right now.
  • Kissaki@programming.dev
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    8 days ago

    When programming.dev became practically unusable most of the time in the last several days I was considering moving elsewhere. No post or announcement even acknowledging the issue here in meta didn’t make me hopeful in the issues subsiding at all. (I’ve experienced instance death before on feddit.de.)

    Great to see this development.

    Thank you for your continued work and efforts! 👍

    • snowe@programming.devOPM
      link
      fedilink
      arrow-up
      3
      ·
      6 days ago

      Sorry for the silence. My years on the internet make me hesitant to claim I’ll do something publicly until I’m already almost done. Else it’s unlikely to get done and I’m then not keeping my word.

    • fuzzzerd@programming.dev
      link
      fedilink
      English
      arrow-up
      2
      ·
      7 days ago

      Felt the same way, and when I had time and thoughts to post the server was down. Glad to see it back up with an upgrade.

  • ISO@lemmy.zip
    link
    fedilink
    arrow-up
    4
    ·
    8 days ago

    Good news. Thank you.

    And it’s good to hear that it wasn’t something nefarious messing with the instance.

    • snowe@programming.devOPM
      link
      fedilink
      arrow-up
      5
      ·
      8 days ago

      Oh we are getting attacked constantly, that definitely didn’t help, but a large portion of it was just thrashing from postgres not getting enough memory.

  • bitcrafter@programming.dev
    link
    fedilink
    arrow-up
    4
    ·
    8 days ago

    Oh, cool, to be honest I was actually in the process of transitioning to another instance due to the instability, but it sounds like I may not need to!

    Thanks a lot!

    • snowe@programming.devOPM
      link
      fedilink
      arrow-up
      5
      ·
      8 days ago

      sorry for all the trouble. I would understand it if you still do, as I haven’t been the best operator.

  • entwine@programming.dev
    link
    fedilink
    arrow-up
    3
    ·
    edit-2
    8 days ago

    Sidebar stats say this instance gets 23 users/day, which seems absolutely tiny and within the capabilities of 4c/16gb cloud instance.

    We were maxing out every part of the server whenever any even slightly significant number of users were on the fediverse.

    Idk anything about how lemmy/fediverse works, but does that mean tiny instances like this get hit when the rest of the network is experiencing high load? Seems problematic.

    EDIT: btw, thanks for the free service and the effort you put in to keep it running!

    • snowe@programming.devOPM
      link
      fedilink
      arrow-up
      12
      ·
      8 days ago

      programming.dev is the 9th largest lemmy server. https://join-lemmy.org/instances

      That stat was probably that low due to the server being down for around 90% of the last two weeks. If you look now it’s at 220 and it will continue to go up.

      On top of that, every action on every server that is federated is relayed to every instance. So all of lemmy.world’s activity is still relayed to us and we have to handle it. Same for the other servers.

      On top of that we also operate many other services:

      • bytes.programming.dev
      • git.programming.dev
      • blocks.programming.dev
      • etc (there’s a lot)

      But really it was mostly just postgres thrashing on all the requests. Here’s a look at our Cloudflare dashboard for number of requests:

      Yes this should be handle-able by a server that small (think actor paradigm), but I was unable to tune postgres to get it to that point as I’m not great at database stuff. I’m sure a DBA would have done a better job. I will note that some of the queries being used in the lemmy code are very badly optimized and were taking 20+ seconds to run each time, locking up the instance. With that on top of some other badly optimized selects for things like reading comments (which would take like 7s mean), there wasn’t much I could do.

      With the cost difference it was well worth it to just upgrade to a cheaper better server all around.

      • fuzzzerd@programming.dev
        link
        fedilink
        English
        arrow-up
        4
        ·
        7 days ago

        For all of the attention in the early days about Lemmy being rust based and thus focused on performance, the database seems to be the main bottleneck neck and from anecdotal monitoring of the other admins complaints I’d say that seems true.

        Seems like some design issues lead to heavy database usage and its going to be really hard to optimize away from that.

        I don’t really have a better idea, just acknowledging even a small instance has to scale disproportionally to its size when the rest of the network grows and that’s heavy on the database specifically.

        • BB_C@programming.dev
          link
          fedilink
          arrow-up
          5
          ·
          7 days ago

          The push-based ActivityPub (apub) federation itself is bad design anyway. Something pull-based with aggregation and well-defined synchronisation would have been much better.

          There are ideas beyond that. For example, complete separation between content and moderation. But that would diverge from the decentralized family of protocols apub belongs to, and may not attract a lot of users and traffic. And those who care and don’t mind smaller networks prefer fully distributed solutions anyway.