BLUF: The operations team at Beehaw has worked to increase site performance and uptime. This includes proactive monitoring to prevent problems from escalating and planning for future likely events.
Problem: Emails only sent to approved users, not denials; denied users can’t reapply with the same username
Disabled Docker postfix container; Lemmy runs on a Linux host that can use postfix itself, without any overhead
Modified various postfix components to accept localhost (same system) email traffic only
Created two different scripts to:
Sending so many emails from our provider caused the emails to end up in spam!! We had to change a bit of the outgoing flow
Configure Lemmy containers to use the host postfix as mail transport
All is well?
Problem: NO file level backups, only full image snapshots
Requested Funds from Beehaw org to be spent on purchase of cloud based storage, B2 - approved (thank you for the donations)
Installed and configured restic encrypted backups of key system files -> b2 ‘offsite’. This means, even the data from Beehaw that is saved there, is encrypted and no one else can read that information
Verified scheduled backups are being run every day, to b2. Important information such as the Lemmy volumes, pictures, configurations for various services, and a database dump are included in such
Verified restoration works! Had a small issue with the pictrs migration to object storage (b2). Restored the entire pictrs volume from restic b2 backup successfully. Backups work!
sorry for that downtime, but hey… it worked
Problem: No metrics/monitoring; what do we focus on to fix?
With this information we’ve determined the areas to focus on are database performance and storage concerns. We’ll be moving our image storage to a CDN if possible to help with bandwidth and storage costs.
Peace of mind, and let the poor admins sleep!
Problem: Lemmy is really slow and more resources for it are REALLY expensive
This gets it’s own section. Look, the largest issue with Lemmy performance is currently the database. We’ve spent a lot of time attempting to track down why and what it is, and then fixing what we reliably can. However, none of us are rust developers or database admins. We know where Lemmy spends its time in the DB but not why and really don’t know how to fix it in the code. If you’ve complained about why is Lemmy/Beehaw so slow this is it; this is the reason.
So since I can’t code rust, what do we do? Fix it where we can! Postgresql server setting tuning and changes. Changed the following items in postgresql to give better performance based on our load and hardware:
huge_pages = on # requires sysctl.conf changes and a system reboot
shared_buffers = 2GB
max_connections = 150
work_mem = 3MB
maintenance_work_mem = 256MB
temp_file_limit = 4GB
min_wal_size = 1GB
max_wal_size = 4GB
effective_cache_size = 3GB
random_page_cost = 1.2
wal_buffers = 16MB
bgwriter_delay = 100ms
bgwriter_lru_maxpages = 150
effective_io_concurrency = 200
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_maintenance_workers = 2
max_parallel_workers = 6
synchronous_commit = off
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
Now I’m not saying all of these had an affect, or even a cumulative affect; just the values we’ve changed. Be sure to use your own system values and not copy the above. The three largest changes I’d say are key to do are synchronous_commit = off
, huge_pages = on
and work_mem = 3MB
. This article may help you understand a few of those changes.
With these changes, the database seems to be working a damn sight better even under heavier loads. There are still a lot of inefficiencies that can be fixed with the Lemmy app for these queries. A user phiresky has made some huge improvements there and we’re hoping to see those pulled into main Lemmy on the next full release.
Problem: Lemmy errors aren’t helpful and sometimes don’t even reach the user (UI)
No, not by far. But I am about to hit the character limit for Lemmy posts. There have been many other changes and additions to Beehaw operations, these are but a few of the key changes. Sharing with the broader community so those of you also running Lemmy, can see if these changes help you too. Ask questions and I’ll discuss and answer what I can; no secret sauce or passwords though; I’m not ChatGPT.
Shout out to @Lionir@beehaw.org , @Helix@beehaw.org and @admin@beehaw.org for continuing to work with me to keep Beehaw running smoothly.
Thanks all you Beeple, for being here and putting up with our growing pains!
A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.
Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.
Subcommunities on Beehaw:
This community’s icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.
Awesome write up. I appreciate the transparency and the knowledge transfer for others trying to keep their Lemmy instances up under load.
I too use B2 for a restic target in my home network. I’m not delivering any content, but I’m assuming you are aware of the B2 integration with CDN’s that at least don’t incur B2 egress charges? https://www.backblaze.com/b2/solutions/content-delivery.html
I’d assume there are CDN bandwidth charges involved. I have no idea how those compare to the VPS charges. I’m also certain the setup isn’t simple.
I’m looking forward to similar posts and can’t express how much I appreciate the transparency.