This should have end up differently

oftheair@lemmy.blahaj.zone

Okay, seems there are only two as it seems nepenthes is no longer developed.

dave@lemmy.nz

Yeah so anubis is the bot blocking one, already breached by bots.

Iocaine is an LLM maze and poisoner, intended to trap a bot but your site still needs the resources to serve all the requests, and it's not clear what happens when a user is accidentally identified as a bot.

oftheair@lemmy.blahaj.zone

Ah, okay.

Thanks for the info!

rekabis@lemmy.ca

Lemmy.world will send my instance hundreds of thousands if not millions of requests a day, in a near steady stream. It's telling my instance about every post, comment, or vote.

And yet, federation means that each instance should know all the other domain names, yes? So do daily DNS lookups of all IP addresses associated with federation and auto-whitelist them.

Sure, if you have to then configure cloudflare with these IPs, it’ll require an API to do so automatically.

But otherwise if you are running some sort of throttling protection on the actual box or VM the instance is sitting on, it should be rather trivial to update it directly, especially if said throttling software is doing Linux correctly and drawing its whitelist from a flat file.

oftheair@lemmy.blahaj.zone

Guess they didn't live up to their name.

cooper8@feddit.online

"you could block logged out users but that would impact many lurkers"

"regardless you might not be logged in at all, you should still be allowed to browse content"

Fundamentally, what I'm suggesting is a fork in the road. Either an instance admin can set up to eliminate scrapers by making the instance private to only registered users,

or they can maintain their instance as public and deal with more arcane methods to attempt to eliminate scraping.

The issue is that if the infrastructure isn't in place for the instance operator to decide to make their service private, then everyone is opted in to the Scrapers vs Countermeasures war with no alternative.

Privacy and encryption just work, it seems like not building the infrastructure to enable the network to function with them in place is a mistake.

To me, and to many users, what we want is fast load times, quick federation, and reliable service, all things that benefit from reducing traffic load to only registered users.

dave@lemmy.nz

Fundamentally, what I’m suggesting is a fork in the road. Either an instance admin can set up to eliminate scrapers by making the instance private to only registered users,

Yeah, it would require perhaps more changes (since instances newly subscribed to a community need the ability to ad hoc fetch content), but even just not showing the website when someone isn't logged in would probably make a big difference. That might be pretty easy, just redirect requests to load the web app (except the login page) to the login page, and exclude the API. Apps would still get logged out access but I doubt that's much of a problem compared to the website, since the bots seem to just be indiscriminately scraping web pages.

dave@lemmy.nz

New instances (and not just Lemmy instances, but Mastodon and other fediverse instances) are coming online all the time, so you need a way to let them through to start the federation process. There are thousands, so it needs to be automatic, you can't require a new instance sends whitelisting requests to ever server one of their users might want to interact with (instances aren't linked unless a local user subscribes to something on a remote instance).

Given the AI bots seem to just be indiscriminately scraping web pages, I excluded API endpoints from blocking anyway. Another admin showed me a nice Cloudflare rule to do this, though media can still be a problem due to how it's individual users on other instances that are loading it so it's hard to block scrapers without blocking users, which is another way Cloudflare helps (static media files are easily cached by their CDN).

cooper8@feddit.online

Definitely true.

rekabis@lemmy.ca

you need a way to let them through to start the federation process.

This isn’t via an API endpoint explicitly for that purpose that bots would normally not utilize?

And why not have a process by which admins from a new instance poke the admins of another instance - any other instance, so long as it’s already a part of the network - to do an initial manual whitelist that could cascade through the entire system?

Then there should be ways that the software itself can auth with other instances of itself, via a common encryption protocol. While this would only work with like software, the key point being that only a toehold is needed to start propagating.

The point being, there are options. Some of them quite simple.

dave@lemmy.nz

Realistically, federation is not the main concern. You can leave all your API endpoints open to bots and not have a problem because they are loading the web app. Just block the web app for suspicious traffic.

ActivityPub already uses authentication to some extent with other instances, it's the first contact where you have to have trust.

My main concern is still that media is loaded directly from users in most cases, the APIs are not a problem right now as the bots aren't specifically targeting Lemmy. There are ways to address this but Lemmy (and other threadiverse services) don't have full time dev teams, they work on what they can or want to work on given the very low hourly rate.

NodeBB

This should have end up differently