This should have end up differently

dave@lemmy.nz

Cloudflare's bot detection triggers the blocking because federation looks a lot like a bot (well, it is a bot).

For example, Lemmy.world will send my instance hundreds of thousands if not millions of requests a day, in a near steady stream. It's telling my instance about every post, comment, or vote. AI scrapers send hundreds of thousands of requests or millions in a near steady stream each day.

For all intents and purposes, federation is bot traffic and looks just like it. Typically I block by identifying high traffic ASNs (a group of IPs run by the same entity, because blackhat AI scrapers use many IPs) and showing a cloudflare challenge (which will typically have a 0% pass rate). If it's from 1IP then it's probably a federated instance, but I typically see many IPs from the same area spread with an even spread of requests.

I also try to exclude federation/API endpoints, which can help stop false positives as scrapers are generally loading the web page.

This is something Lemmy (and PieFed, Mbin) admins try to help each other with strategies for because one day a bot will find you and suddenly your instance is down because they are hammering you too hard.

I bet if you are in China, Brazil, Singapore, Argentina, etc then you will see a lot of blocked content on Lemmy, as this is often where the bot traffic comes from (Google, Facebook, OpenAI, Amazon, etc will typically respect the robots.txt so US traffic is less of an issue).

cooper8@feddit.online

The thing that confuses me is, wouldn't a whitelist for federated instances and request frequency throttling at the account level solve this issue?

I suppose this would require that the client not have a public front end that keeps full navigation functionality, but for a smaller instance that seems like an easy sacrifice to make in exchange for stability.

"But then how will new instances get federated?" maybe they have to actually talk to the admins of other instances to get vouched in to the whitelist. Just because the network is distributed doesnt mean it needs to be fully inclusive by default, and in fact it explicitly isn't.

I'm assuming I'm missing something super basic that makes all this not enough, bots spoofing the requests with the credentials of a whitelisted instance maybe?

Seems like maybe the instances should have encrypted keys that handshake each other with batch requests.

Am I on to something or just wildly gesticulating?

dave@lemmy.nz

There are thousands of instances and it's not really about admins. If a Mastodon user wants to go and follow a Lemmy community, they can. They shouldn't need to ask their admin to contact the admin of the Lemmy instance to be allowed to.

However, there is something called Fediseer which allows a chain of trust. Some instances guarantee other instances who then guarantee others down a chain. If an instance turns out bad then their guarantor can revoke it and any instances lower in the chain (that the spammy instance guarantees) also lose their trusted status. It doesn't share IPs to my knowledge though, and outbound IPs are different than the inbound one on the domain if there is a CDN like Cloudflare in the mix. The intent is actually to identify and block instances set up to spam (or other reasons to defederate).

I think the other part missing is that it's not just instances. If you upload an image to Lemmy.world and then someone on feddit.online views it, the feddit.online user's browser loads that image directly from Lemmy.world. That means if you block any IP that's not an instance, people won't be able to see content uploaded by your users. So you have to be able to tell what is a Brazil-hosted AI bot and what's a Brazilian user viewing a meme your user uploaded.

There are of course different parts that you can or can't block which is basically the idea, working out which endpoints can be blocked and which will break things for genuine users. With static images they can be basically ignored because Cloudflare will cache it, but having thousands of post or feed loads in a hurry can bring down an instance.

dragonfucker@lemmy.nz

What about Anubis?

dave@lemmy.nz

Yeah so anubis is like a Cloudflare challenge, it fits in to a certain part of the process.

My point is basically that Cloudflare provides a service that stands in for many things an admin could be doing. There are many instances that don't use Cloudflare, and I commend them for that. It's more work but certainly possible.

There's also the additional problem that AI bots are breaking through anubis so it can't be the only line of defence.

E.g. https://news.ycombinator.com/item?id=44914773

dragonfucker@lemmy.nz

Interesting, thanks

redacted@lemmy.zip

Thank you for the detailed response i even understood most of it

irelephant@lemmy.dbzer0.com

cursed wojak

kierunkowy74@piefed.social

crap.itdidnt.work

cooper8@feddit.online

Fediseer seems like a good solution, essentially a whitelist vouch system with touching at second hand.

Regarding the media hosting, again it seems like something that could rely on a method of identifying the user request directly with their user account before responding to the request. Cookies could be an option for this, though they are falling out of favor. Alternately, and more securely, it could be a cryptographic handshake where the user's home instance and the instance hosting the post generate a public key using their two private keys for the user, and the user provides the public key when making pull requests from the federated instance. The keys could be batch generated when an instance first federates content with another and then assigned to user accounts the first time the user makes a pull request through a link from their home instance to the federated instance.

Secure Scuttlebutt Protocol already deved the encryption methodology that could be cross applied for a lot of this: https://ssbc.github.io/scuttlebutt-protocol-guide/ though I am of course not suggesting SSP be adopted whole cloth, and there are a bunch of other OS projects with encryption that could be used. This is just the one that comes to mind.

(edit: also I am in favor of finding methodologies that work whether CloudFlare is used by the instance or not, obviously CloudFlare has advantages but as we have seen also is a vulnerability of the network.)

dave@lemmy.nz

Regarding the media hosting, again it seems like something that could rely on a method of identifying the user request directly with their user account before responding to the request.

Yeah, so far it works to just check for a JWT in the cookie (regardless of what it is) to allow logged in users to bypass the rules. This works on Lemmy because the bots aren't specifically targetting Lemmy so they don't try to fake this (although if there were, just make an instance and our instances will send you all the data lol).

Alternately, and more securely, it could be a cryptographic handshake where the user’s home instance and the instance hosting the post generate a public key using their two private keys for the user, and the user provides the public key when making pull requests from the federated instance.

This is already basically how ActivityPub works for communication between instances. But the activities are one thing, it's the page loads that are the killer because of the database queries needed to compile a unique, sorted home page of subscriptions. You could block logged out users but that impacts many lurkers.

For media, that's difficult as media is often being loaded from a remote instance that doesn't know who you are, along with the problem that the media provider is not technically part of Lemmy (it's a separate service called pict-rs) so doesn't know if you're logged in. I'm not sure how that worked on PieFed or Mbin, but regardless you might not be logged in at all, and you should still be allowed to browse content.

Lemmy has a proxy option where the instance can fetch content from the other servers to provide to the user, which does get around this issue for logged out users. But the proxy caches the media, and when this happens you are now the host of whatever media is in any post that made it's way to your instance, along with all the legal risks that involves.

(edit: also I am in favor of finding methodologies that work whether CloudFlare is used by the instance or not, obviously CloudFlare has advantages but as we have seen also is a vulnerability of the network.)

All of the things being discussed around mitigations in Cloudflare are also possible to do without Cloudflare, but it just means setting it all up yourself. I'll just wait for someone smarter than me to build a tool I can host myself that does all this automatically, then I'll consider it

ulterno@programming.dev

Yeah mine too.
I started wondering if the requests from desktop Lemmy apps also go through Cloudflare (probably do).

cheesenoodle@lemmy.world

So my takeaway from this thread is existing mega corporations have found a legal way (deliberately or not) to run endless denial of service attacks on potential competition?

oftheair@lemmy.blahaj.zone

because AI scrapers bring it to it’s knees

There are three (at least) piece of web software to protect from AI Scrapers currently, it should be more than possible without Cloudflare.

dave@lemmy.nz

It's not even possible to do a good job of it with Cloudflare. What are the three you are referring to? The most commonly known one is Anubis, which Codeberg found AI bots had learnt to solve them.

oftheair@lemmy.blahaj.zone

Okay, seems there are only two as it seems nepenthes is no longer developed.

dave@lemmy.nz

Yeah so anubis is the bot blocking one, already breached by bots.

Iocaine is an LLM maze and poisoner, intended to trap a bot but your site still needs the resources to serve all the requests, and it's not clear what happens when a user is accidentally identified as a bot.

oftheair@lemmy.blahaj.zone

Ah, okay.

Thanks for the info!

rekabis@lemmy.ca

Lemmy.world will send my instance hundreds of thousands if not millions of requests a day, in a near steady stream. It's telling my instance about every post, comment, or vote.

And yet, federation means that each instance should know all the other domain names, yes? So do daily DNS lookups of all IP addresses associated with federation and auto-whitelist them.

Sure, if you have to then configure cloudflare with these IPs, it’ll require an API to do so automatically.

But otherwise if you are running some sort of throttling protection on the actual box or VM the instance is sitting on, it should be rather trivial to update it directly, especially if said throttling software is doing Linux correctly and drawing its whitelist from a flat file.

oftheair@lemmy.blahaj.zone

Guess they didn't live up to their name.

NodeBB

This should have end up differently