Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse

NodeBB

  1. Home
  2. Fediverse memes
  3. This should have end up differently

This should have end up differently

Scheduled Pinned Locked Moved Fediverse memes
44 Posts 19 Posters 0 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • dave@lemmy.nzD dave@lemmy.nz

    It's not even possible to do a good job of it with Cloudflare. What are the three you are referring to? The most commonly known one is Anubis, which Codeberg found AI bots had learnt to solve them.

    oftheair@lemmy.blahaj.zoneO This user is from outside of this forum
    oftheair@lemmy.blahaj.zoneO This user is from outside of this forum
    oftheair@lemmy.blahaj.zone
    wrote last edited by oftheair@lemmy.blahaj.zone
    #34

    Okay, seems there are only two as it seems nepenthes is no longer developed.

    • Anubis
    • Iocaine
    dave@lemmy.nzD 1 Reply Last reply
    3
    • oftheair@lemmy.blahaj.zoneO oftheair@lemmy.blahaj.zone

      Okay, seems there are only two as it seems nepenthes is no longer developed.

      • Anubis
      • Iocaine
      dave@lemmy.nzD This user is from outside of this forum
      dave@lemmy.nzD This user is from outside of this forum
      dave@lemmy.nz
      wrote last edited by
      #35

      Yeah so anubis is the bot blocking one, already breached by bots.

      Iocaine is an LLM maze and poisoner, intended to trap a bot but your site still needs the resources to serve all the requests, and it's not clear what happens when a user is accidentally identified as a bot.

      oftheair@lemmy.blahaj.zoneO 1 Reply Last reply
      3
      • dave@lemmy.nzD dave@lemmy.nz

        Yeah so anubis is the bot blocking one, already breached by bots.

        Iocaine is an LLM maze and poisoner, intended to trap a bot but your site still needs the resources to serve all the requests, and it's not clear what happens when a user is accidentally identified as a bot.

        oftheair@lemmy.blahaj.zoneO This user is from outside of this forum
        oftheair@lemmy.blahaj.zoneO This user is from outside of this forum
        oftheair@lemmy.blahaj.zone
        wrote last edited by
        #36

        Ah, okay.

        Thanks for the info!

        1 Reply Last reply
        1
        • dave@lemmy.nzD dave@lemmy.nz

          Cloudflare's bot detection triggers the blocking because federation looks a lot like a bot (well, it is a bot).

          For example, Lemmy.world will send my instance hundreds of thousands if not millions of requests a day, in a near steady stream. It's telling my instance about every post, comment, or vote. AI scrapers send hundreds of thousands of requests or millions in a near steady stream each day.

          For all intents and purposes, federation is bot traffic and looks just like it. Typically I block by identifying high traffic ASNs (a group of IPs run by the same entity, because blackhat AI scrapers use many IPs) and showing a cloudflare challenge (which will typically have a 0% pass rate). If it's from 1IP then it's probably a federated instance, but I typically see many IPs from the same area spread with an even spread of requests.

          I also try to exclude federation/API endpoints, which can help stop false positives as scrapers are generally loading the web page.

          This is something Lemmy (and PieFed, Mbin) admins try to help each other with strategies for because one day a bot will find you and suddenly your instance is down because they are hammering you too hard.

          I bet if you are in China, Brazil, Singapore, Argentina, etc then you will see a lot of blocked content on Lemmy, as this is often where the bot traffic comes from (Google, Facebook, OpenAI, Amazon, etc will typically respect the robots.txt so US traffic is less of an issue).

          R This user is from outside of this forum
          R This user is from outside of this forum
          rekabis@lemmy.ca
          wrote last edited by
          #37

          Lemmy.world will send my instance hundreds of thousands if not millions of requests a day, in a near steady stream. It's telling my instance about every post, comment, or vote.

          And yet, federation means that each instance should know all the other domain names, yes? So do daily DNS lookups of all IP addresses associated with federation and auto-whitelist them.

          Sure, if you have to then configure cloudflare with these IPs, it’ll require an API to do so automatically.

          But otherwise if you are running some sort of throttling protection on the actual box or VM the instance is sitting on, it should be rather trivial to update it directly, especially if said throttling software is doing Linux correctly and drawing its whitelist from a flat file.

          dave@lemmy.nzD 1 Reply Last reply
          1
          • slothrop@lemmy.caS slothrop@lemmy.ca

            sh.itjust.works was out

            oftheair@lemmy.blahaj.zoneO This user is from outside of this forum
            oftheair@lemmy.blahaj.zoneO This user is from outside of this forum
            oftheair@lemmy.blahaj.zone
            wrote last edited by
            #38

            Guess they didn't live up to their name.

            1 Reply Last reply
            1
            • dave@lemmy.nzD dave@lemmy.nz

              Regarding the media hosting, again it seems like something that could rely on a method of identifying the user request directly with their user account before responding to the request.

              Yeah, so far it works to just check for a JWT in the cookie (regardless of what it is) to allow logged in users to bypass the rules. This works on Lemmy because the bots aren't specifically targetting Lemmy so they don't try to fake this (although if there were, just make an instance and our instances will send you all the data lol).

              Alternately, and more securely, it could be a cryptographic handshake where the user’s home instance and the instance hosting the post generate a public key using their two private keys for the user, and the user provides the public key when making pull requests from the federated instance.

              This is already basically how ActivityPub works for communication between instances. But the activities are one thing, it's the page loads that are the killer because of the database queries needed to compile a unique, sorted home page of subscriptions. You could block logged out users but that impacts many lurkers.

              For media, that's difficult as media is often being loaded from a remote instance that doesn't know who you are, along with the problem that the media provider is not technically part of Lemmy (it's a separate service called pict-rs) so doesn't know if you're logged in. I'm not sure how that worked on PieFed or Mbin, but regardless you might not be logged in at all, and you should still be allowed to browse content.

              Lemmy has a proxy option where the instance can fetch content from the other servers to provide to the user, which does get around this issue for logged out users. But the proxy caches the media, and when this happens you are now the host of whatever media is in any post that made it's way to your instance, along with all the legal risks that involves.

              (edit: also I am in favor of finding methodologies that work whether CloudFlare is used by the instance or not, obviously CloudFlare has advantages but as we have seen also is a vulnerability of the network.)

              All of the things being discussed around mitigations in Cloudflare are also possible to do without Cloudflare, but it just means setting it all up yourself. I'll just wait for someone smarter than me to build a tool I can host myself that does all this automatically, then I'll consider it 😅

              cooper8@feddit.onlineC This user is from outside of this forum
              cooper8@feddit.onlineC This user is from outside of this forum
              cooper8@feddit.online
              wrote last edited by
              #39

              "you could block logged out users but that would impact many lurkers"

              "regardless you might not be logged in at all, you should still be allowed to browse content"

              Fundamentally, what I'm suggesting is a fork in the road. Either an instance admin can set up to eliminate scrapers by making the instance private to only registered users,

              or they can maintain their instance as public and deal with more arcane methods to attempt to eliminate scraping.

              The issue is that if the infrastructure isn't in place for the instance operator to decide to make their service private, then everyone is opted in to the Scrapers vs Countermeasures war with no alternative.

              Privacy and encryption just work, it seems like not building the infrastructure to enable the network to function with them in place is a mistake.

              To me, and to many users, what we want is fast load times, quick federation, and reliable service, all things that benefit from reducing traffic load to only registered users.

              dave@lemmy.nzD 1 Reply Last reply
              1
              • cooper8@feddit.onlineC cooper8@feddit.online

                "you could block logged out users but that would impact many lurkers"

                "regardless you might not be logged in at all, you should still be allowed to browse content"

                Fundamentally, what I'm suggesting is a fork in the road. Either an instance admin can set up to eliminate scrapers by making the instance private to only registered users,

                or they can maintain their instance as public and deal with more arcane methods to attempt to eliminate scraping.

                The issue is that if the infrastructure isn't in place for the instance operator to decide to make their service private, then everyone is opted in to the Scrapers vs Countermeasures war with no alternative.

                Privacy and encryption just work, it seems like not building the infrastructure to enable the network to function with them in place is a mistake.

                To me, and to many users, what we want is fast load times, quick federation, and reliable service, all things that benefit from reducing traffic load to only registered users.

                dave@lemmy.nzD This user is from outside of this forum
                dave@lemmy.nzD This user is from outside of this forum
                dave@lemmy.nz
                wrote last edited by
                #40

                Fundamentally, what I’m suggesting is a fork in the road. Either an instance admin can set up to eliminate scrapers by making the instance private to only registered users,

                Yeah, it would require perhaps more changes (since instances newly subscribed to a community need the ability to ad hoc fetch content), but even just not showing the website when someone isn't logged in would probably make a big difference. That might be pretty easy, just redirect requests to load the web app (except the login page) to the login page, and exclude the API. Apps would still get logged out access but I doubt that's much of a problem compared to the website, since the bots seem to just be indiscriminately scraping web pages.

                cooper8@feddit.onlineC 1 Reply Last reply
                0
                • R rekabis@lemmy.ca

                  Lemmy.world will send my instance hundreds of thousands if not millions of requests a day, in a near steady stream. It's telling my instance about every post, comment, or vote.

                  And yet, federation means that each instance should know all the other domain names, yes? So do daily DNS lookups of all IP addresses associated with federation and auto-whitelist them.

                  Sure, if you have to then configure cloudflare with these IPs, it’ll require an API to do so automatically.

                  But otherwise if you are running some sort of throttling protection on the actual box or VM the instance is sitting on, it should be rather trivial to update it directly, especially if said throttling software is doing Linux correctly and drawing its whitelist from a flat file.

                  dave@lemmy.nzD This user is from outside of this forum
                  dave@lemmy.nzD This user is from outside of this forum
                  dave@lemmy.nz
                  wrote last edited by
                  #41

                  New instances (and not just Lemmy instances, but Mastodon and other fediverse instances) are coming online all the time, so you need a way to let them through to start the federation process. There are thousands, so it needs to be automatic, you can't require a new instance sends whitelisting requests to ever server one of their users might want to interact with (instances aren't linked unless a local user subscribes to something on a remote instance).

                  Given the AI bots seem to just be indiscriminately scraping web pages, I excluded API endpoints from blocking anyway. Another admin showed me a nice Cloudflare rule to do this, though media can still be a problem due to how it's individual users on other instances that are loading it so it's hard to block scrapers without blocking users, which is another way Cloudflare helps (static media files are easily cached by their CDN).

                  R 1 Reply Last reply
                  0
                  • dave@lemmy.nzD dave@lemmy.nz

                    Fundamentally, what I’m suggesting is a fork in the road. Either an instance admin can set up to eliminate scrapers by making the instance private to only registered users,

                    Yeah, it would require perhaps more changes (since instances newly subscribed to a community need the ability to ad hoc fetch content), but even just not showing the website when someone isn't logged in would probably make a big difference. That might be pretty easy, just redirect requests to load the web app (except the login page) to the login page, and exclude the API. Apps would still get logged out access but I doubt that's much of a problem compared to the website, since the bots seem to just be indiscriminately scraping web pages.

                    cooper8@feddit.onlineC This user is from outside of this forum
                    cooper8@feddit.onlineC This user is from outside of this forum
                    cooper8@feddit.online
                    wrote last edited by
                    #42

                    Definitely true.

                    1 Reply Last reply
                    1
                    • dave@lemmy.nzD dave@lemmy.nz

                      New instances (and not just Lemmy instances, but Mastodon and other fediverse instances) are coming online all the time, so you need a way to let them through to start the federation process. There are thousands, so it needs to be automatic, you can't require a new instance sends whitelisting requests to ever server one of their users might want to interact with (instances aren't linked unless a local user subscribes to something on a remote instance).

                      Given the AI bots seem to just be indiscriminately scraping web pages, I excluded API endpoints from blocking anyway. Another admin showed me a nice Cloudflare rule to do this, though media can still be a problem due to how it's individual users on other instances that are loading it so it's hard to block scrapers without blocking users, which is another way Cloudflare helps (static media files are easily cached by their CDN).

                      R This user is from outside of this forum
                      R This user is from outside of this forum
                      rekabis@lemmy.ca
                      wrote last edited by rekabis@lemmy.ca
                      #43

                      you need a way to let them through to start the federation process.

                      This isn’t via an API endpoint explicitly for that purpose that bots would normally not utilize?

                      And why not have a process by which admins from a new instance poke the admins of another instance - any other instance, so long as it’s already a part of the network - to do an initial manual whitelist that could cascade through the entire system?

                      Then there should be ways that the software itself can auth with other instances of itself, via a common encryption protocol. While this would only work with like software, the key point being that only a toehold is needed to start propagating.

                      The point being, there are options. Some of them quite simple.

                      dave@lemmy.nzD 1 Reply Last reply
                      1
                      • R rekabis@lemmy.ca

                        you need a way to let them through to start the federation process.

                        This isn’t via an API endpoint explicitly for that purpose that bots would normally not utilize?

                        And why not have a process by which admins from a new instance poke the admins of another instance - any other instance, so long as it’s already a part of the network - to do an initial manual whitelist that could cascade through the entire system?

                        Then there should be ways that the software itself can auth with other instances of itself, via a common encryption protocol. While this would only work with like software, the key point being that only a toehold is needed to start propagating.

                        The point being, there are options. Some of them quite simple.

                        dave@lemmy.nzD This user is from outside of this forum
                        dave@lemmy.nzD This user is from outside of this forum
                        dave@lemmy.nz
                        wrote last edited by
                        #44

                        Realistically, federation is not the main concern. You can leave all your API endpoints open to bots and not have a problem because they are loading the web app. Just block the web app for suspicious traffic.

                        ActivityPub already uses authentication to some extent with other instances, it's the first contact where you have to have trust.

                        My main concern is still that media is loaded directly from users in most cases, the APIs are not a problem right now as the bots aren't specifically targeting Lemmy. There are ways to address this but Lemmy (and other threadiverse services) don't have full time dev teams, they work on what they can or want to work on given the very low hourly rate.

                        1 Reply Last reply
                        0
                        Reply
                        • Reply as topic
                        Log in to reply
                        • Oldest to Newest
                        • Newest to Oldest
                        • Most Votes


                        • Login

                        • Don't have an account? Register

                        • Login or register to search.
                        Powered by NodeBB Contributors
                        • First post
                          Last post
                        0
                        • Categories
                        • Recent
                        • Tags
                        • Popular
                        • World
                        • Users
                        • Groups