Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Privacy Your Rights Online

Meet Cyveillancebot 47

gulker writes "A rant about making a new 'acquaintance'... Googlebot is like the UPS driver who comes to the door in a uniform, and will happily show you his ID and business card: Cyveillancebot is like a coarse, unshaven, itchy guy with his hat pulled down lurking near your half-open bedroom window. This after Cyveillance defeats a 'protection mechanism' - robots.txt - and grabs 155 copyrighted files from my Web server, which files it will presumably share with others, for a profit..."
This discussion has been archived. No new comments can be posted.

Meet Cyveillancebot

Comments Filter:
  • Available to all.

    ObviousGuy's axiom
  • oh my god! a web crawler not honouring the robots.txt file! and lying about what it is!

    what has the world come to?!?!?!

    for the sarcasm impaired: why are you even reading things on the web anyway? just give up already.
    • Amusement! (Score:3, Insightful)

      by fm6 ( 162816 )
      What's really dumb about this article is the belief that any documents on a public web site can be considered "private". Indeed, the guy seems to totally misunderstand the purpose of robots.txt. It's not there to specify what's private, it's there to control the way your site is presented on public web servers, and also to help spiders avoid overloading your site.

      And in any case, Cyveillancebot is hardly a real threat to security, compared to script kiddies and the like. If you're trying to keep your priv

      • There's a difference between private and copyright.
        All my website is copyright me, but not private. I have no problem with sharing the results of my research with humans, however, I don't want my copyrights violated. I'm happy with google caching them, I consider that a favour, as it does a public service like a library. This is different though, it's not a public resource.

        If every website were to contain a query-response entry page which screened out non-humans (or unintelligent ones, or ones that can't r
        • Well then, you must think very highly of Cyveillance's intrusive spybot. It's only purpose is to sniff out copyright violations!
          • I think highly of Spyveillance's bot in the same way that I'd like every airport security guard to stick his finger up my arse in order to see if I was smuggling heroin.

            Maybe some people approve of such things, but I ain't one of them.

            YAW

  • It's even friendly enough to grap that robot.txt file. If you want to snatch a whole site for (uhum) research just tell it it's your site, and wait for the great slurping sound.
  • Don't blame the software, blame the users.
  • by swmccracken ( 106576 ) on Tuesday May 06, 2003 @10:03PM (#5897757) Homepage
    This guy is a moron, right?

    Anyone that has *anything* on a public web server that isn't protected with a username and password (and that isn't very difficult, now, is it?) and they want it kept private is some kind moron.

    I mean, I could easily spider his site using wget ignoring his robots.txt.. (For the record, his robots.txt is disalow everyone).

    It is, in fact a mechanism for safeguarding content that owners wish to keep private from crawlers. WRONG! It is a mechanism for discouraging crawlers from downloading vast hunks of your site. (Good example: Crawling all of slashdot would be much larger than slashdot itself because of all the different views of comments you can have. That's why the robots.txt of /. discourages spiders in the dynamically generated views.) Yes, in theory he's right, but reality beckons.

    Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.

    Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...
    • Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...

      Just because the Donut Brigade won't move enough to make the powdered sugar fall from their rotund bellies doesn't make breaking and entering any more legal, regardless of how little effort *you* think 'breaking' requires...
      • Ironically, it takes *more* effort to not "break in" in this case.

        Yeah, it's a bit of a strech though, I know.

        But, thinking that "reading links on your site that you don't want them to even though you didn't try and stop them is an invasion" is just niaeve and stupid.
    • Anyone that has *anything* on a public web server that isn't protected with a username and password (and that isn't very difficult, now, is it?) and they want it kept private is some kind moron.
      I mean, I could easily spider his site using wget ignoring his robots.txt.. (For the record, his robots.txt is disalow everyone).

      And this would get you sued if they didnt like what you were doing anymore, as you would be a trespasser. While cases such as Bidders edge v. eBay didnt explicitly hold that robots.tx

      • Thankfully, US Case Law isn't binding on me, yet.

        Really, you're arguing that robots.txt is just a special case of "Terms of Use" that you see around the place.

        (Don't get me started on the so-called "justice" system. :-)

        I would perfer to hope that it becomes accepted knowledge that putting anything on a website is considered publication of that information.. but this could just be idle hope.
    • """
      Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.

      Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...
      """

      You need to see "Bowling for Columbine", particularly the parts about Canada and front doors.

      YAW.

      • Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.


        No, it is more like a sign at the airport that says "Employees only" and then when you are surrounded by the police, you claim "but there was no lock on the door."


        Or at a Radio Shack, there is a sign on the back room door, "Private, employees only."

    • Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...

      On the contrary, if there were a strange figure that was rifling through my house and refused to identify himself, I would sure as hell hope that the police would concern themselves, despite the fact that my doormat says "Welcome"...

      You are equating technology with law, and that is a very dangerous thing to do. That I have a technological means to commit a crime does not invalidate the fact that

    • He's a bit of an idiot.

      I agree with the basic principles that this robot is being a little impolite though. The guy opens up his website, hoping that people will act in a civil manner. Cyveillancebot marches in there with the digital equivalent of hobnail boots, ignores the signs, and takes copies of everything, assuming that anything there is probably stolen.

      Equating it to mugging or breaking and entering is a bit much, but the shifty unshaven lurker seemed quite apt.
  • by Anonymous Coward on Tuesday May 06, 2003 @10:04PM (#5897759)
    Cyveillance runs a web robot. That web robot has one purpose, and one purpose only: to scour the web looking for "copyrighted material" owned by its clients. What happens when such material is found, I don't know; it's probably reported back to the Mother Ship for C&D processing.

    The reason they're widely hated is that their bot misbehaves. Badly. Not only does it send bogus User-Agent headers and disregard the robots.txt file, it'll literally hammer a site. It's one of the most aggressive bots I've ever come across, and it seems its operators don't care. I've seen a server go down because a spider in Cyveillance's IP space was hitting a MySQL-based message board thousands of times per minute.

    Most spiders either ignore URLs with query strings in them, recognize them as potentially resource-intensive and avoid fetching more than once or twice per minute, or are at least smart enough to avoid getting caught in a recursive loop. Not Cyveillance; the damned thing would fetch the forum index, then fetch a thread, then follow the link from that thread right back to the forum index, ad nauseum.

    Cyveillance doesn't just crawl the IP space of webhosting and colo companies, either. They hit my cablemodem all the time - I'm not sure whether they scan all cable modems, or whether they've just grown fond of me because I'm running a web server (which serves nothing externally, save for a tiny index page that shows my uptime).

    Drop 63.148.99.0/24 into the bit bucket and save your server some strain.

    (By the way, why the fuck do I have to logout to post as AC now? Are registered users only allowed one AC post per month or something?)
    • Cyveillancebot is like a coarse, unshaven, itchy guy with his hat pulled down lurking near your half-open bedroom window.

      you'd think a corp. would take more care. is it to hard to believe this bot will get stuck in a loop and tie up someone's bandwidth... perhaps sending them over their limit and costing them money? when you've got a name and an address that can be sued, its best to use some common sense.

      maybe this bot doesn't do that, so feel free to explain why

    • by PurpleFloyd ( 149812 ) <`zeno20' `at' `attbi.com'> on Wednesday May 07, 2003 @02:14AM (#5898941) Homepage
      To me, these actions (hammering databases, getting caught in recursive loops that could be easily avoided) are much worse than ignoring robots.txt. While the whole robots.txt issue could be justifiable from their position (so people couldn't hide copyrighted info via robots.txt), bringing down servers through what amounts to a DOS attack is simply inexcusable.

      There are any [google.com] number [altavista.com] of spiders out there that are smart enough to index whole sites, including dynamically-generated pages, without taking a site down or even hitting it harder than a couple of simeltaneous users. This behavior is not only negligent, but malicious. Any site brought down by Cyveillance would probably have good grounds for legal action (I am not a lawyer, this is not legal advice, talk to a lawyer if you want legal advice, etc.).

    • The ironic part is, they may well download material copyrighted by the web host, protected by a digital notice of the unacceptability of doing so...sounds like these guys want to play with the DMCA...
    • Actually, they only have a /27

      OrgName: Cyveillance
      OrgID: CYVEIL
      Address: 1555 Wilson Blvd., Ste. 404
      City: Arlington
      StateProv: VA
      PostalCode: 22209-2405
      Country: US

      NetRange: 63.148.99.224 - 63.148.99.255
      CIDR: 63.148.99.224/27


      If you block the whole /24, you're hitting a few unrelated (probably innocent) organizations.
    • ok, it looks like the abuse could lead to change. What we the inconvience likely be for us?

      -> what are the defences for aggressive spiders and
      --> what is the impact of these defences?

      And, a case study. What happens if I copy+paste a WP posting to my own free site when:

      - site is hosted under cuban domain?
      - I copy data to paper word for word and fly to cuba, then submit and host there?

      ^ Laws for US/EU?

      Where might be a good source to answer these ridiculous legal copyright related questions? They se
    • Cyveillance runs a web robot. That web robot has one purpose, and one purpose only: to scour the web looking for "copyrighted material" owned by its clients. What happens when such material is found, I don't know; it's probably reported back to the Mother Ship for C&D processing.

      What I don't understand is why scouring the web for Copyrighted material is considered being violated. If you are depending on the copyright laws, then you must abide by the limitations on those rights. Once the copyright o
      • What I don't understand is why scouring the web for Copyrighted material is considered being violated.

        Well, I certainly don't consider it wrong for copyright holders to search the web for theft of their IP. Problem is, Cyveillance does it in an extremely disruptive manner. It's probably not reasonable to expect the cyveillancebot to honor robots.txt, as Chris Gulker thinks it should. But if it doesn't act nicer than it currently does, then web masters will just lock it out -- and it will defeat its own p

  • IP-BLOCK TO BLOCK (Score:5, Informative)

    by Oriumpor ( 446718 ) * on Tuesday May 06, 2003 @10:33PM (#5897905) Homepage Journal
    I used SAMSPADE [samspade.org] to reference their owned IP block (off the wonderful article) this is most definitely not their ONLY ip block, but if anyone does have more, it would be great to compile a whole list of "mean" IPS.

    I do not care for this kind of intrusion (I equate this to exactly what spammers do to harvest your email...) then you can block these ips (route em to never never land.)
  • by plsuh ( 129598 ) <plsuh@noSpAM.goodeast.com> on Tuesday May 06, 2003 @10:58PM (#5898076) Homepage
    Take a look at one guy's experiences [diveintomark.org] with blocking rude bots and spiders. Mark is a buddy of mine and this got him pretty steamed.

    --Paul
    • I'm not a webmaster, but it sounds like a spambot trap is close to being a necessary feature for a small web site. But I can't say I like to idea of using a firewall this way. Mark also provides a link to a site that supposedly does the same thing with apache, but that site is offline. (!)

      I find it interesting that he can lock out Cyveillancebot and other spybots simply by banning their IP addresses. Sounds like Cyveillance and other "ebusiness intelligence" companies are being less than diligent in provi

  • Sorry, mate, but as much as I dislike abuse of copyright (I've had some of my own works pillaged in the past), if you don't take steps to protect it, you can assume someone will copy it and use it illegitimately.

    The best you can do is chase - legally if necessary - those who steal your work, and gain whatever compensation you can. Oh, and make sure that copyright is broadly proclaimed in the first instance, too.

    No, the `bot shouldn't crawl past robots.txt (rfc-ignorant [rfc-ignorant.org], anyone?). But, given that it do

  • RewriteCond %{REMOTE_HOST} ^www\.cyveillance\.com$
    RewriteRule ^.*$ - [F]


    Of course the actual address of the bot may vary.
    • Nope. Because the cyveillence bot doesn't announce itself. It masks its user-agent.

      See the above comment: Cyveillance in a nutshell [slashdot.org]

      You need to block it's IP:

      # Cyveillance
      RewriteCond %{REMOTE_ADDR} ^63.148.99.(22[4-9]|2[3-5][0-9])$

      # FILTER BOTS : 403-Forbidden
      RewriteRule ^.* - [F,L]
      • Nope. Because the cyveillence bot doesn't announce itself. It masks its user-agent.

        You need to block it's IP


        Uh huh, and did you see my rule mention HTTP_USER_AGENT anywhere in it? No. Look at what you wrote--the only difference between your rule and mine is that you followed my advice and used an IP address range instead of the host name.
  • by moc.tfosorcimgllib ( 602636 ) on Wednesday May 07, 2003 @08:30AM (#5900059) Journal
    C EVIL BOT CAN LYE
  • 64.68.82.39 - - [05/May/2003:15:18:23 -0700] "GET /robots.txt HTTP/1.0" 404 275 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

    Unless I am misunderstanding the log entry, robots.txt doesn't actually exist on this guy's server. So why does he spend so much time complaining about this thing not looking for it?

  • HTMLized version of Cyveillance's robots.txt file [cyveillance.com], for your browsing pleasure:


    User-agent: *
    Disallow: /web/us/partners/submit_pw.asp [cyveillance.com]
    Disallow: /web/uk/partners/submit_pw.asp [cyveillance.com]
    Disallow: /web1/us/partners/submit_pw.asp [cyveillance.com]
    Desallow: /web1/uk/partners/submit_pw.asp [cyveillance.com]

    Notice how they misspelled "Disallow" in the fourth item, and that none of the pages seem to exist. Good job, Cyveillance!

An Ada exception is when a routine gets in trouble and says 'Beam me up, Scotty'.

Working...