Meet Cyveillancebot

Please create an account to participate in the Slashdot moderation system

Meet Cyveillancebot 47

Posted by timothy on Tuesday May 06, 2003 @09:32PM from the or-better-yet-don't dept.

gulker writes "A rant about making a new 'acquaintance'... Googlebot is like the UPS driver who comes to the door in a uniform, and will happily show you his ID and business card: Cyveillancebot is like a coarse, unshaven, itchy guy with his hat pulled down lurking near your half-open bedroom window. This after Cyveillance defeats a 'protection mechanism' - robots.txt - and grabs 155 copyrighted files from my Web server, which files it will presumably share with others, for a profit..."

This discussion has been archived. No new comments can be posted.

Meet Cyveillancebot

Load All Comments

Search 47 Comments Log In/Create an Account

Comments Filter:

On the web? (Score:1)

by ObviousGuy ( 578567 ) writes:

Available to all.

ObviousGuy's axiom
shock! (Score:1)

by kevin lyda ( 4803 ) * writes:

oh my god! a web crawler not honouring the robots.txt file! and lying about what it is!

what has the world come to?!?!?!

for the sarcasm impaired: why are you even reading things on the web anyway? just give up already.
- Amusement! (Score:3, Insightful)
  
  by fm6 ( 162816 ) writes:
  
  What's really dumb about this article is the belief that any documents on a public web site can be considered "private". Indeed, the guy seems to totally misunderstand the purpose of robots.txt. It's not there to specify what's private, it's there to control the way your site is presented on public web servers, and also to help spiders avoid overloading your site.
  And in any case, Cyveillancebot is hardly a real threat to security, compared to script kiddies and the like. If you're trying to keep your priv
  - Re:Amusement! (Score:2, Insightful)
    
    by You're All Wrong ( 573825 ) writes:
    
    There's a difference between private and copyright.
    All my website is copyright me, but not private. I have no problem with sharing the results of my research with humans, however, I don't want my copyrights violated. I'm happy with google caching them, I consider that a favour, as it does a public service like a library. This is different though, it's not a public resource.
    
    If every website were to contain a query-response entry page which screened out non-humans (or unintelligent ones, or ones that can't r
    - Re:Amusement! (Score:2)
      
      by fm6 ( 162816 ) writes:
      
      Well then, you must think very highly of Cyveillance's intrusive spybot. It's only purpose is to sniff out copyright violations!
      - Re:Amusement! (Score:2, Insightful)
        
        by You're All Wrong ( 573825 ) writes:
        
        I think highly of Spyveillance's bot in the same way that I'd like every airport security guard to stick his finger up my arse in order to see if I was smuggling heroin.
        
        Maybe some people approve of such things, but I ain't one of them.
        
        YAW
        
        Re:Amusement! (Score:2)
        
        by fm6 ( 162816 ) writes:
        
        Sloppy of me, I forgot the smily. What's the smily for "irony", anyway?
NetObjects Fusion does that too (Score:2)

by infonography ( 566403 ) writes:

It's even friendly enough to grap that robot.txt file. If you want to snatch a whole site for (uhum) research just tell it it's your site, and wait for the great slurping sound.
This is the same as dealing with Gnutella (Score:1)

by Rares Marian ( 83629 ) writes:

Don't blame the software, blame the users.
This guy is a bit stupid, right? (Score:5, Informative)

by swmccracken ( 106576 ) writes: on Tuesday May 06, 2003 @10:03PM (#5897757) Homepage

This guy is a moron, right?

Anyone that has *anything* on a public web server that isn't protected with a username and password (and that isn't very difficult, now, is it?) and they want it kept private is some kind moron.

I mean, I could easily spider his site using wget ignoring his robots.txt.. (For the record, his robots.txt is disalow everyone).

It is, in fact a mechanism for safeguarding content that owners wish to keep private from crawlers. WRONG! It is a mechanism for discouraging crawlers from downloading vast hunks of your site. (Good example: Crawling all of slashdot would be much larger than slashdot itself because of all the different views of comments you can have. That's why the robots.txt of /. discourages spiders in the dynamically generated views.) Yes, in theory he's right, but reality beckons.

Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.

Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...

Share
twitter facebook
- Re:This guy is a bit stupid, right? (Score:1)
  
  by Erebus ( 13033 ) writes:
  
  Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...
  
  Just because the Donut Brigade won't move enough to make the powdered sugar fall from their rotund bellies doesn't make breaking and entering any more legal, regardless of how little effort *you* think 'breaking' requires...
  - Re:This guy is a bit stupid, right? (Score:1)
    
    by swmccracken ( 106576 ) writes:
    
    Ironically, it takes *more* effort to not "break in" in this case.
    
    Yeah, it's a bit of a strech though, I know.
    
    But, thinking that "reading links on your site that you don't want them to even though you didn't try and stop them is an invasion" is just niaeve and stupid.
- current state of things (Score:1)
  
  by danoatvulaw ( 625376 ) writes:
  
  Anyone that has *anything* on a public web server that isn't protected with a username and password (and that isn't very difficult, now, is it?) and they want it kept private is some kind moron.
  I mean, I could easily spider his site using wget ignoring his robots.txt.. (For the record, his robots.txt is disalow everyone).
  
  And this would get you sued if they didnt like what you were doing anymore, as you would be a trespasser. While cases such as Bidders edge v. eBay didnt explicitly hold that robots.tx
  - Re:current state of things (Score:1)
    
    by swmccracken ( 106576 ) writes:
    
    Thankfully, US Case Law isn't binding on me, yet.
    
    Really, you're arguing that robots.txt is just a special case of "Terms of Use" that you see around the place.
    
    (Don't get me started on the so-called "justice" system. :-)
    
    I would perfer to hope that it becomes accepted knowledge that putting anything on a website is considered publication of that information.. but this could just be idle hope.
    - that is a good statement. (Score:2)
      
      by www.sorehands.com ( 142825 ) writes:
      
      eally, you're arguing that robots.txt is just a special case of "Terms of Use" that you see around the place.
      
      A like that wording. robots.txt is a terms of use that a computer can usually understand.
- Re:This guy is a bit stupid, right? (Score:1)
  
  by You're All Wrong ( 573825 ) writes:
  
  """
  Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.
  
  Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...
  """
  
  You need to see "Bowling for Columbine", particularly the parts about Canada and front doors.
  
  YAW.
  - Re:This guy is a bit stupid, right? (Score:2)
    
    by www.sorehands.com ( 142825 ) writes:
    
    Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.
    
    No, it is more like a sign at the airport that says "Employees only" and then when you are surrounded by the police, you claim "but there was no lock on the door."
    
    Or at a Radio Shack, there is a sign on the back room door, "Private, employees only."
- Re:This guy is a bit stupid, right? (Score:2)
  
  by Hard_Code ( 49548 ) writes:
  
  Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...
  On the contrary, if there were a strange figure that was rifling through my house and refused to identify himself, I would sure as hell hope that the police would concern themselves, despite the fact that my doormat says "Welcome"...
  
  You are equating technology with law, and that is a very dangerous thing to do. That I have a technological means to commit a crime does not invalidate the fact that
- Re:This guy is a bit stupid, right? (Score:2, Insightful)
  
  by 91degrees ( 207121 ) writes:
  
  He's a bit of an idiot.
  
  I agree with the basic principles that this robot is being a little impolite though. The guy opens up his website, hoping that people will act in a civil manner. Cyveillancebot marches in there with the digital equivalent of hobnail boots, ignores the signs, and takes copies of everything, assuming that anything there is probably stolen.
  
  Equating it to mugging or breaking and entering is a bit much, but the shifty unshaven lurker seemed quite apt.
Cyveillance in a nutshell (Score:5, Informative)

by Anonymous Coward writes: on Tuesday May 06, 2003 @10:04PM (#5897759)

Cyveillance runs a web robot. That web robot has one purpose, and one purpose only: to scour the web looking for "copyrighted material" owned by its clients. What happens when such material is found, I don't know; it's probably reported back to the Mother Ship for C&D processing.

The reason they're widely hated is that their bot misbehaves. Badly. Not only does it send bogus User-Agent headers and disregard the robots.txt file, it'll literally hammer a site. It's one of the most aggressive bots I've ever come across, and it seems its operators don't care. I've seen a server go down because a spider in Cyveillance's IP space was hitting a MySQL-based message board thousands of times per minute.

Most spiders either ignore URLs with query strings in them, recognize them as potentially resource-intensive and avoid fetching more than once or twice per minute, or are at least smart enough to avoid getting caught in a recursive loop. Not Cyveillance; the damned thing would fetch the forum index, then fetch a thread, then follow the link from that thread right back to the forum index, ad nauseum.

Cyveillance doesn't just crawl the IP space of webhosting and colo companies, either. They hit my cablemodem all the time - I'm not sure whether they scan all cable modems, or whether they've just grown fond of me because I'm running a web server (which serves nothing externally, save for a tiny index page that shows my uptime).

Drop 63.148.99.0/24 into the bit bucket and save your server some strain.

(By the way, why the fuck do I have to logout to post as AC now? Are registered users only allowed one AC post per month or something?)

Share
twitter facebook
- Re:Cyveillance in a nutshell (Score:1)
  
  by TubeSteak ( 669689 ) writes:
  
  Cyveillancebot is like a coarse, unshaven, itchy guy with his hat pulled down lurking near your half-open bedroom window.
  you'd think a corp. would take more care. is it to hard to believe this bot will get stuck in a loop and tie up someone's bandwidth... perhaps sending them over their limit and costing them money? when you've got a name and an address that can be sued, its best to use some common sense.
  maybe this bot doesn't do that, so feel free to explain why
- Re:Cyveillance in a nutshell (Score:5, Insightful)
  
  by PurpleFloyd ( 149812 ) writes: <`zeno20' `at' `attbi.com'> on Wednesday May 07, 2003 @02:14AM (#5898941) Homepage
  
  To me, these actions (hammering databases, getting caught in recursive loops that could be easily avoided) are much worse than ignoring robots.txt. While the whole robots.txt issue could be justifiable from their position (so people couldn't hide copyrighted info via robots.txt), bringing down servers through what amounts to a DOS attack is simply inexcusable.
  There are any [google.com] number [altavista.com] of spiders out there that are smart enough to index whole sites, including dynamically-generated pages, without taking a site down or even hitting it harder than a couple of simeltaneous users. This behavior is not only negligent, but malicious. Any site brought down by Cyveillance would probably have good grounds for legal action (I am not a lawyer, this is not legal advice, talk to a lawyer if you want legal advice, etc.).
  
  Parent Share
  twitter facebook
- Re:Cyveillance in a nutshell (Score:3, Interesting)
  
  by mdielmann ( 514750 ) writes:
  
  The ironic part is, they may well download material copyrighted by the web host, protected by a digital notice of the unacceptability of doing so...sounds like these guys want to play with the DMCA...
- Re:Cyveillance in a nutshell (Score:2)
  
  by toastyman ( 23954 ) * writes:
  
  Actually, they only have a /27
  
  OrgName: Cyveillance OrgID: CYVEIL Address: 1555 Wilson Blvd., Ste. 404 City: Arlington StateProv: VA PostalCode: 22209-2405 Country: US NetRange: 63.148.99.224 - 63.148.99.255 CIDR: 63.148.99.224/27
  
  If you block the whole /24, you're hitting a few unrelated (probably innocent) organizations.
- Re:Cyveillance in a nutshell (Score:1)
  
  by jago25_98 ( 566531 ) writes:
  
  ok, it looks like the abuse could lead to change. What we the inconvience likely be for us?
  
  -> what are the defences for aggressive spiders and
  --> what is the impact of these defences?
  
  And, a case study. What happens if I copy+paste a WP posting to my own free site when:
  
  - site is hosted under cuban domain?
  - I copy data to paper word for word and fly to cuba, then submit and host there?
  
  ^ Laws for US/EU?
  
  Where might be a good source to answer these ridiculous legal copyright related questions? They se
- Re:Cyveillance in a nutshell (Score:3, Insightful)
  
  by Cy Guy ( 56083 ) * writes:
  
  Cyveillance runs a web robot. That web robot has one purpose, and one purpose only: to scour the web looking for "copyrighted material" owned by its clients. What happens when such material is found, I don't know; it's probably reported back to the Mother Ship for C&D processing.
  
  What I don't understand is why scouring the web for Copyrighted material is considered being violated. If you are depending on the copyright laws, then you must abide by the limitations on those rights. Once the copyright o
  - Intrusive Spybots (Score:2)
    
    by fm6 ( 162816 ) writes:
    
    What I don't understand is why scouring the web for Copyrighted material is considered being violated.
    
    Well, I certainly don't consider it wrong for copyright holders to search the web for theft of their IP. Problem is, Cyveillance does it in an extremely disruptive manner. It's probably not reasonable to expect the cyveillancebot to honor robots.txt, as Chris Gulker thinks it should. But if it doesn't act nicer than it currently does, then web masters will just lock it out -- and it will defeat its own p
- Re:And this is why many ISPs don't give log access (Score:4, Insightful)
  
  by km790816 ( 78280 ) writes: <wqhq3gx02 AT sneakemail DOT com> on Tuesday May 06, 2003 @11:30PM (#5898272)
  
  I totally agree...but...
  
  This is classic American business practices.
  
  We are a good, upstanding corporation.
  We want to protect our turf.
  We employ a company to help us.
  We don't ask about that companies means or, more likely, turn a blind eye.
  
  Dell would never agree that applications on the Internet should, in general, act the way that Cyveillancebox does.
  
  I believe that the author understands your point. He's not whining.
  
  He is, however, pointing out the hypocrisy, which I think is valuable. I'll think twice about buying another Dell.
  
  Parent Share
  twitter facebook
IP-BLOCK TO BLOCK (Score:5, Informative)

by Oriumpor ( 446718 ) * writes: on Tuesday May 06, 2003 @10:33PM (#5897905) Homepage Journal

I used SAMSPADE [samspade.org] to reference their owned IP block (off the wonderful article) this is most definitely not their ONLY ip block, but if anyone does have more, it would be great to compile a whole list of "mean" IPS.

I do not care for this kind of intrusion (I equate this to exactly what spammers do to harvest your email...) then you can block these ips (route em to never never land.)

Share
twitter facebook
Another guy's experiences (Score:5, Informative)

by plsuh ( 129598 ) writes: <plsuh@noSpAM.goodeast.com> on Tuesday May 06, 2003 @10:58PM (#5898076) Homepage

Take a look at one guy's experiences [diveintomark.org] with blocking rude bots and spiders. Mark is a buddy of mine and this got him pretty steamed.

--Paul

Share
twitter facebook
- Traps, ripoffs. (Score:2)
  
  by fm6 ( 162816 ) writes:
  
  I'm not a webmaster, but it sounds like a spambot trap is close to being a necessary feature for a small web site. But I can't say I like to idea of using a firewall this way. Mark also provides a link to a site that supposedly does the same thing with apache, but that site is offline. (!)
  I find it interesting that he can lock out Cyveillancebot and other spybots simply by banning their IP addresses. Sounds like Cyveillance and other "ebusiness intelligence" companies are being less than diligent in provi
- Re:Saddam, Cyveillance, etc. etc. (Score:5, Interesting)
  
  by gulker ( 174365 ) writes: on Wednesday May 07, 2003 @12:40AM (#5898586) Homepage
  
  The point isn't that I'm shocked to see material downloaded from a public Web site... the point is that Cyveillance brags about how it protects copyright: their PR placed a Businesweek piece about how they had forced a site that was using Washington Post content to pay up.
  
  Cyveillance is basically reselling content from thousands of Web sites - original thinking, research and writing, that is not theirs... they are exactly what they claim to protect the corporate copyright owners from - they basically rip off work, including copyrighted material, and resell it.
  
  Good scam, they make a ton of money according to their press releases, but a scam, nevertheless.
  
  Parent Share
  twitter facebook
Sorry, but no sympathy at all. (Score:2)

by The Fink ( 300855 ) writes:

Sorry, mate, but as much as I dislike abuse of copyright (I've had some of my own works pillaged in the past), if you don't take steps to protect it, you can assume someone will copy it and use it illegitimately.
The best you can do is chase - legally if necessary - those who steal your work, and gain whatever compensation you can. Oh, and make sure that copyright is broadly proclaimed in the first instance, too.
No, the `bot shouldn't crawl past robots.txt (rfc-ignorant [rfc-ignorant.org], anyone?). But, given that it do
Block it with Apache and mod_rewrite! (Score:2)

by scrod ( 136965 ) writes:

RewriteCond %{REMOTE_HOST} ^www\.cyveillance\.com$
RewriteRule ^.*$ - [F]

Of course the actual address of the bot may vary.
- Re:Block it with Apache and mod_rewrite! (Score:1)
  
  by hansk ( 107187 ) writes:
  
  Nope. Because the cyveillence bot doesn't announce itself. It masks its user-agent.
  
  See the above comment: Cyveillance in a nutshell [slashdot.org]
  
  You need to block it's IP:
  # Cyveillance
  RewriteCond %{REMOTE_ADDR} ^63.148.99.(22[4-9]|2[3-5][0-9])$
  
  # FILTER BOTS : 403-Forbidden
  RewriteRule ^.* - [F,L]
  - Re:Block it with Apache and mod_rewrite! (Score:2)
    
    by scrod ( 136965 ) writes:
    
    Nope. Because the cyveillence bot doesn't announce itself. It masks its user-agent.
    
    You need to block it's IP
    
    Uh huh, and did you see my rule mention HTTP_USER_AGENT anywhere in it? No. Look at what you wrote--the only difference between your rule and mine is that you followed my advice and used an IP address range instead of the host name.
    - Re:Block it with Apache and mod_rewrite! (Score:1)
      
      by hansk ( 107187 ) writes:
      
      Yup, you are correct. But, using "remote_host" may not work if your server does not have a reliable reverse dns lookup. Also, it can add additional overhead because of the lookup time. Therefore, banning by IP is better.
CYVEILLANCEBOT (Score:4, Funny)

by moc.tfosorcimgllib ( 602636 ) writes: on Wednesday May 07, 2003 @08:30AM (#5900059) Journal

C EVIL BOT CAN LYE

Share
twitter facebook
What robots.txt? (Score:1)

by eet23 ( 563082 ) writes:

64.68.82.39 - - [05/May/2003:15:18:23 -0700] "GET /robots.txt HTTP/1.0" 404 275 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
Unless I am misunderstanding the log entry, robots.txt doesn't actually exist on this guy's server. So why does he spend so much time complaining about this thing not looking for it?
And now, Cyveillance's robots.txt file (Score:2)

by tregoweth ( 13591 ) writes:

HTMLized version of Cyveillance's robots.txt file [cyveillance.com], for your browsing pleasure:
User-agent: * Disallow: /web/us/partners/submit_pw.asp [cyveillance.com] Disallow: /web/uk/partners/submit_pw.asp [cyveillance.com] Disallow: /web1/us/partners/submit_pw.asp [cyveillance.com] Desallow: /web1/uk/partners/submit_pw.asp [cyveillance.com]

Notice how they misspelled "Disallow" in the fourth item, and that none of the pages seem to exist. Good job, Cyveillance!

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

On the web? (Score:1)

shock! (Score:1)

Amusement! (Score:3, Insightful)

Re:Amusement! (Score:2, Insightful)

Re:Amusement! (Score:2)

Re:Amusement! (Score:2, Insightful)

Re:Amusement! (Score:2)

NetObjects Fusion does that too (Score:2)

This is the same as dealing with Gnutella (Score:1)

This guy is a bit stupid, right? (Score:5, Informative)

Re:This guy is a bit stupid, right? (Score:1)

Re:This guy is a bit stupid, right? (Score:1)

current state of things (Score:1)

Re:current state of things (Score:1)

that is a good statement. (Score:2)

Re:This guy is a bit stupid, right? (Score:1)

Re:This guy is a bit stupid, right? (Score:2)

Re:This guy is a bit stupid, right? (Score:2)

Re:This guy is a bit stupid, right? (Score:2, Insightful)

Cyveillance in a nutshell (Score:5, Informative)

Re:Cyveillance in a nutshell (Score:1)

Re:Cyveillance in a nutshell (Score:5, Insightful)

Re:Cyveillance in a nutshell (Score:3, Interesting)

Re:Cyveillance in a nutshell (Score:2)

Re:Cyveillance in a nutshell (Score:1)

Re:Cyveillance in a nutshell (Score:3, Insightful)

Intrusive Spybots (Score:2)

Re:And this is why many ISPs don't give log access (Score:4, Insightful)

IP-BLOCK TO BLOCK (Score:5, Informative)

Another guy's experiences (Score:5, Informative)

Traps, ripoffs. (Score:2)

Re:Saddam, Cyveillance, etc. etc. (Score:5, Interesting)

Sorry, but no sympathy at all. (Score:2)

Block it with Apache and mod_rewrite! (Score:2)

Re:Block it with Apache and mod_rewrite! (Score:1)

Re:Block it with Apache and mod_rewrite! (Score:2)

Re:Block it with Apache and mod_rewrite! (Score:1)

CYVEILLANCEBOT (Score:4, Funny)

What robots.txt? (Score:1)

And now, Cyveillance's robots.txt file (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals