Meet Cyveillancebot 47
gulker writes "A rant about making a new 'acquaintance'... Googlebot is like the UPS driver who comes to the door in a uniform, and will happily show you his ID and business card: Cyveillancebot is like a coarse, unshaven, itchy guy with his hat pulled down lurking near your half-open bedroom window. This after Cyveillance defeats a 'protection mechanism' - robots.txt - and grabs 155 copyrighted files from my Web server, which files it will presumably share with others, for a profit..."
On the web? (Score:1)
ObviousGuy's axiom
shock! (Score:1)
what has the world come to?!?!?!
for the sarcasm impaired: why are you even reading things on the web anyway? just give up already.
Amusement! (Score:3, Insightful)
And in any case, Cyveillancebot is hardly a real threat to security, compared to script kiddies and the like. If you're trying to keep your priv
Re:Amusement! (Score:2, Insightful)
All my website is copyright me, but not private. I have no problem with sharing the results of my research with humans, however, I don't want my copyrights violated. I'm happy with google caching them, I consider that a favour, as it does a public service like a library. This is different though, it's not a public resource.
If every website were to contain a query-response entry page which screened out non-humans (or unintelligent ones, or ones that can't r
Re:Amusement! (Score:2)
Re:Amusement! (Score:2, Insightful)
Maybe some people approve of such things, but I ain't one of them.
YAW
Re:Amusement! (Score:2)
NetObjects Fusion does that too (Score:2)
This is the same as dealing with Gnutella (Score:1)
This guy is a bit stupid, right? (Score:5, Informative)
Anyone that has *anything* on a public web server that isn't protected with a username and password (and that isn't very difficult, now, is it?) and they want it kept private is some kind moron.
I mean, I could easily spider his site using wget ignoring his robots.txt.. (For the record, his robots.txt is disalow everyone).
It is, in fact a mechanism for safeguarding content that owners wish to keep private from crawlers. WRONG! It is a mechanism for discouraging crawlers from downloading vast hunks of your site. (Good example: Crawling all of slashdot would be much larger than slashdot itself because of all the different views of comments you can have. That's why the robots.txt of
Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.
Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...
Re:This guy is a bit stupid, right? (Score:1)
Just because the Donut Brigade won't move enough to make the powdered sugar fall from their rotund bellies doesn't make breaking and entering any more legal, regardless of how little effort *you* think 'breaking' requires...
Re:This guy is a bit stupid, right? (Score:1)
Yeah, it's a bit of a strech though, I know.
But, thinking that "reading links on your site that you don't want them to even though you didn't try and stop them is an invasion" is just niaeve and stupid.
current state of things (Score:1)
And this would get you sued if they didnt like what you were doing anymore, as you would be a trespasser. While cases such as Bidders edge v. eBay didnt explicitly hold that robots.tx
Re:current state of things (Score:1)
Really, you're arguing that robots.txt is just a special case of "Terms of Use" that you see around the place.
(Don't get me started on the so-called "justice" system.
I would perfer to hope that it becomes accepted knowledge that putting anything on a website is considered publication of that information.. but this could just be idle hope.
that is a good statement. (Score:2)
A like that wording. robots.txt is a terms of use that a computer can usually understand.
Re:This guy is a bit stupid, right? (Score:1)
Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.
Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...
"""
You need to see "Bowling for Columbine", particularly the parts about Canada and front doors.
YAW.
Re:This guy is a bit stupid, right? (Score:2)
No, it is more like a sign at the airport that says "Employees only" and then when you are surrounded by the police, you claim "but there was no lock on the door."
Or at a Radio Shack, there is a sign on the back room door, "Private, employees only."
Re:This guy is a bit stupid, right? (Score:2)
On the contrary, if there were a strange figure that was rifling through my house and refused to identify himself, I would sure as hell hope that the police would concern themselves, despite the fact that my doormat says "Welcome"...
You are equating technology with law, and that is a very dangerous thing to do. That I have a technological means to commit a crime does not invalidate the fact that
Re:This guy is a bit stupid, right? (Score:2, Insightful)
I agree with the basic principles that this robot is being a little impolite though. The guy opens up his website, hoping that people will act in a civil manner. Cyveillancebot marches in there with the digital equivalent of hobnail boots, ignores the signs, and takes copies of everything, assuming that anything there is probably stolen.
Equating it to mugging or breaking and entering is a bit much, but the shifty unshaven lurker seemed quite apt.
Cyveillance in a nutshell (Score:5, Informative)
The reason they're widely hated is that their bot misbehaves. Badly. Not only does it send bogus User-Agent headers and disregard the robots.txt file, it'll literally hammer a site. It's one of the most aggressive bots I've ever come across, and it seems its operators don't care. I've seen a server go down because a spider in Cyveillance's IP space was hitting a MySQL-based message board thousands of times per minute.
Most spiders either ignore URLs with query strings in them, recognize them as potentially resource-intensive and avoid fetching more than once or twice per minute, or are at least smart enough to avoid getting caught in a recursive loop. Not Cyveillance; the damned thing would fetch the forum index, then fetch a thread, then follow the link from that thread right back to the forum index, ad nauseum.
Cyveillance doesn't just crawl the IP space of webhosting and colo companies, either. They hit my cablemodem all the time - I'm not sure whether they scan all cable modems, or whether they've just grown fond of me because I'm running a web server (which serves nothing externally, save for a tiny index page that shows my uptime).
Drop 63.148.99.0/24 into the bit bucket and save your server some strain.
(By the way, why the fuck do I have to logout to post as AC now? Are registered users only allowed one AC post per month or something?)
Re:Cyveillance in a nutshell (Score:1)
you'd think a corp. would take more care. is it to hard to believe this bot will get stuck in a loop and tie up someone's bandwidth... perhaps sending them over their limit and costing them money? when you've got a name and an address that can be sued, its best to use some common sense.
maybe this bot doesn't do that, so feel free to explain why
Re:Cyveillance in a nutshell (Score:5, Insightful)
There are any [google.com] number [altavista.com] of spiders out there that are smart enough to index whole sites, including dynamically-generated pages, without taking a site down or even hitting it harder than a couple of simeltaneous users. This behavior is not only negligent, but malicious. Any site brought down by Cyveillance would probably have good grounds for legal action (I am not a lawyer, this is not legal advice, talk to a lawyer if you want legal advice, etc.).
Re:Cyveillance in a nutshell (Score:3, Interesting)
Re:Cyveillance in a nutshell (Score:2)
OrgName: Cyveillance
OrgID: CYVEIL
Address: 1555 Wilson Blvd., Ste. 404
City: Arlington
StateProv: VA
PostalCode: 22209-2405
Country: US
NetRange: 63.148.99.224 - 63.148.99.255
CIDR: 63.148.99.224/27
If you block the whole
Re:Cyveillance in a nutshell (Score:1)
-> what are the defences for aggressive spiders and
--> what is the impact of these defences?
And, a case study. What happens if I copy+paste a WP posting to my own free site when:
- site is hosted under cuban domain?
- I copy data to paper word for word and fly to cuba, then submit and host there?
^ Laws for US/EU?
Where might be a good source to answer these ridiculous legal copyright related questions? They se
Re:Cyveillance in a nutshell (Score:3, Insightful)
What I don't understand is why scouring the web for Copyrighted material is considered being violated. If you are depending on the copyright laws, then you must abide by the limitations on those rights. Once the copyright o
Intrusive Spybots (Score:2)
Well, I certainly don't consider it wrong for copyright holders to search the web for theft of their IP. Problem is, Cyveillance does it in an extremely disruptive manner. It's probably not reasonable to expect the cyveillancebot to honor robots.txt, as Chris Gulker thinks it should. But if it doesn't act nicer than it currently does, then web masters will just lock it out -- and it will defeat its own p
Re:And this is why many ISPs don't give log access (Score:4, Insightful)
This is classic American business practices.
We are a good, upstanding corporation.
We want to protect our turf.
We employ a company to help us.
We don't ask about that companies means or, more likely, turn a blind eye.
Dell would never agree that applications on the Internet should, in general, act the way that Cyveillancebox does.
I believe that the author understands your point. He's not whining.
He is, however, pointing out the hypocrisy, which I think is valuable. I'll think twice about buying another Dell.
IP-BLOCK TO BLOCK (Score:5, Informative)
I do not care for this kind of intrusion (I equate this to exactly what spammers do to harvest your email...) then you can block these ips (route em to never never land.)
Another guy's experiences (Score:5, Informative)
--Paul
Traps, ripoffs. (Score:2)
I find it interesting that he can lock out Cyveillancebot and other spybots simply by banning their IP addresses. Sounds like Cyveillance and other "ebusiness intelligence" companies are being less than diligent in provi
Re:Saddam, Cyveillance, etc. etc. (Score:5, Interesting)
Cyveillance is basically reselling content from thousands of Web sites - original thinking, research and writing, that is not theirs... they are exactly what they claim to protect the corporate copyright owners from - they basically rip off work, including copyrighted material, and resell it.
Good scam, they make a ton of money according to their press releases, but a scam, nevertheless.
Sorry, but no sympathy at all. (Score:2)
The best you can do is chase - legally if necessary - those who steal your work, and gain whatever compensation you can. Oh, and make sure that copyright is broadly proclaimed in the first instance, too.
No, the `bot shouldn't crawl past robots.txt (rfc-ignorant [rfc-ignorant.org], anyone?). But, given that it do
Block it with Apache and mod_rewrite! (Score:2)
RewriteRule ^.*$ - [F]
Of course the actual address of the bot may vary.
Re:Block it with Apache and mod_rewrite! (Score:1)
Nope. Because the cyveillence bot doesn't announce itself. It masks its user-agent.
See the above comment: Cyveillance in a nutshell [slashdot.org]
You need to block it's IP:
# CyveillanceRewriteCond %{REMOTE_ADDR} ^63.148.99.(22[4-9]|2[3-5][0-9])$
# FILTER BOTS : 403-Forbidden
RewriteRule ^.* - [F,L]
Re:Block it with Apache and mod_rewrite! (Score:2)
Uh huh, and did you see my rule mention HTTP_USER_AGENT anywhere in it? No. Look at what you wrote--the only difference between your rule and mine is that you followed my advice and used an IP address range instead of the host name.
Re:Block it with Apache and mod_rewrite! (Score:1)
CYVEILLANCEBOT (Score:4, Funny)
What robots.txt? (Score:1)
Unless I am misunderstanding the log entry, robots.txt doesn't actually exist on this guy's server. So why does he spend so much time complaining about this thing not looking for it?
And now, Cyveillance's robots.txt file (Score:2)
Notice how they misspelled "Disallow" in the fourth item, and that none of the pages seem to exist. Good job, Cyveillance!