Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Businesses Microsoft Privacy The Courts The Internet

LinkedIn Says It's Illegal To Scrape Its Website Without Permission (arstechnica.com) 167

A small company called hiQ is locked in a high-stakes battle over web scraping with LinkedIn. It's a fight that could determine whether an anti-hacking law can be used to curtail the use of scraping tools across the web. From a report: HiQ scrapes data about thousands of employees from public LinkedIn profiles, then packages the data for sale to employers worried about their employees quitting. LinkedIn, which was acquired by Microsoft last year, sent hiQ a cease-and-desist letter warning that this scraping violated the Computer Fraud and Abuse Act, the controversial 1986 law that makes computer hacking a crime. HiQ sued, asking courts to rule that its activities did not, in fact, violate the CFAA. James Grimmelmann, a professor at Cornell Law School, told Ars that the stakes here go well beyond the fate of one little-known company. "Lots of businesses are built on connecting data from a lot of sources," Grimmelmann said. He argued that scraping is a key way that companies bootstrap themselves into "having the scale to do something interesting with that data." [...] But the law may be on the side of LinkedIn -- especially in Northern California, where the case is being heard. In a 2016 ruling, the 9th Circuit Court of Appeals, which has jurisdiction over California, found that a startup called Power Ventures had violated the CFAA when it continued accessing Facebook's servers despite a cease-and-desist letter from Facebook.
This discussion has been archived. No new comments can be posted.

LinkedIn Says It's Illegal To Scrape Its Website Without Permission

Comments Filter:
  • by Anonymous Coward on Monday July 31, 2017 @01:42PM (#54914951)

    don't make it public fi you don't want it read

    • by Anonymous Coward on Monday July 31, 2017 @02:07PM (#54915221)

      don't make it public fi you don't want it read

      They want it read. By people. (And search engines.) They don't want it read by companies that take the information and then sell it as their business model.

      If we support hiQ, saying that scraping publicly-accessible content from another site and then using that for profit is permissible, then doesn't that mean it's also applicable to other sites? Slashdot's content is public: can I scrape everything, host it on my site, insert ads, and make money?

      Sorry hiQ, as much as software and internet legislation is behind the times and technically inappropriate, there are some things in law which follow common sense - and one of them is you can't take someone else's stuff and sell it for yourself. If you want to use their content then you need to follow the (common) practice of establishing some sort of licensing agreement.

      But anyways, what about their user agreement?

      You agree that you will not: [...] Develop, support or use software, devices, scripts, robots, or any other means or processes (including crawlers, browser plugins and add-ons, or any other technology or manual work) to scrape the Services or otherwise copy profiles and other data from the Services;

      Is that not enough for at least an injunction and civil suit?

      • by BronsCon ( 927697 ) <social@bronstrup.com> on Monday July 31, 2017 @02:19PM (#54915327) Journal

        They don't want it read by companies that take the information and then sell it as their business model.

        What do search engines do, then?

        • by tattood ( 855883 )

          What do search engines do, then?

          Search engines create an index that is searchable and make money by selling ads on the search page. Search engines are NOT collecting the website data, and make correlations about the data on the website and selling that data to companies.

          • 1) collecting the website data. Check The spider downloads all of the text content and stores an index along with contextual relationships.

            2) make correlations about the data on the website Check The hyperlinks on the web site are used to evaluate the relative importance of the linked web site.

            3) selling that data to companies. Nearly They don't charge for the search engine directly - they charge for advertisers and then provide the data for free to visitors. More or less the same thing effectively sp

            • but the "less" and "effectively speaking" are the keys to the whole thing. Along with that pesky little thing called copying without permission for the intended usage. For some concrete, everyday examples, wander into any Catholic or Methodist or LDS (Mormon) church and look through the hymnal. You will find something at the beginning explaining how all the music can be copied for non-commercial use except as otherwise noted. And then you will find that some of the songs are marked with phrases that fit the

        • Where exactly in the complaint by Linkedin are they telling search engines to not index linkedin.com?
          • Not relevant to the argument being made, which was in response to someone claiming they want search engines to be able to do what search engines do then, in the very next sentence, claiming they don't want search engines to be able to do what search engines do.
            • But it seams that Linkedin want precisely that, i.e for search engines to continue to do what they do but not let hiQ do what hiQ does.
              • Right. I was replying to this, though:

                They want it read. By people. (And search engines.) They don't want it read by companies that take the information and then sell it as their business model.

                I was pointing out that search engines "take the information and then sell it as their business model."

                Sorry you missed that.

                • I didn't miss that, just look like you thing that extracting the title of a page constitutes "take the information and then sell it", something that is covered by fair use. It would be a whole different affair if i.e Google extracted and resold the amount of information that hiQ does, which of course was the point of the GP.
                  • I didn't miss that, just look like you thing that extracting the title of a page constitutes "take the information and then sell it"

                    No, it looks like you missed where they're taking the entire content of the page and not just the title, since you keep coming back to "just the title".

                    It would be a whole different affair if i.e Google extracted and resold the amount of information that hiQ does, which of course was the point of the GP.

                    So, if Google took only key pieces of information, rather than the entire page, that would be problematic? Because Google takes the whole page, while hiQ takes key pieces of data; Google is actually taking, repackaging, and profiting from more of LinkedIn's data than hiQ is.

                    But, all of that is still highly irrelevant to what I was replying to.

                    • But Google is doing something of which LinkedIn approves and has given Google permission to do. hiQ, on the other hand, is doing something of which LinkedIn does not approve and has not given hiQ permission to do. That is entirely the difference here. LinkedIn believes that they benefit from the way Goole indexes their pages and allows them to be searched but LinkedIn believes that what hiQ does is harmful to LinkedIn as it will tend to drive people away.

                      I understand that people who have never created anyth

                    • But Google is doing something of which LinkedIn approves and has given Google permission to do.

                      Have they, though? Or have they simply not asked them to stop?

                      I understand that people who have never created anything of value or who believe strongly in socialism have no concept of ownership of property

                      Lovely assumption, but incorrect. I, in fact, have created quite a bit of value in this world. Just as a small sample, my clients value me enough to keep me employed long-term and my employees value the income and stability I provide them. So, then, you must think I'm a socialist? Why is that? Wait, no, you can't possibly think I have no concept of ownership of property when I've stated that LinkedIn has ownership of the data they've collected.

        • by bws111 ( 1216812 )

          It doesn't matter what search engines do. The owner of the site is perfectly within his rights to say 'these accesses are allowed, these are not'.

          Being indexed by a search engine is probably beneficial to LinkedIn. Both parties gain from being indexed, it is a symbiotic relationship.

          HiQ is probably not beneficial. By ratting out LinkedIn's user to their employers they are potentially decreasing the number of people who will use LinkedIn. That is a parasitic relationship.

          • The owner of the site is perfectly within his rights to say 'these accesses are allowed, these are not'.

            Yes, and they can do that with HTTP200 and HTTP403 status codes, respectively.

            • by bws111 ( 1216812 )

              Sure, they CAN do that, but they don't HAVE to do that. Once you have been told you don't have permission, you don't have permission.

              • Actually, anything you're able to view from a public space is fair game under current laws, with the exception of court orders stating otherwise. If hiQ's servers can view the content from the public internet (that is, if LinkedIn's servers serve it to them without them hacking around some technical measure), it's fair game unless LinkedIn gets an injunction against hiQ. That is, what you're claiming is really for the courts to decide.

                Or, you know, LinkedIn could just claim copyright on their data and iss
                • Data is not copyrightable, because it isn't a "work of authorship" under the copyright statutes. That's why LinkedIn is using this hacking law in a contorted way to try to stop the use of this content.
                • A library shelf is a public space and so is a museum wall. Are you claiming that anybody has the right to walk in, take pictures or photocopies of anything in those public spaces and resell those copies and that they are not violating current law? I would be asking those several lawyers that you consulted with for a refund.

                  • When it comes to reading (viewing in your museum example), which is what was discussed in the argument I was originally replying to, the above is absolutely true. When it comes to copying, it's a little more nuanced than that, of course; but, then, I was writing a Slashdot post, not a fucking dissertation, I certainly was not giving legal advice and, again, was arguing against someone who claimed that merely viewing something viewable from public space, which the owner readily serves up to you with no techn
                  • by AK Marc ( 707885 )
                    I can go into the Louvre, sit in front of the Mona Lisa, and sketch an exact replica of it, down to the brush stroke (except for the fact that there are always people standing in front of it trying to take a selfie), then sell that copy. Wait, what was your point again?
        • by AHuxley ( 892839 )
          Re "What do search engines do, then?"
          Connecting people who worked on secret mil/gov projects with people looking for staff to work on other secret mil/gov projects.
          So people list all the projects they worked with and can show they are trusted in plain text.
          They used the same methods in the gov/mil and just expect the same results on the net.
      • I guess you can't print a page either.
      • by Dog-Cow ( 21281 )

        If MS asserted a copyright claim, that would be different. There is no fraud and no hacking taking place when scraping publicly-accessible data.

      • Slashdot's content is public: can I scrape everything, host it on my site, insert ads, and make money?

        Copyright law clearly makes that illegal. This case is a little different in that it seems to be about the kind of data that can't be copyrighted.

        • by AK Marc ( 707885 )
          Then why didn't they file a copyright complaint? Instead, they are claiming "hacking" for viewing public information. (not copyright for using it, but "hacking" for viewing). Copyright is irrelevant, and not the complaint.
      • Slashdot's content is public: can I scrape everything, host it on my site, insert ads, and make money?

        Plain old Copyright law is enough to put an end to that. However, facts are not copyrightable, and Linkedin has a lot of valuable facts in its database.

        But anyways, what about their user agreement?

        Can you put a EULA in a document folder (page 50 in a stack of 200 pages) and throw it on the ground in the park, and expect to enforce it when it tells people not to read the other pages in the envelope? That's the physical-world equivalent.

      • by AK Marc ( 707885 )
        So the solution is to provide public APIs, and request scrapers use those, so the data access can be tracked and identified just like when humans and search engines use it.

        If they make it public and predictable so search engines point to them, then they have given a robots.txt that allows that use, so it's "licensed" by the lack of controls, same as search engines.

        But anyways, what about their user agreement?

        The search engines never log in or agree to the user agreement, and this use seems to be a search engine that doesn't simply direct views to the

    • Exactly! (Score:2, Informative)

      by Anonymous Coward

      I refuse to use any social media site including LinkedIN. A lot of companies - such as Goodwill - recruit exclusively from LinkedIN. Fuck'em.

      I don't work for any company that uses social media for recruiting.

  • Because if it's not illegal to scrap their websites, black hat hackers will have a field day.

  • by GerryGilmore ( 663905 ) on Monday July 31, 2017 @01:49PM (#54915017)
    Using some add-on python packages it is ridiculously easy to scrape any web page, even those that use ASP (It's a PITA to get set up the first time, but...). The ONLY thing - aside from legal action, apparently - is to have a login mechanism in front. Without authenticating, it's no-go.
  • by ErichTheRed ( 39327 ) on Monday July 31, 2017 @01:54PM (#54915075)

    Airline websites have this same problem -- the online "cheap ticket" engines regularly scrape the publicly available data by essentially running the "book a trip" workflow millions of times to try to pull the entire set of fares for different city pairs. It's a cat-and-mouse game because the information has to be available for normal humans to book trips; no one is going to solve a CAPTCHA to look up fares. Basically these engines are looking for any irregularities like mis-filed fares or fares that happen to be a particularly good deal. (Airlines have to publish their fares in advance and make them available to online sources that are available to travel agents. This is why you'll occasionally see stuff like a transatlantic business class ticket for $50 or similar...)

    I'm not sure if LinkedIn can actually bar someone from scraping their public data. If that was the case, no one could run wget on a website and pull down all the static content.

    • by shuz ( 706678 )

      I have direct experience with this myself.

      This is why companies like Akamai have products geared specifically for this problem. However stopping bots is nearly impossible unless you deal with them on a realtime basis. It would be interesting if Linkedin could get the entire world to make website scrapers illegal and then actually enforce that illegality. As of now when a bot owner is shutdown they just move the operation overnight to the ISP that will take their business in the same country or move countrie

      • Wouldn't it just be easier to run your bots through multiple VPN's with endpoints in different countries?

      • No offense but you are a complete noob if you're trying to scrape sites without connecting through proxies. LinkedIn will start sending 403's almost right away.
  • This is bonkers! (Score:5, Interesting)

    by Zobeid ( 314469 ) on Monday July 31, 2017 @02:05PM (#54915205)

    Here's why it seems bonkers to me. . . When you access a website, you are merely sending that site a request for information. That's all. Assuming it responds with the requested information, one must presume that's because the operator (and, by proxy, the owner) of the website set it up for that purpose. So what we have here is effectively. . .

    LinkedIn: Don't request information from us!

    hiQ: Please send the following information.

    LinkedIn: OK, here you go.

    LinkedIn: Dammit, you requested information after we told you not to! WE'RE GONNA SUE!!

    • Re:This is bonkers! (Score:5, Interesting)

      by bluefoxlucid ( 723572 ) on Monday July 31, 2017 @02:19PM (#54915323) Homepage Journal

      Actually, LinkedIn has a point.

      LinkedIn supplies service to the public at-large, in the same way that a MicroCenter supplies retail service to the public at-large. All members of the public are allowed to enter a MicroCenter. You walk up to the doors and they open automatically.

      You can be trespassed for no reason by a retail center or other physical location open to the public at-large. The doors still open to you, but you're not allowed in. It's the same with a Web site: it's difficult in-practice to establish a verifiable packet identity on the Internet. IP addresses change, and you can do goofy shit like put the data scrapes in AJAX requests to distribute their source.

      In other words: you're by default authorized to access LinkedIn's public assets. You're not allowed to access stuff requiring a logged-in session until you've gotten log-in credentials, because there are actual systems in place to stop you from doing that, implying that you're not supposed to force access there. Basically, civilized understanding of the expectations of your host on the face.

      If LinkedIn tells you to stop, you've now had your authorization revoked. You can't claim a restraining order is invalid because someone's outside and you can also be anywhere outside, and you also can't claim that LinkedIn can't de-authorize you unless they specifically identify and block you. Blocking an individual entity from a Web site is hard and has collateral damage.

      So the CFAA is actually a valid vehicle here, since "abuse" is essentially defined as "accessing a system to which you are not authorized." The reasonable person test holds up a lot of behavior, largely because it's unreasonable for a person to determine if a certain behavior or function on a Web site might not be something they're allowed to touch, or whatnot, given the reasonable behavior of people at-large. A lot of stuff happens that won't pass CFAA as fraud or abuse, even though it's inconvenient and unintended. By the same token, when somebody has told you to stop accessing their systems in a certain way and you do it anyway, a reasonable person might assume you were, you know, told not to, and not allowed to do that, and that you know damned well you're not allowed to do that.

      That's not to say threats, lawyers, and other anti-social behavior are good business. Poor diplomacy here. Effective in the legal field, but not your best option.

      • Then blacklist IP's at the firewall(s) for endpoints that are scraping your site.
        • by tattood ( 855883 ) on Monday July 31, 2017 @02:53PM (#54915581)

          Then blacklist IP's at the firewall(s) for endpoints that are scraping your site.

          IP addresses are fairly easy to change. You can use something like TOR, so your public IP always changes.

        • Let's try this again.

          it's difficult in-practice to establish a verifiable packet identity on the Internet. IP addresses change, and you can do goofy shit like put the data scrapes in AJAX requests to distribute their source.

          Blocking an individual entity from a Web site is hard and has collateral damage.

          Wikipedia has tried this, with collateral damage and limited success. I've seen people get sent to jail for harassment and legally barred from accessing certain sites and systems under restraining order, and then continue to access them with no reasonable way to prove their identity (i.e. could be someone else pretending to be said person).

          These days, it's different. Those IP addresses are probably automatically-assigned or internal to cloud infrastructure. IAAS may share address

      • by Ichijo ( 607641 )

        Except when you enter a MicroCenter, you are stepping foot on their property. When you anonymously request a public web page from a web server, you're standing on the public sidewalk at the walk-up window. Since you as a taxpayer own that sidewalk, can the store owner restrain you from your own property as a way to make you stop placing orders at the window?

        From TFA:

        [Orin Kerr, a legal scholar at George Washington University] argues sites wanting to limit access to their site should be required to use a tec

        • When you request a public Web page, you're accessing and using their machinery.

          My entire argument was "available to the public" versus "except you; you get the hell out right now." A technical mechanism is infeasible: if they want the data to be publicly-viewable and don't want people to do certain things, then a password doesn't work; and firewalls and the like will have to contend with modern global, auto-scaling, IP-changing data centers where you can't just single out a particular actor by IP addres

      • A trespass notice can't stop you looking at MicroCenter from a public space.

        If you want to restrict someone in a public space, you need a restraining order from a judge.

        Transferring that to the internet, a cease and desist letter is like a trespass notice. Probably appropriate for telling someone to stop creating new logins to access restricted content after you disable their old ones.

        Asking a judge for an injunction would be appropriate to stop someone accessing publicly available content. Of course, this

        • Actually, transferring that to the Internet, you have to walk into the MicroCenter and turn the display around, then go back outside the window and look at it again to get a view of what's there. Every time you want to see it, you have to walk inside, fiddle with things, then walk back out.

          You do know that nothing is actually "on the Internet", right? Do we need to explain to you how the Internet works?

      • LinkedIn supplies service to the public at-large

        OK, there's where you're wrong.

      • You're not allowed to access stuff requiring a logged-in session until you've gotten log-in credentials, because there are actual systems in place to stop you from doing that, implying that you're not supposed to force access there.

        Actually, if the scraper used a valid username and password (or other valid credentials) to gain access, access was authorized. It might have violated a user agreement perhaps, but that's a separate civil matter. The Computer Fraud and Abuse Act specifies criminal acts that a private entity (like LinkedIn) can't use as a basis for its suit.

        • The point wasn't that they used a password; there was a further point down that LinkedIn had de-authorized them from non-password-protected mechanisms: they told them they're now specifically not allowed to do that, which means they're not.

          Imagine if you ssh'd to a bank's accounting system across the 'net and found that it just lets you log in as root, no password. Is that also legal?

    • by bws111 ( 1216812 )

      Many stores have doors that you can open by pushing a button. Assuming the door opens, one must presume that is because the management (and, by proxy, the owner) of the store has set it up for that purpose. So what we have here is effectively..

      Store: You have been banned from this store. Do not come back
      You: Push the button
      Store: Door opens, you go in
      Store: We told you to stay out, we're having you arrested for trespassing

      This, of course, happens all the time (except for the idiotic assertion that the

    • Trying to make it illegal to scrape the data is beside the point -- what linkedin really wants to do is prevent others from publishing the data. Just because you can find a book in the library and the book doesn't fire lasers at your eyes to blind you and stop you reading it doesn't mean you have permission to sell your own book which consists of photocopies of that book with a few small changes.

  • I refer to the Robot.txt used to tell search engines what's out of bounds. http://www.searchtools.com/rob... [searchtools.com]

    • But they want to be indexed by Google, just not by they company that tells employers their staff is looking.

      The solution is just to never, ever, stop looking. Even if you love your job, having a current resume on Linkedin will get you better raises.

      • by Mandrel ( 765308 )

        But they want to be indexed by Google, just not by they company that tells employers their staff is looking.

        A robots.txt file can state which HTTP User Agent strings are allowed. For example, Slashdot only allows [slashdot.org] access by certain search engines. If you're starting a new one, you have to misrepresent yourself, or you're buggered. The question is when such misrepresentation is legal and moral, and whether it is instead up to sites to more accurately detect who they want to serve, and serve errors to those they don't.

        The solution is just to never, ever, stop looking. Even if you love your job, having a current resume on Linkedin will get you better raises.

        Again it pays to be the selfish squeaky wheel. The basis of advertising.

  • Now if LinkedIn had instead posted "ecto gammat", all the nerds would be in their corner.

  • LinkedIn's whole business model is "scraping" information from people. It's not like they pay people to enter that information.

    When CDDB tried this sort of B.S. it led to FreeDB. Maybe LinkedIn being assholes will lead to something similar.

  • Can we talk about what HiQ is doing with the data for a sec? "HiQ scrapes data about thousands of employees from public LinkedIn profiles, then packages the data for sale to employers worried about their employees quitting" I mean WTF?

    • Like it or not, if you (or an employee in your example) choose to publish information about yourself in a publicly-accessible place, then you've voluntarily relinquished whatever privacy rights you had in that information. Whatever you believe about HiQ, they are only organizing and re-releasing public information. LinkedIn has no copyright in it (as they didn't create the data, nor is it a work of authorship), and they were complicit in the act by delivering it up upon request.
  • The Computer Fraud and Abuse Act is part of the Federal Criminal Code, and no private entity can use it to bring a suit. A prosecuting attorney for the government could make a criminal charge, but LinkedIn would have to persuade him/them to take that act. This is much ado about nothing.

A committee takes root and grows, it flowers, wilts and dies, scattering the seed from which other committees will bloom. -- Parkinson

Working...