Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Patents

Checksumming Webpages Patented 234

Just when you thought nothing else stupid could be patented, Wahfuz noted a story running about a company called Pumatech who has apparently patented storing a checksum of a webpage to determine if it has updated or not. I guess from now on everyone who wants to detect changes in web pages will need to store full copies of the pages in question, because I'm sure nobody thought of anything so complex as piping it through md5 and saving the output.
This discussion has been archived. No new comments can be posted.

Checksuming Webpages Patented

Comments Filter:
  • by Anonymous Coward
    I wonder how quickly this will get added to BountyQuest?
  • by Anonymous Coward
    Huh? You're not making any sense. I've implemented content based caching using the ETag header and If-Not-Match. The variant caching is another feature ETags enable, but they are certainly not orthogonal.

    Using ETags for caching instead of if-modified-since was prompted by variants, since the multiple langauge versions usually have the same timestamp. Just because its useful for that doesn't make it any less useful for fully generalized content based caching.
  • So, if I just stored it and didn't ever do anything with it, I'd be okay.

    Ever heard the story of the CD-WOM? It was a device consisting of two blocks of ordinary wood and a cable connecting it to the user's PC. CD media was placed between the two blocks and data was written to the CD. The process was foolproof (I challenge you to prove to me that no data was written to write-only media!)

    That's about how useful storing a checksum of a webpage would be without *doing* anything with the data. Sure, the checksum exists, but if you don't bother to do anything with it, the data is as worthless as a CD-WOM. Obviously, someone creating MD5 hashes of all their webpages would also build some sort of system around it to make use of those hashes!

    - A.P.

    --
    Forget Napster. Why not really break the law?

  • The program urlmon [syr.edu] that checks URLs for changes has had this feature for quite awhile.

    The README [syr.edu] has the lowdown:

    urlmon makes a connection to a web site and records the last_modified time

    for that url. Upon subsequent calls, it will check the URL again, this
    time comparing the information to the previously recorded times. (Note
    that if the subsequent time is older (less than) the first, urlmon will
    still assume that the URL has been updated. I figured I'd play it safe.)
    Since the last_modified data is not required to be given by the http (it's
    optional), urlmon will then take an MD5 checksum.


    DISCLAMIER: I contributed to this project
  • You can have your web page send the ETag header. Generate a new ETag when the page changes. I mostly use it to change web page contents every x minutes, without messing with date stamps and worrying about screwed up clocks on browsers' computers. The RFC in question is available at http://www.faqs.org/ [faqs.org] - it is the very lengthy HTTP1.1 protocol spec. From what I understand (not that the actual mechanism matters), the browser sends a request for a page along with the ETag (if page was previously cached) and the server will determine whether to send a 304 or the updated page with a new ETag. ETags are essentially disconnected from file date stamps and page content, which makes them great for use in dynamic pages.
  • by jCaT ( 1320 ) on Monday April 23, 2001 @01:30PM (#270350)
    I'm sure nobody thought of anything so complex as piping it through md5 and saving the output.

    Yeah- this is one of those "Why didn't I think of that?" things- but I have yet to hear of a web cache or proxy that uses md5sums instead of last-modified headers- are there any out there? And if so, wouldn't that count as the all-important prior art?

    Just because something seems simple once somebody else thought of it doesn't mean it wasn't a good idea in the first place.
  • Sure, someone invented those concepts, but it wasn't these guys.
  • Akamai, among probably lots of others, uses md5 checksums as one of the methods to detect updated pages. I don't know when they started and when the patent was applied for, but it's a possible example of prior art that came right to mind when I heard of the patent.
  • This is a method that is public knowledge and has been for some time. Mudge discussed this as a "web security" technique at blackhat back in '98. Heck, CNN was there and broadcasted pieces of that particular panel. Since he released it into the public domain by open discussion at a national conference, I do believe that voids the patent on the basis of a widely known public method. Of course, I'm not a lawyer even though I don't play one on TV.
  • If you read the press release, the patent isn't on storing checksums of HTML pages, but is for storing checksums of sections of a page between pre-identified HTML nodes.

    Now, perhaps there is prior art for this, but its a damn good idea and I sort of doubt it because I've been around the block a few times and haven't seen ANY caching mechanisms that can determine if a page has changed based on a checksum calculated from just a portion of the page (presumably so things like today's date on a page doesn't affect the state of the cache).

    That seems pretty damn innovative to me. I'm no big fan of software patents, but as software patents go, this is a lot more justifiable than most.

    So flame away, but there is a lot of posturing going on here about prior art, and none of them seem to come close.
  • And, unfortunately, probably perfectly valid in the US where something as stupid as software patents can be "valid".

    I quote:

    a checksum generator, coupled to receive the fresh copy of the document from the periodic fetcher, for generating a fresh checksum of a portion of the fresh copy of the document and comparing the fresh checksum to the original checksum, the checksum generator signaling a detected change to the remote client when the fresh checksum does not match the original checksum,

    Note the bold part. Contrary to the inflamatory headlines, this patent does NOT cover blindly checksumming webpages, but rather strategically checksumming the critical part of a page, so the fluff doesn't affect the cache status.
  • Noel Bell has had his web page up since 1996 on signing web pages using pgp. His key is 2.6.3i, which is probably the last "safe" version anyway. :)

    Here [pobox.com] is a link to his page, with a copyright on it:

    © 1996,1997 Edward James Noel Bell
  • Of the top of my head I can think of about a dozen or so software that will apply a checksum to a file (regardless whether a browser will render it badly or not)... Transfer protocols like ZModel should certainly qualify as prior art?
  • Yup, mindit got one less user today.
    Of course, I also took the time to fill their poll, to explain why I unsubscribed.

    Also, look for MD5, Content-MD5 or ETags on the www.w3c.org [w3c.org], their silly patent doesn't fly for a second.

  • Yes, it's certainly good that we have patents; why, before patents neither the wheel nor fire had even been developed. Why would anyone want to invent things if not for the reward of being able to deny them to others without compensation?
  • Erm, well, I have been running a twice-daily cronjob called "urlwatch" on my workstation since - oooh, about 1997 - the guts of which are:

    cat $cf |
    while read url sig junk
    do
    test "$url" = "" && continue
    if www diagwww -aceh "$url" >$tf 2>/dev/null
    then
    newsig=`md5 if [ "$sig" != "" ]
    then
    if [ $sig != $newsig ]
    then
    reminder $url $sig $newsig
    fi
    fi
    sig="$newsig"
    else
    ( echo ERROR $url ; cat $tf ; echo -- ) 1>&2
    fi

    echo $url $sig
    done > $nf

    ...and I can probably dig-up corporate off-site backups to prove it.

    For those not familiar with my toolkit, the script retrieves a URL, MD5's it, and mails me a reminder-note when the signature changes due to modification of content.

    I would deem this to be an obvious idea, and would happily support an effort to squash the patent.

    - alec

  • ps: I apologise for some corruption in the above, but /. doesn't seem to like large wodges of code, and it got munged in transit.

    if anybody wants the real thing, drop me a line. usual anti-spam provisions apply.

  • I have a script I have been running for over a year, that fetches a remote pages, MD5s it, compares the MD5 to the last one, if it's different it save the page and updates the stored MD5, otherwise it drops the page.

    Is this prior art? I was developing a small script or two to do this with arbitary pages, do I have to stop now?
  • You have misunderstood the use of the ETag field. A server may provide different versions (variants) of a resource depending on request fields such as Accept (accepted types), Accept-Language (accepted languages), or User-Agent. The ETag response field is used to identify the variant that is being returned. The Vary response fields specifies which request fields the server may use in selecting variants of the resource. If a user-agent or proxy needs to fetch a resource, and has one or more variants in its cache (that haven't expired), but the values it would use for the request fields listed in the Vary response field differ from those used to obtain the cached variants, it needs to make a new request from the server. It can use the If-Not-Match request field to avoid re-fetching data if the server selects one of the variants it already has. This is all completely orthogonal to checking for modification of the resource.
  • Not that it matters anyway - with many web pages often having dynamic content for dates and menus taking a checksum is a bit pointless.
    Better to keep a DB of last-edited timestamps. This is how I work with a site of mine that uses HTML::Mason and needs to know when to serve a cached copy, when not to, or when to update the cache.

  • http://www.delphion.com/details?&pn=US06219818__

    Ok, there's a little bit more to it than just storing checksums, but is this really non-obvious and original?

    --
  • There's a simple perl CGI tool called JD What's New. I use it quite a bit myself. You can find it on Freshmeat here. [freshmeat.net]
    Last change, MD5, Checksum, and size are all applicable methods for checking for updates.
  • Hebrew scribes would add up the total of the letters on a page to assure that they had correctly copied the text. (in Hebrew, the same characters are used for both letters and numbers, as any qabalist could tell you.)
  • by Francis ( 5885 ) on Monday April 23, 2001 @02:06PM (#270368) Homepage

    I used to work at Pumatech. (Actually, I worked in the wireless web-browsing [proxinet.com] end of things, as an engineeer)

    Anyways, we were checking our emails one day (this was about 6 months ago) and there's some big "congratulations" email - we got another pattent!

    A large portion of the company is based out of synchronization software. (Synchronize your PIM, Laptop, whatever) We'd just received a patent on a revolutionary new technique - time based syncing! Sync data, based on their TIME STAMPS!

    We had a good laugh.


    --
  • I think that if it benefits society as a whole then some ideas should not be owned by a single person. This goes right back to the generic drug debate, if the "Intellectual Property" is something that could change people's lives, then I don't think that a single company has the right to charge exorbant amounts for it.


    I also want to point out that in theory Communism is a GREAT idea, it just sucks in pratice because of corruption on the part of people in power. I don't think that a single person should be able to have well over a billion dollars while other people die of starvation.
  • Actually, a number of technologies relevant to nuclear weapons were patented prior to and during the Manhattan Project. For some reason, Mr. Stalin failed to adhere to such Intellectual Property law as might have existed at that time. Now that I think about it, I can't imagine a notion more antithetical to the Communist Manifesto than intellectual "property".
  • Hmm... you could use the MD5 of a document as the entity tag (Etag) and use the If-None-Match conditional header. By spec, Etags are totally opaque, but there's no reason they couldn't be checksums.

    IOW, the mechanism is there, but I'm not aware of that particular policy (tag==MD5sum) ever being used.

  • Ehrm?

    You have an old copy of the page and a checksum of that copy. You send a request to the server saying "If the checksum is no longer X, please send me the new copy, otherwise send me a 304 Not Modified message". The server has a checksum Y of whatever version of the page is current. If X==Y, send "304 Not Modified" (a few hundred bytes). If X!=Y, send the new version. This is standardized behavior (see ETag and If-Match/If-None-Match in RFC2616).

    You and Taco must be smoking the same crack today.

  • If they're using a simple checksum, then someone should figure out how to fool it--add like a comment field to a webpage with the correct characters to make the checksum the same.

    If they're using md5sums, well, I guess this won't work.
  • US 4 ever and worlddomination 4 the us.
    Well, the us can force (to a certain level) other countrys to do things they do not want to do. Also if someone has a us-patent it is not to (unfortunately) hard to get an eu-patent for it.

    However, why is the us the only country that has a right to have a good economy? The people in Japan worked hard. Why do us-pharmacy-concerns have the right to tell african states which pills they have to buy?

    You say, that so much good can be done by a proper patent system. Good for whom? The us? Perhaps you did not realize it, but there are human beeings outside the us as well.

  • So, you claim that without patents (which came into force a bare 200 years ago) there would be no scientific progress? Have you ever actually read a history book? How about a history of science? Obviously, not.

    I suggest you read Kuhn's Structure of Scientific Revolutions, a few other historical documents and then get back with us when you have some familiarity with the subject.

    The "sad, sick thing" is that you put personal profit above intellectual honesty.

    mp

  • by Christopher Thomas ( 11717 ) on Monday April 23, 2001 @02:02PM (#270376)
    Just because something seems simple once somebody else thought of it doesn't mean it wasn't a good idea in the first place.

    And just because they (allegedly) were the first to think of it, doesn't mean it's patentable.

    Patents are supposed to be given only for things that aren't "obvious to anyone skilled in the art". In practice, this isn't assessed well by the patent office, but that's another can of worms.
  • I rattled off specs for just such a thing during a lunchtime meeting months ago. I suppose since we threw out the McDonalds wrappers I wrote it on I cannot have it considered as prior art.
    Damn Ketchup Packets.

    The only issue was how to get the endusers browsers to use the system. IANAProgrammer so I was relying on the technical ability of my lunchmates.

    With the success of the new.net dns plugin [new.net] and other such plugins I see that writing a plugin to the browser is the simplest way to do this.
    -miket
  • I have thousands of MD5 sums stored from web pages and various files linked to web pages along w/ many of the original files. I've been sucking such info off the net and using MD5 sums to verify unique these files for a couple years at least. Never even considered the lame ass idea of patenting such a thing. Damn maybe I should patent all my shell scripts. :)
  • Likewise. Company I used to work for did something very similar, using a CRC calculated using the text of a web page to determine web page "identity". I would be surprised if the Lycos (or Altavista, or Webcrawler, or Hotbot...) spiders didn't do something very similar.

    Which brings up an interesting question - if, by 1997, there were enough companies implementing this sort of "technology" already, then can't it be argued that the Pumatech patent is obviously invalid because at the time they applied for it, it was already in use by multiple companies... which seems to me to indicate that their "innovative" technology is "obvious to a practioner skilled in the arts".

  • Shouldn't that be... if you patent it, they will pay?
  • Either this patent is limited in scope, or even very common programs like tripwire are prior art...
  • They've got a lot of work to do before I'll believe that they're even trying. And hearing about cases like this ... well, it sure doesn't make me think more hightly of them!


    Caution: Now approaching the (technological) singularity.
  • You have neglected one significant cost. These *** patents make it much more difficult for a small company. A small company won't have cross-license agreements, won't have a large legal staff, won't get a "good-buddy" licensing price, and is generally operating on a shoe-string budget anyway.

    So this is one of the factors that causes many new companies to fold. Think of it as a social control mechanism ... and it is, whether intentional or not. Because of this, I tend to think of these "spurious" patents as a large evil. Not the biggest one, but not a small one either.


    Caution: Now approaching the (technological) singularity.
  • Yes indeed. Text is so highly differentiated that if you know about doing something to the whole thing, doing something to a part of it is patentworthy. ????

    You have an extremely low standard for what should be patentable. Considering the cost of defending against a patent, if trivialities are patentable, soon only the rich will be able to legally initiate any action. Is this a social good? Is it in compliance with the constitutional provisions enabling the patent law? (I don't remember the precise pharsing, sorry. It isn't "To promote the general welfare ...", but that was the idea behind it.)

    E.g.: There may be no prior are in the archives of the patent law covering eating using a metalic or otherwise ridgid, or somewhat stiff, divided instrument to convey the nutritive material from a holding container to the grinding apparatus. Should this be patentable?


    Caution: Now approaching the (technological) singularity.
  • Ok. I haven't read their patent, but I did implement and present a system that stored the SHA-1 and time of a generated web page (by url) so that a dynamic web site could correctly answer the HTTP If-Modified-Since header.

    I presented a little paper at a small gathering in '98.

    see the pdf [usyd.edu.au]

    Anyway, I can't remember thinking this was novel enough to patent. Obviously I'm never going to be rich.

  • Can they testify that they have been doing this since prior to Feb 18, 1999?

  • I would've thought there would be prior art for this type of thing already... Ooops,wait the USPTO doesn't take that into account before granting the patent.

    Still, this should be easy to defeat.

    I've been checksumming files on file servers for years, to verify that they have been changed. How is that any bit different than this ?

    grumble.
  • Go read section 14.27 in the same RFC I refered to above. It is about "If-Range" which is the range based version of If-Match.

    So no, its not different. And yes, its an obvious extension. So obvious that the HTTP/1.1 people included it.
  • by kijiki ( 16916 ) on Monday April 23, 2001 @02:20PM (#270389) Homepage
    You couldn't be more wrong.

    January 1997 -- rfc2068 HTTP/1.1

    See section 14.20, 14.25, 14.26, and 14.43.

    It describes the "ETag: " header, which is usually a md5 hash of the resource.

    The client can then validate the resources in its cache by sending a request with a "If-None-Match: " header with the ETag associated with the copy in its cache.

    The server will either respond "Not modified" in which case the client simply uses the version in its cache, or the server will resend the resource if the ETags don't match.

    Since this patent was filed for in 1999, this is pretty clear prior art, in the most commonly used protocol on the largest network in the world. If the patent office can't locate prior art in incredibly obvious (obvious to anyone skilled in the art, that is) cases like this one, what hope do we have for them intelligently handling more subtle cases?
  • I wrote a script to do that for me when I was 12...

    ---------------------------
    "I'm not gonna say anything inspirational, I'm just gonna fucking swear a lot"
  • One of the ways Akamai uses to see if cached content needs to be updated is to fingerprint the content (HTML/gif/jpg/etc.) with an MD5 hash. They even supply a server filter that modifies the content URL to reference that MD5 fingerprint so that as soon as the content changes the Akamai servers see a new fingerprint in the request and know that it's time to refresh it's cache.
  • check_www is a series of scripts and filters that I created under the GPL last year to automatically advise me of when web pages change, popping up alert boxes and pre-loaded browsers as apropriate. It includes filters to remove unwanted constantly changing information and search for terms. It is available on http://olliver.family.gen.nz/check_www.tgz Ironically, I was alerted to this article by it. Vik :v)
  • This is an obvious, but well-written, troll. My compliments to the chef!
  • You never found anyone better than netmind at actually monitoring web pages for you then? I've got netmind watching about 50 web pages for me, and you're right, they often take days and days to do their "daily" update. I've found spyonit.com, but that (despite their claims) doesn't seem to be as customisable, and found a few other things which can unhelpfully tell you "this page has changed" but not actually tell you the changes.

    If you've found a site that can tell you when AND HOW a web page has changed, and can be taught to ignore simple date-changes, and preferably attach the page to an HTML-format email, and do it punctually, I'd appreciate knowing about it!

  • by FreeMars ( 20478 )
    It ought to do well indexing pages with text hits counters...
  • I remember reading an article in DrDobbs April issue [ddj.com] about a search-engine using checksums to see if a page has changed and need to be re-indexed.
  • Good lord this is lame. Back when I was a wee programmer knee-high to Linus Torvalds I wrote some Perl to create a searchable web index from the HTML on a server, and I generated an MD5 checksum on the pages as I indexed them and stored it as part of the change history for a page, then if something 'touch'ed the page and changed the mod date my indexer still knew it hadn't really changed. This was before 1997. I didn't know I was smart enough to have a patentable idea.
  • Now I know you can go after the police for malicious prosecution, and I know people have sued to recover court costs before. Could something like that be used to go after companies that file obvious patents that have been in use for a long time?

    Say you're an independant coder, and you create a way to check if a file is current using checksums, and you use it on your personal web site, never thinking about it. Years later a company patents exactly what you're doing.

    A normal reaction might be to yell and scream about how you were already doing it and how the patent is worthless. What about if you instead copied their product, using their supposedly patented technology. Seeing that, they'd come after you for patent violations. You could then show you were using the algorithm for much longer than them. Then, after you won the case, you could sue them to recover the costs associated with defending the case.

    I dunno, maybe some variation on this might work. It sure would be nice to be able to turn the screws on the screwers.

    Disclaimer: I am not a lawyer liscensed in your jurisdiction or in any other jurisdiction. I'm not a lawyer at all, and I'm probably not even in your country. If I were in your jurisdiction and were a lawyer I'd probably not want to give out free legal advice anyhow... but who knows what I'd do, cuz I'd probably be pretty depressed at being a lawyer.

  • Publically avaliable prior art: the [Harvest] distributed Internet search system, programmed in 1994, and still freely available for download, compilation and use today, includes exactly what is claimed here. (Related to Zeinfeld's work?)
  • rsync [samba.org] does a block by block checksum of a file, then searches another file for matching blocks, thus making it a generalisation of this idea to /any/ file. It's been around for a /long/ time - the mailing list archives go back to 1991.

    rproxy [sourceforge.net] applies the rsync protocol to http caching. I first heard about it at CALU in July 1999, and checked out some cvs code that worked at that time.

    The general idea has been floating around for ages, though - look on the rproxy site for links to other people's ideas about this kind of thing.

    This /is/ yet another case of a really dumb patent.

    himi

    --


  • % telnet slashdot.org 80
    Trying 64.28.67.150...
    Connected to slashdot.org.
    Escape character is '^]'.
    HEAD / HTTP/1.0

    HTTP/1.1 200 OK
    Date: Tue, 24 Apr 2001 05:22:53 GMT
    Server: Apache/1.3.12 (Unix) mod_perl/1.24
    Connection: close
    Content-Type: text/html

    Connection closed by foreign host.

    --

  • If anyone wants to challenge this patent, I believe I can show prior art (I haven't actually read the patent yet.) I used an MD5 checksum to check if a page had changed for the Excite Newstracker service in 1996. As virtually any competent programmer would have done...

    Actually, the problem is harder than that, because you have to filter out things that change every time you access the page, like embedded banner ads, counts of how many times the page has been accessed, and so forth. Another approach I considered was to compare a vector of word counts, and consider the document unchanged if the new vector was sufficiently close to the old one.
  • Nee Arrowpoint, the web balancers Slashdot itself uses.

    It stores an MD5 checksum of a webpage to determine if the page it retrieved is complete. This is part of its timing mechanism to determine load. Pretty sure they did this prior to Feb. 99.

  • What you're missing is that the machine that's doing the checksumming isn't necessarily the same machine that's viewing the page.

    If the machine that's doing the checking is on a nice, big, fat pipe - it can check a page regularly (very quickly) - then send a notification to the user, who may be on a slow (dialup) link... this way the user doesn't have to keep visiting a page (they just wait for the change notification)
  • Yeah- this is one of those "Why didn't I think of that?" things

    No, it isn't.

    but I have yet to hear of a web cache or proxy that uses md5sums instead of last-modified headers- are there any out there?

    No, because that's a completely different question.

    Just FYI, this has been going on for _ages_ There was a 'web page change detector' available back in my 14.4kbps modem days (early 1995 - I can't remember what it was called, tho - been too damn long) that used this very technique... you fed a URL into a CGI, and it would poll the page every so often and email you if it had changed. And guess what? It used a checksum of the page to determine if it had changed (since storing all those pages would just take way too much storage space.)

    This is _NOT_ new, and it's _NOT_ non-obvious.
  • Ask web crawlers designers, When I was working on a web crawler, I wondered what would happen when pages got updated and how I would go about getting the latest update, so I had the crawler stop a page with the date it was fetched and a checksum of the page. If a page hasn't been fetched in 10 days and is crawled, it is fetched, the checksum is compared, and if different it is parsed for potential new links/keywords... This is so obvious, I am sure that google and major search engines probably do this.

  • Here's the URL for the patent [164.195.100.11], from the US Patent & Trademark Office Database [uspto.gov].

    "Oh, Lisa, that's a load of rich creamery butter." - Homer Simpson

  • http://www.geek-girl.com/ids/1995/0306.html

    lots of postings here from 1995 about tripwire and it's predecessors. . .

    maybe the USPTO should post their patent requests to slashdot and let us find the prior art before they issue patents.

    How about a site like http://find-prior-art.com that pays out money to the first people to find prior art for patent requests?
  • I wrote a program in 1998 which monitored several pages and emailed me if any of them had been changed. I used it when I was having problems with a content generation system loosing connection to the database. I couldn't use the last modified header, because this was dynamically generated content without one.
  • I can think of at least two excellent reasons off the top of my head.

    First, it's a considerable expense and hassle. Patent attorneys are not optional - the claims have to be properly worded for the USPTO office to accept them *and* to prevent some business from stealing your idea by rewording an ineffectual claim ever so slightly. If you're a business and want to create market entry barriers to your competition, $10-20k might be a good investment. If you're a working stiff, that's a lot harder to justify. If you're still in college, forget it!

    Second, by seeking patents for "obvious" things we're implicitly accepting the validity of all other obvious patents. A sadly too common analogy is elections in corrupt regimes - you can organize a voter boycott because the election is corrupt, you can run your own candidate, but you can't do both.
  • He's an idiot don't expect him to actually think about things like that.

    He thinks that if you disagree about a patent you are a communist. What kind of a moron thinks like that?
  • I don't know why anyone needs this. There are expiration dates and conditional loading of pages if expired already defined in HTTP/1.1 (Rfc 2068) so instead of creating a hash a server honouring requests such as 'If-Modified-Since' would perfectly do the job. There is also an entity tag already defined in the faq. Deducating it from a hash is one possible solution to create such a hash. Encoding the document location and the date of the last change another.
    But in general a server using the last modification date of the file as 'Last-modified:' header would well do the job. Else an entity-tag would do the job. The hash would only make sense, if the Document could be retrieved under different URLs. Even then sensible creation of an entity Tag would do the job.

    Then there is the Content-MD5 field for an integrity check (from rfc 2068):
    The Content-MD5 entity-header field, as defined in RFC 1864 [23], is an MD5 digest of the entity-body for the purpose of providing an end-to-end message integrity check (MIC) of the entity-body. (Note: a MIC is good for detecting accidental modification of the entity-body in transit, but is not proof against malicious attacks.)

    This is in the rfc dated January 1997. There are also guidelines, how Proxies or clients should use these Tags to check for expired Documents. It's all there.
  • I mean, how ridiculous can it get? You look up something you deem a good idea, then modify it slightly and patent? Note that the method in the faq doesn't refer to patents and thus is probably not patented. The authors thought it obvious to mark the document with tags to deduce date of last modification, a unique id (for documents retrieved under this url) and a checksum for integrity check. Now some morons come along, see it already done, do it on parts and get a patent.

    I would like to patent transporting morons. In parts.
  • Besides, the US let out the REAL secret at Alamagordo, Hiroshima, and Nagasaki. Namely, that it was possible to build a working atomic bomb. Once the Russkis had that, the rest was engineering. They already knew the theory.
  • Actually, I seem to recall that Inktomi Traffic Server has this functionality. However, I'm not sure if they implemented it prior to 1999 or not.

    Their design was more geared towards hashing something like redball.gif and allowing a single instance in the cache to be referenced for multiple sites, thereby saving space in the cache.

    -Todd

    ---
  • Why would you want to checksum a file to see if it's changed? As a web server, the time stamp is adaquate to determine if it's changed, and as a web browser or web proxy, HEAD is adaquate to check the time stamp.

    While we're at it. I'm going to rush to the patent office and see if I can "patent" 64bit date time stamps, so I have a lead in on the next big crisis!

    -Michael
  • Did the patent office even try a Google search before stamping its approval on this patent?

    Obviously not: http://www.google.com/search?q=web+checksum [google.com]

    Hit #2 is prior art: "BIBLINK.Checksum - an MD5 message digest for Web pages" [ariadne.ac.uk] . Note that: "This article last updated/links checked on 23-Sept-1998"
  • Not that I figure prior art will be hard to come by for this, but I did this in a Squeak/Smalltalk for a CS project my sophomore year in college, 1998. And they've been using this project for several years of this class.
  • Taking a look at the patent content [delphion.com], it's not as simple as running the page through a checksum generator. This wouldn't work with some dynamicaly-generated pages, for example, because their dates of creation will change every time.

    The process in the patent allows you to select a portion of the web page, and then the server only tracks changes in that portion. It also generates a checksum for each portion of content between HTML tags, and it is smart enough not to tell you that the content changed if certain sections got reordered, but the content's the same. It will also show you exactly which portions changed, since it has a separate checksum for each section.

    It's not fusion power, but it's an ok idea, and I don't think anyone has used it before. So, let them have the patent.


    ----------
  • This is the same company that developed and sold the synchronization software that supposedly worked with the Palm HotSynch app to allow synchronization to other schedulers. Their conduit software worked once you took the days required to figure out how to install it correctly.

    It figures that they'd come up with yet another harebrained scheme....

    -drin
  • The posting begins, "Just when you thought nothing else stupid could be patented" . . . um, hello? Why the heck would ANY of us think that? Did I miss the story about the patent office coming to its senses?
  • There is probably major prior art from Tripwire [tripwire.com] and other file-integrity checkers. Basically the exact same idea, with the purpose of detecting when important files have been altered through a breakin.


    ---

  • I have.

    Infact, getting the whole page and doing an md5 sum, and then comparing it to a stored value in a mysql database is exactly how mine works. This patent can go fuck itself, thank you very much :)

    I dont remember if I actually coded the sum/compare part, becuase by the time i got to that part, i was sick of the idea anyhow. But the bookmarks db and all its entries are live on my machine at home, and i use it for storing and retriving bookmarks from anywhere i happen to be using a computer.

    A patent for this is ridiculous. I am fucking tired of people patenting totally naive and obvious approaches to trivial problems.
  • by Carnage4Life ( 106069 ) on Monday April 23, 2001 @02:02PM (#270459) Homepage Journal
    Isn't this just doing stuff similar to what strong validators [w3.org] a là Entity Tags [w3.org] in HTTP requests and responses use for determining whether a page has been changed (i.e. is in the cache) or not?

    The only difference I can see is that they generate an Etag like entity for tect highlighted by the user as well as the entire webpage. Doesn't seem worthy of a patent though.

    --
  • Claim 1 of the patent reads:

    1. A change-detection web server comprising:

    a network connection for transmitting and receiving packets from a remote client and a remote document server;

    a responder, coupled to the network connection, for communicating with the remote client, the responder registering a document for change detection by receiving from the remote client a uniform-resource-locator (URL) identifying the document, the responder fetching the document from the remote document server and generating an original checksum for a checked portion of the document, the checked portion being less than the entire document;

    archival storage means, coupled to the responder, for receiving the URL and the original checksum from the responder when the document is registered by the remote client, the archival storage means for storing a plurality of records each containing a URL and a checksum for a registered document;

    a periodic fetcher, coupled to the archival storage means and the network connection, for periodically re-fetching the document from the remote document server by transmitting the URL from the archival storage means to the network connection, the periodic fetcher receiving a fresh copy of the document from the remote document server,

    a checksum generator, coupled to receive the fresh copy of the document from the periodic fetcher, for generating a fresh checksum of a portion of the fresh copy of the document and comparing the fresh checksum to the original checksum, the checksum generator signaling a detected change to the remote client when the fresh checksum does not match the original checksum,

    whereby a change in the document is detected by comparing a checksum for the checked portion of the document, wherein changes in portions of the document outside the checked portion are not signaled to the remote client.

    So, the usual flame-before-reading crowd isn't entirely unjustified. (That's not to endorse flaming before reading, much less thinking, but hey, even a blind pig finds the occasional acorn.)

    Oh, btw, the priority date is January 14, 1997. Leave it to the guys who do the press release to give the wrong impression of when the thing was invented. Not that doing a checksum and not recording non-changes wasn't just as obvious in 1997 as 1999.

  • The last company I worked for had been doing checksums of web pages since about 1997. Depending on when Pumatech started, this may be prior art. In my breif skim, I didn't see any initial date.

    Anyways, its a silly patent. Checksums are a pretty fundamental thing to do! I don't even think my last company tried to patent it because it was so blatantly obvious!

  • the patent in question does checksums of *parts* of an html document, so the system is more complex than just wget | md5gen or whatever. It's supposed to be able to fetch only the diffs. Perhaps still not patentable, but still different from what has already been done.
  • by Chillas ( 144627 ) on Monday April 23, 2001 @01:28PM (#270480)
    Ahem ... no, they have patented a system for creating, storing, and using the checksum. An entire system, not just the storage of a checksum. Once again, alarmist headlines from /. I think we'd all appreciate it if these stories had accurate headlines.

  • Yeah- this is one of those "Why didn't I think of that?" things- but I have yet to hear of a web cache or proxy that uses md5sums instead of last-modified headers- are there any out there? And if so, wouldn't that count as the all-important prior art?

    I know of a website which kept an index of lists to people's weblogs (it's a semi-private thing on a cable modem, so sorry but no link). It polls the websites every 15 minutes to see whether they're changed, and orders the list of links as such ... so you can visit the page an see at a glance who's updated their weblog.

    This was all accomplished using a homegrown Perl script. Originally, it stored a checksum of the pages it retrieved for later comparison, to determine when a page was last updated. This was later replaced with a simple byte count of the page's size - using a checksum or the whole page generates "false alarms" when people are using hit counters on their page, wheras the size of the page tends to be more stable, yet it unlikely to remain the same between updates.

  • Well now hang on a second here. Is it really the company that's trying to get rich, or the lawyers. Who has more to gain from this? The company might make some money suing another company some day, but the lawyers definitely make money (and keep their job security) by encouraging patenting technology.
  • Lawyers can be like any other consultant. A lot of their advice can be such that it requires the constant presence of a lawyer to keep you out of legal trouble. I don't trust 'em any farther than I can thrown 'em.
  • nope, that would involve...ummm...technical competence
  • I wrote a similar system as a college freshman. From the UMBC Agent Web [umbc.edu]:

    A new and improved diffAgent server [industry.net] has been released which includes additional mediators. "A diffAgent watches information sources available via the web and e-mails you when it detects changes. In particular, it can:

    • Watch your FedEx package for you and e-mail you when it sees the words "Package has been Delivered!" (make a package watcher agent)
    • Monitor a list of query results at a search service like Altavista to see when new pages on your topic appear (make a web topic watcher agent)
    • Keep track of news articles on a topic and mail you when it finds new ones (make a news topic watcher agent)
    • Mail you when your name appears in a list of papers at an electronic archive (make a web page watcher agent)
    • Tell you when the word "snow" appears on the Pittsburgh weather page (make a web page watcher agent)
    8/15/96

    diffAgent had two modes. In the first mode, it stored a CRC checksum of the page, periodically compared checksums, and notified you of changes.

    In the second mode, it stored the whole page, ran diff --context=3 over it to detect changed lines, and then grep'd for user-specified words of interest.

    I believe The NetMind web page was already up at that time, but they may not have had all of the features important to the patent. IMO, the NetMind technology is not worth a patent, but it is a bit beyond the diffAgent, and not entirely trivial to implement even if it is trivial to think of.

  • hashing web pages (or anything else for that matter) is a standard security procedure. You monitor the hashes, and notify if something isn't right.

    I use hashes for database work...for example, when I want to make a link to a data element I just added, I create a hash of that record to refer to (because you don't know the primary key ID that was assigned by the database, and that's how you would call the record.)

    Anyway...yes it is obvious, and yes there is quite a bit of prior art. Even if neither were the case, patenting processes is fscking stupid anyway. If you have a prototype of a device, by all means, patent it. Otherwise go away.

  • If you patenet it, they will come.

  • I've ALWAYS used checksums to do that kind of stuff. Unfortunately in scripts that aren't distributed publicly, but cripes, any damn fool could come up with that idea!

    Another trick I've used is in scripts that generate static .html pages from a database: take the data used in the page (not the page itself), and make an md5 of the concatenation. Since most md5 routines can take data in chunks, you can generate it as you're getting the data. Then save the md5sum in a comment at the top. Then in the future you can compare with md5sum of the page with the md5sum of the data. If there is a "last modified" date on the page or something this will only update it when the data changes.

    I also use this trick for an automatic DNS updating script that creates zone files from a master data file. Can't just update the zone files every time because then the serial numbers would be updated constantly.

    So if anybody patents this silly idea (maybe they already have?), I've been using it for like eight years!! I'm publicly announcing it here on /.!!

    Blah.

    Besides I don't use NetMind anymore, I use SpyOnIt [spyonit.com].

  • I note that Linux Focus already uses md5 to allow mirrors to check for updates to the pages. See that here [linuxfocus.org].

    Did the patent office even try a Google search before stamping its approval on this patent?

  • actually mysql probably keeps it running VERY FAST. We had a mysql server take upwards of 15000 requests to write a webpage (PHP API). The page would be written within seconds.

    Now to get on topic, does the patent office do any background checking on anything dealing with a computer program? Or do they just assume that since this was the first they read about this function, that it is obviously the first time it was implemented?

  • I have prior art for the specific first patent claim sent to the www-talk mailing list in 1994.

    The HTTP protocol itself has had Jeff Moghul's cahce optimization protocol in it since at least 1996.

    It is yet another bogus patent. Time to use the proposal I made of issuing a civil action for perjury against people making fraudulent patent claims. I suspect that approach would cut down on the number of bogus applications.

  • Publically avaliable prior art: the [Harvest] distributed Internet search system, programmed in 1994, and still freely available for download, compilation and use today, includes exactly what is claimed here. (Related to Zeinfeld's work?)

    I had forgot how Harvest worked, I suspect that the number of like cases is very large.

  • by MrNovember ( 310587 ) on Monday April 23, 2001 @01:30PM (#270552)
    When laws such as copyright and patent become misused in idiotic ways, the masses will simply ignore them in what amounts to large scale civil disobedience.

    The danger of patents like these is not, IMHO, that someone is going to ask you to pay a license fee for your two line Perl program that uses checksumming but that when you really invent something original and worthwhile, patent protection will have been rendered meaningless by people simply ignoring it.

  • I believe people who work hard and ethically have a right to their billion dollars.


    Hello? Heelllooo?!

    No one makes a billion dollars by working 100,000 times harder than someone making 10K.
    They make a billion dollars by having a horde of people who are earning 10K work for them. Check out Nike.
    Phil Knight doesn't work any harder than the Vietnamese girls who make the shoes. Those girls are not *lazy*.

    He makes his money by siphoning off the value from their labor, since they work in a corrupt government where unions and occupational safety codes are written by dictators who have no interest in protecting these "lazy" poor people.

    There is no relationship, for example, between executive compensation and productivity.

    What really lets people make huge amounts of money is not hard work (the mexicans who wash the dishes in the restaurant where you dine are working very hard) and it's not intelligence (the college prof's who taught you are probably pulling in 60K on average. The grad students are making 15-20K) but it's being able to position yourself into a role where you either manage people, or money, or both. Or maybe get a fat government monopoly on something (i.e. patents) that others use and skim off of their income. That, or just let your money "work" for you.

    In either case the key to making big bucks is to park your behind right in the middle of some productivity intersection, and start taking tolls..


    And if any one objects, there will always be Ayn Rand worshipping idealogues such as yourself to keep up the PR war, believing that this is somehow the ethical way to do business.

  • Does an individual deserve to own a patent on checksumming? Surely not. But is there an argument to be made for collective ownership of the patent? I believe there is.

    You see, when a patent is granted to an individual, the benefits aren't accrued solely by the individual. The entire society benefits, because that country now possesses a citizen who owns the patent and can wield it against other countries' citizens. The GNP is in whole raised because of efforts like these.

    You can imagine how much richer the US economy would have been if we'd managed to patent the transistor before Japan got its own electronics markets running. You can imagine how much safer the world would be from nuclear warfare if the US had successfully patented atomic weapons before the Russians got their own projects going. Though the lifespan of a patent is only about 18 years, that would have been enough time to get some diplomatic solutions in place and prevent the escalated arms races of the Cold War.

    What does this have to do with checksumming? Not much, I'm afraid. That's a stupid patent and we all know it. But let's not cut off our nose to spite our face when so much good can be done by a proper patent system.
  • Ha! I just patented 1-Click check sums... The rest of you will have to use the inferior "2-click" check sum...

What is research but a blind date with knowledge? -- Will Harvey

Working...