Slashdot Log In
Cracking the Google Code... Under the GoogleScope
Posted by
CmdrTaco
on Tue May 10, 2005 11:25 AM
from the something-to-read dept.
from the something-to-read dept.
jglazer75 writes "From the analysis of the code behind Google's patents: "Google's sweeping changes confirm the search giant has launched a full out assault against artificial link inflation & declared war against search engine spam in a continuing effort to provide the best search service in the world... and if you thought you cracked the Google Code and had Google all figured out ... guess again. ... In addition to evaluating and scoring web page content, the ranking of web pages are admittedly still influenced by the frequency of page or site updates. What's new and interesting is what Google takes into account in determining the freshness of a web page.""
This discussion has been archived.
No new comments can be posted.
Cracking the Google Code... Under the GoogleScope
|
Log In/Create an Account
| Top
| 335 comments
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
On the minds of all slashdotters, (Score:5, Funny)
Re:On the minds of all slashdotters, (Score:5, Funny)
(http://www.geocities.com/rrkap)
So will this make it easier or harder to find porn?
Because there's a shortage of porn on the web?
Re:On the minds of all slashdotters, (Score:5, Funny)
Re:On the minds of all slashdotters, (Score:4, Funny)
bingo. therefore, the question he should've asked is: will the pron I find make me harder?
Re:On the minds of all slashdotters, (Score:4, Informative)
Have you already seen DOMAI [domai.com]? (NSFW)
Great (Score:2, Interesting)
(http://www.aperture.ca/)
http://www.anologger.com/ [anologger.com]
Google what is best in life (Score:5, Funny)
it's a war (Score:5, Funny)
(http://booktextmark.mozdev.org/)
resistance is futile (Score:5, Funny)
(http://booktextmark.mozdev.org/)
Borgle.
in case of slashdotting, article text (Score:5, Informative)
Google's US Patent confirms information retrieval is based on historical data.
Publication Date: 5/8/2005 9:51:18 PM
Author Name: Lawrence Deon
An Introduction:
Google's sweeping changes confirm the search giant has launched a full out assault against artificial link inflation & declared war against search engine spam in a continuing effort to provide the best search service in the world... and if you thought you cracked the Google Code and had Google all figured out
Google has raised the bar against search engine spam and artificial link inflation to unrivaled heights with the filing of a United States Patent Application 20050071741 on March 31, 2005.
The filing unquestionable provides SEO's with valuable insight into Google's tightly guarded search intelligence and confirms that Google's information retrieval is based on historical data.
What exactly do these changes mean to you?
Your credibility and reputation on-line are going under the Googlescope! Google has defined their patent abstract as follows:
"A system identifies a document and obtains one or more types of history data associated with the document. The system may generate a score for the document based, at least in part, on the one or more types of history data."
Google's patent specification reveals a significant amount of information both old and new about the possible ways Google can (and likely does) use your web page updates to determine the ranking of your site in the SERPs.
Unfortunately, the patent filing does not prioritize or conclusively confirm any specific method one way or the other.
Here's how Google scores your web pages.
In addition to evaluating and scoring web page content, the ranking of web pages are admittedly still influenced by the frequency of page or site updates.
What's new and interesting is what Google takes into account in determining the freshness of a web page.
For example, if a stale page continues to procure incoming links, it will still be considered fresh, even if the page header (Last-Modified: tells when the file was most recently modified) hasn't changed and the content is not updated or 'stale'.
According to their patent filing Google records and scores the following web page changes to determine freshness.
The frequency of all web page changes
The actual amount of the change itself... whether it is a substantial change redundant or superfluous
Changes in keyword distribution or density
The actual number of new web pages that link to a web page
The change or update of anchor text (the text that is used to link to a web page)
The numbers of new links to low trust web sites (for example, a domain may be considered low trust for having too many affiliate links on one web page).
Although there is no specific number of links indicated in the patent it might be advisable to limit affiliate links on new web pages. Caution should also be used in linking to pages with multiple affiliate links.
Developing your web page augments for page freshness.
Now I'm not suggesting that it's always beneficial or advisable to change the content of your web pages regularly, but it is very important to keep your pages fresh regularly and that may not necessarily mean a content change.
Google states that decayed or stale results might be desirable for information that doesn't necessarily need updating, while fresh content is good for results that require it.
How do you unravel that statement and differentiate between the two types of content?
An excellent example of this methodology is the roller coaster ride seasonal results might experience in Google's SERPs based on the actual season of the year.
A page related to winter clothin
Unintended side effects of the Google arms race (Score:5, Interesting)
(http://www.xpriori.com/ | Last Journal: Friday June 18 2004, @04:18PM)
Re:SEO (Score:4, Informative)
(http://www.lagom.nl/)
Doesn't work in slashdot because:
Re:Unintended side effects of the Google arms race (Score:5, Insightful)
Re:Unintended side effects of the Google arms race (Score:4, Interesting)
(http://www.xpriori.com/ | Last Journal: Friday June 18 2004, @04:18PM)
That would be great. Now that I've read TFA, it looks like Google's techniques a long way toward eliminating the fakery done by SEO's currently.
As an aside, the article looks like it was written by an SEO consultant, as it contains a lot of advice about how to get good rankings under Google's patented approach. Interestingly, the recommended actions are mostly legitimate (offer interesing content, update regularly, don't try to create fake links to your site), but also some less-upfront techniques (make link-exchange deals with other sites and encourage bookmarking, for example).
Re:Unintended side effects of the Google arms race (Score:5, Interesting)
(http://www.intelligentblogger.com/ | Last Journal: Monday August 27, @11:47AM)
Companies need to start realizing that making money is about providing what customers want. Advertising is a great way of getting your name out, but only a good product or service will actually carry through. So in that frame of thinking, I highly recommend that companies:
Re:Unintended side effects of the Google arms race (Score:5, Funny)
I'm unhappy because I was grabbed off the street. May I go now?
Please?
Re:Unintended side effects of the Google arms race (Score:5, Insightful)
(http://www.intelligentblogger.com/ | Last Journal: Monday August 27, @11:47AM)
Actually, pretty much everything you list falls under the issue of usability. Many of those options have lower usability for the user, and thus the search engine by extension.
These companies don't need an SEO, they need to find a web designer that doesn't use Macromedia "tools".
Re:Unintended side effects of the Google arms race (Score:4, Insightful)
(http://www.internetisshit.org/ | Last Journal: Sunday April 03 2005, @04:42PM)
I will not say anything at all about Flash because two camps who BOTH don't get it will start the usual pointless discussion. Flash is rarely used for what it's great at, visualizing data, and plagues us with wildly unnecessary and annoying l33t-masturbation stuff instead.
Dreamweaver itself is indeed a powerful timesaver in the hands of an experienced XHTML/CSS guy. If you look at it closely, you'll find that it is a very nice graphical frontend to HTML itself, with a great set of shortcuts so that you almost don't have to touch the mouse at all. The palettes just provide access to the most commonly needed attributes of the element you're working on. If you leave all those nasty "behaviours", "timelines" and whatnow alone, it produces nicely readable and well-formed code. I'm using Dreamweaver since the early betas, and even back then this was the case. I tend to think that this was an initial design goal behind DW.
The bad comes from the 'designers' who are taught print design at the universities and apply them to the Web, using all the nutty clicky-pointy tools that produce JS-laden horror cabinet of non-standards-compliance they dare to call "HTML". It's a classical PEBKAC. Look at it this way - if DW didn't have those features, GoLive would've taken over long ago and we don't want THIS to happen. IMNSHO the only thing worse would be Frontpage. At least the guys at Macromedia didn't invent bogus HTML extensions because they were incapable of providing a proper metadata infrastructure, like Adobe did.
(I'm not a fanboy though, I just use what works best at the moment for the things I do. If someone shows me how to reproduce this "Apply Source Formatting" feature from DW in Kate/KDevelop and how to synchronize sites like in DW, I'm switching my machine at work from Win2K with DW to KDevelop/nvu on FreeBSD tomorrow, because it better fits the things I do nowadays. It will then match my setup at home.)
While we're at it, SEO is, was and always will be BS, just like the whole Internet Advertising Myth which after nearly a decade of documented failure still isn't debunked. Duh.
After link analysis (Score:5, Interesting)
Re:After link analysis (Score:4, Insightful)
(Last Journal: Friday February 17 2006, @06:51PM)
Yes (Score:5, Funny)
Re:Yes (Score:5, Insightful)
(http://www.intelligentblogger.com/ | Last Journal: Monday August 27, @11:47AM)
Is it the general opinion of the public... (Score:3, Interesting)
Take the article with a grain of salt... (Score:5, Insightful)
The article is not written by a Google employee, nor did the author speak with anyone at Google. It's simply his analysis of the patent document filed by Google.
Also, at the bottom of the article after the author's name, there's a link to some search optimization service's website.
Non-subscribers! Damn you all! (Score:2)
(Last Journal: Monday September 25 2006, @01:19PM)
Six weeks to fix? (Score:2, Informative)
If this claim is true, I guess we'll have to wait the typical "four to six weeks for delivery."
GoogleBombs Away (Score:5, Funny)
(http://slashdot.org/~Doc%20Ruby/journal | Last Journal: Thursday March 31 2005, @01:48PM)
effect on search engine optimizers (Score:5, Informative)
Article text and Google cache link (Score:3, Informative)
(http://sourcery.blogspot.com/ | Last Journal: Tuesday September 18, @11:53AM)
Google United - Google Patent Examined
Google's newest patent application is lengthy. It is interesting in some places and enigmatic in others. Less colourful than most end user license agreements, the patent covers an enormous range of ranking analysis techniques Google wants to ensure are kept under their control.
Publication Date: 4/7/2005 7:41:24 AM
By Jim Hedger, StepForth News Editor, StepForth Placement Inc.
Thoughts on Google's patent... "Information retrieval based on historical data."
Google's newest patent application is lengthy. It is interesting in some places and enigmatic in others. Less colourful than most end user license agreements, the patent covers an enormous range of ranking analysis techniques Google wants to ensure are kept under their control. Some of the ideas and concepts covered in the document are almost certainly worked into the current algorithm running Google. Some are being worked in as this article is being written. Some may never see the blue-light of electrons but are pretty good ideas so it might have been considered wise to patent them. Google's not saying which is which. While not exactly War and Peace, it's a pretty complex document that gives readers a glimpse inside the minds of Google engineers. What it doesn't give is a 100% clear overview of how Google operates now and how the various ideas covered in the patent application will be integrated into Google's algorithms. One interesting section seems to confirm what SEOs have been saying for almost a year, Google does have a "sandbox" where it stores new links or sites for about a month before evaluation.
Google is in the midst of sweeping changes to the way it operates as a search engine. As a matter of fact, it isn't really a search engine in the fine sense of the word anymore. It isn't really a portal either. It is more of an institution, the ultimate private-public partnership. Calling itself a media-company, Google is now a multi-faceted information and multi-media delivery system that is accessed primarily through its well-known interface found at www.google.com.
Google is known for its from-the-hip style of innovation. While the face is familiar, the brains behind it are growing and changing rapidly. Four major factors (technology, revenue, user demand and competition) influence and drive these changes. Where Microsoft dithers and .dll's over its software for years before introduction, Google encourages its staff to spend up to 20% of their time tripping their way up the stairs of invention. Sometimes they produce ideas that didn't work out as they expected, as was the case with Orkut, and sometimes they produce spectacular results as with Google News. The sum total of what works and what doesn't work has served to inform Google what its users want in a search engine. After all, where the users go, the advertising dollars must follow. Such is the way of the Internet.
In its recent SEC filing, the first it has produced since going public in August 2004, Google said it was going to spend a lot of money to continue outpacing its rivals. This year they figure they will spend about $500 million to develop or enhance newer technologies. In 2004 and 2003, Google spent $319 million and $177 million respectively. The increase in innovation-spending corresponds with a doubling of Google's staff headcount which has jumped from 1628 employees in 2003 to 3021 by the end of 2004.
Over the past five years Google has produced a number of features that have proven popular enough to be included among its public-search offerings. On their front page, these features include Image Search, Google Groups, Google News, Froogle, Google Local, and Google Desktop. There are dozens of other features which can be accessed by cli
Old Story (Score:1, Informative)
From the article: GOOGLE has plans that will dramatically improve the results of internet news searches, by ranking them according to quality rather than simply by their date and relevance to search terms. The ambitious system is revealed by patents filed in the US and around the world (WO 2005/029368) by researchers based at the company's headquarters in Mountain View, California.
Frequency of changes (Score:4, Insightful)
Also, a page with frames might get penalized since its content doesn't change, although the content of the frames may change frequently.
Coral cache link (Score:3, Funny)
(http://lamphowto.com/)
FAQs (Score:2)
Since the story submission didn't end the post with a question, I feel compelled to add one:
How will this affect the ranking of insightful FAQs, which by nature my not change frequently?
Another shout-out poll to my homeboy Slashdotters: Do you pronounce FAQs as "F-A-Q's" or "Faks"?
Google's crackdown is coming (Score:4, Insightful)
(http://www.animats.com)
Note that Google is now looking at domain ownership information. This may result in a much lower level of bogus information in domain registrations. It's probably a good idea to make sure that your domain registration information, business license, D&B rating, on-site contact info, and SSL certificates all match.
"Domain cloaking" will probably mean that you don't appear anywhere the top in Google. So that's on the way out.
Search Engine Spam (Score:1)
(http://www.jewelrymall.com/)
Google's Click History Asset (Score:5, Insightful)
(http://slashdot.org/ | Last Journal: Wednesday October 23 2002, @05:38PM)
Google has millions upon millions of click history on their search results that say what it is people really are looking for, as well as which ones appeared good fodder for first clicking.
No one else has such a large database of what humans have actually picked.
Such a click history and search term history asset is worth even more if it gets correlated with Evil Direct Marketing information from the cookie traders.
Although, it seems possible that large ISPs could also grab and analyze their members Google interactions to figure out people's tastes, assuming such interactions remain unencrypted.
I have to wonder how many companies with static IP addresses have, unbeknownst to them, built up extensive history logs at Google showing their search term preferences and click selections. If I were a technology startup with a hot idea to research I'd be a little more paranoid about something like that.
Re:Google's Click History Asset (Score:4, Interesting)
Re:Google's Click History Asset (Score:4, Informative)
(http://www.celsius1414.com/)
You sure about that? Try copying and pasting a Google results link.
For example, let's search Google for "elluusive" [google.com]. The first result was your slashdot "homepage", at http://slashdot.org/~eluusive [slashdot.org], which at first glance seems to be a direct link. But if you right-click on the link and copy it, paste it somewhere and you'll find something along these lines:
http://www.google.com/url?sa=U&start=1&q=http%3A/
Re:Google's Click History Asset (Score:5, Informative)
Each link in the search results on google has a onmousedown event attached.
If you have javascript enabled and click on it, then your browser will also execute the javascript, which sends a get request to google. They do log each link you click on.
check the source of any google search page.
The function that gets called for each onmousedown is called clk():
Attn: Google(TM) and Apple(TM) (Score:1, Insightful)
Thanks,
Rob Malda
Thank goodness (Score:1, Interesting)
Two Keys: Data Mining and Delay (Score:5, Interesting)
(http://www.backupcritic.com/ | Last Journal: Friday October 15 2004, @01:02PM)
What does that mean? At the highest level, it means that most of the Google algorithm is constructed by a machine. You give the machine human-constructed examples of how to rank a sample set of pages (notice those want ads where Google is hiring people who can inspect and assess the quality of web pages?) and it then uses essentially brute-force techniques to test every possible combination of your ranking variables to find the simplest formula that ranks pages the same way the human did.
There is no human at Google "twisting dials" to alter individual parameters of a formula. The machine constructs the algorithm, and it can therefore easily be so complex that no human can understand it. Tweaking the algorithm becomes a process of changing or adding to your "training set" of human-ranked pages, and letting the data mining process come up with a revised algorithm.
For example, Google could invent a new variable called "category", and identify each page as belonging to category Astronomy, Botulism, Country, [...] and Other. Once that variable is thrown into the mix, then the Google "aglorithm" is essentially free to vary wildly from one type of subject matter to the next. For example, you might see someone with a Real Estate site swearing up and down that inbound links are no longer as important, while someone with an Astronomy site might swear that, no, inbound links are more important than ever. You can see exactly this kind of bickering in most of the forums that people who hope to do Search Engine Optimization frequent.
The other big mistake people make in trying to see how to game the Google algorithm is "delay". In studying how people manage (or fail to manage) complex systems, psychologists learned that people generally would fail if a delay was introduced between their actions and the results of their actions.
In one very simple test, people were charged with trying to stabilize the temperature in a virtual refridgerator. They had one dial, and there was exactly one piece of feedback: the current temperature in the fridge. However, they were not explicitly told that there was a delay between moving the dial and when the results of that action would stabilize.
The responses of those test subjects was eerily similar to what we see in Google-gaming webmasters these days. Some people swore up and down that some human behind the scenes was directly tweaking the results to thwart whatever they did. Others became frustrated and decided that nothing they did really mattered, so they would just swing the dial back and forth between its minimum and maximum settings.
What does this have to do with Google? These days, Google can change their algorithm relatively frequently, and the algorithm can vary by the relative date of various things. The net sum is, there's a delay between when your page is first ranked and when it is likely to arrive at a relatively stable ranking. This can drive webmasters nuts as they think they've done something clever to rank their page high, but then it drops a week later. Although it doesn't occur to them, the important question is: did the change cause the high ranking or did it cause the sudden decline?
The few people who did master the simple refridgerator system? Well, they sounded more like some of the people who are more successful at gaming Google. Those folks tend to say things like: "just make one change and then leave it alone for a while to see what happens."
Can you still game the Google algorithm? Undoubtedly in specific cases. But it's getting harder. The Google algorithm was always complex, but what's changing is that the days when a few variables (such as inbound li
Web page "freshness?" A good thing... (Score:3, Informative)
The site is mostly static but is rich with cultural value. It's currently the number one hit on Google. I'm hoping that Google's emphasis on "freshness" won't make his site fall in ranking.
seo gets more difficult (Score:2)
Wait a second. (Score:2, Insightful)
(http://www.paris-promenades.com/ | Last Journal: Wednesday June 15 2005, @01:07PM)
Seriously, this little article is going to get Webmasters thinking a little more but I don't see anything to panic about. Not yet, anyways.
solution to get reliable results (Score:1)
(http://slashdot.org/)
What about harmful link spam? (Score:3, Insightful)
Google rhymes with "GOD" (Score:2)
Its not nice to fool Mother Nature.
So What About The Rest (Score:1)
(http://www.geocities.com/my_haz_runs/)
Content still RULEZ! Film at 11 (Score:2)
Maybe the SEOs do realize it, but can't resist the offer of easy money from the thousands of MLM and "me too" sites trying to sell useless crap.
Cracked the code alright... (Score:1)
(http://www.nitemarecafe.com/)
Find Me - Google (Score:1)
(http://www.geocities.com/josephbcotton | Last Journal: Tuesday January 10 2006, @09:27PM)
I don't care (Score:1)
(Last Journal: Friday November 09, @11:57PM)
Higer rank for valid code (Score:2)
To fix Google dramatically, STOP ALL SPAMDEXING! (Score:1)
(http://www.slashdot.org/ | Last Journal: Tuesday March 09 2004, @11:15PM)
If all sites were limited to ONE AND ONLY ONE webpage from a bona-fide unique web domain, Google would probably need only a fraction of the computer systems to store and process 4 billion webpages.
This would also get rid of all the e-commerce affiliates who have set up shop in some directory on some public hompage webserver and not paid for their own domain.
This would also improve the performance and search results given out by Google by not having to index and catalogue more than one page of an e-commerce site.
dynamic vs. static (Score:1)
(http://breasy.com/blog/)
How does this apply to dynamic content, specifically dynamic content hidden behind apache mod-rewrite to look and act static. I would assume that any time googlebot hits such a url it will see a file listed as modified. This is especially true if the content varies dynamically with things like, for instance, 'latest comments' or 'newest' boxes and the like.
In this way, every page on my new site [isitnormal.com] is always "fresh" to some degree as small pieces are constantly changing and random. Anyone want to venture a guess as to how Google treats this situation?
Personally I think this whole "freshness" idea is misguided. It just doesn't make much sense.
Re:This is under YRO? (Score:4, Insightful)
Re:A reason why *not* to use .NET? (Score:2)
(http://www.bazaah.org/)
Re:This is under YRO? (Score:5, Funny)
(http://slashdot.org/)
Re:Is it the case.. (Score:2, Insightful)
Their search dominance is a direct result of PageRank. That they have a patent on it prevents other companies from copying the idea or hiring their employees away (Microsoft is notorious at doing both these things). So yes, the patent is important.
Sorry kids, but patents and "Do no evil" are mutually incompatible concepts.
You're retarded if you think that.
Re:A reason why *not* to use .NET? (Score:3, Insightful)
Re:10 Comments and the site is down (Score:2)
(http://willcode4beer.com/ | Last Journal: Thursday May 12 2005, @07:33AM)
Seems most of the pages returned from google feeding this error code(0x80004005) are about people using MS Access behind the web server. Now I like to raz on MS as much as the next guy but, this may just be a case of the wrong tool for the job. Access on a public web site is just a plain bad idea.
OTOH, even if its not Access the nature of the article suggests that any reasonable site would employ caching and not bang the database with every request. OF course, I admit that without seeing the site or their code its all just speculation.
I think we can expect poor results from any tool misapplied.
Re:A reason why *not* to use .NET? (Score:2)
Re:A reason why *not* to use .NET? (Score:2)
(http://www.error-417.com/blog/ | Last Journal: Thursday July 28 2005, @12:43PM)
Re:Pamela Jones EXPOSED (Score:1)
Re:A reason why *not* to use .NET? (Score:2)
At least you're getting an error message telling you what's wrong instead of just no response.
You're new here aren't you?
Re:My Last Google Article (Score:1)
I certainly hope you took the time to complain about the article posted about the play Spamalot, and the one on the latest reviews of Star Wars. I imagine you're just complaining to complain and didn't even take the time to come up with a coherent arguement (as is evident by the lame "Slashdot must be paid by google", do you honestly think Google needs to advertise?). So what "News for Nerds" should have been posted instead?
Re:A reason why *not* to use .NET? (Score:1)
(http://lyrictalk.net/)
You must be new. Welcome to Slashdot!