Semantic Web Under Suspicion 79
Dr Occult writes "Much of the talk at the 2006 World Wide Web conference has been about the technologies behind the so-called semantic web. The idea is to make the web intelligent by storing data such that it can be analyzed better by our machines, instead of the user having to sort and analyze the data from search engines. From the article: 'Big business, whose motto has always been time is money, is looking forward to the day when multiple sources of financial information can be cross-referenced to show market patterns almost instantly.' However, concern is also growing about the misuses of this intelligent web as an affront to privacy and security."
It's cool! (Score:3, Funny)
All Talk (Score:5, Informative)
I think that we are all missing some very important aspects of what it takes to make something capable of what they speak of. In all the projects I have worked on, to create something geared toward this sort of solution, you need two things: training data & a robust taxonomy.
First things first, how would we define or even agree on a taxonomy? By taxonomy, I mean something with breadth & depth that has been used and verified. By breadth I mean that it must be capable of normalization (pharmacetical concoctions, drugs & pills are all the same concept), stemming (go & went are the same action, dog & dogs are the same concept) and also important is how many tokens wide a concept can be. By depth I mean that we must be able to define specificity and use it to our advantage (a site about 747s is scored higher than a site about airline jets which is scored higher than a site about planes). By rigorous I mean that it must be tried and true
Without a taxonomy, how will we index sites and be able to tell between "water tanks" and "panzer tanks." I think that this is one of the great things that Google is missing to really improve its searching abilities. If you suggest an ontology to replace it, the problems encountered in developing it only multiply.
Where is the training data? Well, one may argue that the web content out there will suffice as training data but I think that more importantly, they need collections of traffic for these sites and user behavioral patterns to quickly and adequately deduce what the surfer is in need of.
I feel that these two aspects are missing and the taxonomy may be impossible to achieve.
Why are we even concerned with security if we can't even lay the foundations for the semantic web? I would argue that once we plan it out and determine it's viable, then we concern ourselves with the everyone's rights.
Re:All Talk (Score:3, Informative)
Re:All Talk (Score:2)
For what it's worth, I can think of two reasons you feel univerally ignored. Fi
Re:All Talk (Score:1)
http://wordnet.princeton.edu/ [princeton.edu]
Re:All Talk (Score:3, Funny)
Re:All Talk (Score:3, Funny)
Re:All Talk (Score:1)
Re:All Talk (Score:5, Interesting)
Re:All Talk (Score:1)
Re:All Talk (Score:1)
I think it might be easier to approach the problem from another direction. Once a semantic A.I. like Cyc [cyc.com] has reached a level at which it can begin categorizing and "understanding" the information on the Web, it could do the enormous chore of creating a semantic web for us.
Re:All Talk (Score:2, Informative)
There's VerbNet [upenn.edu], FrameNet [berkeley.edu], Arabic WordNet [globalwordnet.org], and probably others I don't know about.
WordNet has become a standa
Re:All Talk (Score:1)
Re:All Talk (Score:1)
The context of a word seems to me (obviously not a math of CS geek) to be a good, and relatively easy to calculate, indicator of the word's relation to other terms. By context I mean the "physical" proximity to other terms on a page, rather than the normal written language cont
Uhmm... (Score:2)
I get people all the time dismissing the whole idea because "man, you'd have to agree on definitions" or "how does 'it' know?" Rig
Re:All Talk (Score:2)
If we can help to define standards for some part of knowledge then we have helped the world a little bit, which is a better place than we started off.
As for how we do it, well, there is lots of experience around the world at doing this. Check out the Dewey Decimal system, or the Library of Congress classification. If you want something bit, then SNOME
Smarter Machines (Score:5, Interesting)
What I really want to see is the search engine reduce the duplicated content to single entries (try Googling for a Java classname and you'll see how many Google-searched websites have the API on them), or order them by reoccurrance of the word or phrase giving the context more value than the popularity of the page.
communication and marketing? (Score:1)
I wonder how this will influence our language and communication in general. Can language itself (not only its use) assimilated by marketing?
I shudder at the thought of 'marketspeak'...
Re:Smarter Machines (Score:5, Insightful)
There is a huge problem with this, and it goes back to the days of people jamming 1000 instances of their keywords at the bottom of their pages in the same fant color as the background. Also, your desire to rate the pages on context requires an ontology type algo, which is NOT easy. Google has been working on this for a little while now, but it is a big hill to climb. They are using popularity as a substitution for this. It is not the most effective, but it is a pretty decent second option.
There is another issue with the approach you suggest. If Google decides that javapage.htm is the end all be all of JAVA knowledge, and removes all other listings from their database - then everyone and their grandmother will be fed information from this one source. That will ultimately reduce the effectiveness of Google to return valid responses to people who do not use search like a robot.
There is a human element at play here that Google is attempting to cater to through sheer numbers. Not everyone knows how to use search properly, hell most people have no idea. Keyword order, booleans, quotes - these will all affect the results given back, but very few people use them right off the bat. If you reduce the number of returned listings for a single word search to one area that was detirmined to be the authority, you have just made your search engine less effective in the eyes of the less skilled. I would be willing to bet that this less skilled group composed most of Googles userbase.
If you don't cater to these people, then you lose marketshare, and then you lose revenue from advertisers, and then you go out of business.
Re:Smarter Machines (Score:1)
Re:Smarter Machines (Score:2)
Re:Smarter Machines (Score:2)
Unfortunately that is exactly what is happening today. [wikipedia.org]
Re:Smarter Machines (Score:2)
You could just thread the result. If you did a search for a certain java class, and it turned out a whack of pages w
Re:Smarter Machines (Score:1)
Re:Smarter Machines (Score:1, Flamebait)
I fear the day where typing on an electronic device will produce better looking text and typography than me painstakingly painting every letter and produce one book a year.
Re:Smarter Machines (Score:2)
So you'd prefer google just return all pages in it's index with your keywords in them, in a random order, and let you go through the 3 million results and look for the important ones by hand?
The symantic web is all about allowing you to more precisely specify your keywords. More precise search results then follow
This is already in place, just not on the web. (Score:1, Offtopic)
Hypothetically, if all of them decided it would be for the good of humanity to allow someone to examine their sales in real time as a whole to identify flu outbreaks early - then the process of doing that would not be too difficult.
UPS and Fed Ex track their packages in real time, know who sent them and who is receiving them and how much it weighs.
Da
Re:This is already in place, just not on the web. (Score:2)
UPS and Fed Ex track their packages in real time, know who sent them and who is receiving them and how much it weighs.
Because in and of itself how much my package weighs doesn't amount to a hill of beans. However if I know a natural disaster recently struck an area, and found some more "harmless" data to add to my filter, I can tell how much replacing of stuff via insurance claim people do on-line, and some other very interesting things.
I just thought I'd clarify
NSA goes public (Score:1)
It's already happening... (Score:4, Insightful)
...and growing and evolving.
Take a look at the "blogosphere" and the tagging/classification initiative that's happening there.
Sure, it seems crude and unrefined but it's working, like most grass-roots initiatives do when compared with grandiose "industry standards" and the big, bulky workgroups that try to define them.
The idea is to make the web intelligent (Score:3, Funny)
5...4...3...2...1
Already took care of it (Score:2)
SKYNET vs "Intelligent web" .. (Score:3, Funny)
I dare hypothesize that if a truly intlligent web ever arose, it would have a strong porn background.
I shudder to think of what it's version of Judgement Day would be
Biz School (Score:3, Insightful)
That motto is really "anything for a buck". Even if business has to wait or waste time to get money, it will wait until the cows come home - then sell them.
Re:Biz School (Score:1, Offtopic)
100% Troll
TrollMods must get paid in dollars, because they certainly don't have sense.
Semantic Web ~- evil (Score:5, Informative)
The article would have us believe that this is going to expose everyone to massive amounts of privacy invasion. This is not necessarily the case. It is already the case that there are privacy mechanisms to protect information in the SW (e.g. require agents to authenticate to a site to retrieve restricted information). Beyond simple mechanisms, there is a lot of research being conducted on the idea of trust in the semantic web - e.g. how does my agent know to trust a slashdot article as absolute truth and a wikipedia article as outright fabrication (or vice versa).
As for making the content of the internet widely available, some researchers feel this will never happen. As another commenter noted that it is essential that there is agreement in the definition of concepts (ontologies) to enable the SW to work (if my agent believes the symbol "apple" refers to the concept Computer, and your agent believes it refers to "garbage", we may have some interesting but less than useful results). I am researching ontology generation using information extraction / NLP techniques, and it is certainly a difficult problem, and one that isn't likely to have a trivial problem (in some respects, this is goes back to the origins of AI in the 1950's, and we're still hacking at it today).
For some good references on the Semantic Web (beyond Wikipedia), check out some of these links
Re:Semantic Web ~- evil (Score:1)
Re:Semantic Web ~- evil (Score:2)
Re:Semantic Web ~- evil (Score:2)
Is it possible to have a markup structure that could handle this issue by searching for a "secondary key" bit of information to qualify the identifier? Using your example above of "apple":
Re:Semantic Web ~- evil (Score:2, Informative)
For example, we could define the Apple domain as
Classes: Computer, Garbage, ComputerMfg
Roles: makesComputer computerMadeBy
We can assign the domain of makesComputer to be a ComputerMfg, and the range to be a Computer (the inverse would be flipped).
Class rdf:ID="Computer"
Class rdf:ID
OT THANKS (Score:2)
Pfff, the problem is marketing (Score:5, Insightful)
You could already do this semantic web nonsense if people would just stick to a standard and be honest with what they publish.
Nobody wants to do that however. Mobile phone companies always try to make their offering sound as attractive as possible by highlighting the good points and hiding the bad ones. Phone stores try to cut through this by making their own charts for comparing phone companies but in turn try to hide the fact that they get a bigger cut from some companies then others.
It wouldn't be at all hard to set up a standard that would make it very easy to tell what cell phone subscription is best for you. Getting the companies involved to participate is impossible however.
This is the real problem with searching the web right now. It wouldn't be at all hard to use google today if everyone was honest with their site content. For instance, removed the word "review" from a product page if no review is available.
Do you think this is going to happen anyday soon? No, then the semantic web will not be with us anyday soon either.
This is actually insightful (Score:2)
Re:This is actually insightful (Score:2)
It's because of people like you that we're getting identity cards. Would it have killed you to join Tesco's loyalty programme?
renting to avoid local government records
How does renting help? You still need to be on the electoral register and you need to pay council tax - in both cases it doesn't matter if you're renting or owning. And what about TV Licencing? Not having a TV Li
Who Web? (Score:3, Funny)
How many people read this and thought "Okay, what have they done with Norton now?"
Someone is being hysterical... (Score:2)
The idea, however,
Healthcare? (Score:1, Informative)
But all this semantic web stuff makes me giggle when they start talking about healthcare, anyway. I worked in that industry up until a couple years ago. Semantic web people want to move everybody away from EDI...while the healthcare people are struggling to upgrade to EDI. In 2003 I was setting up imports of fixed-length mainframe records. By the time healthcare is ex
Re:Healthcare? (Score:2)
I was interested that you posted about the healthcare industry, because I work in it today, and also went to a university which has done quite a bit of research into the area of health & bio informatics. From the research, it is clear that the semantic web and healthcare are actually a great match for each other, particularly when it comes to things like concepts & ontologies (for example, check out MeSH [nih.gov] if you haven't seen it before).
Another example of how semantics make sense for healthcare i
The next great leap (Score:2, Insightful)
Semnatic Web vs. Contextual Web Mining (Score:3, Insightful)
The semantic search engine would then cross-reference all of the information about hotels in Majorca, including checking whether the rooms are available, and then bring back the results which match your query.
And here in all its glory is the 1999 version:
The software would then use XML to cross-reference all of the information about hotels in Majorca, including checking whether the rooms are available, and then bring back the results which match your query.
Of course, the problem with this fantasy of XML was that no standardization of schemas led to an infinite mix of tagging and thus, the laypersons idea that "this XML document can be read and understood by any software" was pure bunk.
Granted, the semantic web addresses many of these problems, but IMHO the underlying problem remains: layers of context on top of content still need to be parsed and understood.
So the question remains: will the Semantic Web be implemented in a useful fashion before some develops a Contextual Web Mining system that understands web content to a degree that it fufills the promise of the Semantic Web without additional context?
Disclaimer: I work on contextual web content extraction software [q-phrase.com] so yes I may be biased towards this solution, but I really think the Semantic Web has a insanely high hurdle (proper implementation in millions of web pages) before we can tell how successful it is.
Re:Semnatic Web vs. Contextual Web Mining (Score:3, Interesting)
Re:Semnatic Web vs. Contextual Web Mining (Score:1)
Well (Score:3, Informative)
Re:Practical Applications??? (Score:1, Interesting)
There are already lots of inferencing engines, too - Sesame, cwm, etc. It's really not a big deal; the whole point of RDF is that the architecture makes this stuff easy.
Re:Practical Applications??? (Score:1, Interesting)
CWM sucks big time. Just go ask the semantic web researchers out there how aweful it is and poorly it scales. In fact, google and see what results you find.
Oy Vey! (Score:1)
Now THAT would be something.
outrage !! (Score:1)
symantec web? (Score:1)
Glass Houses (Score:5, Insightful)
"All of this data is public data already," said Mr Glaser. "The problem comes when it is processed."
The privacy and security concerns are bizarre. They're saying that there is currently an implicit "security through obscurity" and that's ok. However, if someone were to make available data more easily found, then it would be less secure?
Here's a radical thought; don't make any data public you don't want someone to see. Blaming Google because you put your home address on your blog and "bad people" found you is absurd. If data is sensitive it shouldn't be there now.
You can't really bitch about peeping Tom's if you built the glass house.
Re:Glass Houses (Score:2)
The problem is that no one is willing to man
Re:Glass Houses (Score:2)
a bunch of automatically generated metadata was added to it, possibly without your knowledge. Think of all the trouble that has come from Word document metadata being put on the web.
I'm not sure that this is the gist of the article I read, but it is an interesting thought.
That's more under the heading of the cute, insecure ideas that just wont die. Why does a web server tell me it's name and build number? Why does web publishing software include the registered user's name in the meta data? It's val
I have a chapter on SW in my new book (Score:3, Informative)
The Protege project provides a good (and free) editor for working on ontologies - you might want to grab a copy and work through a tutorial.
I think that the SW will take off, but its success will be a grass roots type of effort: simple ontologies will be used in an adhoc sort of way and the popular ones might become defacto standards. I don't think that a top-down standards approach is going to work.
I added a chapter on the SW to my current Ruby book project, but it just has a few simple examples because I wanted to only use standard Ruby libraries -- no dependencies makes it easier to play with.
I had a SW business idea a few years ago, hacked some Lisp code, but never went anywhere with it (I tend to stop working on my own projects when I get large consulting jobs): define a simple ontology for representing news stories and writing an intelligent scraper that could create instances, given certain types of news stories. Anyway, I have always intended to get back to this idea someday.
Going around the utopy (Score:1)
The Semantic Web I see coming is one where many different, limited domains are identified and semantically annotated, allowing some kind of agents to perform well-defined activities for us (i.e., book tickets, make appointments, search info, etc.). This sound
Lie to Telemarketers! (Score:1)
books, newspapers, magazines are dangerous too (Score:2)
Come on, this is absurd. If anything this article underscores the need for privacy laws - but the privacy implications of the semantic web are hardly any more significant than any other publishing method.
Big Business Already Has This: (Score:1)
That's the Bloomberg service in a nutshell. Yes, same company founded by current NYC mayor, Michael Bloomberg. As an example, I was able to simultaneously examine various financial ratios of about 1,200 companies along with their current market values. Depending upon where certain ratios went, I flagged them
The OTHER massive issue (Score:3, Insightful)
You're kidding (Score:2)
There's still people out there doing that stuff? That's too much! Good luck, semantic web dudes!
N.B. The above is a flippant, snide, and unhelpful comment. However, in my defence, I submit that that is _exactly_ the sort of comment that any remaining semantic web diehards should be most used to hearing.