Meaning-Based Web and Web 3.0


Yesterday I went to NY Social Media Club meeting. (See my summary of last month’s meeting on Social Networks here). The topic of discussion was Semantic Web and Web 3.0. There were two panelists, moderated by Howard Greenstein. The first panelist was Tim McGuinness, Vice President of Search from Hakia.com. Hakia is a NY-based startup that has a great meaning-based search engine. They just launched a new beta version with some very social networking feature this week. They use Natural Language Processing techniques to produce better searches. Nate Westheimer was the other panelist. He is the founder of BricaBox.com, a site that just launched its Beta this morning. Also, Marco Neumann, the leader of the NY Semantic Web Meetup, contributed a lot to the conversation. This post is not a strict summary, but rather some thoughts related and inspired by the discussion yesterday. I purposely use the term Meaning-Based web, and stay away from using the term Semantic Web, since it refers more to a set of technologies than a wider concept.

Meaning-Based Web – Motivation

First of all, Semantic Web is really about improving the connections and the meaning that one can gleam from the internet. So that when you do searches, it only returns the searches relevant to the meaning of what you are looking for. The goal of meaning-based web technologies is to make the meaning of the pages on the World Wide Web better understood by the computers. This will drastically improve our ability to find things, and to ask intelligent questions about the world.

To illustrate the difference: today, when somebody does a search for “George Bush”, the search engines are fundamentally looking for a string of characters in the sequence you types in. It does not understand that you are talking about a person, cannot relate it directly that George Bush is the president, etc. You want your search to find all the cases when George Bush is referred to through meaning, i.e. The President, 41rd President’s Son, “W,” Kerry’s opponent in 2004, etc. To us humans, these are obvious connections to make, to computers – not so.

There are two ways to approach this goal, through two philosophically different directions. They are the Semantic Web Techniques and Natural Language Processing, and in a way are two sides of the same coin.

Approach One – Evolve the Web (Semantic Web, Microformats)

The first approach is to evolve the web, to add more information to it. This means that the content producers will add more information to the Web, and thus enhance it to make it more understandable to machines. Semantic Web is a set of W3C standards that allow to add that data and query it. Very similar to the W3C approach, Microsoformats are another, more light-weight, way to do the same. The goal is for content to have semantic information attached to it so that computers can read it and form connections just like the humans do.

Using this approach, a page with information about a movie called “Magnolia” has hooks in the page (possibly invisible to the user) that mark it as such. A page about flower magnolia has markings that explain that it’s about flower.

Approach Two – Extract Meaning (through Natural Language Analysis)

The second approach is work harder to extract meaning from the web. To assume that enhancing the data is very cumbersome, and that while some people will do it, not everybody will. Additionally, adding more data means that there will always be holes and things that cannot be expressed easily. It would be great if computers could get closer to the real meaning of what the web pages are talking about.

This movement espouses Natural Language Processing techniques. Natural Language Processing is a set of techniques that try to extract meaning and relationships from text. Their algorithms read the texts and cull meanings from the text, coupled with an ontology of relationships they defined elsewhere. For example, their ontology will know that car is a vehicle, and that car has certain actions that it can perform, and that it’s different from an inanimate object, which means that it cannot speak. As it “reads” the web pages, it applied the ontology to the content and records not where a specific word can be found in a document, but rather where a specific concept is. Additionally, these ontologies are language independent, except for some minor exception language particularities.

To come back to our example, using this approach the technology will automatically be able to tell the page is talking about a flower because it sees words like “grows,” “soil”, etc – same clues that allow us humans disambiguate the meaning of words. It is able to figure out meaning from the context.

In reality, both approaches have their strengths and weaknesses, although I have to admit to be more partial to NLP approach over the long term.

Implications

The implications here are profound. As this technology improves, searching will become more seamless, and that things like Search Engine Optimization will be the thing of the past. The search engines will understand the true meaning value of the content, and will be able to direct people towards you. The very cumbersome task of thinking up of various words that your content can be searched will be a thing of the past.

Ability to understand text on a higher level (natural language processing) means that ads will be targeted even more precisely. A lot of ambiguities will be resolved easily, just by the engine asking you a few disambiguating questions. As a user, you will be rewarded for putting more search terms since the engine will be able to find the information you are looking for faster. You will be able to have an interactive conversation with your search engine until you zero in on precisely what you are looking for.

As far as search engines are concerned, I see meaning-based searching as the future. However, that cannot happen in isolation. For example, there is a lot of bad information on the web about child vaccinations, a vocal minority of sorts, mostly driven laypeople. There is also a tremendous amount of authoritative research data that shows the benefits of vaccinations. One of the reasons that Google search has been successful is that they have been able to harness the power of authority – their original Page Rank algorithm was based on the assumption that a page that has been linked to a lot is more authoritative than others. Since then, their search algorithms evolved thousand fold, but the central concept of authoritative sources is still very important on the internet (and in real life).

On top of the natural language techniques, and authoritative-based approach, the next realm in search is personalization and social networking. The next generation of collaborative filtering technologies will be collaboration-based with personalization mixed in. You’ll receive not just the best content, but the best content targeted to your current interests. If the search engine knows that I am currently interested in dancing, and I search for salsa, it will automatically return sites related to dancing, as opposed to cooking. Additionally, if it can mark studios or events that my friends have been to, that would be even more valuable.

Web 3.0

So what is Web 3.0? Nobody knows yet, and neither do I. Right now, I think it’s emerging as a combination of several emerging technologies – meaning-based web, social networking, greater personalization, and locale-based information. I think that once you are able to create mashups based on meaning-based information, extract that information easily from existing data sources, then we will have Web 3.0. Lastly, many sites will offer not just access APIs, but a way to really integrate your application into them. Therefore, Facebook API , OpenSocial, and Ning are early precursors of Web 3.0.

New things will become possible. It will be easy to cross-reference unstructured documents with information stored in relational databases. It will be easy to create a personal profile page based on the information already out there on the internet. It will be easy to create something similar to tumble blog based on your web activities. We are not quite there yet with mashups, at least not based on what I’ve seen. We are close though. When everything becomes a data source, then we will have arrived.

Meaning-Based Enterprise

Since I work for a compano, Alfresco, that is focused on bringing Web 2.0 ideas into the enterprise, I am concerned with how this will affect the people behind the firewall. Just like on the public web, I see a great opportunity to transform existing systems and ways of collaborating.

One of the reasons why many content management solutions exist is to add semantic meaning to data. When you create a taxonomy to classify your documents, you are adding semantic meaning on top of unstructured content. Much of the reasons we are doing it is because computers can not quite do it themselves. With technology improving, a lot of traditional document management systems will be fundamentally changed. Whole areas of taxonomy analysis, information architecture will be transformed, since the semantic web techniques will allow extracting these taxonomies automatically from the documents themselves.

I also see some great short-term opportunities in Natural language Processing technologies and services. If a document can be automatically tagged with metadata instead of humans having to do it, this leads us to much better user experience and thus more useful content management systems. Some of this will require better plumbing, some of it will require newer interfaces, the kind of that Adobe Flex or Microsoft SilverLight are starting to enable. This is why we are firmly committed to Flex as the future evolution of our user interface.

Since the next generation of the web will feature much better meaning-based technologies, this will also dramatically improve collaboration and information sharing. Tools will be developed that will become agents, searching in the background for information that’s relevant to your work interests, and will automatically notify you of things you didn’t even know you were looking for. As you are working on solving a problem, an agent will also be searching for a solution to your problems, both inside the intranet and on the public web. The agents will also be able to traverse your social network, and connect you with other people in your social network or company who have expertise in the area.

Auto Tagging, Auto-Classification, new ways of collaborating – Wiki-based, Mashup-based, are all transforming the public web. And these superior ways of collaborating are moving rapidly inside the enterprise. Forrester talks about tech populism – the idea that as the web in becomes more user-friendly, enterprise users will demand the same simplicity and interactivity they are becoming used to.

This is the future I am excited to be a part of.

Some more resources on Meaning-Based Web and Web 3.0:

  • Great article about Semantic Web From Scientific American by Tim Berners-Lee, the inventor of the world wide web.
  • Semantic Wave Report from Project 10X
  • Ling pipe – an advanced NLP java library. New York-Based and partially open source software.
  • Twine – From Radar networks – a semantic web startup that just got some funding.
  • Hakia.com – NLP-based search engine

2 Responses to “Meaning-Based Web and Web 3.0”

  1. Amar Rama Says:

    Hi Jean :)
    Glad to see you enjoying Alfresco! hope all is well.

  2. How to Get Six Pack Fast Says:

    I can tell that this is not the first time you mention the topic. Why have you chosen it again?

Leave a Reply