XML Europe 2004 logo

Crawling the Semantic Web

Abstract

This paper examines the problem of semantic web crawling - following links from document to document and gathering the results for searching. Unlike centralised web search facilities, semantic web agents will be distributed, personalised and often highly domain-specific. How can we hold the entire world inside our laptops?

The W3C vision for the Semantic Web is "an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation". It is "the representation of data on the World Wide Web", expressed using the Resource Description Framework (RDF). Just as the web grew in usefulness as it was traversed, indexed and searched by systems such as Lycos, Altavista and Google, the semantic web requires technologies that can crawl, aggregate and query the RDF data.

This paper presents a modular semantic web crawler designed to explore the provision of services to applications. It highlights differences from and similarities to existing web search systems that gather their source data from the public web. Rather than have web crawling and aggregation built into every semantic web application, agents will be able to call on aggregation services via webservices, be notified of new resources by publish-and-subscribe mechanisms, or simply receive a stream of RDF statements as they are found.

Keywords