Abstract
This paper examines the problem of semantic web crawling - following links from document to document and gathering the results for searching. Unlike centralised web search facilities, semantic web agents will be distributed, personalised and often highly domain-specific. How can we hold the entire world inside our laptops?
The W3C vision for the Semantic Web is "an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation". It is "the representation of data on the World Wide Web", expressed using the Resource Description Framework (RDF). Just as the web grew in usefulness as it was traversed, indexed and searched by systems such as Lycos, Altavista and Google, the semantic web requires technologies that can crawl, aggregate and query the RDF data.
This paper presents a modular semantic web crawler designed to explore the provision of services to applications. It highlights differences from and similarities to existing web search systems that gather their source data from the public web. Rather than have web crawling and aggregation built into every semantic web application, agents will be able to call on aggregation services via webservices, be notified of new resources by publish-and-subscribe mechanisms, or simply receive a stream of RDF statements as they are found.
Keywords
Table of Contents
Crawling the semantic web is essentially identical to crawling the HTML content web - it's simply a case of choosing one or more starting points, downloading a resource and following the pointers in it to further resources.
The key distinction between gathering HTML and RDF data is that RDF has a well-defined mechanism for merging multiple RDF models. Any number of RDF models can be combined to produce a single unified model. Instead of building a database of keywords and lsnks to locations where HTML representations related to those keywords may be found, an RDF crawler can build a combined model of all the semantic web data found. This combined model then becomes a potentially rich resource for answering queries. Separate fragments of information stored in disparate documents may combine to form the answer to a query that could not be answer by any one individual document. Miller, Guha and McCool (http://www2003.org/cdrom/papers/refereed/p779/ess.html) give the example of combining data from a CD database, a music encyclopaedia, a geographic almanac and a weather database to answer the contrived query, "What is the current temperature in the town where the creator of the album 'Appalachian Journey' was born?".
The motivation for this exploration of semantic web crawling is providing services to applications. Depending on its functionality and available computing resources, an application might wish to hook in at one of the many levels of the semantic web 'layer cake' or 'Stack of Expressive Power' (http://www.w3.org/2003/Talks/05-gartner-tbl/slide29-0.html%22) as defined by Tim Berners-Lee. A storage and indexing system may simply require a stream of URLs from which RDF resources may be downloaded. An agent such as foafbot (http://usefulinc.com/foaf/foafbot) might require a querying service backed by a store of gathered information, and perhaps expect that the model queried have had certain normalisations or inferences performed on it. Domain-specific applications might only be interested in statements made using a fixed set of predicates from a given namespace.
In order to support this wide variety of requirements, we allow modules to be hooked into the scutter at various points in its architecture. Later in this paper we will examine the use of the Joseki (http://www.joseki.org/) RDF server to store and query an aggregated model, the use of the Spread (http://www.spread.org/) distributed messaging system to broadcast discovered resources to subscribing agents, and the application of domain-specific rules and normalisations targeted at processing of social network information expressed with the FOAF (http://www.foaf-project.org) vocabulary.
The three main activities of a web crawler are:
scheduling the download of resources
gathering new URLs to crawl
processing of downloaded resources
Accordingly, we split up the architecture of the crawler into three linked subsystems. Modules may be added to each subsystem in order to observe or alter the flow of information.
Considering the history of growth of the current content web, any semantic web scutter built to be run as a long-term service must be able to scale with a similar growth of the semantic web. The design strategy for our scutter is to minimise the state that it is necessary to store in order to perform its crawling task and still communicate with clients.
In order to monitor additions and changes to the semantic web, we require that our scutter is able to run 24 hours a day, makes the minimum possible impact on HTTP resources while still remaining useful, and records appropriate metadata on the provenance of the information it gathers.
Mark Pilgrim (http://diveintomark.org/archives/2003/07/21/atom_aggregator_behavior_http_level) has a useful list of considerations for any HTTP user agent that will repeatedly gather and monitor HTTP resources. We consider the implementation of these recommendations to be a necessary part of any scutter.
At its simplest, URL harvesting from incoming RDF is a case of following the targets of <rdfs:seealso> triples. The RDF Schema (http://www.w3.org/TR/rdf-schema/#ch_seealso) specification states:
A triple of the form S rdfs:seeAlso O states that the resource O may provide additional information about S. It may be possible to retrieve representations of O from the Web, but this is not required. When such representations may be retrieved, no constraints are placed on the format of those representations.
Therefore we may attempt to retrieve O, and attempt to parse any resulting document with an RDF parser. If successful, the retrieved model may be pushed into the analysis pipeline.
If we choose to implement dynamic schema awareness in the scutter, this same process should be applied to any subClass or subProperty of rdfs:seeAlso.
The purpose of a scutter is to discover semantic web information. If the scutter is not designed as part of a specific application, it must be able to communicate with client applications, passing on information or answering queries. Distribution of information falls into two distinct areas: communication of found triples, and metadata about those triples.
There are several levels of useful information that a client application might require from a scutter service. The simplest is that of resource discovery; a feed of URLs from which parseable RDF resources may be downloaded. A client that is capable of triple storage but wishes to do the minimum possible work around managing and downloading resources would subscribe to a flow of triples (optionally with provenance information). These triples might have had one or more pre-processing stages applied: normalisation rules, application of processes such as 'smushing', or filtering to only include statements made about a particular list of URIs or namespaces, or using a particular set of predicates.
Statements using bNodes present a problem when two distributed systems share information. For example, if a scutter module is smushing URIs following OWL inverse functional properties, it may be discovered that two distinct bNodes referenced in two different RDF documents should be merged. The OWL property owl:sameAs is sufficient to communicate this fact, but without URIs to refer to the two bNodes, we cannot form such a statement. We address this problem by rewriting bNodes in models using a URI scheme http://hackdiary.com/ns/bNode#ABC where ABC is a unique internal identifier allocated to each bNode by our RDF library at parse-time.
Joseki has many useful features that make it a fine storage testbed for distributed semantic web crawling. It has a simple HTTP interface, meaning that it is agnostic as to whether its clients are local or remote. Being based on the Jena2 toolkit, it is possible to enable an OWL reasoner on a Joseki store and have it dynamically expand its rulebase from OWL rules inserted into the store. Alternatively, it can work as a plain store without inference. The parser is able to accept N-Triples as well as RDF/XML syntax.
This allows us to test several of the modules developed for our scutter. We can use Joseki as a storage mechanism, testing the minimal-state strategy of the core crawling system. We can pass on OWL rules discovered 'in the wild' in order to explore the potential pitfalls in doing so. We can load schemas for FOAF and other vocabularies that depend on IFPs for the identity schemes, and explore querying of the resulting inference models. Results of these queries can be compared to queries against models whose IFPs have been pre-processed by the scutter's smushing module.
To achieve our aim of building a semantic web crawling service that simplifies the job of the application writer, we must consider a number of scenarios of different levels of complexity. The requirements of each scenario will shape the interfaces we provide to our service, and the way in which we communication information to clients.
The Spread toolkit "provides a high-performance messaging service that is resilient to faults across external or internal networks". It provides a distributed group communication mechanism: after joining a named group, a client may send a packet of information, and will receive all packets sent to that group by other clients. Spread defines no structure or semantics on the information packets; interpretation is up to the clients.
XML document formats present problems when delivered over streams; the well-formedness requirements mean that a document cannot be accepted until its final closing tag is encountered. Because of this, we use the N-Triples (http://www.w3.org/2001/sw/RDFCore/ntriples/) format, an RDF syntax that can be parsed line-by-line, to deliver RDF models over Spread groups.
The simplest use of Spread is to send discovered triples (with any desired transformations or normalisations) to a fixed group. Clients may subscribe to this group to receive a steady stream of information. This concept can be built on to provide specialised streams - triples involving predicates or subjects from a given list or namespace, for example.
The key problem with this model of behaviour is that the client receives no indication of the provenance of the supplied data. This means that when information at a particular URL changes, there is no way for the scutter to indicate that it is sending replacement information, and no way for the client to identify which data it needs to replace in its own store. Similarly the client cannot apply external metrics of trustworthiness to data that it does not know the origin of. This model needs further work to either define a 'wrapper' format giving provenance information, or to identify a vocabulary for making provenance statements inline with the data itself. The latter approach is trickier due to the difficulties encountered with systems that use reification in RDF.
The W3C defines (http://www.w3.org/2001/sw/) the semantic web as "the representation of data on the World Wide Web", rightly making no restriction on what the type or use of that data might be. Many applications will be built to process information for a particular domain, but resource discovery is hard to do without monitoring the entire semantic web.
By providing a mechanism for agents to register interest in a particular namespace or set of predicates, a scutter should be able to support the building of information monitor systems. Using a push mechanism based on asynchronous messaging or HTTP POST, client applications would be able to work passively, waiting for notification from a scuttering service.
A large part of the potential value of semantic web information comes from applying knowledge about the vocabularies used to make statements. RDF Schema and/or OWL are commonly used to describe qualities such as the domain and range of RDF properties, or the inferences that may be drawn from a particular set of statements. Properties may be subclassed from other properties, or one property might be the inverse of another. RDF Schema specifies a property isDefinedBy, similar to seeAlso, allowing an RDF document to indicate where a schema may be downloaded to give definitions for terms used. We optionally follow these links in our scutter in order to dynamically discover ontologies.
Code is now available in various systems that implement a large part of the rules available in OWL. Models built using Jena2 may be augmented with OWL rules to form an 'inference model' where triples implied by the rules are asserted along with the explicit triples from the original model. Queries may then be performed on this inference model.
Using this feature, care needs to be taken to avoid malicious or accidental corruption of a model. Because the OWL rules may be dynamically added to, if an incoming model discovered by the scutter contains OWL statements, these will be added to the ruleset. Simply by asserting an of owl:sameAs statement, arbitrary RDF placed on the semantic web could cause unwanted merging of nodes. In order to prevent this, we provide a module for our scutter that filters OWL statements out of an incoming model before they can be passed on to a triple store.
Implementing the full specification of OWL is a demanding task, both in terms of development effort and processing power. For our scutter, we identified two key OWL primitives and provided simple implementations of them as modules that can be plugged into the scuttering chain.
The FOAF vocabulary makes use of inverse functional properties to allow references to be made to entities such as people across multiple RDF documents with having to agree on the tricky question of what URI represents a person By making foaf:mbox (email address) an owl:inverseFunctionalProperty, two foaf:Persons with the same email address are inferred to be the same person.
This is one of the design principles underlying FOAF (and for that matter the entire Semantic Web effort): a pragmatic, pluralistic approach to resource description and identification. Rather than building big, centralised registries of people (or companies, or physical things) we look for cheaper, more lightweight shared strategies for identification. (http://rdfweb.org/mt/foaflog/archives/000039.html)
This scheme means that regular RDF graph merging is not sufficient for traversing and answering queries about the FOAF world. Building algorithms into client applications to merge on IFPs is an unnecessary burden on application writers, so we use the following algorithm to smush a model with reference to models previously seen:
Examine the incoming model for statements using a predicate in the IFP list.
For each statement, check whether we have previously seen a statement that used the same predicate and object.
If not: the subject of this statement is canonical for this predicate/object combination. Record this fact.
If so: decide that the subject of this statement will be rewritten to the subject of the statement previously made canonical for this predicate/object combination. The subject's URI is owl:sameAs the canonical URI.
If in step 4 more than one candidate canonical URI was found, merge them by declaring that each candidate canonical URI is owl:sameAs to every other.
We are in the early days of the semantic web in terms of volume. Over the past year, volume of discovered triples by various scutter experiments has grown from around half a million triples to over 7 million, with many more known to exist in semi-public (robot-protected) databases such as the Aktors triplestore at the University of Southampton.
Estimates of the current number of HTML pages on the public internet (after around 12 years of public use of HTML) run to over a billion pages. One scutter run (http://www.hackdiary.com/misc/scuttered-urls-with-counts-20031210.txt) found 14,000 RDF documents, with an average of 65 triples per document. Can we build RDF stores that will scale to 65 billion triples?
Broadly, there is just a single application for HTML web crawling: keyword search. This has allowed the centralisation of crawling and search activity in services such as Google and Altavista, employing huge clusters of processing and storage machines. The semantic web has a much wider potential; one can imagine an enormous number of different domain-specific and general applications. Keys areas for future work must include: federated query systems, perhaps working in a peer-to-peer fashion; vertical-application stores that ignore any information not essential to their core operation, with pointers kept to further potentially useful information; minimal-state systems that work on streams of RDF data instead of requiring storage systems, passing on alerts and information to other systems.
![]() ![]() |
Design & Development by deepX Ltd. |