NLP works well statistically; the SW, in contrast, requires logic and doesn't yet make substantial use of statistics. Natural language is democratic, as expressed in the slogan 'meaning is use' (see Section 5.1 for more discussion of this). The equivalent in the SW of the words of natural language are logical terms, of which URIs are prominent. Thus we have an immediate disanalogy between NLP and the SW, which is that URIs, unlike words, have owners, and so can be regulated. That is not to say that such regulation will ensure immunity from the meaning drift that linguists detect, but may well provide sufficient stability over the short to medium term.
It is argued - though currently the arguments are filtering only slowly into the academic literature - that folksonomies are preferable to the use of controlled, centralised ontologies [e.g. 259]. Annotating Web pages using controlled vocabularies will improve the chances of one's page turning up on the 'right' Web searches, but on the other hand the large heterogeneous user base of the Web is unlikely to contain many people (or organisations) willing to adopt or maintain a complex ontology. Using an ontology involves buying into a particular way of carving up the world, and creating an ontology requires investment into methodologies and languages, whereas tagging is informal and quick. One's tags may be unhelpful or inaccurate, and no doubt there is an art to successful tagging, but one gets results (and feedback) as one learns; ontologies, on the other hand, require something of an investment of time and resources, with feedback coming more slowly. And, crucially, the tools to lower the barriers to entry to controlled vocabularies are emerging much more slowly than those being used to support social software .
Time-stamping is of interest because the temporal element of context is essential for understanding a text (to take an obvious example, when reading a paper on global geopolitics in 2006 it is essential to know whether it was written before or after 11th September, 2001). Furthermore, some information has a 'sell-by date': after a certain point it may become unreliable. Often this point isn't predictable exactly, but broad indications can be given; naturally much depends on whether the information is being used in some mission critical system and how tolerant of failure the system is. General temporal information about a resource can be given in XML tags in the normal way. However, in the body of the resource, which we cannot assume to be structured, there may be a need for temporal information too, for users to find manually. In such a case, it is hard to identify necessary temporal information in a body of unstructured text, and to determine whether a time stamp refers to its own section or to some other part of the resource. It may be that some ideas can be imported from the temporal organisation of more structured resources such as databases, as long as over-prescription is avoided . In any case, it is essential to know the time of creation and the assumptions about longevity underlying information quality; if the content of a resource 'is subject to change or withdrawal without notice, then its integrity may be compromised and its value as a cultural record severely diminished' .
Another key factor in assessing the trustworthiness of a document is the reliability or otherwise of the claims expressed within it; metadata about provenance will no doubt help in such judgements but need not necessarily resolve them. Representing confidence in reliability has always been difficult in epistemic logics. In the context of knowledge representation approaches include: subjective logic, which represents an opinion as a real-valued triple (belief, disbelief, uncertainty) where the three items add up to 1 [159, 160]; grading based on qualitative judgements, although such qualitative grades can be given numerical interpretations and then reasoned about mathematically [110, 115]; fuzzy logic (cf. ); and probability . Again we see the trade-off that the formalisms that are most expressive are probably the most difficult to use.
Nevertheless, the next generation Web should not be based on the false assumption that text is predominant and keyword-based search will be adequate for all reasonable purposes . Indeed, the issues relating to navigation through multimedia repositories such as video archives and through theWeb are not unrelated: both need information links to support browsing, and both need engines to support manual link traversal. However, the keyword approach may falter in the multimedia context because of the greater richness of many non-textual media . The Google image search approach relies on the surrounding text for an image, for example, which allows relatively fast search, and again in general the user is often able to make the final choice by sifting through the recommendations presented (keyword-based image searches tend to produce many fewer hits, which may mean they are missing many plausible possibilities). The presence of the human in the loop is hard to avoid at the moment: human intervention in the process of integrating vision language with other modalities is usually required , although there are a number of interesting techniques for using structures generated from texts associated with image collections to aid retrieval in restricted contexts .
Web topology contains more complexity than simple linear chains. In this section, we will discuss attempts to measure the global structure of the Web, and how individual webpages fit into that context. Are there interesting representations that define or suggest important properties? For example, might it be possible to map knowledge on theWeb? Such a map might allow the possibility of understanding online communities, or to engage in 'plume tracing' - following a meme, or idea, or rumour, or factoid, or theory, from germination to fruition, or vice versa, by tracing the way it appears in various pages and their links . Given such maps, one could imagine spotting problems such as Slashdot surges (the slowing down or closing of a website after a new and large population of users follow links to it from a popular website, as has happened from the site of the online magazine Slashdot) before they happen - or at least being able to intervene quickly enough to restore normal or acceptable service soon afterwards. Indeed, we might even discover whether the effects of Slashdot surges have declined thanks to the constant expansion of the Web, as has been argued recently .
Perhaps the best-known paradigm for studying the Web is graph theory. The Web can be seen as a graph whose nodes are pages and whose (directed) edges are links. Because very few weblinks are random, it is clear that the edges of the graph encode much structure that is seen by designers and authors of content as important. Strongly connected parts of the webgraph correspond to what are called cybercommunities and early investigations, for example by Kumar et al, led to the discovery and mapping of hundreds and thousands of such communities . However, the identification of cybercommunities by knowledge mapping is still something of an art, and can be controversial - approaches often produce "communities" with unexpected or missing members, and different approaches often carve up the space differently .
The connectivity of the webgraph has been analysed in detail, using such structural indicators as how nodes are connected. Various macroscopic structures have been discerned and measured; for example one crawl of in excess of 200 million pages discovered that 90% of the Web was actually connected, if links were taken as non-directional, and that 56m of these pages were very strongly connected  cf. . The structure thus uncovered is often referred to as a bowtie shape, as shown in Figure 4.1. The 'knot' of the tie is a strongly connected cluster (SCC) of the webgraph in which there is a path between each pair of nodes. The SCC is flanked by two sets of clusters, those which link into the SCC but from which there is no link back (marked as IN in the figure), and those which are linked to from the SCC but do not link back (OUT). The relationship between the SCC, IN and OUT gives the bowtie shape. The implications of these topological discoveries still need to be understood. Although some have suggested alterations to the PageRank algorithm to take advantage of the underlying topology , there is still plenty of work to do to exploit the structure discerned.
Indeed, the bowtie structure is prevalent at a variety of scales. Dill at al have discovered that smaller subsets of the Web also have a bowtie shape, a hint that the Web has interesting fractal properties - i.e. that each thematically-unified region displays (many of) the same Fig. 4.1 The bowtie shape of the Web and its fractal nature . characteristics as the Web at large . The Web is sufficiently sparsely connected to mean that the subgraph induced by a random set of nodes will be almost empty, but if we look for non-random clusters (thematically-unified clusters or TUCs) which are much more connected, then we see the bowtie shape appearing again. Each TUC will have its own SCC, and its own IN and OUT flank, contained within the wider SCC. The larger-scale SCC, because it is strongly connected, can then act as a navigational backbone between TUCs.
In this way the fractal nature of the Web gives us an indication of how well it is carrying the compromise between stability and diversity; a reasonably constant number of connections at various levels of scale means more effective communication . Too many connections produce a high overhead for communication, while too few mean that essential communications may fail to happen. The assumption that levels of connectivity are reasonably constant at each level of scale is of importance for planning long-range and short-range bandwidth capacity, for example. TheWeb develops as a result of a number of essentially independent stochastic processes that evolve at various scales, which is why structural properties remain constant as we change scale. If we assume that the Web has this sort of fractal property, then for designing efficient algorithms for data services on the Web at various scales it is sufficient to understand the structure that emerges from one simple stochastic process .
IR is the focus for an arms race between algorithms to extract information from repositories as those repositories get larger and more complex, and users' demands get harder to satisfy (either in terms of response time or complexity of query).
One obvious issue with respect to IR over the Web is that the Web has no QA authority. Anyone with an ISP account can place a page on the Web, and as is well known the Web has been the site of a proliferation of conspiracy theories, urban legends, trivia and fantasy, as well as suffering from all the symptoms of unmanaged information such as out-of-date pages and duplicates, all the difficulties pertaining to multimedia representations, and all the indeterminacies introduced by the lack of strictly constrained knowledge representation. Understanding exactly what information is available on a page waiting to be retrieved remains a serious problem.
Perhaps more to the point, traditional IR has been used in benign environments where a mass of data was mined for nuggets of sense; typical problems were complexity and lack of pattern. Benchmark collections of documents for IR researchers tend to be high-quality and almost never intentionally misleading, such as collections of scientific papers in particular journals. Other Web-like mini-structures that can be used, such as Intranets, are also characterised by the good faith with which information is presented. But malicious attempts to subvert the very IR systems that support theWeb so well are increasingly common. Web-based IR has to cope with not only the scale and complexity of the information, but potential attempts to skew its results with content intended to mislead .
One view is reminiscent of the philosophical idea of supervenience [168, 169]). One discourse or set of expressions A supervenes on another set B when a change in A entails a change in B but not vice versa. So, on a supervenience theory of the mind/brain, any change in mental state entails some change in brain state, but a change in brain state need not necessarily result in a change in mental state. Supervenience is a less strong concept than reduction (a reductionist theory of the mind/brain would mean one could deduce mental state from brain state, that psychology follows from neuroscience). And it has been thought over the years that supervenience is a good way of explaining the generation of meaning: uninterpreted material in the lower layers of discourse is organised in significant ways so that the material in the upper layers is constrained to be meaningful. It may be appropriate to think of the Web as having this sort of supervenience layering: the meaningful constructs at the top depending crucially on meaningless constructs in HTML or XML or whatever below.
If we are to see the higher levels of the Web as supervenient on the lower, then the question arises as to what the foundational levels of the Web are, and the further question of whether they have to take some particular form or other.
There are many different types of reasoning, but not too many have been successfully automated beyond deductive linear reasoning and various statistical methods. What alternative methods has the Web facilitated? One obvious candidate is associative reasoning, where reasoning on the basis of associations - which can be extremely unpredictable and personalized - takes one down a train of thought . So, for example, the classic case of associative reasoning is given in Proust's novel Remembrance of Things Past, where the middle-aged narrator, upon eating a Madeleine dipped in tea, finds himself transported to his childhood in Combray, when his Aunt L´eonie would give him a Madeleine on Sunday mornings. On the Web, the potential of associative reasoning is immense, given the vast number of associative hyperlinks, and the small world properties of the Web. Google-like searches, valuable though they undoubtedly are, cannot be the whole story in a world of small pervasive devices, software agents and distributed systems .
However, associative reasoning via hyperlinks, though an attractive and important method, is not the only way to go about it. This type of reasoning is not strictly associative reasoning proper, as the associations are those of the author, the person who puts the hyperlinks into a document. In Proust's scene, this is like Marcel taking a bite of his Madeleine and suddenly and unexpectedly perceiving the memories of the baker. Open hyperlinking allows the reader to place link structures over existing Web pages, using such information as metadata about the page in question, relevant ontologies and user models . Associativity is clearly one of the major driving forces of the Web as a store of knowledge and a source of information. Associative reasoning, for example, has been used for collaborative filtering in recommender systems .
Another type of reasoning is analogical reasoning, another highly uncertain type of reasoning that humans are remarkably successful at using. Reasoning by analogy works by spotting similar characteristics between two subjects, and then assuming that those subjects have more characteristics in common - specifically that if subject A has property P, then by analogy so does subject B . Obviously the success of analogical reasoning depends on having representations of the two subjects which make it possible to spot the analogies, and in being suitably cautious (yet creative) in actually reasoning. Case-based reasoning (CBR) is a well-explored type of analogical reasoning.
Analogical reasoning can be made to work in interesting contexts , and reasoning engines exist . Sketches of an approach using analogical reasoning to generate metadata about resources have appeared recently , and case-based explanations can be useful in domains where causal models are weak . In a domain described by multiple ontologies, analogical reasoning techniques may well be useful as the reasoning moves from one set of ontological descriptions to another, although equally the change of viewpoint may also complicate matters. There have been interesting attempts to support analogical reasoning (i.e. CBR) across such complex decentralised knowledge structures , and also extensions to XML to express case-based knowledge .