The Open Collections website will be unavailable July 27 from 2100-2200 PST ahead of planned usability and performance enhancements on July 28. More information here.

Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Biomimetic information retrieval with spreading-activation networks Huggett, Michael William Peter 2007

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


831-ubc_2007-317884.pdf [ 15.75MB ]
JSON: 831-1.0052031.json
JSON-LD: 831-1.0052031-ld.json
RDF/XML (Pretty): 831-1.0052031-rdf.xml
RDF/JSON: 831-1.0052031-rdf.json
Turtle: 831-1.0052031-turtle.txt
N-Triples: 831-1.0052031-rdf-ntriples.txt
Original Record: 831-1.0052031-source.json
Full Text

Full Text

Biomimetic Information Retrieval With Spreading-Activation Networks by Michael Wi l l iam Reter Huggett B . S c , The University of Toronto, 1999 M.Sc . , The University of British Columbia A THESIS S U B M I T T E D I N P A R T I A L F U L F I L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF Doctor of Philosophy in The Faculty of Graduate Studies (Computer Science) The University of British Columbia October 2007 © Michael Wi l l iam Peter Huggett 2007 Abstract Information management systems act as a prosthetic scaffold for human memory. They retain and organize information objects to be conveniently recalled in support of knowledge-based tasks. We note a striking similarity between the functions of human memory and the processes in computational information re-trieval. For this reason, we ask whether it is viable to purposely design information management systems biomimetically, i.e., in a manner inspired by biological systems. Based on a comparison of cognitive models of human memory and computational information retrieval algorithms, we propose the Principles of Mnemonic Associative Knowledge (P-MAK) to describe the nec-essary components of biomimetic systems: the constraints of computing machines, the properties of human memory, how semantic knowledge representations are constructed, and the contexts in which information is usefully retrieved. The goal of P - M A K is to describe systems that are simple, inspectable, comprehensible, and easy to use. Since human memory as described by cognitive network models is analogous to a large associative hyper-text repository, P -MAK's principles suggest that networks would be an appropriate representation format. Therefore, we build a semantic similarity network from a document corpus using information retrieval (IR) algorithms, and describe how these processes are comparable to the functions of human semantic memory. To approximate an optimal link distribution, we introduce a novel link-pruning technique to tune the net-work to a small-world topology. We show in a user study that a semantic network based on cognitive models can improve user access to information. The ability to recall information in appropriate contexts is also a useful property of human memory. Based on models of human episodic memory, we propose a real-time, incremental temporal index that captures some of the regularity of human information behaviour. Temporal patterns are represented using a novel cue-event-object (CEO) model, in which observed events are related to a collection of cues. The cues describe time, place, or sensory qualities and are analogous to cognitive schemas. Cues are combined to represent an event, analogous to cognitive convergence zones. The model connects related cues, events, and objects together to encode the relations present in observed occurrences. The CEO model simulates cognitive reinforcement learning to build patterns of user information behaviour. If an object is used consistently at a given time, the links connecting cues, event, and object all grow stronger; otherwise, they decay and are "forgotten". The resulting network structure can function as a recommender system by using spreading activation to retrieve objects at times and under circumstances where they have previously proven themselves useful. The model also allows users to pose queries such as when an event typically occurs, or what items are used at particular times. In a user-log experiment, we show that the CEO model quickly learns to make correct predictions of user behaviour, and increases in accuracy the more data that it is given. ii Contents Abstract , i i Contents . . . . i i i List of Tables vii List of Figures vi i i Acknowledgements ix Co-Authorship Statement x 1 Introduction 1 1.1 Motivation 2 1.1.1 Goals of Biomimetic IR 3 1.1.2 Context of this Thesis 4 1.2 Existing Models of Semantics 5 1.2.1 Human Memory as Information Processing 5 1.2.2 Information Retrieval Systems 8 1.3 Existing Models of Context 14 1.3.1 Cognitive Notions of Context . 15 1.3.2 Context for IR 16 1.4 Our Approach: Network Models of Semantics and Context 19. 1.4.1 The Semantic Network 19 1.4.2 The Context Network 21 1.4.3 Contributions , 22 1.4.4 Structure of the Thesis ' 22 Bibliography • • 24 2 Cognitive Principles for Information Management (P-MAK) 28 2.1 Introduction 28 2.2 The Challenge of Human Memory 31 2.2.1 Failures of Memory 31 2.2.2 Computational Models of Memory 32 2.2.3 Memory Prosthesis 34 2.3 Human-Centred Information Management 36 2.3.1 The Basic Operations of Information Management 36 2.3.2 Mind-Machine Symbiosis 37 2.3.3 Introduction to Principles: P - M A K 38 2.4 The Fundamental Principles 39 2.4.1 Mechanistic Principles: Making Machines Effective 40 ii i 2.4.2 Anthropic Principles: Making Knowledge Comprehensible 41 2.5 The Organizational Principles • • 42 2.5.1 Epistemic Principles: Building Knowledge • 42 2.5.2 Situational Principles: Capturing Context , 49 2.6 Associative Network Representation . • • • • ,• ^ 2.6.1 The Advantage of Networks 5 4 2.6.2 Basic Network Elements : 55 2.6.3 Networks for Similarity, Usage, and Situations 56 2.7 Conclusion and Future Work 59 Bibliography 62 3 Testing the P-MAK Semantic Network 69 3.1 Introduction . 6 9 3.2 Related Work 71 3.2.1 Query by Reformulation . .' 71 3.2.2 Similarity Hypertext 73 3.2.3 Information-Searching Behaviour . . , . 74 3.2.4 Network Topology : - 75 3.3 Experimental Design • • •' 76 3.3.1 User Interface 76 3.3.2 Corpora 77 3.3.3 Similarity Network • 78 3.3.4 Design 79 3.3.5 Subjects . . • • • : • 79 3.3.6 Apparatus 79 3.3.7 Procedure : 79 3.3.8 Task Design • • 80 3.3.9 Measures 81 3.4 Results and Analysis • • • 82 3.4.1 Performance Comparison of Interfaces 82 [ 3.4.2 Task Analysis • • • • 82 3.4.3 Navigation Behaviour . . . 83 3.4.4 Recall Over Time 84 3.5 Discussion 85 3.5.1 Exceptions of Interest 86 3.5.2 Future Work 87 3.6 Conclusions 87 Bibliography 89 4 Testing the P-MAK Context Network 91 4.1 Introduction • • 91 ' 4.2 Related Work . . '. . 93 4.2.1 Temporal Indexing 93 4.2.2 Contextual IR 94 4.3 The C E O Model • • • 96 4.3.1 Cues, Events, and Objects , • • • 97 4.3.2 Temporal Subsumption Graph (TSG) 98 4.4 Temporal Patterns 98 iv 4.4.1 Encoding the Temporal Patterns of Events 99 4.4.2 Temporal Aggregation 100 4.4.3 Temporal Disaggregation 102 4.4.4 Retrieval 103 4.5 Experiment: Real-World Trip Data 103 4.5.1 Motivation and Goals 104 4.5.2 Set-Up ' 104 4.5.3 Results 106 4.6 Extending the Model to Personal Information Management 107 4.6.1 Searches and Queries 107 4.6.2 Reminders 109 4.7 Conclusion 110 Bibliography 111 5 Conclusion . 113 5.1 Outcomes: The Semantic Network 114 5.2 Outcomes: The Context Network 115 5.3 Future Work 117 5.3.1 Semantic Network Building 117 5.3.2 Context . . 118 5.3.3 Abstraction ' . . '• 119 5.3.4 User Interaction 119 5.4 Biomimetic Information Retrieval 120 Bibliography 122 I Appendices 124 A System Design . 125 A . l The Semantic Network 125 A . 1.1 Pre-Processing . . 126 A.1.2 Semantic Indexing 127 A . l . 3 Semantic Linking . . .• 130 A . 1.4 Small-World Link Pruning 131 A . 1.5 Semantic Retrieval 136 A . 1.6 Adding Individual Documents 137 A.2 The Context Network ,. . • 138 A.2.1 Data Types 138 A.2.2 Predicting Destinations 139 A.2.3 Contextual Encoding 143 A.2.4 Contextual Retrieval 145 A.2.5 Temporal Aggregation 146 Bibliography 151 v B Data for User Study 152 B . l User Tasks • 152 B . l . l N Y T Task Descriptions 152 B.1.2 Reuters Task Descriptions 152 B.2 User Questionnaires 153 B.2.1 Pre-Questionnaire 153 B.2.2 Post-Questionnaire • 154 B.3 Order of Tasks 155 B.4 Table of Results . . 156 C Data for Context Experiment 157 D Ethics Approval for User Study 159 VI List of Tables 2.1 The P - M A K framework 3.1 Analysis of tasks by ANOVA B. 1 Order of task completion B. 2 Result data for the user study C. 1 Data for the context experiment List of Figures 1.1 Global models of memory 6 1.2 Network theories of memory 7 1.3 Neural networks 11 1.4 Semantic networks 12 1.5 Network-IR models 13 1.6 Network episodic memory 16 1.7 The scope of information management 17 2.1 The information-mapping process of human memory 36 2.2 A simple similarity network 56 2.3 The Cue-Event-Object (CEO) model as a network 58 2.4 Triggering actions in the environment 59 3.1 The graphical user interface 76 3.2 Task dimensions of the user study 80 3.3 Plots of average user score per task 83 3.4 Recall over time for Search vs. Browse interfaces 84 4.1 The C E O Model . 96 4.2 Temporal Subsumption Graph (TSG) 98 4.3 Temporal pattern before aggregation 100 4.4 Temporal pattern after aggregation 101 4.5 Determining support for aggregation 102 4.6 Disaggregation of a temporal pattern 103 4.7 Route prediction map 106 4.8 Increase in predictive accuracy 107 4.9 Accuracy versus routes travelled 108 4.10 Accuracy versus termini . . . 108 4.11 Increase in routes and termini 109 5.1 Biomimetic information retrieval in context 121 A . l Degree distribution with increased 9 ' . . . 132 A.2 Relationship between context components . . . . 140 A.3 The aggregation mapping 0 147 A.4 The event mapping A 148 v i a Acknowledgements I wish to thank — M y committee—Edie Rasmussen, Richard Rosenberg, David Poole, and Tamara Munzner—for calling me back from the wilderness. Ian McKellen (Anthropology) and Ori Simchen (Philosophy), for their interesting extra-departmental interpretations and applications of network-based ontologies. Ron Fussell for preparing and maintaining the experimental machines for our study. Heidi Lam and Karen Parker for their helpful insights as pilot-study guinea pigs. Anne Condon for consulting enthusiastically ori set theory notation. John Lloyd for his restorative martinis. In particular I wish to thank my parents, without whom I could not have taken education for granted. ix Co-Authorship Statement Other than indicated below, all elements of this thesis were prepared solely by the author. Cognitive Principles for Information Management (Chapter 2) was written in discussion with co-authors Holger Hoos and Ron Rensink, in their capacity as graduate advisors. Static Reformulation (Chapter 3) was co-authored with Joel Lanir. As first author, I decided on the research question and took the lead in manuscript preparation, particularly the pre-print revisions; designed the user interface and the experimental code base, and did most of the programming; and compiled the corpora that we used as experimental data. M y co-author and I worked closely to compare readings, refine and revise our goals, devise task questions and user questionnaires, gather data, meet with experts in statistics and user studies, and write the analysis and discussion; my co-author took the lead on data analysis, and wrote the initial draft of that section. Holger Hoos and Ron Rensink contributed to discussions about experimental design; Avinoam Borowsky, Brian Fisher, Barry Po, and Joanna McGrenere contributed to discussions on user studies and statistical analysis; and Frank Shipman suggested readings, helped form the hypothesis, and shepherded the paper to publication. x Chapter 1 Introduction Decades before personal computers became a viable commodity, visionaries such as Vannevar Bush were already projecting them as everyday cognitive tools: "A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceed-ing speed and flexibility. It is an enlarged intimate supplement to ... memory" (Bush, 1945). Since then, the memex has become a touchstone, a holy grail representing the perfect system for information retrieval (IR) and personal information management (PIM). As an appealing vision of easy collaboration between humans and machines, the memex has become an ongoing, often-cited inspiration for important and influ-ential research in the information sciences. One of the attractions of memex is that it is biomimetic—that is, it imitates a natural biological process—by combining concepts1 and information objects associatively, as does human memory. The goals of memex articulated by Bush have not yet been achieved, and the memex vision for information access is still a long way off, awaiting breakthroughs in the robustness and affordability of natural language processing, cognitive modeling, portability, and interfaces. The challenge is made more acute by the growing collections of personal information generated by digital consumer products, in the form of emails, news ar-ticles, product releases, journals, photographs, music, videos, etc. Even 20 years ago, prominent researchers were commenting that "the amount of available information is increasing rapidly and offering accurate and speedy access to this information is becoming ever more difficult. This ... is still valid nowadays if you consider the amount of information offered on the Internet" (Berger et al., 2004). The influence of memex has been profound. Despite the emphasis of IR research on algorithmic bench-marks, the final report generated by the Challenges in Information Retrieval and Language Modeling workshop, attended by many prominent researchers in the area, identified the singular importance of "global information access" that would "satisfy human needs through natural effective interaction" (Allan and Croft, 2003). The importance of effective, human-friendly information access is also appreciated beyond the field of IR. In his acceptance speech for the Turing Award, the highest award in computing, database researcher Jim Gray spoke of a personal and world memex as long-term goals of computer science (CS), arguing that the long-term goals of information retrieval research are important for all of CS, and proposing greater collaboration between the two fields (Gray, 1999). This thesis considers the issue of creating an "enlarged intimate supplement to human memory". In the re-mainder of this chapter, we look at the motivations for our approach, then compare the background literature 'Throughout this thesis we define concept as "an idea or mental picture of a group or class of objects formed by combining all their aspects" (Canadian Oxford Dictionary, 1998)—a definition consistent with the attribute-based framework that we develop. 1 for minds and machines, first on the subject of semantic encoding, then with respect to contextual models. Finally, we introduce our cognitively-inspired network model for semantics and context. 1.1 Motivation We begin with three broad motivating claims: • Information retrieval is a memory prosthetic. Human memory is limited, and must be extended artifi-cially in order for civilization to operate with any degree of sophistication. Libraries are one example of extended human memory, if one includes the built-in search processes of the Dewey-Decimal code and helpful librarians. However, IR systems are particularly worthy of the title memory prosthetic, as their operation is fast and often includes personal facets of our lives. We contrast memory prosthesis, a backwards-looking process of retrieving past experience, with sensory enhancement, the forward-looking process of finding new information, much as we would take binoculars with us on a hike to see clearly beyond our immediate area. • Human memory is associative. This phenomenon has been noted almost since the beginning of writ-ten language. Given any meaningful cue ("hamburger"), listeners instantly retrieve related attributes ("two all-beef patties, special sauce, lettuce, cheese, pickles, onions, on a sesame-seed bun"), and other commonly co-occurring objects ("fries"; "antacid"). • Human memory is contextually dependent. What we remember depends factors such as where we are, what we see, and who we're with. Certain locations (home, versus school) and activities (picnics, ' examinations) imply their own mental schemas that focus memory on concepts that are contextually relevant. We combine these three claims to form our mission statement: Information retrieval in aid of human memory should be associative and contextually dependent We believe that there are advantages to a system built on this basis. A system that, in the memex sense, works as memory does may provide a natural scaffolding to a user's memory processes. It wil l display important information that the user certainly remembers, but also show peripheral information relevant to the current situation, barely forgotten on the edge of consciousness. A n automatic, dynamic system that can believably mimic human memory function in this way may both build the user's confidence in it and extend the limits of (conscious) working memory. , While these claims describe what we expect, a full exposition of this position is beyond the scope of a single thesis: rather, it is appropriate to an extended multidisciplinary research program. The goals of this thesis are more modest. First we sketch out the boundaries of the problem space with a set of principles that are closely based on experimentally-verified models of human memory, and on developments in computational information retrieval. The principles compare the goals and processes of computational information retrieval to the functional processes of human memory, and show two essential approaches to organizing information 2 objects: by their internal semantic characteristics (i.e. by what they say) and by their external contextual characteristics (i.e. by when and where they are used). The principles thus describe a philosophical basis for user-centered information retrieval. The models of semantics and context that we derive both use a network representation. We do not claim that networks are inherently better than other forms of representation, but rather that networks are most appropriate to our set of principles.2 Another reason why we use networks is that despite their dominance in describing cognitive memory models, there is little work on network representation in information retrieval. This thesis suggests that there are advantages to a networked approach that have yet to be exploited. We build two systems directly based on our set of principles. The first system models semantic relations between information objects (text documents, in this case); the second models the context (i.e. time and place) in which information objects are used. We test both of these systems and show them to work with real-world data. 1.1.1 G o a l s o f B i o m i m e t i c I R Biomimicry is a synonym of bionics, defined as "the study of mechanical systems that function like living organisms or parts of living organisms" (Canadian Oxford Dictionary, 1998). Human memory is a very successful implementation of information retrieval, and while we cannot mimic the brain's parallel physical architecture in detail without considerable cost, we can at least mimic mid-level functions between the fine-grained cellular level and the high-level behavioural level. To provide a personal memory prosthesis inspired by human memory, our approach should be user-centered, and our principles therefore specify system properties that are — • clear: simple — parsimonious and easy to understand for both users and system administrators inspectable — providing users with representations of its processes and data • friendly: automatic — providing services to the user unobtrusively, without requiring much user feedback tuneable — permitting users to adjust the system's parameters adaptive — staying current and up-to-date • useful: scalable — able to handle a data collection that grows to arbitrary size extensible — able to incorporate new types of data and meta-data efficient — able to process queries quickly effective — able to significantly improve a user's experience of memory expansion The general idea is that if we want a system to scaffold and support human memory, then it is reasonable to use human memory as a model. Human memory has been well-studied for decades, and many of its 2The principles of associationism and navigation in particular are well-suited for network representation. 3 properties are well-understood. Taking advantage of the way that memory works seems to be a reasonable basis for building retrieval systems that are better at meeting human needs (Foltz, 1991; Henninger, 1995). As for whether machines will one day think as humans: Thinking is a complex procedure, which is necessary in order to deal with a complex world. Machines that are able to help us handle this complexity can be regarded as intelligent tools that support our thinking capabilities. The need for such intelligent tools will grow as new forms of complexity evolve, for example those of our increasingly globally networked information society. In order for machines to be capable of intelligently predigesting information, they should per-form in a way similar to the way humans think. First, this is a necessary prerequisite in order that people can communicate naturally with these machines. Second, humans have developed sophisticated methods for dealing with the world's complexity, and it is worthwhile to adapt to some of them. Such "natural thinking machines" (NTM) with their additional conventional mathematical skills can leverage our intellectual capabilities (Binnig et al., 2002). Such statements posit information retrieval as an extension of the human conceptual system. However, the physical architectures of mind and machine are categorically different: minds are chemical, parallel, massive, and slow, while machines are electronic, serial, small, and fast. Modeling highly complex parallel human brains on a serial machine is not tractable. For this reason, in this thesis we focus on the mid-level functions of human memory, which represent a pragmatic distillation of the processes involved. We do not show that a biomimetic approach is superior to other approaches, but rather that it is both simple to implement and viable. 1.1.2 Context of this Thesis This thesis is interdisciplinary in nature, both human-centered and computational, and is strongly influenced by three areas: cognitive science, computer science, and information retrieval. Cognitive science is defined as the study of mental function, including development, attention, language, perception, planning, and memory. Within these topics, this thesis draws primarily upon functional symbolic memory models, which represent information objects as nodes in a network. Computer science covers an increasingly broad area, but might be most simply defined as the study of computability, its applications, and its effects on individuals, organizations, and society. There are many topics within computer science that are relevant to our interests, but which conflict with the above-stated biomimetic goals. For example, knowledge representation provides relevant research in semantic networks, which as hand-built entities are neither simple nor automatic; user modeling examines user context and pref-erences, but its methods are ethnographic and focus on behaviour rather than cognitive function; database research appears most relevant to human memory, but it is mostly process-oriented, focusing on data struc-tures (e.g., X M L ) and offline data mining. The role of computer science in this thesis, therefore, is a supportive one, to provide processing tools and systems engineering that will enable us to show the value of our approach in information retrieval. The purpose of information retrieval is to satisfy the information needs of human endeavour. This is no simple goal, as the demands of modern society have grown critically complex since the days when memex was proposed, particularly the burdens of demographics (in terms of education and caring for an aging society) that seem best managed with technologies that leverage our ability to think. In the next two sections, we look at the similarities between human and machine models of semantic and contextual encoding. 1.2 Existing Models of Semantics Semantics is defined as "the interpretation or meaning of a sentence, word etc." (Canadian Oxford Dictio-nary, 1998). Cognitive science and information retrieval have similar models of semantic encoding, using both "global" vector models, and "local" network models; in this section we compare these paradigms. 1.2.1 H u m a n M e m o r y a s I n f o r m a t i o n P r o c e s s i n g The workings of the human mind have often been described in terms of the technological idiom of the day: by pneumatics and hydraulics, as clockwork, and most recently as computation. These descriptions have sought to explain the associationist tendencies of human memory, which Aristotle described by the princi-ples of similarity, as well as spatial or temporal contiguity. Associationism was first rigorously defined by Hume, and remains current in the modern age, as it describes the mental basis for acquiring, structuring, and exploring knowledge. Arguably, there are four defining principles shared by all associationists (Anderson and Bower, 1973): • concepts are associated in memory through experience • all concepts can be reduced to a set of atomic concepts • atomic concepts are grounded in perception and sensation • the properties of complex ideas can be summed from the properties of their component atomic con-cepts Cognitive science has provided strong evidence for the associativity of human memory. Semantic priming has been observed where subjects judged word pairs to be real words more quickly if they were semantically related (Meyer and Schvaneveldt, 1971), and words seem to be arranged in memory according to their se-mantic relatedness (Blank and Foss, 1978). Associationism is appropriate to implementation on a machine, essentially as a stimulus-response model in the mechanical tradition of Watson (1919), or in the functional tradition of B.F. Skinner (1957). Associationism can therefore accommodate Hebbian learning behaviours such as paired experience and the principle of reinforcement (Hebb, 1949). 5 concept concept concept concept 1 2 3 4 hot 0 1 0 0 smooth 1 0 0 0 blue 0 0 1 0 sweet 0 1 0 0 red 0 1 0 0 bright 0 0 0 1 furry 0 0 0 0 Figure 1.1: An example of a global memory model. Each concept in memory is represented as a vector of attributes; Concept 2 could be a partial description of cherry pie. Concepts are considered more similar the more that they share attributes. With its emphasis on atomic concepts, associationism describes discrete objects, each defined by discrete attributes. This model would appear to be at odds with the neuronal, parallel architecture of the brain, if not for a recent interesting discovery. Cell assemblies are collections of synchronized neurons that represent in-dividual atomic concepts, as well as sets of atomic concepts (much like the binding layers discussed below) (Huyck, 2004). As such, cell assemblies bridge connectionist neural networks with symbolic representa-tions: each symbol is defined by a neural network, and grounded in the mind's perception of the external world. In applying functional cognitive principles to information management, we focus on functional, algorithmi-cally described computational models of memory. There are two main types of memory model, the global memory models that make simultaneous use of all of memory's resources, and the network models that simplify memory modeling by treating encoding and retrieval as limited, local phenomena. Global Memory Models Global theories of human memory are meant to reflect the human brain's parallel computational nature (e.g., Hintzman, 1984; Lund and Burgess, 1996). Each concept in memory is cast as a collection of attributes represented as a vector (Figure 1.1). Parallelism is the essential feature of global theories: cues are matched simultaneously against all concepts and memories. Retrieval is effected in two ways. First, a compound cue of attributes is consciously assembled in working memory (i.e., as an executive process in the "conscious" part of the brain) to act as a query, and is matched against all memories that contain these attributes. The best match is then retrieved into working memory. The second process is similar, but more exhaustive: once a memory is retrieved, it can act in turn as a query. This iterative process of associative retrieval quickly chains together a series of similar concepts. There are several problems with global theories. First, they are too broad, not specific enough to adequately account for and predict cognitive phenomena. Second, they are static and do not readily account for dynamic 6 Coll ins & Loftus Anderson Figure 1.2: Network theories of memory. In the Collins & Loftus model (1975), concepts are defined by internal attributes, and are linked more or less closely to the degree that those attributes are shared. In the Anderson model (1983), concept units are linked through attribute elements, and activating a particular set of elements will retrieve the most strongly connected units past the threshold line into working memory. change and learning over time. Third, since each concept will use only a small proportion of the available attributes, the resulting concept vectors will be mostly empty, representing inefficient resource usage. Most important, they are impractical if re-tasked directly as information retrieval models: simultaneously match-ing a large set of concepts would be punishingly expensive on a standard serial machine, while parallel machines are uncommon and quantum computing is not yet viable. Network Memory Models Network theories of human memory are essentially a refinement of the global models. They address the shortcomings of expensive computation, lack of dynamism, etc. by focusing on the functions of human memory rather than its parallel computational architecture. In network models, concepts are represented as nodes instead of vectors, but the concepts are still defined by discrete attributes, and relations between concepts are still determined by attribute matching. The main insight of network theories is that once attributes have been matched, sufficiently strong simi-larities are indicated by inserting a link between the nodes, and the link is weighted by assigning it the numerical result of the matching function. Thus network models can be viewed as long-term means of storing inter-nodal valuations. Network memory models may be assembled in two ways, with attributes either external or internal to the concept nodes (Figure 1.2). In the case of external attributes, attributes are represented as nodes and are connected to all concepts that they describe. With internal attributes, each concept node contains a list of 7 attributes that describe it, and concepts are directly linked the more that they share attributes; the attribute list may be ranked to indicate the most descriptive attributes for a particular concept. In both cases, the links are typically weighted to indicate the strength of relation, e.g., the attribute feathered would be more strongly linked to the concept bird than the rarer but also valid attribute swims. Retrieval in network models is performed by spreading activation (SA; also called spread of activation, SoA). Activation can be described as a flow of energy from a given query-cue to a retrieved memory. In the external case, activation spreads from a set of attributes that describe a query to all nodes that are described by those attributes. Concepts sum all the activation that they receive as the result of a query: the more attributes that pass activation to a concept, the more likely that the concept wil l be retrieved. SoA is usually constrained in some way to ensure that propagation will not continue unchecked. Most common is the decay constraint that "taxes" activation at each propagation step; propagation ceases when activation falls below a threshold. Decay is often assisted by the fan effect (Anderson, 1983), which divides the activation propagated from a node among its outgoing links; the greater the average number of links per node, the less far activation is likely to spread through the network. Other constraints include the distance constraint that ends propagation after a given number of iterations (de Groot, 1983), and the structural constraint that specifies which link types may be used for propagation (Cohen and Kjeldsen, 1987). 1.2.2 Information Retrieval Systems Information retrieval (IR) is the science of searching for documents, within documents, and for document descriptors, and examines storing, organization, and fast access to large bodies of data. The data may be textual, visual, or auditory, however most IR systems concentrate on textual data in the form of doc-uments. Although far from a household word, IR has produced information-management tools that have fundamentally changed the nature of modern intellectual life. Search engines in particular have, in a very short time, emerged as the dominant consumer tool for information access. The roots of IR begin with statistical language analysis, which efficiently and usefully indexes large document corpora based on the occurrence frequencies of the words that they contain. The field of IR has also fostered evaluation as a research program in itself, based on algorithmic measures of large test corpora, and (increasingly) on user studies. From its start in the printed document domain, IR has expanded to include Web-based search, and user context. Inspired by the growth of Human-Computer Interaction studies (HCI), there has been a growing interest in examining users' information behaviour, leading to a proposal for establishing a new field of human-information interaction (HII)(Jones et al., 2006). The position of IR in the information sciences seems in general poorly understood. In accepting the Salton Award, IR's greatest achievement, Bruce Croft recently claimed, "although IR has had a very strong re-lationship with library and information science, its relationship to computer science (CS) and its relative standing as a sub-discipline of CS has been more dynamic. IR is quite an old field, and when a number of CS departments were forming in the 60s, it was not uncommon for a faculty member to be pursuing research related to IR" (Croft, 2003). Today CS and IR are less strongly bound, and although many of the same techniques are employed in both camps, IR research is more likely to be found in departments of 8 library and information science (LIS), and in the new multi-disciplinary i-schools, than in computer science. Whereas database research in CS seeks to provide exact answers to a query, IR seeks to produce ranked lists of uncertain, "best guess" results that reflect the user's inferred information need. IR seeks to -• manage large and growing information corpora • build, interpret, and maintain complex representations • find relevant information • specify accurate and effective queries At the same time, given the increasing torrent of digital information—some new, some old but newly scanned—it becomes imperative to automate and optimize the processes of information management. This entails a conscious avoidance of rigid hand-built ontologies that rely heavily on expert knowledge, and also an attempt to avoid placing a "burden of decision" on users to continuously and manually refine their own queries and provide the system with ongoing feedback. As with human memory models, the two main types of IR model are the global models that process all information at retrieval time, and the network models that store global evaluations to allow more efficient local processing at run-time. Global-IR Models (standard IR) Although not widely acknowledged, it can be argued that global models of IR are much like global mem-ory models. In both cases, objects (i.e., memories or documents) are defined by discrete descriptors, and compared on the basis of the descriptors that they share. In IR, organization of objects is typically based on vector methods, where the salient features of each document are extracted and used as descriptors for that document. There are many methods by which such keywords are extracted, but the key point is that words that appear in a few documents in a corpus are likely to be good descriptors for those documents, while words that appear in all documents do nothing to differentiate between them. Documents are represented by vectors, much as depicted in Figure l . l 3 : all keywords that are used in a corpus are given a value in each of the vectors. As shown, values could be strictly binary, with a 1 given to a keyword that appears in a document, otherwise a 0 is given. In the extreme case, all words in all documents could be used as keywords, but this leads to a great deal of unnecessary computation: words that appear in most or all documents in the corpus do not help differentiate between documents, and as. such should not appear in the keyword vectors. As larger corpora wil l likely produce larger vectors, the ideal goal at all times is to find less frequent keywords that are uniquely representative of a document and use those as keywords (Sparck Jones, 1972). Since linguistic statistical analysis has shown word usage to be sparsely distributed (Sigman and Cecchi, 2002; Steyvers and Tenenbaum, 2005), each document wil l not contain most of the available keywords and the majority of the vector cells wil l be empty. Greater retrieval accuracy can be achieved if real numbers are used instead of binary numbers to indicate which keywords are the best 3Here the vector for Concept 2 could represent a document that discusses the merits of cherry pie. 9 descriptors of a document, particularly if the document has a great many keywords. In such case, the values are normalized in the range 0 to 1. Retrieval is performed with matching methods similar to those of global memory models: users select a set of search terms that they believe reflects their information need, and these are added to a query vector, which is then matched against the vectors of the corpus to find those most similar. The problem with this approach is that it ignores synonymy: users must match their search terms to the available keywords exactly in order to elicit good results. In human memory, necessary associations are made automatically on the basis of perceptual similarities, but computers do not have such innate flexibility. IR systems commonly use keyword expansion to retrieve accurate results from imprecise queries, for example by using a thesaurus to match the user's search terms to other keywords available in the system. As with global memory models, once the user has found a useful document, the document's vector can itself be used as a query to retrieve other similar documents, allowing the user to navigate through the corpora's keyword-defined "information space". The foremost method for calculating semantic trends in a set of document vectors is Latent Semantic Anal-ysis (LSA)(Landauer and Dumais, 1997). L S A starts with the document vectors, collected into a sparse keyword-document matrix. L S A finds higher-order concepts (i.e., collections of keywords) that tend to co-occur strongly, and thus represent a semantic summary of the content of the corpus. To accomplish this, L S A uses the rank-lowering method of singular value decomposition (SVD), which keeps the strongest term correlations and prunes away those that are less descriptive, producing a less-sparse representation. Global models of IR face some problems. First, an exhaustive method for matching a query (or document) vector against all documents in a corpus will be computationally intensive, and unlikely to be executable in real time, leading to bothersome delays for the user. This problem is partly addressed through the use of an inverted index, which lists for each keyword the documents that it references; finding a set of relevant documents is then a matter of retrieving the sets of documents referenced by each keyword and performing an intersection of the sets. Second, the fact that document vectors will be sparse means that a significant amount of storage space will be wasted. Third, methods to characterize the semantic trends of the corpus by calculating its term eigenvectors (e.g., LSA), are highly compute-intensive, and create a static model of semantic trends that must be recalculated if the corpus grows or if its semantic composition changes significantly. Fourth, such methods are black boxes, in that their operation is not inspectable, and their products (as term eigenvectors), although statistically accurate, may not be meaningful to humans. Nonetheless, such problems are relatively trivial compared with the effectiveness of the models, and al-though decades old, global methods remain the basis of modern IR. Network-IR Models (Net-IR) Network models are comprised of nodes that represent documents; some models also use nodes to represent search terms and keywords. A l l models link nodes together to indicate relatedness, and perform retrieval by a spreading activation process that propagates activation along links from nodes of interest to connected nodes 10 Input Layer Hidden Layer Output Layer Input Input Input Input #1 #2 #3 #4 Output Figure 1.3: A neural network appropriate for information retrieval. The nodes of the input layer can be used to represent keywords. When a particular set of input keywords is stimulated, a trained network will pass activation through the hidden layer to the outputs. The link weight patterns determine which of the outputs will be selected as the most appropriate document. In this example, a range of search terms is mapped onto a single output document, although more nodes can be added to the output layer to represent additional documents. that are hopefully also of interest. From this basic architecture network-IR models can exhibit bewildering variety. Although several types of network have been used for information retrieval, we are interested only in those that best fit our motivating goals: the networks should be inspectable, automatically constructed, tuneable, and scalable. Since the neural network (NN) is sometimes used for IR (e.g., Mozer, 1984), and is popular within computer science, we should explain why it is not appropriate to our needs. NNs are typically arranged as a set of input nodes that are completely connected to a layer of hidden nodes, which are in turn connected to a layer of output nodes (Figure 1.3). NNs are robust, good for online learning, accommodate large data sets well, and are often used for function approximation, classification, and clustering. However, NNs are a black box process; as a whole, they represent useful relations between sets of inputs and outputs, but it is impossible to point to any part of a N N and say where the knowledge lies. Rather, knowledge is distributed throughout the network in its pattern of link weights. Neural networks are built by hand, relatively difficult to use, and require a good understanding of the underlying theory. Significant experimentation is required to select and tune an algorithm for training on unseen data. As a computational equivalent of low-level neuronal processes, NNs are too fine-grained for application to a large corpus. In terms of our goals, they are neither simple nor inspectable, are difficult to tune, are not built automatically, and are difficult to extend and scale up. 11 ( Cell ) (example HypertenslonJ ^ Diabetes) Figure 1.4: A semantic network for information retrieval. In the Cohen and Kjeldsen (1987) model pictured here, concepts are defined by domain experts who link them together manually using various types of relationship. Such networks can provide usefully nuanced valuations, but require an enormous amount of effort to construct, and cannot be considered practical for a large, heterge-neous semantic domain. Some of these problems are alleviated by using fully symbolic networks, that is, networks in which all nodes represent an object, and links between objects represent identifiable relations. Such semantic networks have virtually no limitations on what types of objects may be represented by their nodes, and what relations by their links (Figure 1.4). Semantic networks are often used for explicitly modeling human-like or expert knowledge, and the representation is often highly taxonomic and logic-based in a manner more appropriate to problem-solving than information-finding (Sowa, 1991). As they provide a visual "map" to a complex description, they are easier to read and understand than textual descriptions. However, semantic networks also exhibit some shortcomings. Finding information relies on a detailed understanding of the network's classification scheme, which is usually the purview of an expert user. Different taxonomic structures may be necessary to satisfy different information needs (Lakoff, 1987). Users have been observed to become disoriented and ignore possibly relevant information due to the cognitive load involved in navigation (Foltz and Kintsch, 1988). The G R A N T system of Cohen and Kjeldsen (1987; Figure 1.4) uses spreading activa-tion to find funding agencies that best match a given research agenda. Its large number of link types require carefully hand-tuned activation rules to constrain the spread of activation. This underscores the weakness of semantic networks with complex ontologies: they are time-consuming to build, and their manual construc-tion is impractical for all but the smallest corpora. 12 St*|« One Jones Belew Figure 1.5: Two influential network-IR models. In the Jones model (1986), a query node is created by link-ing to the terms that describe it best; activation then propagates from the terms to connected documents, and the documents that receive the greatest sum of activation are retrieved. The Belew model (1989) operates in essentially the same manner, although elements are also inter-connected within their own layer. Since the model describes a citation-matching engine, author names are used as a privileged type of keyword and given their own layer.4 With respect to our biomimetic goals, the most promising form of existing Net-IR model might be termed a "simplified semantic network". In fact, these networks are remarkably like the network models of human memory described above. As with human memory models, in IR the network models can also be viewed as abstractions of global models. The goal of human memory is to find associated memories; the goal of IR is to find relevant documents. If the relationships between objects (e.g., their similarity) are unlikely to change significantly, why not store them? Using a network for storage is particularly efficient. Unlike the wasted space in the sparse vectors of global-IR, networks need only use as many resources as necessary to indicate only what is there, without the large number of empty cells implied by a sparse matrix model. As with the network memory models, the object features (such as keywords) of a document may be stored inside the (document) nodes, or externally in nodes of their own. Links represent the strength of the relationships between documents, between terms and documents, or between terms. The Memory Extender (ME) model (Jones, 1986; Figure 1.5, left) represents both keywords and documents as nodes. Retrieval is a two-step process: a query activates a set of keywords that represent the user's information need, and activation then passes from those keywords to the documents that they describe; the distribution of activation to neighbouring nodes mimics the fan effect of human memory network models (Anderson, 1983). Each document node accumulates incoming activation, and documents with the highest resulting activation levels are retrieved. The model performs a kind of keyword expansion5by reflecting activation back to all keywords connected to the retrieved documents. The new keywords are used to retrieve Images from Jones (1986), and Belew (1989), are copyright by the Association for Computing Machinery (ACM). Reproduced by permission. 'Query expansion is denned as the addition to, or substitution of, search terms with their synonyms, typically through reference to a thesaurus; the reformulated query is used to find related documents that may not contain the given search terms although discussing the same topic (Chowdhury. 2004. p. 144-145). 13 additional documents, so that related documents can be retrieved even if their keywords did not exactly match the initial query. The Associative Information Retrieval (AIR) system (Belew, 1989; Figure 1.5, right) operates similarly to the Memory Extender, in that it implements the fan effect and a simulation of keyword expansion, but also uses user relevance feedback to tune the weights of the network to improve future retrieval. The AIR model also connects related keywords to other keywords and documents to documents, whereas the M E model does not. As with human memory models, spreading activation in network-IR models is constrained so that propa-gation will not continue endlessly or even "blow up" to infinite values. The methods of constraint may be more or less "cognitively plausible". Neural networks are path constrained, in that activation passes in a directed way from input to output and is not allowed to wander; path constraints also include limitations on what type of link may be followed under different conditions (Cohen and Kjeldsen, 1987). The most simple distance method merely stops propagation after a fixed number of steps (Cohen and Kjeldsen, 1987), with four steps being roughly equivalent to the estimated distance of propagation in human memory (de Groot, 1983). Other methods perform decay damping that "taxes" activation each time it makes a hop; propagation ends when activation falls below a set threshold. As with human memory models, propagation based on activation levels terminates faster if the activation is fragmented by the fan effect at each hop. Existing Net-IR models exhibit some problems: where they require user feedback to tune their weights, they enforce a burden of decision upon the user; they do not adequately explain or encode the context in which the information is used; they tend to assume small corpora; they are hand-built and thus not easily scalable; they largely ignore useful topological properties of networks (such as clustering and link distribution); and they perform little or no user testing. More generally, despite the clear parallels between human memory and information retrieval, the IR models are not connected to human memory in a principled manner. 1.3 Existing Models of Context We define context as "the circumstances relevant to something under consideration" (Canadian Oxford Dic-tionary, 1998). There is more to information use than the semantic content of documents, the search terms that are used to retrieve them, and the semantic relations that are used to connect them. Information use is always situated in a context, such as location or time of day. For example, the agenda for a regularly sched-uled meeting may always be used at 9am on Mondays, regardless of changes to its content—it represents a particular, persistent type of organized activity. Human memory works as much by the context of activity, as by semantics (Tulving, 1972). It would therefore be useful to allow users to retrieve information objects by the context of their use as much as by their content—not just by what is said, but when, and under what circumstances. Ideally, contextual retrieval would work by pre-fetching information automatically without conscious in-tervention from the user. What is retrieved will depend on, and be appropriate to, current activities and demonstrated patterns of usage. This notion brings us back to our mission statement: Information retrieval in aid of human memory should be associative and contextually dependent. 14 1.3.1 C o g n i t i v e N o t i o n s o f C o n t e x t Cognitive science has developed two useful ways to encode context: as schemas and as episodic memories. Schemas define a knowledge structure for rapid human understanding of typical environments, procedures, and social situations; by extension schemas determine the types of information that are most easily remem-bered. Episodic memory refers to the human ability to remember events based on the perceptual features of those events. Schemas The term schemata was introduced by the German philosopher Immanuel Kant, who believed that we are born with a priori structures that help to organize our perceptions of the world (Gleitman, 1995). Schemas combine the typical temporal and spatial features of well-known activities, such as going on a picnic: the physically observable sensations (flowers, grass, sandwiches) are encoded as a scene schema (Brewer and Treyens, 1981), and an event schema describes the ordered series of actions involved (making sandwiches, packing, finding a good spot, unpacking, eating) (Mandler, 1984). Schemas act as knowledge scaffolds that enable a degree of automaticity: less cognitive load is required if the actions and objects of a particular task are already well-understood. Schemas have been defined by three processes: (1) in accretion, a new experience (e.g., today's picnic) is added to an existing appropriate schema; (2) tuning adjusts existing schemas incrementally to reflect variable circumstances; and (3) restructuring involves changing an existing schema to map onto a novel, unfamiliar event (Rumelhart and Norman, 1981). The importance of schemas for human memory is such that new information that cannot be readily matched to pre-existing patterns is easily lost (Eckhardt, 1990), biasing memory against the retention of atypical details. Episodic Memory In his seminal work on memory types, Tulving identifies declarative memory for factual, encyclopedic knowledge (such as recalling the capital of England); procedural memory of how to perform actions (such as riding a bicycle); and episodic memory for personally experienced events (Tulving, 1972). Episodic memory is often explained in terms of encoding specificity, by which memories are more easily retrieved if the cues that were present at encoding are also present at retrieval. Encoding specificity has clear implications for information retrieval: both global and net-IR models use keyword attributes to aid in semantic, declarative retrieval. It would be relatively simple to add new non-semantic attribute types that represent the context of time and place of retrieval. The convergence-zone model of episodic memory (Moll et al., 1994) gathers attributes by type (colour, hardness, etc.) into feature maps (Figure 1.6). These feature maps approximate parts of the brain devoted to different perceptual stimuli. When a new event is experienced, a binding layer is created to encode the convergence of the event's various attributes, more or less strongly depending on their prominence in the 15 Binding l-ajer Binding Layer Feature Map 1 Feature Map 2 Feature Map 3 Feature Map 4 Feature Map 1 Feature Map 2 Feature Map 3 Feature Map 4 Figure 1.6: A network model of episodic memory. In the model of Mol l , Miikkulainen, and Abbey (1994), feature maps are sets of attributes collected by type, e.g., colour, hardness, etc. A binding layer is created to encode an experience, and is linked to relevant attributes in the feature maps. A partial cue involving some of the attributes activates the pertinent binding layers, which then propagate activation back to other feature maps, eventually retrieving a complete memory. experience. Retrieval occurs when some of the event's attributes reappear later as a set of cues. The most relevant binding layer is activated, and activation propagates back to all component feature maps, "filling in" the missing attributes and reconstructing the complete memory. This approach can be directly applied to contextual IR, since cue attributes can also represent time and place, and be used to retrieve documents based on contextual cues. 1.3.2 C o n t e x t f o r I R Information retrieval systems typically do not encode the context of information use, and prominent text-books in the area typically omit context as a subject of discussion (Chowdhury, 2004; Frakes and Baeza-Yates, 1992; Saltonsand McGil l , 1983; Witten et al., 1999). Rather, the focus is on information management in large corpora, and consideration of human information needs seems restricted to semantically-based re-trieval of target documents. Fortunately, this perspective is changing, particularly as information manage-ment, and the manipulation of personal data, becomes more widespread throughout society. I n f o r m a t i o n M a n a g e m e n t As a practice, information management (IM) deals with the application of information technology to cor-porate enterprise (Figure 1.7). IM can be divided into two broad areas: technology-oriented and content-oriented information management (Schlogl, 2005). Technology-oriented information management is the purview of Chief Technical Officers and Chief Information Officers, and includes data management, in-formation technology management and strategic information technology management, which focus on the cost-effective installation of information-processing ability intended to maintain and improve the profitabil-ity of the company. The content-oriented side of information management focuses more on the manipulation and organization of document semantics, through the study of human information behaviours, records man-agement, induction of new information, and the provision of information resources such as libraries and 16 size ^ > PIM CSCW HCIM Figure 1.7: Increasing the scope of information management (IM), from personal to organizational. Per-sonal information management (PIM) focuses on the individual's information needs, and as such can be described in culturally independent functional cognitive terms. Once the scope of IM increases, to the computer-supported collaborative work (CSCW) of groups, and then to the human-centred information management (HCIM) within organizations, functional cognition no longer applies, and must be supplanted by culturally dependent descriptions of group dynamics and social psychology. Beyond these levels, communication between organizations has not been described by human information behaviour, but rather in terms of economics and information technology. intranets. A l l of these activities are based on developing ontologies and formats' (such as fill-in forms and operational protocols) appropriate to a specific work culture. The study of these concerns has been formal-ized in disciplines of Information Architecture (IA) and Knowledge Organization (KO). In brief, information management seeks strict, formal control of the "information life cycle" from knowledge capture to document archiving and retrieval (Schlogl, 2005). Thus we define the basic operations of information management in terms of these technical and human needs as follows: • The definition of an information object in terms of its ontology • The induction of raw information into the system, to be represented as internal objects • The retrieval of objects based on queries • The discovery of trends in the semantics of objects, and in how those objects are used Thus, information management defines the notion of both work context and work culture, which together generate specific information needs and usability challenges. However, the basis for these needs is as much social (with respect to culture) as it is cognitive (with respect to the individual). Information manage-ment, and indeed cross-over fields such as Information Behaviour (IB)(Fisher et al., 2005), and Human-Centered Information Retrieval (HCLR) / Cognitive-IR (Ingwersen and Jarvelin, 2005), focus particularly on behaviour situated within a cultural environment. By contrast, cognitive memory models are culturally unbiased because they describe the general function of memory at a fundamental level that applies to all humans, rather than focusing on the shared experience of specific socially mediated information structures. Our initial interest, therefore, is in those personal information systems that can make immediate use of general properties of human memory. 17 P I M and Memory Prosthesis Personal information management (PIM) organizes information of a visceral nature, such as personal emails, writings, photographs, maps, calendars, agendas, and other documents that fulfill interests related to an individual's work and leisure. PIM is a new sub-field of IR, born out of increasing interest in user behaviour, and closely allied with similar research in the HCI community. PIM seeks to help users organize their information, and in doing so pays increased attention to information context: when objects were used or created, and with what other objects. In allowing users to retrieve information based on the daily salients of their lives, PIM is essentially (although perhaps not explicitly) an intimate form of memory prosthesis, and when systems automatically retrieve information objects according to context, they function as memory-scaffolding reminders. Since one of our goals is to reduce the user's "burden of decision", we divide PLM research between sys-tems that require direct intervention—sometimes even programming—on the part of the user, and those that minimize these demands. Among the "burdensome" systems, CybreMinder is "a context-aware tool that supports users in sending and receiving reminders that can be associated to richly described situations in-volving time, place and more sophisticated pieces of context" (Dey and Abowd, 2000). The system specifies a set of temporal and sensory "triggers" to provide reminders consistent with human prospective episodic memory (i.e., memory concerning future events), such as remembering to do something at a given time, a given place, or under particular circumstances. Autominder (Pollack et al., 2002) is an "assistive agent" for planning, monitoring, and reminders. It is purposely designed as a "cognitive orthotic" to help people with memory impairment carry out the activities of daily life. Autominder is part of the control system of a domestic robot. Using a variety of on-board sensors, the robot looks for evidence that critical activities are being carried out as planned, and generates timely reminders. Given our biomimetic goal of automatic-ity, the problem with both of these systems is that details of plans and reminders must be programmed in advance by the user or caregiver, requiring technical dexterity with well-formed Boolean statements. Systems that require less user intervention include the wearable remembrance agent (Rhodes, 1997), which tags information objects with available context information (such as location and time), and recommends objects to the user with reference to an ontoloby that combines contextual with semantic attributes in a global-IR vector model. The iRemember system (Vemuri and Bender, 2004) records the user's spoken audio reminders on a portable PDA, and for retrieval uses audio-pattern matching to generate keywords for each audio snippet; time and physical location can also be used as search cues. The most common approach to interaction with contextual information depicts information events (such as the creation of an information object) in sequence on a time-line, and uses standard (mostly Boolean) search metaphors, as seen in the Forget-Me-Not (Lamming and Flynn, 1994), MyLifeBits (Gemmell et al., 2003), and Stuff I've Seen (Dumais et al., 2003) systems. LifeStreams (Fertig et al., 1996) can organize documents by when they were created, received, or modified; its time line also allows the user to position a document in the future, to be retrieved at a set time as a reminder. A l l of these systems cite memex (Bush, 1945), which suggests their ambitious cognitive aspirations, but none of these systems show a clear connection to the functional aspects of human memory models. 18 Rather than requiring the conscious engagement of users, systems should use contextual information to help relieve the user of the need to reformulate queries and provide the system with relevance feedback. A s much as possible, systems should passively observe user behaviour, infer l ikely user actions, and provide the user with automatic, pertinent information retrieval. 1.4 Our Approach: Network Models of Semantics and Context Memory models and IR models of semantics and context are remarkably similar, which suggests that IR, in organizing artifacts that are meaningful to humans, acts as a form of memory, or at least as an extension of human memory that externalizes the mass of detailed information that we simply cannot maintain in our heads. However, since document keywords are equivalent to the feature attributes of human memory, we suggest that IR systems act essentially as cue managers and cue scaffolds that stimulate user memories by reminding them of what they need to recall. We take afunctional approach to modeling information retrieval based the processes rather than the mechan-ics of memory, meaning that the models that we use are neither too fine-grained nor too high-level. A fine-grained model would simulate the processes of individual neurons—a computationally expensive proposi-tion. B y contrast, an approach that is too high-level enters the realm of abstract, consciously-controlled processes (reasoning, language, and so forth), that are heavily influenced by culture. Instead, by fol lowing the middle path, we can simulate how memory behaves in a culturally independent, generalizable way. A similar functional cognitive approach was taken by Hoenkamp (2005), who found that noun phrases are an inherently human and culturally invariant trait, richer than keywords in fulfi l l ing information needs; he suggests that culturally flexible IR systems should be designed from such universally-applicable bases. B y contrast, much of IR research of human information behaviour examines overt behaviours that are culturally biased, and not generalizable—say—between Europe and Afr ica. A s Hoenkamp demonstrates, the goals of functional research are complementary to the common behavioural approach, while providing a more general basis for design decisions. Information objects may be referenced and retrieved by either or both of our semantic and contextual net-works. A valid question is, why use networks for knowledge representation? In general terms, compared to vector methods networks are easily built and edited, are good for sparse data, and are easy to draw. More specifically, networks are good for graph analysis, in terms of the significance of node clusters and the dis-tribution of links per node. Associative networks in particular are good for finding related items, and for navigation by browsing. 1.4.1 T h e S e m a n t i c N e t w o r k To ensure the generality of the semantic network, we take a bottom-up rather than top-down approach to network building. The common approach to semantic modeling is to specify an ontological hierarchy from the top down, with classes, subclasses, types and tokens. The problem with this approach is that the 19 model is never truly correct: there are always new interpretations and alternative divisions of the knowledge space—in fact the number of possible interpretations derived from a body of data is virtually infinite (Smith, 1996; Thornton, 2000). Our approach is bottom-up in that it does not use a pre-defined ontology; rather the knowledge space takes shape depending on the variable semantics of document corpora. Our semantic network represents documents as nodes, and we use only a single link type to connect them, based on strength of relation, as with network-based human memory models. Keywords are automatically extracted from documents using standard IR indexing methods, and stored as a weighted list inside each node. Nodes are then linked more or less strongly to the degree that they share keywords. Thus each document is then automatically connected to its most similar peers with a single bidirectional link, as the similarity relation between peer documents is symmetric. The interesting aspect of this approach is that we are using a network to store the similarity valuations of a global-IR method: once the document similarities have been calculated and stored in the network, they are more efficient to retrieve than re-running the global method (Salton and Allan, 1994). An inverted index is used to indicate which documents contain each keyword; keyword search can then be performed as an intersection of the sets of documents retrieved by each search term in the query. User search with this system is a two-step process: the user first performs a keyword search to find documents that could satisfy an information need. The user can then navigate to a document's neighbours if they appear to repre-sent a more accurate result, at which point all of the neighbour's nodes in turn become available for selection. In this manner, the user can navigate from node to node through the network, exhibiting "information for-aging behaviour" by following an "information scent" (Pirolli and Card, 1999). Such ostensive approaches reduce the need for the user to reformulate queries, as each time the user examines a document, the system suggests other documents similar in semantic content (Campbell and van Rijsbergen, 1996; Crestani and Lee, 2000); "Navigability through the semantic structure permits formulation of a query by means of the identification of a semantic path through the reference structure" (Agosti et al., 1991; emphasis added). In our system, keyword search is just the first step in an iterative process of automatic query reformulation by navigation; we manipulate the connectivity of the network to encourage browsing. Without establishing some sort of link threshold, a large number poorly-related nodes will be connected if they have even one weak keyword in common. Where networks have been used in IR experiments, thresholds are typically set manually to some arbitrary acceptable threshold (e.g., a threshold low enough to ensure enough useful connections); this tends to produce reasonable results, but is not a principled approach and does not attend to the resulting properties of the generated network. Recent work on small-world topologies has shown them to be present in many naturally occurring and human-made networks; the more places researchers look, the more they are discovered; they also show optimal navigability with minimal resources (Barabasi, 2002). These properties seem ideal for information networks such as ours, which support browsing and for which an excess of links would be a distracting annoyance to the user. We introduce a simple, novel algorithm that tunes a semantic network to a small-world link distribution while also preventing network fragmentation. Using this approach, we build a semantic network that satisfies our biomimetic goals. It is simply defined, easy to inspect and comprehend visually, scalable (new nodes can be added easily), extensible (new key-20 words are readily added to the index), and is efficient to build, maintain, and use. We tested the utility of the semantic network in a user experiment that showed that user navigation through the network provided easier access to information than by requiring users to repeatedly reformulate their own search queries. 1.4.2 The Context Network Unlike the semantic network, the context network is built top-down, in that some meaningful ontology that reflects how humans parse events in time and place needs to be established before data is inducted. In our case, we created a temporal hierarchy based on the calendar, and used GPS coordinates as the basis for a spacial model. The context network, dubbed the Cue-Event-Object (CEO) model, works as a simplified model of episodic memory. It is pre-programmed with a set of cue nodes that represent the facets of an available schema. For example, a temporal schema would include node representations of the minutes and hours of the day, the days of the week etc. A scene schema would be delineated by sensors that are sensitive to appropriate attributes such as location, velocity, colour, temperature, proximity, etc. Thus, instead of using expert knowledge to set up static, narrowly-focused semantic networks, expertise is required to decide on which attributes wi l l be visible to the model. The main worry then is the possibility of cue blindness, where critical phenomena go unnoticed if they lie in the perceptual gap between sensors (Thornton, 2000). The model observes events as they occur, and builds contextual patterns by linking the relevant cue nodes to an event node equivalent to the binding layer of the convergence-zone episodic memory model (Moll et al., 1994); any information objects that are used in the described context are also linked to'the event node. Whenever possible, the patterns described by the cue nodes are simplified through a process of aggregation. The model is dynamic in that patterns of correlation that are consistent over time maintain a high level of activation, whereas less-used patterns decay to a lower activation level (Hebb, 1949). Patterns that cease to be supported are dis-aggregated into their component parts. The model can thus provide users with summaries of their information behaviour, and can and support queries related to schema cues, such as "what happens at time £...?", "when does event e occur?", and "where do I do yV We tested the context network in an experiment based on user logs. The experimental domain was infor-mation services provided during driving, and the experiment was performed at the research laboratory of a major automobile manufacturer. The premise is that i f user driving behaviour can be predicted beforehand, then a variety of information-retrieval services, such as music selection, shopping reminders, and trip ad-visories, can be provided to enhance "in-car driving happiness". A minimal basis for tailored services is the prediction of trip destination. In our experiment, we used real driving data streams of time and location sensors to predict the driver's destination. We found that the system quickly rose to a reasonable level of predictive accuracy, which increased as more information was input to the system. 21 1.4.3 C o n t r i b u t i o n s The .contributions of this dissertation are as follows. The P-MAK framework: the framework introduces a set of principles that describes the intersection be-tween human memory and information retrieval, and provides a philosophical basis for the design of human-centered systems. Small-world tuning for semantic networks: the introduction of a principled method to build and tune an associative similarity network to a small-world topology, to promote maximal navigability with minimal resource usage. Static semantic networks for query reformulation: the introduction of static semantic small-world simi-larity networks as a tool for refining search results through information navigation. The CEO context model: the introduction of a novel network-based model that uses spatio-temporal ag-gregation to capture dynamic trends in data streams. Finally, we hope to foster a greater connection between cognitive science and information retrieval. "The two camps do not communicate much with each other and it is safe to say, that one camp generally views the other as too narrowly bound with technology whereas the other regards the former as an unusable academic exercise" (Ingwersen and Jarvelin, 2005). I believe there is much to be gained from greater collaboration. 1.4.4 S t r u c t u r e o f t h e T h e s i s This document is structured as a manuscript thesis. Manuscripts are "constructed around one or more related manuscripts" which have been published, or are in preparation for academic publication. The introduction sets the context of the work, and the conclusion ties the chapters together and suggests avenues for future research (FoGS, 2007). Chapter 1 introduces the thesis as a network-based approach to semantic and contextual retrieval at the crossroads of cognitive science, computer science, and information retrieval. Chapter 2 provides the foundational thinking for the dissertation, defining principles based on cognitive theories of memory that describe how semantic and contextual knowledge can be stored. These principles are developed primarily from a cognitive perspective, and the chapter has been accepted for publication in the journal Minds and Machines (Springer) (Huggett et al., 2007). Chapter 3 tests the utility of the semantic network with a user study. The chapter refers to semantic similarity networks in the guise of hypertext, and has been accepted as a paper for publication in the Proceedings of the Joint Conference for Digital Libraries (JCDL) (Huggett and Lanir, 2007). Chapter 4 tests the efficacy of the context network with a study based on user logs. The chapter looks at the in-car driving experience as a domain area for contextual information retrieval. A n abstract of the chapter has been accepted for publication in the Proceedings of the International Conference on Research and 22 Development in Information Retrieval (SIGIR), and the chapter's ideas were presented in the SIGIR Doctoral Consortium (Huggett, 2007). A paper is in preparation for submission to the journal Information Processing and Management (Elsevier), for an upcoming special topic issue on adaptive information retrieval (Huggett, 2008). Chapter 5 concludes with reflections on lessons learned and some discussions of implications for future work. There are four appendices: Appendix A details how the semantic network and context network are constructed, and discusses their algorithms and data structures. Appendix B presents materials used in user study of Chapter 3, and displays a table of results. Appendix C presents a table of data and results for the user-log experiment of Chapter 4. Appendix D presents the ethics approval form for the user study of Chapter 3. 23 Bibliography Agosti, M . , Colotti, R., and Gradenigo, G. (1991). Issues of data modelling in information retrieval. Electronic Publishing, 4(4):219-237. Allan, J. and Croft, W. B. (2003). Challenges in information retrieval and language modeling. Final report of a Workshop held at the Center for Intelligent Information Retrieval, University of Massachusetts Amherst, September 2002. Technical report. Anderson, J. R. (1983). A spreading activation theory of memory. Journal of Verbal Learning and Verbal Behavior, 22(3):261-295. Anderson, J. R. and Bower, G. H. (1973). Human Associative Memory. V.H. Winston: Washington, DC. Barabasi, A . - L . (2002)C Linked: The New Science of Networks. Perseus Publishing: Cambridge, M A . Belew, R. K. (1989). Adaptive Information Retrieval: Using a connectionist representation to retrieve and learn about documents. In Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval, pages 11-20. A C M Press: New York, NY. Berger, H. , Dittenbach, M . , and Merkl, D. (2004). An adaptive information retrieval system based on associative networks. In Hartmann, S. and Roddick, J., editors, Proceedings of the First Asia-Pacific Conference on Conceptual Modelling (APCCM2004), volume 31 of Conferences in Research and Practice in Information Technology (CRPIT), pages 27-36. Australian Computer Society. Binnig, G., Baatz, M . , Klenk, J., and Schmidt, G. (2002). Will machines start to think like humans? Europhysics News, 33(2):7pp. Blank, M . A . and Foss, D. J. (1978). Semantic facilitation and lexical access during sentence processing. Memory & Cognition, 6(6):644-652. Brewer, W. F. and Treyens, J. C. (1981). Role of schemata in memory for places. Cognitive Psychology, 13:207-230. Bush, V. (1945). As we may think. Atlantic Monthly, 176(1): 101-108. Campbell, I. and van Rijsbergen, C. J. (1996). The ostensive model of developing information needs. In In-gwersen, P. and Pors, N . , editors, Proceedings of COLIS-96, 2nd International Conference on Conceptions of Library Science, pages 251-268. Canadian Oxford Dictionary (1998). The Canadian Oxford Dictionary. Oxford University Press: Toronto, Canada. Chowdhury, G. G. (2004). Introduction to Modern Information Retrieval, 2nd edition. Facet Publishing: London. Cohen, P. R. and Kjeldsen, R. (1987). Information retrieval by constrained spreading activation in semantic networks. Information Processing & Management, 23(4):255-268. Collins, A . M . and Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psycholog-ical Review, 82(6):407-428. Crestani, F. and Lee, P. L . (2000). Searching the web by constrained spreading activation. Information Processing & Management, 36(4):585-605. 24 Croft, W. B. (2003). Salton Award Lecture - Information retrieval and computer science: An evolving relationship. In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pages 2-3. A C M Press: New York, NY. de Groot, A. M . B. (1983). The range of automatic spreading activation in word priming. Journal of Verbal Learning and Verbal Behavior, 22:417-436. Dey, A . K. and Abowd, G. D. (2000). CybreMinder: A Context-Aware System for Supporting Reminders. In HUC '00: Proceedings of the 2nd international symposium on Handheld and Ubiquitous Computing, volume 1927 of Lecture Notes in Computer Science, pages 172-186. Springer-Verlag: London, U K . Dumais, S. T., Cutrell, E., Cadiz, J. J., Jancke, G., Sarin, R., and Robbins, D. C. (2003). Stuff I've Seen: a system for personal information retrieval and re-use. In SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 72-79. A C M Press: New York, NY. Eckhardt, B. B. (1990). Elements of schema theory. Unpublished paper, University of New Mexico. Cited in Fundamentals of Cognitive Psychology, R. Hunt & H. Ellis, 1999. McGraw-Hill: New York, NY. Fertig, S., Freeman, E., and Gelernter, D. (1996). Lifestreams: An alternative to the desktop metaphor. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI '96), pages 410-414. A C M Press: New York, NY. Fisher, K. E., Erdelez, S., and McKechnie, L. (2005). Theories of Information Behavior. Asist Monograph Series. Information Today Inc.: Medford, NJ. FoGS (2007). Masters and Doctoral Thesis Preparation and Submission: Manuscript-Based Thesis. Faculty of Graduate Studies, University of British Columbia.,000,000,000,000,000,000. Foltz, P. W. (1991). Models of human memory and computer information retrieval: Similar approaches to simiar problems. Technical Report 91-3, University of Colorado, Boulder, CO. Foltz, P. W. and Kintsch, W. (1988). An Empirical Study of Retrieval by Reformulation on HELGON. Technical Report 88-9, University of Colorado, Boulder, CO. Frakes, W. B. and Baeza-Yates, R. (1992). Information Retrieval: Data Structures & Algorithms. Prentice-Hall: Upper Saddle River, NJ. Gemmell, J., Lueder, R., and Bell, G. (2003). The MyLifeBits Lifetime Store. In Proceedings of the 2003 ACM SIGMM Workshop on Experiential Telepresence, pages 80-83. Gleitman, H. (1995). Psychology, 4th edition. W.W. Norton & Company: New York, NY. Gray, J. (1999). What Next? A Dozen Information-Technology Research Goals. The A C M Turing Award Lecture. Technical Report MS-TR-99-50, Microsoft Research. Hebb, D. O. (1949). The Organization of Behavior. John Wiley: New York. Henninger, S. (1995). Information access tools for software reuse. Journal of Systems and Software, 30(3):231-247. Special issue on software reuse. Hintzman, D. L. (1984). Minerva 2: A simulation model of human memory. Behavior Research Methods, Instruments, & Computers, 16(2):96—101. Hoenkamp, E. C. M . (2005). Why information retrieval needs cognitive science: A call to arms. Proceed-ings of the 27th Annual Conference of the Cognitive Science Society, pages 965-970. Huggett, M . (2007). A network model for context-dependent information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and Development in Information Retrieval. A C M Press: New York, NY. An abstract submitted for participation in the SIGIR Doctoral Consortium. 25 Huggett, M . (2008). A network model for context-dependent information retrieval. A paper in preparation for submission to the journal Information Processing and Management special topic issue on adaptive information retrieval. Huggett, M . , Hoos, H. , and Rensink, R. (2007). Cognitive Principles for Information Management: The Principles of Mnemonic Associative Knowledge (P-MAK). Minds and Machines. In review. Huggett, M . and Lanir, J. (2007). Static reformulation: A user study of static hypertext for query-based reformulation. In Proceedings of the Joint Conference on Digital Libraries (JCDL). A C M Press: New York, NY. Huyck, C. R. (2004). Overlapping cell assemblies from correlators. Neurocomputing, 56:435-439. Ingwersen, P. and Jarvelin, K. (2005). The Turn: Integration of Information Seeking and Retrieval in Context. Springer: Dordrecht, The Netherlands. Jones, W. P. (1986). The Memory Extender Personal Filing System. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 298-305. A C M Press: New York, NY. Jones, W. P., Pirolli, P., Card, S. K. , Fidel, R., Gershon, N . , Morville, P., Nardi, B., and Russell, D. M . (2006). Panel: It's about the information stupid!: Why we need a separate field of human-information interaction. In CHI '06 extended abstracts on human factors in computing systems, pages 65-68. A C M Press: New York, NY. Lakoff, G. (1987). Women, Fire, and Dangerous Things: What Categories Reveal about the Mind. Univer-sity of Chicago Press: Chicago, IL. Lamming, M . and Flynn, M . (1994). "Forget-Me-Not"-Intimate Computing in Support of Human Memory. In Proceedings of FRIEND21 '94 International Symposium on Next Generation Human Interfaces, pages 1-9. Rank Xerox Research Center: Cambridge, UK. Landauer, T. K. and Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211-240. Lund, K. and Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2):203-208. Mandler, J. M . (1984). Stories, Scripts, and Scenes: Aspects of Schema Theory. Lawrence Erlbaum Associates: Hillsdale, NJ. Meyer, D. E. and Schvaneveldt, R. W. (1971). Facilitation in recognizing pairs of words: Evidence of a dependence between retrieval operations. Journal of Experimental Psychology, 90(2): 227-234. Moll , M . , Miikkulainen, R., and Abbey, J. (1994). The capacity of convergence-zone episodic memory. In Proceedings of the 12th National Conference on Artificial Intelligence, AAAI-94, pages 68-73. MIT Press: Cambridge, M A . Mozer, M . C. (1984). Inductive information retrieval using parallel distributed computation. Technical report, University of California at San Diego, San Diego, C A . Pirolli, P. and Card, S. (1999). Information foraging. Psychological Review, 106(4):643-675. Pollack, M . E., McCarthy, C. E., Tsamardinos, I., Ramakrishnan, S., Brown, L. , Carrion, S., Colbry, D., Orosz, C , and Peintner, B. (2002). Autominder: A planning, monitoring, and reminding assistive agent. In Proceedings of the Seventh International Conference on Intelligent Autonomous Systems. Rhodes, B. J. (1997). The wearable remembrance agent: A system for augmented memory. Personal Technologies Journal (Special Issue on Wearable Computing), 1:218-224. Rumelhart, D. E. and Norman, D. A . (1981). Analogical processes in learning. In Anderson, J., editor, Cognitive Skills and their Acquisition, pages 335-359. Erlbaum: Hillsdale, NJ. Salton, G. and Allan, J. (1994). Automatic text decomposition and structuring. In Proceedings of the RIAO Conference: Intelligent Text and Image handling, volume 1, pages 6-20. 26 Salton, G. and McGil l , M . (1983). An Introduction to Modern Information Retrieval. McGraw-Hill: New York, NY. Schlogl, C. (2005). Information and knowledge management: dimensions and approaches. Information Research, 10(4): 16pp. Sigman, M . and Cecchi, G. A. (2002). Global organization of the Wordnet lexicon. Proceedings of the National Academy of Sciences, 99(3): 1742-1747. Skinner, B. (1957). Verbal Behavior. Appleton-Century-Crofts: New York, NY. Smith, B. C. (1996). On the Origin of Objects. MIT Press: Cambridge, M A . Sowa, J. F. (1991). Principles of Semantic Networks: Exploration in the Representation of Knowledge. Mogan Kaufmann Series in Representation and Reasoning. Morgan Kaufmann: San Mateo, C A . Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1): 11—21. Steyvers, M . and Tenenbaum, J. (2005). Small worlds in semantic networks. Cognitive Science, 29(1):41-78. Thornton, C. (2000). Truth from Trash: How Learning Makes Sense. MIT Press: Cambridge, M A . Tulving, E. (1972). Episodic and Semantic Memory. In Tulving, E. and Roberts, M . , editors, Organization of Memory, pages 381-403. Academic Press: New York. Vemuri, S. and Bender, W. (2004). Next-generation personal memory aids. BT Technology Journal, 22(4): 125-138. Watson, J. B . (1919). Psychology from the Standpoint of a Behaviourist. Lippincott: Philadelphia, PA. Witten, I. H. , Moffat, A. , and Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann: San Francisco, CA. 27 Chapter 2 Cognitive Principles for Information Management: The Principles of Mnemonic Associative Knowledge (P-MAK) 1 Information management systems are used to organize large collections of information. As such they act as memory prostheses, implying a connection to human memory models. Since humans process information by association, and situate it in the context of space and time, systems can maximize their effectiveness by mimicking these functions. Since human attentional capacity is limited, systems should scaffold their users' cognitive efforts in an easily comprehensible manner. We propose the Principles of Mnemonic Associative Knowledge ( P - M A K ) , which describes a framework for semantically identifying, organizing, and retrieving information, and for encoding episodic events by time and stimulus. Inspired by prominent human memory models, we propose associative networks as a preferred representation. Networks are ideal for their parsimony, flexibility, and ease of inspection. Networks also possess topological properties—such as clusters, hubs, and small-world link distributions—that promote analysis and navigation of an information space. Our cognitive perspective addresses fundamental problems faced by information management systems, in particular the retrieval of related items and the representation of context. We present evidence from neuro-science and memory research in support of this approach, and discuss the implications of systems design within the constraints of P - M A K ' s principles, using text documents as an illustrative semantic domain. 2.1 Introduction Memory provides the raw materials for intelligence. Without memory an intelligent agent, whether me-chanical or biological, would be unable to store and compare ideas, or respond appropriately to changing circumstances. Memory's associative nature moves fluidly between related facts to narrow in on information of interest. It provides us with the sense of a continuous personal identity, and of our world. Improved mem-ory forms the basis for greater intelligence by providing an expanded store of knowledge and experience. One theory concerning our evolutionary cousins speculates (based on physiology and anthropological arti-facts) that Homo Neanderthalensis had, despite a smaller working memory, an excellent long-term memory 'A version of this chapter has been accepted for publication. Huggett, M., Hoos, H., and Rensink, R. Cognitive Principles for Information Management: The Principles of Mnemonic Associative Knowledge (P-MAK). Minds and Machines (Springer). 28 One theory concerning our evolutionary cousins speculates (based on physiology and anthropological arti-facts) that Homo Neanderthalensis had, despite a smaller working memory, an excellent long-term memory that enabled them to compete with and sometimes even surpass us (Wynn and Coolidge, 2004).2 Enhanced memory expands cognitive abilities, much as a crane extends the ability to lift, or a vehicle improves the ability to travel. How can we design more sophisticated information-management systems still comprehensible to our falli-ble "stone-age" brains? Words on paper have always been subject to physical constraints—such as printing, shipping, filing, and shelving—that make their management burdensome. In modern times, information management has evolved from tactile filing of ink-based books, papers, and folders, to channeling an in-creasing flow of digital media—a task beyond human capacity. Instead, automated processes should sum-marize what users have available to them, facilitate their searches, and direct them to the information that best suits their needs. Humans are predisposed to a particularly human structure of ideas: we can communicate across language and culture barriers and anticipate how others think (Ekman, 1971; Osgood et al., 1975). Thus to make information management systems more useful to a wider range of people, it seems reasonable to apply functional cognitive principles to data storage and retrieval. Daily experience shows us that our memories are highly associative: given a particular cue (e.g., red) our thoughts and perceptions stimulate a wealth of contextually relevant memories (fire, apples, cherries, stop, etc.). The general principle of associationism holds that higher mental processes—and memory in particular—result from connections between sensory and mental elements based on experience (Anderson and Bower, 1973), and associative network models are prominent in cognitive science as descriptions of memory, knowledge, and reasoning (e.g., Anderson, 1983; Collins and Loftus, 1975; Quillian, 1969; Raaijmakers and Shiffrin, 1981). The physical medium underlying such mental functions is less intuitive: the brain is a massively parallel processor with billions of neurons, each of which typically gains synaptic inputs from thousands of others. A l l neurons fire continually at varying rates, and we have only the vaguest notion of how this roar of activity resolves to our subjective experience of reality. Although connectionist models that simulate brain activity at a neural level (e.g., Rumelhart et al., 1986) have provided important insights into pattern learning, it would be an overwhelming task to model higher-level memory processes in such detail. Instead we embrace the motto "to the desirable via the possible" (Marr, 1982), and take a pragmatic interest in the overt functions of memory, which are more easily represented and implemented. We simultaneously avoid an emphasis on human behaviour, since behaviour may be is influenced by cultural contexts. Our goal then is to suggest culturally-neutral systems that are automated, simple, and easily understood . Systems are ultimately created to fulfill human needs, and in doing so should use metaphors that are easily human-interpretable. We propose the Principles of Mnemonic Associative Knowledge (P-MAK), a framework for information management systems that is both inspired and guided by several related fields: information retrieval, cogni-tive modeling, knowledge representation, evolutionary psychology, and neurophysiology. The resulting set 2However the lesser working memory of Neanderthalensis eventually allowed the incremental innovation of Homo Sapiens to pull ahead. 29 of principles combines the functional strengths of machine computation and human cognition. In the short term P - M A K can suggest improvements to current information systems; in the long term it suggests the convergence of semantics, contextual interaction, and human-centred information management (HCIM). P - M A K exploits the brain's useful information-processing paradigms and presents information in a way that is familiar to human users—and thus easier to use. Although we make no claims of whether particular cognitive models are complete and accurate descriptions of mental function, they are useful starting points for information systems in the principles that they embody. Human memory has strengths to be mimicked, but also weaknesses to be scaffolded (as discussed in Section 2.2). The P - M A K framework describes four sets of theoretical principles that structure and guide the development of information systems: they are divided in pairs into fundamental principles of the properties of brains and the constraints of machines, and organizational principles for the accumulation and context of semantic knowledge. Two sets of fundamental principles describe the necessary basis of processing in mind and machine (Section 2.4): Mechanistic principles. These concern the computational constraints on the design and opera-tion of computing machines and information management systems, which should be familiar to every computer scientist: minimal use of resources, scalability, heterogeneity, extensibility, and reliability. 3 Anthropic principles. These concern the properties and constraints of human memory and knowledge that machines should incorporate to provide more effective user interactions. Human information processing is fundamentally associative. It also requires simple interpretations to rapidly quantize the world into discrete objects, and form meaningful abstractions as summary 'maps'. The organizational principles are based on these fundamental principles, and describe how the structure of information can facilitate information-management operations, especially when using cognitive paradigms supplemented by the context of time and place (Section 2.5): Epistemic principles. These describe the basics of inducting, organizing, and retrieving in-formation objects with an associative knowledge structure. Information objects are described pragmatically in terms of discrete attributes, which are assigned by a perceptual classification process. The similarity of objects is then judged based on attributes that they have in common. Objects are retrieved both by queries that match specified attributes, and also ostensively (i.e., by example) through similar objects. Situational principles. These describe the main contextual characteristics of human memory: encoding regular and co-occurring events, the intervals at which they occur, and the physical 3These principles apply to brain function as well, as suggested by Marr (1982), but such an analysis is separate from our focus on improved human-machine interaction. 30 conditions that accompany them. The re-occurrence of events can be predicted from this con-textual structure. P -MAK's principles are a particularly good match for representation using associative-networks. Networks can encode relations between information objects optimally and explicitly, and when sparse and scale-free, they show advantages for organizing, exploring, and analyzing information spaces (Section 2.6). Some of the terms in this paper have been used elsewhere in different contexts. Since we take a func-tional approach that focuses on cognitive information processing, P - M A K applies primarily to information management for individuals. Human-centred information management (HCIM) is more broadly defined by human information-seeking behaviours, and by the cultural (usually business) context in which information-seeking takes place (Schlogl, 2005). We believe that our functional approach is complementary to HCIM, and we believe that the combination of functional and behavioural factors is necessary for the development of user-friendly systems. The challenges of information management lie between these poles of function and behaviour, determined by human characteristics and shaped by cultural context. Ultimately, HCLM functions as a cognitive prosthetic that extends memory to recover past information, and that augments perception to find new, useful information. 2.2 The Challenge of Human Memory The sapiens brain was well-matched to our ancestral environment (Wynn and Coolidge, 2004), but a fast-paced computer-mediated society requires increased powers of recognition and recall for general factual knowledge, and for events and experiences—in short, for improved semantic and episodic memory systems (Tulving, 1972).4 To better understand the relation between memory and information systems, we first examine the properties of human memory by way of its deficits, and how memory has been explained by computational models. We then look at how artificial memory aids—mnemonics—have been used to support memory-based tasks, and at system models specifically inspired by models of human memory. 2.2.1 Failures of Memory Brains can simultaneously execute many memory-based tasks, such as planning, talking, risk assessment and the identification of stimuli. But while impressively powerful, human memory is also prone to frequent functional lapses. Some key examples are: Forgetting — The timely recall of important information is important to survival. Human memory can be fast and efficient, but its failures are often frustrating, even dangerous. The proposition that forgetting is nature's way of keeping our minds clearly focused on important aspects of a changing environment (Ander-son and Schooler, 1991) is small consolation when useful facts need to be recalled accurately on demand. 4Since information management is primarily concerned with long-term storage and retrieval, we do not directly discuss short-term working memory (Baddeley and Hitch, 1974), since this has more to do with attentive processes that pose queries and process retrievals. 31 Memory performance can be affected by poor initial encoding, such as wandering attention, or may be the result of poor cues, such as trying to recall context-dependent details in a radically different context. If memories are permanent, as some researchers believe (e.g., Bahrick, 1984), then there is much to be gained by making their re-activation more reliable. F a l l a c i o u s I n t e g r a t i o n — Given several descriptions of similar situations, humans tend to integrate the details erroneously into new hybrid memories. This phenomenon increases with the complexity of the descriptions: facets of a situation are combined into a single representation of the general idea, and this gist is remembered while details are forgotten. Memories created in this way are taken as fact unless some detail is so corrupted as to contradict the gist of the original descriptions (Bransford and Franks, 1971). P r e s u p p o s i t i o n & I n fe rence — The gist itself is also subject to corruption. For example, experiments on eyewitness testimony have found that choice of words during questioning can alter memories of actual experiences, and subjects may even claim to have seen details that did not occur. Memory often becomes a blend of true facts and erroneous information implied later (Loftus and Palmer, 1974). Distortions also occur when people find themselves in a particular situation, and based on expectations choose an inappropriate behavioural script for that situation, biasing the gist and leading to misinterpretations that are mistaken for true memories (Brewer and Treyens, 1981). B i a s e d M e m o r i e s — Changes in context can also have a significant effect on the interpretation of a past experience during recall. Although people tend to believe that they draw from a stable semantic memory that has been faithfully abstracted from hundreds of episodes, experiments on episodic influence have shown that even a single experience can have a biasing effect and produce low-probability responses during questioning (e.g., Jacoby and Witherspoon, 1982). Machines have properties that are complementary to human memory failures. Machine-based data storage is less prone to degradation over time, and can be backed up for ensured permanence; where rapid data indexing fails, brute-force searching can recover data reliably in a way that brains cannot. Machines can retain raw data with high accuracy, and different interpretations of the same information objects can then be reliably generated and compared as goals and perspectives change. The relative permanence of machine-based data storage makes it impervious to suggestibility, and although bias in brains is hidden and can be difficult to detect, the biases of machines are inspectable in the algorithms they use, and in how their code is written. Brains have powerful abilities, but machines are more reliable, and can act as supports and scaffolds for the weaknesses of human memory. 2 . 2 . 2 Computat ional Models of M e m o r y A great deal of work has been done to understand how human long-term memory operates, by developing cognitive memory models that describe the functions of human memory in implementable mathematical detail. These models follow an information-processing paradigm that treats the brain as a computing device, or the computer as a "thinking machine". As we prefer a simple, discrete approach, we focus on functional symbol-manipulating models, rather than the resource-dependent neural models that mimic cellular activity. 32 However, neural networks are not incompatible with our approach, as we shall see later. Though none of the following computational models can be considered complete, each accurately depicts aspects of memory's empirically-observed properties. The pioneering Teachable Language Comprehender (TLC) was written as a computer program with the goal of recreating human inferential ability. It organizes knowledge into a hierarchical conceptual organi-zation much like a taxonomy tree, with a general root node (e.g., animal) connected to subordinate nodes (e.g., fish, bird, mammal, etc.) that are each connected to subordinates (e.g., salmon, tuna, pike, etc.). At each node, defining properties are included by listing physical characteristics and abilities. A l l of the char-acteristics of a superordinate node (e.g., animal —• has-skin) are inherited by its subordinate nodes, so that a shared attribute only has to be defined once, giving a cognitive economy of attributes. The structure of T L C models the human category size effect: questions in larger domains take more time to search, and the time required is directly related to the number of links between nodes in the hierarchy (Collins and Quillian, 1969; Quillian, 1969). Building on TLC, the Spread of Activation Model (Collins and Loftus, 1975) uses links of different lengths to indicate relative strength of association. The length of the links reflects the time required to activate related concepts, thereby encoding their semantic distance and typicality. Spread of activation also accumulates activation in the nodes related to the activated node; even when this may not be enough to make them fire, there is still a priming effect that allows nodes to reach full activation faster. In Associative Strength Theory (Anderson, 1983), memories are recovered according to how strongly related they are to a presented cue. Activation spreads to related memory traces, which rise into consciousness if their activation level exceeds a given threshold. The theory explains observed phenomena such as slower response times when faced with more choices, modeling this delay with thtfan effect, which treats activation as a finite quantity to be shared among all connected nodes. Each of these models uses spreading activation as a relatively simple constrained process in long-term mem-ory. By contrast, global matching models perform retrieval based on the combination of cues in short-term memory, and operate in highly parallel fashion. One example is the Search of Associative Memory (SAM) model (Raaijmakers and Shiffrin, 1981; Gillund and Shiffrin, 1984), which describes long-term memory as a set of "images" that are "closely interconnected, relatively unitized, permanent sets of features" describing context as well as semantic content. For retrieval to occur, a set of cues is assembled in short-term memory; these cues activate the images to which they are connected. Parallelism reaches its zenith in models such as MINERVA 2, which assumes that a query is matched against all memories in parallel, and that all memories then respond in parallel, "the retrieved information reflecting their summed output" (Hintzman, 1984). Although these computational models successfully mimic specific characteristics of human memory, they do not generalize well and fail outside the boundaries of their assumed conditions. A more general approach has been the development of cognitive architectures comprised of cognitively-justified tools and theoretical constraints, with the goal of performing a full range of human cognitive tasks; they are used to develop and test new cognitive models. The two most well-known architectures are Soar (Nason and Laird, 2005) and ACT-R (Anderson et al., 2004). ACT-R describes neural-like computation, and assumes that human 33 cognition is optimally evolved to reflect statistical trends in the environment. By contrast, Soar is based on the premise that humans use knowledge in a rational way in order to achieve goals, and assumes that hu-man cognition is a symbol system built upon a connectionist neural physiology (Johnson, 1997). Although more complete in their description of human cognition, cognitive architectures are too broad and powerful to serve as the basis of a cognitive IR system. Compared with spreading activation models, cognitive archi-tectures and global matching models are computationally expensive, and thus beyond the scope of practical information management. 2.2.3 M e m o r y Prosthesis The .challenge of human-centred information management (HCIM) is to compensate for the weaknesses of human memory while still taking advantage of its strengths. Its research has taken two paths. The first has been to develop systems that support human memory by whatever means, and not model, mimic, or explain it. These systems have been relatively narrow in scope, acting as reminders or trackers, or sometimes as special-purpose task assistants. The second path has been to develop systems based on the known characteristics of human memory. There have been relatively few of these, and as research testbeds they have seldom provided full functionality or been widely used. Artificial Associative Memory Aids The use of mnemonics—devices intended to assist the memory—began with cave paintings and progressed through clay tablets to the invention of paper. Computerized systems have since given us powerful indexing and search functions that organize large sets of items with relative speed and ease. Today portable personal devices (such as phones and digital notepads) are probably the most ubiquitous and familiar memory aids. They can access online information, and are useful for prospective (or forward-looking) tasks if used to plan schedules and chime notices, but they do not yet fulfill the promise of human-like associative information management as presciently described by Vanevar Bush (Bush, 1945). The invention of the computer led Bush to imagine a hypothetical multimedia information system called memex that would contain all of a person's books, records and communications in "an enlarged intimate supplement" to memory, "mechanized so that it may be consulted with exceeding speed and flexibility". It would store enduring associative "trails" of items collected on a subject, to provide a form of clustering, navigation, and memory cuing.5 Although Bush's vision exceeded the technology of his time, its inspiration has since driven researchers to create personal information management systems that record the salients of a person's activities (Want et al., 1992), index everything seen on their computer (Dumais et al., 2003), and display personally-relevant items on a timeline (Fertig et al., 1996; Gemmell et al., 2002). But such systems are not inherently associative, and do not provide navigation between related items as would Bush's trails. Rather, modern memory aids typi-cally focus on indexing and mining items from a database, based on user annotations and existing property fields. 5 Memex is thus the earliest system design to embody many of P-MAK's anthropic principles, such as associationism. 34 A few commercially-available systems employ an associative model, with various ways of organizing and displaying archived information. MindManager (2006) and TheBrain (2006) provide little in the way of automated classification of information, but instead provide a common graphical interface in which users themselves input and link items together to create a communal knowledge structure, consolidated into a single repository as an enterprise-wide "knowledge platform." Where the goal is the "automation of un-structured information", associative enterprise systems can be complex and expensive. Autonomy (2006) is a state-of-the-art Bayesian-inference system that provides real-time information retrieval: it analyzes words as they are typed and opens new windows showing related news releases, archived reports, and diagrams, and also displays the contact information of related experts. Cognitively-Inspired Systems Although human memory and computation are sometimes directly compared (e.g., Anderson, 1989; Foltz, 1991), the field of memory has not yet enjoyed the same migration of concepts between human sciences and computation as seen for vision in (Marr, 1982). Computer simulations are commonly used in cognitive science to support particular cognitive theories, but are rarely used to form the basis of an actual information system. One notable exception is the Memory Extender (Jones, 1986), a "personal filing system" that seeks to combine the benefits of human memory and electronic storage. Its network representation reflects the principle of associationism, while its weighted node-and-link architecture connects computer files with the terms that describe them best, mimicking symbolic memory models. Queries using a given term activate the term's node, which then passes activation along links to nodes of files that contain the term. Queries with multiple terms increase activation in the nodes that share those terms, which then rise to the top of a ranked list of retrieved items. New files are linked to context nodes, which indicate the particular circumstances of the file's creation. The system decays the activation levels of unused nodes over time so that they are eventually "forgotten". The system is intended as an enhancement (or replacement) of the standard file-and-folder (FaF) desktop idiom, but is concerned only with file retrieval and does not connect files directly based on shared characteristics, or offer a summary overview of themes in the corpus. A different but significant development is Latent Semantic Analysis (LSA; Deerwester et al., 1990), an in-dexing system that scores candidate documents against a representative corpus based on word co-occurrence statistics. It has been successfully used for cross-language information retrieval, information filtering, text analysis and essay grading. While L S A has shown expert human-like classification abilities (Landauer et al., 1998), it has some drawbacks. As a matrix-based method in a high-dimensional space, it requires signifi-cant amounts of storage space and computation. It is a batch process that requires the induction of a large training set before use, and as a "black-box" process its parameters must be hand-tuned for each collection. The corpus's semantic dimensions are determined algorithmically at an abstract mathematical level, and are difficult to describe in humanly-meaningful terms. Once the knowledge structure is built, it cannot be easily edited or updated with new concepts (Lemaire and Denhiere, 2004; Zha and Simon, 1999). While L S A has been advanced convincingly as a cognitive knowledge model (Landauer and Dumais, 1997), its vector-based 35 affects maps to o b s e r v a t i o n affords Figure 2.1: The information-mapping process of human memory. The shaded area reflects a mapping from raw data to representation. Memory retrieval adds a feedback loop in which user actions adjust the knowledge structure to reflect the context of a user's interests. The mapping process itself, if adaptive, is dynamically biased toward higher-value objects. approach forces symmetrical similarity between terms, which contradicts psycholinguistic findings6. 2.3 Human-Centred Information Management For computing machines, "the performance of the device is characterized as a mapping from one kind of information to another" (Marr, 1982, p.24), a transformative process that takes a stream of raw input and interprets it to produce a structured output. The goals of information retrieval are two-fold: prospective finding of new relevant information, which maps new information into a growing knowledge structure, and retrospective recovery of information that has already been seen, which maps user queries and behaviour onto relevant retrieved objects. These mappings are equally pertinent to human memory or computational information systems, and we believe that information management systems could gain by basing their oper-ation on the functions of human memory. 2 .3 .1 T h e B a s i c O p e r a t i o n s o f I n f o r m a t i o n M a n a g e m e n t A person's changing goals affect how items are perceived, encoded, organized, and retrieved in memory (Barsalou, 1983; Mandler, 1984). As humans recall information and use it to decide their next actions, they refine their search by repeating an interaction loop that brings them closer to their goal (Figure 2.1). The mental data stream is then more than a direct mapping from input to output: learning and retrieval are entangled, as retrieval favours items similar to those that were previously useful. This interaction between attentional processes and long-term memory is common in cognitive memory models (e.g., Anderson, 1983; Raaijmakers and Shiffrin, 1981). Information management systems should be designed to support and exploit this human characteristic. For this they require a simple, pragmatic, easily-understood ontology: discrete objects that can be organized 6Our P-MAK framework similarly assumes symmetrical similarity between documents, however we consider our documents as independent peers in a corpus, and less subject to relative evaluations of parts of speech, semantic hierarchies, or typicality considerations of individual words. 36 into sets, and a set of basic operations that create and manipulate these objects. In the broad sense, an information object is comprised of two parts: a physically-instantiated entity—typically a discrete stored datum such as a document or image—and a descriptive reference that acts as a pointer to that entity. The reference is held in the information management system, and is used to retrieve the entity, to compare it to other data, and to determine topic clusters. By contrast, the entity is stored in a database that may be at some remote location. Such basic operations and objects necessarily apply to any agent—human or otherwise— that interacts with an information source: • Specification — the definition of an object's ontology: its structure, composition, properties and elements, and how objects are inter-related. • Induction — the incorporation of objects into a knowledge structure. • Modification — the mutation of objects and relations to increase their utility, either by inferring trends or through direct user intervention. • Retrieval — the recovery of useful objects from the knowledge structure. 2.3.2 M i n d - M a c h i n e Symbiosis Machines need noi be like humans, but human-machine symbiosis is important for systems to function as an extension of human memory. Physical comparisons are misleading: how machines achieve this sym-biosis cannot depend on mimicking the details of brain architecture. The brain contains billions of highly-connected neurons, versus a few thousand linked processors in the most complex machines. This difference in scale is likely to hold for some time. Wang and Liu (2003) estimate the relative capacity of human memory at a staggering 10 8 ' 4 3 2 bits. Comparative estimates of processing power show a similar imbalance: Moravec (1998) estimates human processing power at 100 million MIPS (Million Instructions Per Second), or 10 1 4 individual 'commands'. Yet despite the massive parallelism of the brain, the human capacity for at-tending to multiple information sources is very limited (e.g., Baars, 1993; Kahneman and Treisman, 1984). For this reason human interaction with information sources typically involves a sequence of simple actions that narrow in on a target: the serial presentation of a few well-chosen items is less overwhelming than presenting an entire corpus all at once. The feedback loop in Figure 2.1 applies also to the interactive nature of information management, which requires a sequence of interaction between user and information system. Interaction is improved if the system's function is easily understood by the user, and if the user's actions are properly interpreted by the system. The design of effective systems requires that users should be presented with a set of rational and predictable behaviours, while machines should relieve users of more repetitive or complex operations, and adapt to user preferences (Hoffman et al., 2002). Information systems are generally prone to a number of frustrating problems. Even technically-fluent users experience operational setbacks, such as hardware failures and accidental deletions, and so continue to rely on paper documents (Whittaker and Hirschberg, 2001). Typical file-and-folder (FaF) systems do not orga-nize information semantically, which forces users repeatedly to judge the relevance of their files, to create (often naive) ad hoc organization schemes, to remember where things are stored, to remember what search 37 F U N C T I O N OPERATIONAL ORGANIZATIONAL UNIVERSAL mechanistic Parsimony Scalability Portability Plasticity Robustness epistemic Identification Perception Similarity Navigation H U M A N -CENTRED anthropic Associationism Simplicity Quantization Abstraction situational Persistence Temporal Cueing Sensorial Cueing Event Convergence Table 2.1: The P - M A K framework. In terms of scope, the universal principles apply to any optimal general-purpose intelligent agent that interacts with information, while the human-centred principles de-scribe the nature of human information processing: its key strengths and constraints. In terms of function, fundamental principles define the (internal) constraints and qualities that affect knowl-edge structures, while organizational principles concern an intelligent system's interpretation of and interaction with the external world, of which the system's user is a part. terms are applicable, to search exhaustively for lost files, and so forth. Machines that do provide automatic indexing are challenged by the intricate nature of knowledge, such as recognition and manipulation of se-mantic subtleties, or adaptation to unanticipated changes of context. Other problems are purely algorithmic, such as efficient data management in large, ever-growing archival storage systems. Our P - M A K framework defines the constraints implied by these problems. 2 . 3 . 3 I n t r o d u c t i o n t o P r i n c i p l e s : P - M A K For human minds to interact well with machines, systems should be designed to support the best aspects of both. The principles that apply are divided into two general dimensions. The first concerns the scope of constraints: universal principles are applicable to any intelligent agent (human or otherwise), while human-centred principles are specific to the nature of human cognition. Universal principles originate in the information-management task itself, independent of the agent that performs it, and define general computa-tional considerations for the design of optimal systems. For example, it is usually desirable that an operation should remain tractable as the size of a dataset increases, regardless of how an agent processes information. By contrast, the human-centred principles describe cognitive properties that can improve interaction be-tween users and information management systems; as such, they are descriptive rather than prescriptive. For example, it is a human-centred principle that humans require meaningful summaries, since non-human agents (natural or artificial) may not have such a requirement. The second general dimension concerns a principle's functional application: operational principles define the fundamental (internal) constraints and qualities that affect an agent's knowledge structures directly, while organizational principles concern a system's interpretation of and interaction with the external world (in-cluding the user). The scalability of data structures is an operational principle, since it is independent of both the semantics of the data and the goals of the user. Meanwhile, comparing objects based on shared 38 attributes pertains to an agent's efforts to make sense of its environment by organizing observed informa-tion. Although operational and organizational principles can be separated out in theory, in practice they often interact. Indeed, the operational principles set the constraints by which the organizational principles function. The aspects of scope and function divide the principles as follows (Figure 2.1): i) Mechanistic principles (universal, operational) The necessary properties for efficient computation and data retrieval. ii) Anthropic principles (human-centred, operational) The inherent properties of human memory that would usefully be incorpo-rated into information-management systems. iii) Epistemic principles (universal, organizational) The processes necessary for an intelligent agent to induct, classify, and re-trieve information. iv) Situational principles (human-centred, organizational) The environmental aspects of how humans organize knowledge: by combi-nations of co-occurrence, time, and physical context. Next we describe the principles in more detail, after which they are used to guide the creation of knowledge, and place it in an environmental context. 2.4 The Fundamental Principles The purpose of computers is to help people think: computers then become as much extensions of human cognitive efforts as reading, writing or drawing. But computers often do not appear collaborative, since they are primarily designed to maximize various hardware properties such as bit depth and MIPS. The need for improved usability has led to the founding of human-computer interaction (HCI), a field that seeks to facilitate the flow of information between humans and computers (see e.g., Baecker et al., 1995). The goal of HCI is not to improve the fundamental operation of machines per se, but to interpose a layer of translation between human and machine. Thus HCI seeks to accommodate and support cognitive properties at the interface level, but seldom applies them directly to machine function. P - M A K takes the opposite approach: it proposes that information within machines should be stored and retrieved in a manner that is inherently biomimetic (i.e., based on forms in nature) so that it is inherently comprehensible. Human memory represents a clearly successful approach to information retrieval and pro-cessing. Our goal then is to transfer this efficient memory structure to machines for better organization and retrieval of information. Conversely, the implementation of human-like processes in machines can make ma-chines more comprehensible by providing users with a familiar (in fact ingrained) information-management paradigm, instead of an ad hoc system-specific one. 39 2.4.1 M e c h a n i s t i c P r i n c i p l e s : M a k i n g M a c h i n e s E f f e c t i v e Compared to brains, the relative simplicity of machines leads to a set of fundamental computational concerns that must be assumed for effective information-management implementation. These are encompassed by the mechanistic principles: Parsimony — Systems should minimize the use of storage and computational resources. Computing systems can maximize simplicity and efficiency by reducing the usage of computational resources (such as cycles or storage space) to a practical minimum. With good design of algorithms and data types, more can be com-puted in less time. Even as computers become more powerful, the questions that we ask of our systems will become more demanding, requiring that we continue to use appropriately minimal designs that maximize efficiency. Parsimony has a direct influence on choice of data representation and tractability of computation (Smith, 1996; Moravec, 1998). Scalability — Systems should use scalable structures for efficient retrieval in growing datasets. As informa-tion repositories grow, systems should continue to retrieve their items in a reasonable amount of time. But information systems do not automatically scale well as more data is added, and while most people are willing to wait a few seconds for a search to complete, beyond that they become impatient (Wickens and Hollands, 1999). Indexing algorithms build reference structures that group the items of a corpus semantically so that items can be more easily found. Indexing should partition the semantics of an archive in a balanced and organized way, allowing rapid navigation from the general to the specific. For example, indexing is straight-forward where items can be sorted alphabetically in binary search trees, guaranteeing fast log-time access, but complex semantic spaces can resolve to many dimensions—hundreds or even thousands (Burgess and Lund, 2000; Deerwester et al., 1990)—which would be intractable to index in real time. Access times can be kept low by pre-calculating and storing relationships, but then the incorporation of new data may require an expensive recalculation of corpus-wide properties. Portability — Systems should apply readily to different and diverse domains. Systems should be adaptable to new uses and new applications with a minimum of effort. In the domain of information retrieval (IR), the classification of data is dependent on its type; a classifier appropriate to sorting textual documents is useless for images or sound (Witten et al., 1999). A heterogeneous dataset will therefore require a classifier appropriate to each of its expected data types. For machines to fulfill their function as "thinking tools", ideally they should be able to accept data from any domain, since mining through such an enlarged and diverse set could find new and interesting relations. Plasticity — Systems should use structures that are easy to reconfigure. A simple representation that can be quickly elaborated and updated to reflect changes in data relations would be ideal, and contributes to parsimony by improving efficiency and reducing usage of resources during reconfiguration. Plasticity is useful where inferring categories depends upon the needs and perspective of users; the data representation should promote assembly of ad hoc categories that match user interests. The antithesis of plasticity is found in the traditional database model, where every field in a data record is of a particular type (e.g., an integer), is allocated a set amount of memory (e.g., 32 bits), and represents a fixed property (e.g., name, address, 40 age, income, etc.). Few such presuppositions can be made about the content of data in a heterogeneous, ever-changing, real-world situation. Robustness — System operation should degrade minimally as the quality of information deteriorates. The brain is extremely good at reasoning under uncertainty, since human survival has depended upon making mostly correct choices under uncertain conditions. Information systems, on the other hand, follow fixed instructions; traditional database models do not easily support robust retrieval with imprecise queries, which confounds users who don't know the words best describing an item, or who erroneously consider different combinations of words to be equivalent. To counter the requirement of exact query terms, information retrieval systems have used statistical term weighting, truncation, and synonymy (Witten et al., 1999); other methods such as vector-space, singular-value decomposition, or Bayesian comparison of terms between documents have also been successfully employed to increase robustness (Foltz, 1991). 2.4.2 A n t h r o p i c P r i n c i p l e s : M a k i n g K n o w l e d g e C o m p r e h e n s i b l e Human memory exhibits various characteristics that should be applied to information management systems. The anthropic principles discuss essentially ingrained user expectations of how information should be man-aged. Associationism — Human memory is—functionally speaking—associative, and the most important associ-ations are semantic. Cognitive science has provided strong evidence for the semantic associativity of human memory, which affects most aspects of acquiring, structuring, and exploring knowledge. Semantic priming has been observed where subjects judge word pairs to be real words more quickly if they are semantically related (Meyer and Schvaneveldt, 1971), and words seem to be encoded by their semantic relatedness (Blank and Foss, 1978). Associationism lends itself to straightforward implementation on a machine, essentially as a stimulus-response model in the functional tradition of Skinner (1977). As such it includes well-specified learning behaviours such as the association of co-occurring pairs and the principle of reinforcement (Hebb, 1949). Simplicity — Humans cannot easily use information if it is too plentiful or complex. To manage information effectively, humans need clear and simple representations. Humans can hold only about 4 to 7 items in working memory (Cowan, 2000; Miller, 1956), and have inherent bounds on their rationality which require shortcuts in reasoning (Todd and Gigerenzer, 2000). Information systems should therefore emphasize simple knowledge structures where possible, to avoid overwhelming human comprehension—simplicity refers to the limits on quantity of information that can be absorbed at one time. Quantization — Humans perceive the world as a collection of discrete objects and concepts. Quantization describes how humans divide up the world into types and tokens. This "chunking" seems to derive from evolutionary pressures: symbolic abstraction is a fast and efficient way to identify and reference known entities and their descriptive properties. Since language is commonly modeled as a symbol system with a generative grammar (Chomsky, 1965), and since human thought seems to occur in the context of discrete words (Whorf, 1956), biomimetic information management can be justifiably based on a symbolic cognitive 41 paradigm. The Language of Thought hypothesis (Fodor, 1975) further supports the notion that thought is explainable by the manipulation of symbol tokens, that complex ideas are compositions of simpler (atomic) symbols, and that symbols are combined in the same structure-sensitive compositions as language. Indeed, quantization appears to be a necessary principle: the alternative continuous ontology, the representation of one's world by a real-valued density function, would require an infinite amount of information and calcula-tion to produce reactions and decisions (Smith, 1996). Abstraction — Humans require meaningful summaries in order to make sense of their world. Although instance theories of human memory claim that we store images of all our experiences (Medin and Schaffer, 1978; Hintzman, 1984; Nosofsky, 1984), there is no doubt that human intelligence is primarily defined by the ability to abstract—essentially, to generalize and make theorems for rapid, informed decision-making. Complex information can be made comprehensible through good organization, which can be described in two ways. First in informatic terms, organization can be modeled by similarity functions and by clusters of concepts. Second, organization is central to comprehension, as humans construct interpretations around a central idea (e.g., Bransford and Johnson, 1973), classify identical items differently depending on current goals (e.g., Barsalou and Sewell, 1984), and attribute different meanings to items depending on context (e.g., Labov, 1973; Lakoff, 1987; Medin and Schaffer, 1978). Abstraction produces better quality of information: inadequately accurate or descriptive summaries tax our limited human attention. 2.5 The Organizational Principles The organizational epistemic and situational principles describe how knowledge is created and used. The epistemic principles describe universal processes for encoding and retrieving semantic information, while the situational principles describe non-semantic organization according to regularities of time and environ-ment. 2.5.1 Epistemic Principles: Building Knowledge The epistemic principles describe the transformation of raw information into useful knowledge; as such they encompass the basic operations of information management described earlier. The principles of identifica-tion, perception, and similarity describe how information is encoded and organized into knowledge, while the principle of navigation describes how stored information is retrieved. Raw information is inducted into discrete objects (the principle of identification) that are identified by their salient properties {perception), organized based on alikeness (similarity), and retrieved through an interactive process of one or more steps (navigation). Identification — Objects are discrete, and described in terms of semantically discrete attributes. Identification is a necessary information management, following on the principle of quantization: the alter-native to discreteness, a continuous density function, is intractably complex. Identification specifies how a 42 semantic object describes the entity that it represents. It is based on the idea that humans tend to notice the salient attributes of an object, and use those attributes to compare objects and generalize abstractions about their world, as described in a number of feature-based knowledge models (e.g., Medin and Schaffer, 1978; Posner and Keele, 1970; Rosch and Mervis, 1975). Feature Set Theory (Smith et al., 1974) is typical of these, and describes objects in terms of defining features (those shared by all objects of a class) and characteristic features (those common but not essential). In information management, words that consistently identify a type of document would be descriptive, while words giving minor differentiation between documents of that type would be characteristic. Attributes in the P - M A K framework need not be limited to keywords, but could also encode continuous values in discrete dimensions. In an information-space of related concepts, an attribute can be defined as any semantic description that is meaningful to humans. Osgood (1952) uses a set of Likert scales of opposing qualities to describe a concept; choosing a real-valued number on a scale denotes the relative strength of its qualities. For example, given the qualities wet-dry and active-passive, rain would score toward wet and active, whereas sand toward dry and passive. Other methods define words in terms of semantic microfeatures such as humanness, softness, and form (McClelland and Kawamoto, 1986), or derive the semantic components of words by collating the subjective valuations of a large number of people (McRae et al., 1997). These are good ways to describe entities in terms of meaningful continuous attributes (cf. LSA: Landauer and Dumais, 1997; Lemaire and Denhiere, 2004), but they also require labor-intensive surveys of human subjects to determine where a particular object lies on each scale. Automated extraction of attributes would be preferable in real-world situations with large corpora. Although identification occurs at a level above the operation of individual neurons, some fundamental con-cepts may be represented in the brain in the form of neuronal clusters (such as the impression of redness being contained in a cluster in the visual cortex), while compound concepts are synchronously distributed in clusters throughout the brain (such as the colour, texture, and behaviour components of a dog node). Thus an associative knowledge structure can be mapped onto the physical structure of the brain by assuming that an object or attribute is an abstract representation of a neuronal cell assembly comprised of some 10K to 100K neurons (Goertzel, 1997; Huyck, 2001; Pulvermuller, 1999). Indeed, the firing patterns of actual neural sig-naling systems have been interpreted as "the state symbol-alternatives of the higher-level symbol-processing description" (Cariani, 2001). Such comparisons suggest a link between low-level, high-granularity connec-tionist neural networks and more abstract symbolic structures. A symbolic approach to identification has several advantages. Whereas file-and-folder indexing and the desktop metaphor of typical information management systems emphasize an object's physical location, an object identified by its attributes can be retrieved through content-based addressing, so that remembering where it is less important than specifying what it is. Attributes are human-readable, conforming to the prin-ciple of simplicity—objects are then comprehensible given the set of component attributes by which they are identified. Thus identification is fundamental to any information management system that uses a language or symbol system to reference distinct objects. 43 Perception — Objects should be distinguished and assigned descriptive attributes by perceptual classifica-tion. Perception isolates and identifies individual objects by extracting and registering their attributes. Automated attribute extraction is essential for information management systems, particularly where large numbers of new, uncategorized items are inducted. This requires the design of individual classifiers sensitive to the key attribute types of a particular domain—for instance, the features of images are different from those of sounds, and are processed differently. The constraints of human perception and comprehension usually determine what qualities are important in each case, but since what is notable about an object often varies according to the perspective of the observer, the best strategy is to extract as many of an object's salient attributes as possible so that the object may be interpretable in many different contexts. Information systems use the equivalent of perception to perform adequately in their "environment" of data streams. For example, data-mining systems may be considered perceptual since they detect patterns in data sets. However, the patterns that they find may be difficult to define in human terms—this is a larger challenge of automated intelligent tools: to explain clearly the basis of their decisions. If their actions are not understandable and inspectable, it is difficult to know whether they actually provide their intended service. The goal of a perceptual classifier then is to observe a real-world entity, create a data object to reference it, and extract the comprehensible semantic attributes that best describe it. Each object's attributes are weighted to reflect how descriptive they are; attributes above a certain threshold will be interpreted as descriptive, while the rest are characteristic. Since documents are a "cognitively plausible" source material for a semantic system (Frank et al., 2003; Landauer, 2002; Lemaire and Denhiere, 2004), they make an instructive example domain. The field of information retrieval (IR) is concerned with the classification, clustering and recovery of documents in large collections. Tfidf, an acronym for "term frequency, inverse document frequency" (also written tf-idf) is one of the simplest—and most illustrative—classification algorithms (Salton and McGi l l , 1983). It extracts human-readable identifying attributes ("keywords") for each document in a corpus, and thus acts as a perceptual classifier for documents. The intuition behind tfidf is that frequently-appearing terms (i.e., words) in a document tend to be descriptive of that document, and should be used as its keywords. At the same time, in a corpus of documents describing for example the many fates of fish, the term fish itself is a poor choice for a keyword, since it will not help to differentiate between documents. If the corpus has many subtopics treating the different species of entree, more specific terms such as salmon, herring, and bream would be better descriptors. With N the number of documents in a corpus, nk the number of documents containing term k, and tfk the number of times term k appears in document i, the expression— guarantees, in the first line describing inverse document frequency, a bias against terms that appear in many documents in the corpus. The second line calculates the weight of each term in a given document; the for some term k Wik = tfik * idfk for some document 44 number of times that the word appears in the document is tempered by the term's corpus-wide idf. Once all the words in all the documents in the corpus have been weighted, the top-weighted words in each document are used as that document's attributes. Classifiers such as tfidf are consistent in principle with cognitive global matching models of memory such as S A M (Raaijmakers and Shiffrin, 1981) and MINERVA 2 (Hintzman, 1984), which describe a simultaneous matching of cues against all images in long-term memory. Unfortunately, the computational expense of global matching models is impractical for serial machines: in the case of tfidf all of the words in the corpus are counted before a document's keywords can be extracted, and it would be costly to repeat such a search as new objects are added to an existing large corpus. Fortunately there are more parsimonious alternatives such as the algorithm proposed by Matsuo and Ishizuka (2004), which uses the distribution of term clusterings in individual documents to extract their keywords, and shows comparable performance to tfidf without requiring a corpus-wide summation. The implication of perception for the design of information management systems is that a homogeneous (e.g., document-only) data source requires just a single classifier tuned to its domain. Heterogeneous databases pose a greater problem: classifiers are necessarily data-dependent. For instance, the automatic ex-traction of attributes from images should cope with the specifics of pictorial data (e.g., Barnard and Forsyth, 2001). Thus a suite of classifiers will be needed to cover each of a range of expected data types. Similarity — Object similarity should be based on shared attributes. Objects are compared by the attributes that they share: the more two objects share attributes, the greater their similarity score. Similarity valuations between all pairs of objects are then used to build semantically-based associative knowledge structures. In humans the judgment of similarity is a real-time process that compares an object's attributes against those of other objects stored in memory. Objects are organized first by their sensory qualities, and as we learn more about objects they gain additional attributes and become more strongly associated with similar objects. The similarity principle applies this process to the organization of information within information-management systems. Objects that have all attributes in common are highly similar, while those with none in common are com-pletely dissimilar. More accurate judgments are possible if attribute weights (such as provided by tfidf) are considered. For example, a document with descriptive attributes "fish, depletion, conservation, migration" would be related to another document with attributes "recipes, baking, cream, fish" at least to some degree, since both documents elaborate further on the fate of fish. Many different similarity equations are possi-ble, depending on the desired ontology and how it should be tuned by weighting its components (see e.g., Tversky, 1977) For example, the following assumptions would be congruent with P -MAK's principles: 45 • Attributes that are more heavily weighted are the dominant attributes of the object (analogous to the defining features of Feature Set TheoryjSmith et al., 1974). • Two objects that share dominant attributes are more similar than are two objects that share weaker attributes. • Two objects that share dominant attributes but whose weaker attributes diverge might be two different descriptions of the same thing. • Where one object's set of attributes is larger than that of another object, the attributes of the latter may simply be incomplete. From these constraints, the following simple similarity measure can be derived. Given two objects defined by attribute vectors N and M of arbitrary lengths— adds the weights w of all shared attributes i, and ip is a function normalizing the output score to the range [0,1]. We could make further refinements, for example by penalizing non-matching attributes or by bias-ing toward supra-threshold attributes, but ultimately the determination of similarity is dependent on what similarity means—it could be context-dependent or include asymmetric relations between objects (such as chili being more closely related to red than vice versa), to better capture the nuanced inter-dependencies of language. Explicitly storing the results of similarity calculations between objects, thereby building an associative data structure, yields significant advantages: once determined, information of relatively low volatility—such as overt similarity—does not need to be recalculated. The strength of relationship between two objects can be immediately determined, and an item's nearest neighbours can be immediately retrieved, giving an amorti-zation of searching that obeys the principle of parsimony. 7 Since information management often deals with large data sets, high dimensionality, and expensive similarity measures that would be impractical to recal-culate in real time (Moreno-Seco et al., 2003), the storage of cumulative results in an associative knowledge structure leads to significant time savings. Navigation — Objects and categories should be organized to allow efficient retrieval from a knowledge space. Since information retrieval in the broadest sense represents an exploration of an information space the principle most associated with retrieval is called navigation. The interaction may involve only a single step, as in direct querying, or it may involve iterative searching based on the local and global attributes of the information space. Direct Querying — The simplest form of retrieval is direct querying, which takes users to their final goal in one step by matching search criteria against a corpus to find the best matches. Direct querying can be 7By contrast, one problem with vector-space representations is that with every query, similarity must be recalculated between the query vector and all vectors that share one or more terms with the query. Recalculation is typically more costly than retrieving pre-computed values. 46 performed as a spreading activation process, such as in the memory model of (Anderson, 1983) where highly activated items are retrieved as most relevant to the query, with the single most highly activated item being the target. Similarly, since objects in P - M A K are indexed by descriptive attributes, a search using those attributes will retrieve the most relevant objects. While direct queries are fast when the target is well-known, they may retrieve too many objects when search criteria are too vague. Conversely, an overly-specific query may return few or no results, and exclude some that may have been highly relevant. Direct queries will also definitely fail if the query's target attributes (or its synonyms) do not appear in the knowledge structure. When direct querying fails, a more costly search process must be used. The amount of time required can be considerable if objects are not associated by some useful property such as similarity or co-occurrence. Moreover, if only partial information is available (in either objects or query), additional time-consuming reasoning processes may be necessary (Russell and Norvig, 1995). Searching is also inevitable when the size of an information space grows to exceed a critical size, while the small number of terms in a typical user query remains the same, reducing the precision and recall of direct queries. In the worst case, the system may need to check each item in turn. A faster, more effective form of navigation through an information space is possible given two complemen-tary sources: (i) local cues (or signposts), based on attributes visible in an object's most similar "neighbour" objects, and (ii) global overviews (or maps), based on a summary description of a large knowledge structure. These search strategies are also found in human wayfinding, both in determining the next move with respect to immediate surroundings, and in reviewing available resources from a situating overview such as a map. Seeking behaviours seem to come naturally to humans, possibly because they are similar to the foraging and hunting behaviours of our ancestors (Sharps et al., 2002). Local Search with Signposts — In human wayfinding, local sign-following is based on nearby cues in the environment, and occurs quickly since these cues are usually present (Hunt and Waller, 1999). Some influential cognitive models (e.g., Anderson, 1983; Raaijmakers and Shiffrin, 1981) posit that recovering a memory begins with a best guess that retrieves a set of items. The best of these items is activated to retrieve its cohorts, and then an item selected from among those cohorts is activated, repeating until a satisfactory item is found or the search is abandoned. Similarly, users of an information-management system are likely to pick a next step that appears to bring them closer to their intended target, based on attributes visible in currently-retrieved objects (Teevan et al., 2004). During the search, a document's nearest neighbours will share a significant number of keywords, but will have some distinguishing keywords closer to what is being sought—the likelihood of a successful search is improved if local attribute "signage" is adequately complete and non-ambiguous. Such iterative searching approaches items of interest, in a semantic gradient descent that quickly narrows the search for relevant items. Ideally, the system will assist in suggesting next steps, for instance by indicating items that are related to previous user choices. Such semantic traversal is more efficient if objects are not too densely interconnected, and show the small-world property of many short-range connections and a few long-range ones (Milgram, 1967; Watts and Strogatz, 1998). Small-world systems (typically represented as networks) tend to have diameters exponen-tially smaller than their size, so that on average a short path may be found between any two items. For 47 example, when the World Wide Web contained a billion pages, only 18.59 clicks on average separated any two (Albert et al., 1999). The organizing principle of the Web (its "classifier") involves authors linking their pages to others based on subjective similarity judgments. This property of local self-organization has positive implications for improved navigability in other semantic information spaces, such as the sort of automatically-constructed similarity-based knowledge structures described by P - M A K . Global Search with Maps — Global map-following occurs where users coordinate their searches using high-level summary overviews that act as guides (or substitutes) to exploration (Hunt and Waller, 1999). By itself, local sign-following does not reveal the big picture, and with too many choices at every step users can quickly become overwhelmed and disoriented. To judge quickly the relevance of a corpus to one's information needs, a summary of its offerings would be useful. Users unfamiliar with the contents of a corpus will have difficulty developing retrieval strategies since they will not know what comprises a good cue. They will also have difficulty in judging the relevance of what is retrieved relative to other items in the system. The principle of abstraction is applicable here, by providing a high-level summary that exposes points of interest and imposes some inspectable order upon them, per-mitting users to jump to dominant items and use local sign-following from there. Users can then ascertain how the information space may be useful to them, and orient themselves wherever they happen to be within it. To create such a semantic index, the attributes of the corpus's objects can be used to build a tree of descrip-tors, with the most common at the top, connected to major sub-descriptors, each of which are connected to their sub-descriptors, etc. down to the leaf level of individual objects (cf. Koller and Sahami, 1997). Objects that are general enough in content can appear at higher levels of the tree. The resulting map exposes the main topic, as well as describing subtopics at increasing levels of detail, in a general-to-specific contain-ment hierarchy of categories—such an organized summary of the semantics, in human-readable form, offers a comprehensible mental model that improves navigation through the information space (Hunt and Waller, 1999; Rainsford and Roddick, 1999). Because they highlight configural information (i.e., the interactions and relations between isolated compo-nents), maps are in some ways superior to exploration based on local cues. First, they show the topic cate-gories available in an information base, how they are related (by semantic distance, or by intervening topics), the relative importance or amount of material in certain topics, the biases of over- or under-represented top-ics, and how the knowledge structure has been organized. Second, maps support orientation by showing users the best entry points from which to start local sign-following, where the users are currently situated within the knowledge structure, their semantic "bearing" (the topics that they appear to be approaching or leaving), and wayfinding clues such as well-used pathways, neighbourhoods, and landmarks. Local signposts and global maps are complementary; humans use either according to need and preference. Information management systems should purposely exploit them as well, since systems become more useful if they store information in a way that promotes quick contextual retrieval, and their interfaces become more useful if they include human-centred searching, browsing, and indexing facilities (Bertel et al., 2004). 48 2 . 5 . 2 S i t u a t i o n a l P r i n c i p l e s : C a p t u r i n g C o n t e x t Situational principles describe how an information management system should interact with the world—and the users that inhabit it—by encoding the temporal and environmental context that humans find important. Since computers are not inherently sensitive to such properties, the most pragmatic approach is to exam-ine how humans incorporate the statistical regularities of their experiences into their memory structures. Human memory is by definition time-based and dynamic, requiring the formation of new memory traces, the fading of less-important traces, and the formation of new associations based on discovered similarities. Cognitively-inspired information systems should therefore take into account the observed persistence of items. With the addition of modality-specific attribute types to our framework, persistence is applicable to the temporal and sensorial coding of events patterns. The situational principle of temporal cueing encodes temporal patterns of observed behaviours which can then be used to trigger reminders, while sensorial cue-ing triggers reactions based on real-world stimuli captured by sensors. An event is then defined as the context of interaction with an information object that encodes combinations of temporal and sensory data in an event convergence. At a functional level, the dominant cognitive theory of episodic memory is encoding specificity (Tulving and Thomson, 1973), in which memories of events are better retrieved if the cues that were present as an event was encoded are also present at retrieval. This description has formed the basis of several computational memory models, such as the Search of Associative Memory (SAM; Raaijmakers and Shiffrin, 1981): given a set of attributes as cues, both semantic memory images and episodic memory of events and can be retrieved. Although instance theories of memory assume that each encounter with an object is stored individually (Medin and Schaffer, 1978; Hintzman, 1984; Nosofsky, 1984), and that cues are matched against all of these images in real time for a best fit, such a large and detailed amount of information would overwhelm an information management system. Instead some form of summarization is necessary. With reference to human memory models, the approach should be more constructive (e.g., Kintsch, 1974) than reconstructive (e.g., Loftus and Palmer, 1974): storing trends that emerge from the details, but not the details themselves. Human memory tends to integrate the details of experience, particularly those of repeated, similar experi-ences, as efficiently-stored summary heuristics for rapid reasoning (Bransford and Franks, 1971). Once these patterns are encoded, they can be used to trigger prospective episodic memory, which cues recall of intended tasks at particular times and under particular circumstances. Three types of prospective memory have been identified (Einstein and McDaniel, 1990): time-based, which refers to actions that will occur at a particular time of day, such as taking medicine at 0900h; event-based, such as returning an item the next time you see its owner; and activity-based, such as remembering to return a book the next time you go to the library. Event-based and activity-based prospective memory are relatively straightforward: if the appropriate contextual cues are strongly enough stimulated, then memories of appropriate actions will gain activation and rise to consciousness. Time-based prospective memory is more problematic: although episodic memory models describe the retrieval of specific snapshots, the explicit coding of temporal information in episodic memory has been little studied in cognitive science (Tranel and Jones, 2006). However, the context of time and place are important for information retrieval. The situational principles capture this context as follows: 49 Persistence — Human memory emphasizes events that are recently, consistently, and concurrently experi-enced. Items that appear regularly, those that show some degree of persistence, represent the key factors of our environment. Persistence is closely bound to the phenomenon of human forgetting: items that rarely appear in the environment are considered less useful, and memories for them are less highly activated, than items experienced more recently and regularly (Anderson and Schooler, 1991). A bias for items that regularly co-occur is also useful: Hebbian learning describes the cognitive process whereby two items that appear together consistently become associated regardless of their similarities; retrieving one then automatically retrieves the other (Hebb, 1949). Persistence appears to reflect a deeply ingrained aspect of human intelli-gence: even the extremely young are good at learning the statistical regularities of their environment, even when events are not semantically related (Munakata, 2004). An information management system that models persistence should similarly encode such regularities. Per-sistence records the degree to which objects have been used, and used together. Encoding the prominence of an individual object is achieved simply by assigning the object a weight; the greater the value, the more that object is used. As the object is ignored in preference for other objects, its weight falls. Thus over time, persistence weights indicate the most useful and important objects in a corpus, as some must rise in activation over others that are "pushed further back" as their usage diminishes. The co-occurrence of objects is similarly encoded: when two items are used together, the system should infer provisionally that there is some meaningful, unseen relation between them, even if they are not otherwise related. A weight is assigned to the pair of co-occurring objects, and as objects continue to be used at the same time, the weight between them strengthens, until eventually as one is retrieved the other will be as well. However, if any one object is used without its peers, then the association between them weakens, and an association may disappear altogether if the correlation that it represents is not repeated. Co-occurrence and similarity are fundamentally different. Two objects will remain similar to the extent that they share attributes. How objects are used is something more fluid: there is a categorical difference between items related by overt similarities (such as cats and dogs), and items related by co-occurrence (such as leashes and dogs). Similarity relations allow the retrieval of declarative knowledge, while co-occurrence relations reflect events experienced over time—how objects have been visited and combined, and what trails have been followed through an information space. Objects clustered based on usage indicate a context of activity, as per the ad hoc categories of Barsalou (1983). Thus P - M A K uses two fundamental types of association: those that represent the (relatively static) similarity relations between objects, and those that represent the dynamic juxtaposition of objects and cues. The principle of persistence is the basis for all adaptive learning in the P - M A K framework: temporal and sensorial cueing also use dynamic association weights to represent how objects persist and co-occur with time and stimulus. 50 Temporal Cueing — Humans encode the temporal regularities of past experiences. Temporal cueing describes the human tendency to retain images in episodic memory of when events are experienced, for instance "every morning" or "in spring-time". Episodic memory often includes defining details and context, along with a subjective impression of when the memory was formed (such as minutes, days, or years ago). Human awareness of time is built-in at the cellular level: the suprachiasmatic nucleus is a cell assembly at the base of the hypothalamus comprising some 10,000 neurons; it triggers the daily se-cretion of melatonin that induces sleep (Yamaguchi et al., 2003). This sensitivity to time—common among living things—appears to be based on the natural periods to which humans are exposed, such as heartbeats, daylight, lunar phases, and seasonal variations. New temporal information is encoded by adjusting neuronal time delays to particular temporal combinations of inputs (Cariani, 2001); the neocortical microcircuit ap-pears to maintain a virtual continuum of timescales, with time constants ranging from milliseconds to years (Denham and Tarassenko, 2003). Encoding specificity can be used to infer the recurrence of events in temporal terms. Memories of one-time, unremarkable incidents are less useful than memory for events that have repeated, ongoing significance in our lives. Such temporal trend-based encoding is enormously useful in information management: it extends retrieval beyond basic similarity judgments to enable users to ask such questions as, "when do I usually do XT', or "what do I regularly do at time TI" Although this seems like a useful function, we know of no information management systems purposely designed to answer these sorts of questions. Such time-based indexing of events requires the definition of temporal attributes. These should not be con-fused with time-related terms used as semantic keywords. For example, a document that contains repeated references to a particular temporal epoch (e.g., Monday) may include it as an attribute, but this says nothing about how the document itself has been used. Rather, an implementation of temporal cueing must provide a priori a scale of permanent, weighted time-unit attributes (or rather cues) appropriate to the expected time scales that the system will encounter: for example the minutes, hours, days, etc. of the calendar. Temporal cues representing particular times become associated with information objects if the objects are used at those times. As with persistence, temporal cues gain activation when their corresponding informa-tion objects are used, and the associations weights between the cues and each object grow stronger. If the objects are not used at the expected time, the cues and associations will weaken. The strongest cues then represent the times that see the most activity, and the strongest associations indicate which objects are used most, and when. Sensorial Cueing — Humans are sensitive to their surroundings, and react appropriately in the presence of contextually relevant stimuli. The use of sensors is related to the epistemic principle of perception. Perception can occur both abstractly, such as when "sensing" the word content of a document, and physically in quantifying motion, temperature, illumination, etc. In this case however, the "classifier" is the array of sensors connected to the system. Sensors act as inputs to a probabilistic decision process. Given a particular set of stimuli, each sensor determines to what degree its conditions have been met; its activation is then proportional to the strength 51 of the stimulus. Sensor correlations are learned by associating objects and sensors if the objects are used while the sensors are active. With an appropriate combination of sensors an information system can learn correlations in the environment, enabling queries such as, "under what conditions are object O used?" and "which objects are used under condition C?" As with temporal cueing, sensorial cueing is based on the principle of persistence in its use of dynamic weights and associations. The more a sensor is stimulated, the stronger its weight becomes. The more an object is used while a sensor is stimulated, the stronger the association between them becomes. A sensor's weight diminishes if it is idle while other sensors are active, and if an object is used while an associated sensor is idle, the association between them weakens and may eventually disappear. The strongest sensors represent the dominant actions in the environment, and the strongest associations indicate objects that are used most in particular contexts. Sensorial cueing is critically dependent on the choice of sensors, which may be too specific or lack sensi-tivity. If important properties can be identified in advance, a sensor with perfect alignment can be designed to pick out the desired attribute cheaply and without fail. Perfect alignment requires no intelligence: it will respond immediately to a stimulus and trigger a reflex (as per Skinner, 1977). An inappropriately chosen or poorly-aligned set of sensors may lead to perceptual biases that result in concept blindness, where important themes go undetected due to gaps in the sensor array. A system may diagnose such problems, for example by detecting events that are registered in the absence of a consistent stimulus pattern, and may then assume that its domain has been too narrowly specified. If greater generality of domain is assumed, then more intelligence will be required to learn a pattern of activations across a set of noisy sensors to uncover new relationships and to maintain reliably correct classifications, and the necessary computations may not be bounded by time. Expert systems reduce this burden by functioning in a well-described restricted domain, where efficient algorithms can be written given prior knowledge and assumptions about the data (Thornton, 2000). Event Convergence — Humans experience an event as a discrete entity comprised of a dynamic set of cues. An event represents an occurrence in the world, specifically an encounter with information objects at the specified time(s) and under the specified condition(s). Although temporal and sensorial cues may be associ-ated individually and directly with an object, when more than one cue is used to describe a compound context they must be represented as a conjunctive set to avoid ambiguity. An "event" is precisely this conjunctive set of some combination of both temporal and sensorial cues. A single event representing multiple cues may then be associated with one or more information objects that occur in that context, and the relative typicality of the pattern's components is reflected in their individual weights. For example, both weather-report and bus-schedule objects may be consulted according to the conjunctive set "every Tuesday and Thursday when it's cloudy." This we call the cue-event-object (CEO) model: temporal and sensorial cues are combined in an event that mimics cognitive convergence zones, which similarly synchronize perceptions and concepts (Moll et al., 1994; Amedi et al., 2005). In the P - M A K framework, information objects take the place of concepts. Event 52 convergence is related to the principles of associationism in its dynamic relation of disparate elements, abstraction in representing dynamic situations as discrete events, and quantization in summarizing an entity from a set of cues. Concept drift can then be easily simulated with the CEO model in the associations between each cue and the event, and between the event and its objects. These associations strengthen or weaken as objects conform to or deviate from the event's pattern. An event's conjunctive set can fragment as necessary: if a user's schedule changes, then the event associated with the bus-schedule and weather objects can migrate to just "every Tuesday when it's cloudy". The determining factor is the strength of the event's associations with its component time units. An event may be represented as rare by its weak association strengths, but it will remain stable as long as all its cue associations are uniformly supported within some tolerance. Reminders of upcoming, contextually relevant events are modeled on human prospective episodic memory, based on the notion that a structure's past usage can predict its current usage (Anderson, 1989; Anderson and Schooler, 1991). Prospective memory has two principal components: cue identification that recognizes the appropriate context, and intention retrieval of the appropriate reaction to the cues (Simons et al., 2006). In our model, cue identification is performed by stimulating the timers and sensors of current conditions, and-if all the components associated with an event are active, intention retrieval is performed by spreading activation to objects associated with the event. If the retrieved objects are used following retrieval, the associations become stronger, otherwise they weaken. Thus objects are retrieved at appropriate times by temporal cueing, mimicking time-based prospective memory. Objects are also retrieved under appropriate conditions by sensorial cueing, mimicking event- and action-based prospective memory. In the sense that objects that have been previously used in particular contexts can be retrieved automatically when the same conditions recur, the CEO model is similar to models of stimulus-response learning (e.g., Skinner, 1977). In the sense that if certain preconditions are observed, then certain responses must follow, the model's timers and sensors also act as inputs to production rules represented as events (cf. Taatgen et al., 2006). Although rare, information management systems that use temporal and sensorial cues as triggers include the experimental CybreMinder (Dey and Abowd, 2000) that generates reminders when user-specified temporal and situational conditions are satisfied, and the wearable Forget-Me-Not (Lamming and Flynn, 1994) that records a user's wanderings and interactions for later analysis. 2.6 Associative Network Representation We have referred repeatedly to "associative knowledge structures" without committing to any particular rep-resentation. Here we commit to a graph-based network design; we believe this to be the best representation, although others are possible. For example, relations between objects could be represented in a matrix, with similarity scores in the cells indexed by object identifier labels. This representation is fine if the pattern of relationships is dense within a corpus: a large proportion of cells will then be filled. But there is evidence that semantic knowledge structures tend to be sparse, and may generally exhibit small-world properties 53 (Barabasi, 2002; Steyvers and Tenenbaum, 2005), as found in patterns of inter-word relations within lan-guages (Motter et al., 2002; Sigman and Cecchi, 2002). In this regard, matrices do not appear to be the best choice for representation since resources would be wasted on unused cells, violating the principle of parsi-mony. Since matrix-based representations cannot accommodate new relation types without the addition of new tables, their inflexibility and rapid growth also violate plasticity and scalability. 2.6.1 The Advantage of Networks Networks on the other hand seem ideal in many ways. They are the most parsimonious associative knowl-edge structure, since they only use resources to represent what is actually there.8 Using the rules of graph theory (see e.g., Harary, 1969) networks are simple to define: each object appears as a node, with nodes linked by edges if they are related. Relationships of different kinds may be represented by typed links. Networks are easy to reconfigure and may even be nested inside the nodes of other networks to create more complex structures, such as compound concepts and dynamically organized categories. Networks are also inherently graphical: they are straightforward to draw, and people often use network diagrams to clarify and communicate their ideas. Humans are good at wayfinding, and since networks are navigable from node to node along links, they encourage explorative browsing to supplement more typical query-based searches. As we've seen, networks are also popular in associative models of human knowledge and memory. 9 Although networks are de facto the most flexible means of organizing data, they are also potentially costly since any node may be directly connected to any other: a fully connected network of n nodes would have (n 2 — n) /2 links connecting them—implying a quadratic increase in links as new nodes are added. How-ever, if a domain exhibits a small-world distribution in the associations between its objects, an explosion of links will be averted by emphasizing local connections. There are other possible benefits. A network's small-world parameters could be monitored to maintain its navigability (Kleinberg, 2000); a skewed distri-bution of similarity links would indicate a poorly tuned classifier, since the classifier extracts the attributes that determine connections. Maintaining a network's small-world property would be important for avoiding phase transitions that cause dramatic non-optimal changes in behaviour as networks grow in size, such as a dramatic expansion of the number of nodes reached by the spreading-activation "event horizon" (Shrager et al., 1987). As such the small-world property is related to the principle of scalability. But despite these potential advantages, information management systems do not yet test for or exploit small-world properties in their domains (Perugini et al., 2004). A network-type topology offers further benefits through graph-theory analyses. Highly connected authority (or hub) nodes indicate semantic trends by their many links. In a cognitive sense authorities are like concepts that come readily to mind, since they are related to so many things, while from the point of view of graph theory, the distance from authorities to other members of their group is on average a minimum (Adamic, 8 For example, as Bayes nets provide a compact representation of joint probability distributions, so the network representation of P-MAK provides a compact representation of semantic and contextual relations. 9"Network thinking is poised to invade all domains of human activity and most fields of human inquiry. It is more than another helpful perspective or tool. Networks are by their very nature the fabric of most complex systems, and nodes and links deeply infuse all strategies aimed at approaching our interlocked universe." (Barabasi 2002, p. 222) 54 1999). As with landmarks in human wayfinding, authority-nodes may act as meaningful entry points to the network. Meanwhile, clusters of nodes that are closely connected represent semantic trends; they can be organized into a clique graph (Harary, 1969) that summarizes clusters into semantic "neighbourhoods" connected by "highways". Cut-points in the clique graph indicate natural divisions between larger sections of the network. Topology can also serve as a diagnostic metric: for instance a large number of unconnected nodes could indicate a poorly tuned or inappropriate classifier. Conversely, if too many nodes are clustered by the same set of attributes, the cluster could be subdivided into smaller distinct groups by increasing the sensitivity of the classifier until additional differentiating attributes emerge. Increasing classifier sensitivity is equivalent to the mental processes that provide humans with a sense of distinctiveness: the more we know about something, the less it seems like other things (e.g., Rabinowitz and Andrews, 1973). 2 . 6 . 2 B a s i c N e t w o r k E l e m e n t s In the P - M A K framework, different types of nodes and links are used to represent different types of ele-ments. There are two types of nodes : • s e m a n t i c nodes represent the meaningful entities of the world: objects, ideas, and events. They are defined by the descriptive attributes that they contain — - ob j ec t nodes represent an individual data object, such as a document or image. - even t nodes represent events in the environment, such as a user's behaviour or the stimulation of environmental sensors. - a c t i o n nodes are activated by the system to trigger an effect in the environment, such as sounding a notice or dispensing medication. • i n d e x nodes encode the situations in which entities described by semantic nodes occur. In particular - t e m p o r a l nodes index the particular times at which events occur, and contain only a time-stamp. - senso r nodes index the observable circumstances under which events occur. - c o n j u n c t i v e nodes combine two or more temporal or sensor nodes into a more specific context representation. Although research into semantic networks has identified many different link types, mostly with respect to linguistic relationships between words (e.g., Woods, 1975), for fundamental information management little more is required than bi-directional edges that represent strength of relatedness. For the basic operations of information management, the P - M A K framework uses two types of relatedness: • s i m i l a r i t y l i n k s represent the degree to which two nodes share attributes. The more attributes they share, the more they represent similar things. • usage l i n k s represent the degree to which two nodes are activated at the same time. With these node and link types, we are able to represent knowledge in several useful ways: by the similarity between objects, by the objects with which they are typically used, and by the temporal and environmental 55 { r,s,t,u,v,w } {a,b,g,h,i,j } ® {b,c,d,p,q,rj fa,b,k,l,m,n } Figure 2.2: An example of a simple similarity network. Each node is defined by discrete attributes, here represented as letters in lists. Nodes are linked to the degree that they share attributes. Thus object nodes no, n\, and 112 are all equally connected for sharing the same two attributes; no and «3 are more strongly connected for sharing three, n.4 is unconnected—it shares no attributes with any node but n^, and in this example one shared attribute that they share is not enough to overcome the similarity function's weight threshold; such tuning of a similarity measure can prevent nodes from being linked when their relationship is too weak, and be used to preserve a large semantic network's small-world property. context in which objects occur. . 2 . 6 . 3 N e t w o r k s f o r S i m i l a r i t y , U s a g e , a n d S i t u a t i o n s Three network types conform to the basic operations of information management and P -MAK's principles, using the node and link elements just described. The similarity network is used to search for items based on their content; as items are retrieved, similar items can be found by navigating to them along connected links. Apart from semantics, items can also be retrieved from a usage network that links items to the degree that they co-occur, and from a situational network that indexes items by when and under what conditions they are activated. Similarity Networks — As the epistemic principles are used to induct a corpus, objects (such as documents) are represented by object nodes created according to the principle of perception, such that each node contains a list of attributes that describe its semantics.10 A similarity network is formed as nodes are connected by weighted similarity links —according to the similarity principle nodes are more strongly linked the more that they share attributes and thus represent similar things. The calculated similarity value can be used as the link's weight; pairs of nodes with low similarity scores would not be linked. The network thus organizes and stores the results of the classification process in the pattern of connections between nodes (Figure 2.2); relations do not then have to be recalculated in real time. There is strong evidence from brain scans (Habib et al., 2003) that semantic and episodic memory are largely processed in different parts of the brain. It therefore seems reasonable to disentangle episodic and similarity '"Attributes could also be represented as nodes, as in (Jones, 1986) and our use of index nodes for cues, but for illustrative purposes we use a simpler formulation of nodes with self-contained semantics here; a simple device such as an inverted index (Salton and McGill, 1983) can then be used to retrieve all nodes that contain a given attribute. 56 information into separate representations that reference the semantic object nodes separately. Usage Networks — As items represented by existing object nodes are used concurrently and consistently over time, a new usage link can be created to connect them, based on the principle of persistence. The more the two objects occur together, the stronger the link grows, otherwise it decays and disappears. Once linked, spreading activation can be used to activate nodes and retrieve objects that tend to co-occur, whether or not they are semantically similar. The combination of all usage links forms a usage network. To adhere to the principles of parsimony and scalability, activation values used for information management should be finite and bounded, so that as the activation of some links and nodes in the usage network rises, normalization places limits on their activation, and decays other unused elements in a process analogous to human forgetting. Such control of activation is useful for information systems: nodes with high activation have a demonstrated utility that makes them likely candidates for re-use (Anderson and Schooler, 1991). By contrast, nodes with a long-term activation approaching zero have little demonstrated utility. Depending on the information-management protocol, these weak nodes and the entity that they represent could be deleted, or removed from active indexing and moved to long-term storage, freeing up real-time resources for more important, recently activated items. Similarity and co-occurrence links are grouped into separate networks to avoid entanglement: usage should not affect similarity valuations, since the same object may be used in different ways, in different contexts, and by different people.11 While the weights of usage links change with every user interaction, similarity links are virtually static: their weights are not recalculated unless the classifier's similarity metric changes, or unless an object's keywords edited or expanded. Situational Networks — Situational networks simulate human episodic memory by describing the times and conditions where objects are used. Parallel episodic memory models (Medin and Schaffer, 1978; Raai-jmakers and Shiffrin, 1981; Hintzman, 1984; Nosofsky, 1984; Miikkulainen, 1992) would be impractical to implement on a serial machine, violating the mechanistic principle of parsimony; this motivates our use of networks to store relationships that would otherwise be expensive to recompute. Situational networks encode the occurrence of events using both temporal and sensorial index nodes representing timers and sen-sors external to the system. Once an event's temporal and environmental patterns have been encoded, the network structure can be queried, reminders can be generated, and actuators can be triggered. An event node represents a real-world occurrence that involves an information object, such as a document retrieval, and may contain attributes that describe the occurrence. Systems are programmed a priori to respond to a finite number of input types, each of which can be described by attributes, and thus systems can assign attributes to events as they occur; users could edit these and also add their own 'tags' as attributes. Events can then be clustered by their similarities. An event node is created the first time a particular event type co-occurs with a given information object. Following the CEO model, the index nodes of timers and "Nonetheless, the semantic and usage networks can work together to recover, say, all similar items that also tend to be used together. 57 Figure 2.3: An example of the cue-event-object model implemented as a network. Temporal and sensory index nodes typify observed events. Grey rectangles represent time and sensor cue nodes; white rectangles are conjunctive cue nodes. In this example, event node eo is associated with usage on Wednesdays at 1 lOOh, when objects no and are used. Object m is also used on Wednesdays at 1 lOOh, but only when the at-work sensor is active; n i is also used every day at 0900h, regardless of location. The network can be used to retrieve objects automatically by spreading activation from index nodes when their temporal and sensorial conditions are repeated. sensors that correspond to the event are linked to a new conjunctive node that is then linked to the event node. The event node is then linked to the information-object node. As the event re-occurs at other times, more conjunctive nodes are added to describe unique circumstances. Where the units of two conjunctive nodes agree, those conjunctive nodes may be aggregated to produce a more compact representation. For example, an event that occurs every day at 0900h can be reduced to a single time unit, and connected directly to the temporal node that represents 0900h. A conjunctive node's well-formed-formulas can also include disjunctions such as 0900/i A {Mon V W e d } to represent partial adherence to a particular scale, in this case the days of the week. As events re-occur at the time specified by a pattern, the weights of the nodes involved increases asymptotically, as do the weights of the links that connect them. Thus the weights indicate the frequency with which a pattern as a whole and each of its components are true. Events must receive support in order to persist, otherwise they are "forgotten": if an event does not re-occur as expected, then the weights of its corresponding elements decay. If an event pattern becomes only partially supported, then a conjunctive nodes is disaggregated as the entropy of its link weights exceeds a threshold, fragmenting a single pattern into two or more with different levels of support. Together, the processes of aggregation and disaggregation are essential for modeling concept drift. Like human memory, the situation network encodes unique events, but also integrates similar events into a gist-like summary. Once event patterns have been encoded into the situation network, they can be queried to determine when events tend to occur, what combinations of sensors tend to be engaged, and what times are busiest, by examining the stronger weights in the network. For specific queries, activation flows from objects through event nodes to index nodes to find out when they occur, and activation flows from time or sensor nodes through conjunctive nodes (if any) and event nodes to determine what occurred. Queries can also be automated to act as reminders: as time moves forward, index nodes corresponding to the current time and stimuli are activated, and the activation spreads to connected conjunctive nodes and 58 Figure 2.4: Index nodes for triggering an action node. In this case, medicine will not only be dispensed at given times, but also if biomedical sensors detect a potential problem. If the user's blood pressure rises above a certain level and their temperature falls, dispensation is triggered by activation flowing through the bp1 conjunctive node and the event node. Sensor and temporal index nodes can also be combined: dispensation will also occur if blood pressure is high at 0900. The event node can also be linked to the object nodes of documents that describe the medicine's properties and dosage. Each of the nodes can be queried to determine its role with respect to other nodes, following connected links to ask such questions as what happens at 0900h, when medicines should be taken, and which medicines are available for dispensation. events. Conjunctive nodes whose input links (including any sensors) are all activated will pass activation on to connected events. The events activate any objects that correspond, retrieving them in the appropriate context. Support is then increased for objects that are used after recall, while it decays for those that are ignored. Users may set their own reminders by connecting events to desired conditions. Here high link weights act as alarms by guaranteeing forceful retrieval, and users can also program the system to produce real-world effects by setting their own patterns and connecting them to action nodes that trigger actuators (Figure 2.4). Although we are unaware of previous work that uses associative networks and spreading activation for temporal encoding, symbolic networks have been used in some models that include sensory processes. The perceptual maps of Convergence-Zone Episodic Memory (Moll et al., 1994), use a handful of perceptual stimuli to retrieve full episodic memories, and the Memory Extender (Jones, 1986) uses "context" nodes that serve a sensory role by biasing node activation according to changing circumstances. Adding sensors to such models would allow the embodiment of associative networks in a machine, with potential applications in robotics and contextual computing. 2.7 Conclusion and Future Work P - M A K describes a framework at the intersection of cognitive science and computing that fulfills the basic operations of information management—specifically the collection, identification, classification, and re-trieval of meaningful objects. These processes are analogous to cognitive memory faculties of learning and recall, while for information systems these processes are the foundations of machine learning and informa-tion retrieval. Using cognitive properties in the design of information systems provides users with increased 59 familiarity of function, which leads to improved usability. The assertion is that the functional processes of existing highly refined and powerful biological systems—such as the brain—may suggest new and more efficient models of human-machine interaction. Based on the constraints and properties of memory and machines, and on the basic operations of information management, P - M A K inducts information as discrete objects that are described by discrete attributes, the objects are associated to the degree that they share attributes, and their use is recorded as events indexed by context. Objects are thus described both by internal attributes and external cues. Although P - M A K ' s prin-ciples are each already well-known, their assembly into an information-management framework is novel. An associative knowledge structure is most simply and flexibly represented as an associative semantic net-work. Networks are a cognitively plausible representation, forming the basis of popular cognitive memory models; they are also well-understood with respect to exploitable topological properties—such as cluster-ing, hubs, and the small world property—that appear in various real-world semantic domains. A network representation is also compatible with neurophysiological models that equate neural cell assemblies with semantic nodes, and neural synchronization pathways with links that represent strength of similarity. P -MAK's symbolic associative model avoids the computational complexity of a fully parallel cognitive model by storing similarity evaluations as links for rapid retrieval. Once similarity associations are stored, an object's nearest neighbours can be immediately retrieved; all retrieval in P - M A K occurs using a constrained spreading-activation model. P -MAK's network can in principle accommodate other association types; as an illustrative example a second link type is introduced that encodes Hebbian-type learning due to usage. Objects may also be indexed by temporal and sensorial index nodes that register when an object is used and under what circumstances. Objects can then be retrieved automatically at a particular time, or in the presence of particular stimuli. Together, this defines a set of principles for information management that responds in a plausible, compre-hensible way to changing conditions, derived from well-known cognitive models and usefully applied to automated information management. P - M A K is widely applicable and domain-independent. Implications P -MAK's generality suggests that it can be extended into epistemology, dynamic behaviour, data visualiza-tion, and applications. Further ideas from cognitive science could be applied to abstraction — • Any subset of interest, such as categories implicit in node clusters, could be represented by a generated meta-node, as per the prototypes of Posner and Keele (1970), or by an existing centroid node, as per the exemplars of Brooks (1978). More generally, an ad hoc context—such as personalized starting points for navigation, or the set of nodes retrieved by a query—could be saved for re-use in a template of relevant node and link activations as per schema theory (e.g., Mandler, 1984). • To make high-dimensional semantic spaces more comprehensible, the network could be summarized with the cognitive abbreviations found in human wayfinding, such as landmarks, paths, districts and boundaries (Vinson, 1999). 60 • P - M A K could be expanded to include rule-mining and rule inference, nodes that act as processors, and a wider assortment links as found in the semantic networks appropriate to semantic networks and expert systems (see Sowa, 1991); this implies in due course a graph-based programming language. Several possibilities relate to learning. Based on the context of a user's interests and actions — • Semantic link strengths could vary dynamically, reflecting the notion that meaning is not abstractly fixed, but contextually dependent on the link pattern among a number of features (Lakoff, 1987). • P -MAK's perceptual classifier could adapt to the composition of a corpus by increasing its sensitivity within subtopics; it could similarly adjust to the drifting bias of a corpus as items are added. • Rates of link and node decay could vary to reflect greater utility and persistence. • To index popular subjects, a dynamic usage map could be generated as a minimum spanning tree of often-used nodes and link pathways (cf. the semantic index of the retrieval principle). • To extend P - M A K from personal to group information management (GIM) 1 2 users could be matched by similar interests to share usage maps; this points to possibilities in collaborative filtering and re-source administration. Some improvements are fundamental — ^ • " • An explicit activation strategy could define—for different circumstances—how activation spreads between nodes, how its strength is calculated, how far it should reach, and how forgetting is modeled. • To refine semantics, attributes could be represented as nodes related through networks of their own, both within individual nodes and across an entire corpus. P-MAK-based systems may be implemented on devices of various types. Its temporal and sensorial cueing are useful for context-sensitive portable devices and ubiquitous computing; for example with the inclusion of metabolic and affective sensors, it could support medical applications and critical tasks that demand concentrated user attention. The cognitive plausibility of P -MAK's various components suggests that it could be applied to cognitive modeling simulations as well as to information systems, as an empirically based testbed. As a personal assistant, if run continuously a P-MAK-based system can reflect ongoing trends, and as such form the basis of a prosthetic cybernetic system, a true memory extender. I2and similarly, Computer-Supported Collaborative Work (CSCW) 61 Bibliography Adamic, L . A . (1999). The small world web. In Abiteboul, S. and Vercoustre, A. , editors, Proceedings of the European Conference on Digital Libraries (ECDL99), volume 1696 of Lecture Notes in Computer Science, pages 443^452. Springer-Verlag: Paris, France. Albert, R., Jeong, H. , and Barabasi, A . - L . (1999). Internet: Diameter of the world-wide web. Nature, 401(6749): 130-131. Amedi, A. , von Kriegstein, K. , van Atteveldt, N . M . , Beauchamp, M . S., and Naumer, M . J. (2005). Func-tional imaging of human crossmodal identification and object recognition. Experimental Brain Research, 166(3-4):559-572. Anderson, J. R. (1983). A spreading activation theory of memory. Journal of Verbal Learning and Verbal Behavior, 22(3):261-295. Anderson, J. R. (1989). A Rational Analysis of Human Memory. In Roediger, H. and Craik, F , editors, Varieties of Memory and Consciousness: Essays in Honor of Endel Tulving, chapter 11, pages 195-210. Lawrence Erlbaum Associates: Hillsdale, NJ. Anderson, J. R., Bothell, D., Byrne, M . D., Douglass, S., Lebiere, C , and Qin, Y. (2004). An integrated theory of the mind. Psychological Review, 111(4): 1036-1060. Anderson, J. R. and Bower, G. H. (1973). Human Associative Memory. V.H. Winston: Washington, DC. Anderson, J. R. and Schooler, L. J. (1991). Reflections of the environment in memory. Psychological Science, 2(6):396-408. Autonomy (2006). Autonomy Corporation pic (LSE: AU.) . Corporate home page accessed on the World Wide Web; Retrieved August 31, 2006, from Baars, B. J. (1993). How does a serial, integrated and very limited stream of consciousness emerge from a nervous system that is mostly unconscious, distributed, parallel and of enormous capacity? CIBA Founda-tion Symposium, 174:282-290. Baddeley, A. D. and Hitch, G. (1974). Working Memory. In Bower, G., editor, The Psychology of Learning and Motivation, volume 8, pages 47-89. Academic Press: New York. Baecker, R., Grudin, J., Buxton, W., and Greenberg, S. (1995). Readings in Human-Computer Interaction, 2nd edition. Morgan Kaufmann Series in Interactive Technologies. Morgan Kaufmann: San Francisco, C A . Bahrick, H. P. (1984). Semantic memory content in permastore. Journal of Experimental Psychology: General, 113(l):l-29. Barabasi, A . - L . (2002). Linked: The New Science of Networks. Perseus Publishing: Cambridge, M A . Barnard, K. and Forsyth, D. (2001). Learning the semantics of words and pictures. In Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV 2001, volume 2, pages 408-415. Barsalou, L . W. (1983). Ad hoc categories. Memory & Cognition, 11(3):211-227. Barsalou, L. W. and Sewell, D. R. (1984). Constructing representation of categories from different points of view. In Emory Cognition Project Report No.2. Emory University Press: Altanta, GA. 62 Bertel, S., Obendorf, H. , and Richter, K.-F. (2004). User-centered views and spatial concepts for naviga-tion in information spaces. Technical Report SFB/TR 8 (Spatial Cognition), Transregional Collaborative Research Center, Universities of Bremen and Freiburg. Blank, M . A . and Foss, D. J. (1978). Semantic facilitation and lexical access during sentence processing. Memory & Cognition, 6(6):644-652. Bransford, J. D. and Franks, J. J. (1971). The abstraction of linguistic ideas. Cognitive Psychology, 2:331— 350. Bransford, J. D. and Johnson, M . K. (1973). Consideration of some problems of comprehension. In Chase, W., editor, Visual Information Processing, volume 2, pages 331-350. Academic Press: New York. Brewer, W. F. and Treyens, J. C. (1981). Role of schemata in memory for places. Cognitive Psychology, 13:207-230. Brooks, L . R. (1978). Nonanalytic concept formation and memory for instances. In Rosch, E. and Lloyd, B. , editors, Cognition and Categorization, pages 170-211. Lawrence Erlbaum Associates: Hillsdale, NJ. Burgess, C. and Lund, K. (2000). The Dynamics of Meaning in Memory. In Dietrich, E. and Markman, A. , editors, Cognitive Dynamics: Conceptual and Representational Change in Humans and Machines, pages 117-156. Lawrence Erlbaum Associates: Hillsdale, NJ. Bush, V. (1945). As we may think. Atlantic Monthly, 176( 1): 101-108. Cariani, P. (2001). Symbols and dynamics in the brain. Biosystems, 60(1-3):59-83. Chomsky, N . A . (1965). Aspects of the Theory of Syntax. MIT Press: Cambridge, M A . Collins, A . M . and Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psycholog-ical Review, 82(6):407-428. Collins, A . M . and Quillian, M . R. (1969). Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Behavior, 8(2):240-248. Cowan, N . (2000). The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24:87-185. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. , and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407. Denham, M . and Tarassenko, L. (2003). Sensory processing. Technical Report of the Foresight Cognitive Systems Project (Research Review), Office of Science and Technology, Department of Trade and Industry, London, U K . Dey, A. K. and Abowd, G. D. (2000). CybreMinder: A Context-Aware System for Supporting Reminders. In HUC '00: Proceedings of the 2nd international symposium on Handheld and Ubiquitous Computing, volume 1927 of Lecture Notes in Computer Science, pages 172-186. Springer-Verlag: London, U K . Dumais, S. T., Cutrell, E., Cadiz, J. J., Jancke, G., Sarin, R., and Robbins, D. C. (2003). Stuff I've Seen: a system for personal information retrieval and re-use. In SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 72-79. A C M Press: New York, NY. Einstein, G. O. and McDaniel, M . A . (1990). Normal aging and prospective memory. Journal of Experi-mental Psychology: Learning, Memory, and Cognition, 16(4):717-726. Ekman, P. (1971). Universals and cultural differences in facial expressions of emotion. In Cole, J., edi-tor, Nebraska Symposium on Motivation 1971, volume 19, pages 207-284. University of Nebraska Press: Lincoln, NE. Fertig, S., Freeman, E., and Gelernter, D. (1996). Lifestreams: An alternative to the desktop metaphor. In Proceedings of the ACM SIGCH1 Conference on Human Factors in Computing Systems (CHI '96), pages 410-414. A C M Press: New York, NY. 63 Fodor, J. A . (1975). The Language of Thought. Crowell: New York. Foltz, P. W. (1991). Models of human memory and computer information retrieval: Similar approaches to simiar problems. Technical Report 91-3, University of Colorado, Boulder, CO. Frank, S. L . , Koppen, M . , Noordmana, L. G. M . , and Vonk, W. (2003). Modeling knowledge-based inferences in story comprehension. Cognitive Science, 27:875-910. Gemmell, J., Bell, G., Lueder, R., Drucker, S., and Wong, C. (2002). MyLifeBits: Fulfilling the Memex Vision. In Proceedings of ACM Multimedia '02, pages 235-238. A C M Press: New York, NY. Gillund, G. and Shiffrin, R. M . (1984). A retrieval model for both recognition and recall. Psychological Review, 91:1-67. Goertzel, B. (1997). From Complexity to Creativity: Explorations in Evolutionary, Autopoietic, and Cogni-tive Dvnamics. IFSR International Series on Systems Science and Engineering. Plenum Press: New York, NY. '' Habib, R., Nyberg, L. , and Tulving, E. (2003). Hemispheric Asymmetries of Memory: The H E R A Model Revisited. Trends in Cognitive Sciences, 7(8):241-245. Harary, F. (1969). Graph Theory. Addison-Wesley: Reading, M A . Hebb, D. O. (1949). The Organization of Behavior. John Wiley: New York. Hintzman, D. L. (1984). Minerva 2: A simulation model of human memory. Behavior Research Methods, Instruments, & Computers, 16(2):96—101. Hoffman, R. R., Klein, G. A. , and Laughery, K. R. (2002). The state of cognitive systems engineering. Intelligent Systems, 17(1):73—75. Hunt, E. and Waller, D. (1999). Orientation and wayfinding: A review. Technical Report N00014-96-0380, Office of Naval Research, Arlington, VA. Huyck, C. R. (2001). Cell Assemblies as an Intermediate Level Model of Cognition. In Wermter, S., Austin, J., and Willshaw, D., editors, Emergent Neural Computational Architectures Based on Neuro-science: Towards Neuroscience-Inspired Computing, volume 2036, pages 383-397. Springer-Verlag: New York, NY. Jacoby, L . L. and Witherspoon, D. (1982). Remembering without awareness. Canadian Journal of Psy-chology, 36:300-324. Johnson, T. R. (1997). Control in ACT-R and Soar. In Shafto, M . and Langley, P., editors, Proceedings of the Nineteenth Annual Conference of the Cognitive Science Society, pages 343-348. Lawrence Erlbaum Associates: Hillsdale, NJ. Jones, W. P. (1986). The Memory Extender Personal Filing System. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 298-305. A C M Press: New York, NY. Kahneman, D. and Treisman, A . (1984). Changing views of attention and automaticity. In Parasuraman, R., Davies, D., and Beatty, J., editors, Varieties of Attention, pages 29-61. Academic Press: New York, NY. Kintsch, W. (1974). The Representation of Meaning in Memory. Halsted Press: New York, NY. Kleinberg, J. M . (2000). Navigation in a small world. Nature, 406:845. Roller, D. and Sahami, M . (1997). Hierarchically classifying documents using very few words. In Pro-ceedings of the 14th International Conference on Machine Learning (ML), pages 170-178. Labov, W. (1973). The boundaries of words and their meaning. In Bailey, C. and Shuy, R., editors, New Ways of Analyzing Variation in English, volume 42, pages 340-373. Georgetown Press: Washington, DC. Lakoff, G. (1987). Women, Fire, and Dangerous Things: What Categories Reveal about the Mind. Univer-sity of Chicago Press: Chicago, IL. 64 Lamming, M . and Flynn, M . (1994). "Forget-Me-Not"-Intimate Computing in Support of Human Memory. In Proceedings of FRIEND21 '94 International Symposium on Next Generation Human Interfaces, pages 1-9. Rank Xerox Research Center: Cambridge, U K . Landauer, T. K. (2002). On the computational basis of learning and cognition: Arguments from LSA. In Ross, N . , editor, The Psychology of Learning and Motivation, volume 41, chapter 13, pages 43-84. Academic Press: San Diego, CA. Landauer, T. K. and Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211-240. Landauer, T. K., Laham, D., and Foltz, P. W. (1998). Learning Human-like Knowledge by Singular Value Decomposition: A Progress Report. In Jordan, M . , Kearns, M . , and Solla, S., editors, Advances in Neural Information Processing Systems, chapter 10, pages 45-51. MIT Press: Cambridge, M A . Lemaire, B. and Denhiere, G. (2004). Incremental construction of an associative network from a corpus. In Forbus, K. , Gentner, D., and Regier, T., editors, Proceedings of the 26th Annual Meeting of the Cognitive Science Society, pages 825-830. Lawrence Erlbaum Associates: Mahwah, NJ. Loftus, E. F. and Palmer, J. C. (1974). Reconstruction of automobile destruction: An example of the interaction between language and memory. Journal of Verbal Learning and Verbal Behavior, 13:585-589. Mandler, J. M . (1984). Stories, Scripts, and Scenes: Aspects of Schema Theory. Lawrence Erlbaum Associates: Hillsdale, NJ. Marr, D. (1982). Vision : A Computational Investigation into the Human Representation and Processing of Visual Information. W.H. Freeman: San Francisco, C A . " Matsuo, Y. and Ishizuka, M . (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal of Artificial Intelligence Tools, 13(1): 157-169. McClelland, J. L . and Kawamoto, A . H. (1986). Mechanisms of sentence processing: Assigning roles to constituents. In Rumelhart, D. and McClelland, J., editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 2: Psychological and Biological Models, pages 318-362. MIT Press: Cambridge, M A . McRae, K., de Sa, V. R., and Seidenberg, M . S. (1997). On the nature and scope of featural representations of word meaning. Journal of Experimental Psychology: General, 126(3):99-130. Medin, D. L. and Schaffer, M . M . (1978). Context theory of classification. Psychological Review, 85:207-238. Meyer, D. E. and Schvaneveldt, R. W. (1971). Facilitation in recognizing pairs of words: Evidence of a dependence between retrieval operations. Journal of Experimental Psychology, 90(2):227-234. Miikkulainen, R. (1992). Trace feature map: A model of episodic associative memory. Biological Cyber-netics, 66:273-282. Milgram, S. (1967). The small world problem. Psychology Today, 1:60-67. Miller, G. A . (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63:81-97'. MindManager (2006). MindJet: Software for visualizing and managing information. Corporate home page accessed on the World Wide Web; Retrieved August 31, 2006, from Moll , M . , Miikkulainen, R., and Abbey, J. (1994). The capacity of convergence-zone episodic memory. In Proceedings of the 12th National Conference on Artificial Intelligence, AAAI-94, pages 68-73. MIT Press: Cambridge, M A . Moravec, H . (1998). ROBOT: Mere Machine to Transcendent Mind. Oxford University Press: New York, NY. 65 Moreno-Seco, F., Mico, L . , and Oncina, J. (2003). Extending fast nearest neighbour search algorithms for approximate k-nn classification. In Perales, F., Campilho, A. , and Perez, N . , editors, Pattern Recognition and Image Analysis, volume 2652 of Lecture Notes in Computer Science, pages 589-597. Springer-Verlag. Motter, A. E., de Moura, A. P. S., Lai, Y . - C , and Dasgupta, P. (2002). Topology of the conceptual network of language. Physical Review E, 65(6):Art. No. 065102 Part 2. Munakata, Y. (2004). Computational cognitive neuroscience of early memory development. Developmen-tal Review, 24(1): 133-153. Nason, S. and Laird, J. E. (2005). Soar-RL: Integrating Reinforcement Learning with Soar. Cognitive Systems Research, 6(l):51—59. Nosofsky, R. M . (1984). Choice, similarity, and the context theory of classification. Journal of Experimen-tal Psychology: Learning, Memory, and Cognition, 10(1): 104-114. Osgood, C. E. (1952). The nature and measurement of meaning. Psychological Bulletin, 49(3): 197-233. Osgood, C. E., May, W., and Miron, M . (1975). Cross-Cultural Universals of Affective Meaning. University of Illinois Press: Champaign, IL. Perugini, S., Goncalves, M . A. , and Fox, E. A . (2004). Recommender systems research: A connection-centric survey. Journal of Intelligent Information Systems, 23(2): 107-143. Posner, M . I. and Keele, S. W. (1970). Retention of abstract ideas. Journal of Experimental Psychology, 83:304-308. Pulvermuller, F. (1999). Words in the brain's language. Behavioral and Brain Sciences, 22(2):253-336. Quillian, M . R. (1969). The Teachable Language Comprehender: A simulation program and theory of language. Communications of the ACM, 12(8):459-476. Raaijmakers, J. G. W. and Shiffrin, R. M . (1981). Search of associative memory. Psychological Review, 88:93-143. Rabinowitz, F. M . and Andrews, S. S. R. (1973). Intentional and Incidental Learning in Children and the von Restorff Effect. Journal of Experimental Psychology, 100(2):315—318. Rainsford, C. P. and Roddick, J. F. (1999). Database issues in knowledge discovery and data mining. Australian Journal of Information Systems, 6(2): 101-128. Rosch, E. and Mervis, C. B. (1975). Family resemblances: Studies in the internal structure of categories. Cognitive Psychology, 7:573-605. Rumelhart, D. E., Hinton, G. E., and McClelland, J. L. (1986). A general framework for parallel distributed processing. In Rumelhart, D., McClelland, J., and PDP Research Group, editors, Parallel Distributed Pro-cessing: Explorations in the Microstructure of Cognition. Vol. 1: Foundations. MIT Press: Cambridge, M A . Russell, S. and Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice Hall: Englewood Cliffs, NJ. Salton, G. and McGil l , M . (1983). An Introduction to Modern Information Retrieval. McGraw-Hill: New York, NY. Schlogl, C. (2005). Information and knowledge management: dimensions and approaches. Information Research, 10(4): 16pp. Sharps, M . J., Villegas, A . B., Nunes, M . A. , and Barber, T. L. (2002). Memory for animal tracks: A possible cognitive artifact of human evolution. Journal of Psychology, 136(5):469-492. Shrager, J., Hogg, T., and Huberman, B. A . (1987). Observation of phase transitions in spreading activation networks. Science, 236(4805): 1092-1094. 66 Sigman, M . and Cecchi, G. A . (2002). Global organization of the Wordnet lexicon. Proceedings of the National Academy of Sciences, 99(3): 1742-1747. Simons, J. S., Scholvinck, M . L. , Gilbert, S. J., Frith, C. D., and Burgess, P. W. (2006). Differential components of prospective memory? Evidence from fMRI. Neuropsychologia, 44:1388-1397. Skinner, B. F. (1977). Why I am not a Cognitive Psychologist. Behaviorism, 5:1-10. Smith, B : C. (1996). On the Origin of Objects. MIT Press: Cambridge, M A . Smith, E. E., Shoben, E. J., and Rips, L. J. (1974). Structure and process in semantic memory: A featural model for semantic decisions. Psychological Review, 81:214—241. Sowa, J. F. (1991). Principles of Semantic Networks: Exploration in the Representation of Knowledge. Mogan Kaufmann Series in Representation and Reasoning. Morgan Kaufmann: San Mateo, C A . Steyvers, M . and Tenenbaum, J. (2005). Small worlds in semantic networks. Cognitive Science, 29(1):41-78. Taatgen, N . , Lebiere, C , and Anderson, J. R. (2006). Modeling paradigms in ACT-R. In Sun, R., ed-itor, Cognition and Multi-Agent Interaction From Cognitive Modeling to Social Simulation. Cambridge University Press. Teevan, J., Alvarado, C , Ackerman, M . S., and Karger, D. R. (2004). The perfect search engine is not enough: A study of orienteering behavior in directed search. In CHI '04: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 415-422. A C M Press: New York, NY. TheBrain (2006). TheBrain Technologies Corporation. Corporate home page accessed on the World Wide Web; Retrieved August 31, 2006, from Thornton, C. (2000). Truth from Trash: How Learning Makes Sense. MIT Press: Cambridge, M A . Todd, P. M . and Gigerenzer, G. (2000). Precis of simple heuristics that make us smart. Behavioral and Brain Sciences, 23:727-780. Tranel, D. and Jones, R. D. (2006). Knowing "what" and knowing "when". Journal of Clinical and Experimental Neuropsychology, 28(l):43-66. Tulving, E. (1972). Episodic and Semantic Memory. In Tulving, E. and Roberts, M . , editors, Organization of Memory, pages 381-403. Academic Press: New York. Tulving, E. and Thomson, D. M . (1973). Encoding specificity and retrieval process in episodic memory. Psychological Review, 80(5):352-373. Tversky, A. (1977). Features of similarity. Psychological Review, 84(4):327-352. Vinson, N . G. (1999). Design guidelines for landmarks to support navigation in virtual environments. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: The CHI is the Limit, pages 278-285. A C M Press: New York, NY. Wang, Y. and Liu, D. (2003). On information and knowledge representation in the brain. In Proceedings of the Second IEEE International Conference on Cognitive Informatics (ICCI'03). Want, R., Hopper, A., Falcao, V , and Gibbons, J. (1992). The active badge location system. ACM Trans-actions on Information Systems (TOIS), 10(1):91-102. Watts, D. J. and Strogatz, S. H. (1998). Collective dynamics of 'small world' networks. Nature, 393(6684):440-442. Whittaker, S. and Hirschberg, J. (2001). The character, value, and management of personal paper archives. ACM Transactions on Computer-Human Interaction (TOCHI), 8(2): 150-170. Whorf, B. L. (1956). Language, Thought, and Reality: Selected Writings. MIT Press: Cambridge, M A . 67 Wickens, C. D. and Hollands, J. G. (1999). Engineering Psychology and Human Performance, 3rd edition. Prentice Hall. Witten, I. H., Moffat, A. , and Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann: San Francisco, C A . Woods, W. A . (1975). What's in a link: foundations for semantic networks. In Bobrow, D. and Collins, A. , editors, Representation and Understanding, pages 35-82. Academic Press, New York. Wynn, T. and Coolidge, F. L. (2004). The expert neandertal mind. Journal of Human Evolution, 46(4) :467-487. Yamaguchi, S., Isejima, H. , Matsuo, T., Okura, R., Yagita, K. , Kobayashi, M . , and Okamura, H. (2003). Synchronization of cellular clocks in the suprachiasmatic nucleus. Science, 302(5649): 1408-1412. Zha, H . and Simon, H . D. (1999). On updating problems in latent semantic indexing. SI AM Journal of Scientific Computing, 21(2):782-791. 68 Chapter 3 Static Reformulation: A User Study of Static Hypertext for Query-Based Reformulation 1 Hypertext allows users to navigate between related materials in digital libraries. The most fundamental automated hypertexts are those constructed on the basis of semantic similarity. Such hypertexts have been evaluated by a variety of means, but seldom by real users given simulated real-world tasks. We claim that while other methods exist, one of the best ways to prove the usefulness of hypertext is to show the benefits for users performing realistic tasks. We compare the reformulation of queries that users perform in keyword searching, to the query reformulation implicit in browsing between documents linked by similarity of con-tent. We find that a static automatically-constructed similarity hypertext provides useful linking between related items, improving the retrieval of targets when used to augment standard keyword search. 3.1 Introduction As digital libraries grow, it becomes increasingly difficult for human editors to organize the deluge of mate-rial, and the automatic induction and indexing of new materials becomes crucial. Information-seeking users are similarly challenged in finding what they want in a large set of materials. To help users find information, it should be organized not just by subject, but also by similarity of content. Keyword search by itself is not enough. While keyword search is fast when the target of a search is known, it fails when users do not have a clear idea of what they seek. For example, users may not be familiar with terms common to a specific topic, or may be unskilled at composing accurate queries. If a user's query is too general or vague, too many search results may be retrieved to find targets efficiently. Conversely, if a search query is too narrow, appropriate targets may be excluded. Even with a relevant item in hand, finding related items can be difficult. When search is augmented with similarity hypertext, users have the opportunity to say to their systems, "give me more like this one". Such augmentation involves query by reformulation methods. In its simplest form, query by reformulation requires users to re-pick their search terms iteratively until they find what they want. This places a "burden of 'A version of this chapter has been accepted for publication. Huggett, M. and Lanir, J. (2007). Static reformulation: A user study of static hypertext for query-based reformulation. In Proceedings of the Joint Conference on Digital Libraries (JCDL). ACM Press: New York, NY. 69 decision" on users, who may not have the necessary skills to formulate effective queries. Overtly interactive approaches ask users to identify relevant and irrelevant documents in their search results to improve the inferences of the search engine. Fully automatic approaches refine search results through the automatic adjustment of query terms by query expansion or document expansion algorithms, or through inferences of information need based on observed user behaviour. If effective, these latter automatic methods are preferred as least intrusive for the user. A hypertext is a network in which each node represents a document, and documents are linked if there is a relation of interest between them. Hypertexts are commonly built by hand: the most impressive example is the Web, in which millions of pages have been written and linked by individual authors. Manual linking is less difficult if performed as pages are written, but is too slow and costly for linking large pre-existing corpora especially if the linking scheme may change. Manual linking is also subjectively dependent upon the person who creates the links; authors tend to link the same materials in different ways, leading to a lack of "inter-author consistency" (Furner et al., 1999). Automatically-constructed hypertext produces more consistent results by using standard information re-trieval (IR) methods to link documents together based on an algorithmic measure of relatedness. Since the process is automated, large numbers .of new pages can be inducted reliably and quickly. Hypertext ideally helps users to find documents that bear a clear and useful relation to a document already found. In this case, the most generally useful relation is that of similarity, and for this only a single link type representing a measure of alikeness is required. Over-all, the goals of automatically-constructed similarity hypertext are -• to support vague and difficult search tasks by recovering items most related to a designated item. • to grow with ease; given the constant expansion of digital document collections, a system should not require user intervention to decide if documents should be linked. • to incorporate heterogeneous documents, unlike expert systems that are restricted to a particular do-main. Automatically-constructed hypertext tends to be evaluated algorithmically using IR methods rather than user studies. IR methods gauge a system's ability to retrieve correct targets in answer to predetermined queries. Large test corpora such as TREC have been developed with specified questions and known answers as test-beds for such algorithmic evaluation (Teevan et al., 2004). However, many researchers have advocated user studies as an important part of evaluating automatically-constructed hypertexts (Agosti et al., 1997; Blustein and Staveley, 2001; Blustein et al., 1997; Green, 1999; Melucci, 1999). Some hypertext user studies, mainly in the Web domain, have asked such questions as whether browsing im-proves retrieval with too-general or vague queries, and whether browsing gives additional retrieval improve-ment over querying alone (Marchionini et al., 1993); whether ranking retrieved documents by similarity to a query improves retrieval, and whether similarity links allows faster, more accurate retrieval of more relevant documents (Carmel et al., 1992). However, there have been few user studies on automatically-constructed hypertext, particularly when used explicitly for query by reformulation. 70 Whether links to similar documents are pre-computed to form a network, or inferred unobtrusively at run-time, from the user's perspective there is no difference. As long as the user is presented with a choice of documents similar to a chosen document, then the chosen document acts as a query. By choosing a new focus document from the results that is more relevant to their information needs, the user is implicitly modifying the query by navigating among similar documents in the hypertext. We use standard IR methods to build a static similarity hypertext from a controlled corpus of similarly sized news articles, and link documents together more or less strongly to their closest peers based on shared keywords. Since networks with small-world properties have been shown to be optimally navigable while using a minimum of resources, we approximate small-world topology in the resulting hypertext network with a self-tuning link pruning method. Our user study provides a standard, generic interface to the pre-computed hypertext, with user tasks that simulate real-world conditions. Our main contribution is to compare two approaches by which users reformulate queries: directly by revising search terms, versus implicitly by navigating through a static similarity network. We ask, does adding implicit reformulation to a search engine enhance the user's experience and result in better search results? When do users use similarity links, how do they use them, and in what kinds of tasks are they most useful? What is the efficacy of such, simple systems? Our results show that implicit reformulation provided by a static hypertext provides significant improvement for user performance in realistic information-seeking tasks. 3 . 2 R e l a t e d W o r k Prior work includes studies of automatically constructed hypertext, how hypertext network topology has been tuned, how users interact with information, and how hypertext has been evaluated. 3.2.1 Query by Reformulation Query by reformulation most commonly uses relevance feedback from the user to adjust the meta-data of indexes or documents, sometimes creating links dynamically from documents to other documents, or to anchors within documents, in response to queries. Reformulation is particularly useful as an active guide when users do not have a clear idea of what they need, are unfamiliar with the lexicon of their topic, or use terms that differ from the system's understanding of the topic. The Rabbit system (Williams, 1984) is a "query assistant" that provides guided query reformulation based • i on properties of human memory retrieval processes. The user starts by identifying a general topic, and then interactively refines a series of queries by criticizing the results that each query returns, rating their attributes as either relevant or irrelevant to the current information need. Following each interaction, the system refor-mulates and displays the new query to the user. During the process, the system shows users what keywords are available and appropriate to a particular topic, which is of particular value to users who are unfamiliar with a topic, or who are challenged by the subtleties of reformulating their own queries. Since the retrieval 71 process is compute-intensive, the system caches "exemplar" documents that act as generic representatives of each topic. The system can then quickly provide at least somewhat reasonable results by first displaying exemplars relevant to the current query while the retrieval process continues in the background. The CodeFinder system is designed to provide programmers with retrieval of software components for re-use (Henninger, 1994). As with Rabbit, CodeFinder provides structured iterative query refinement by allowing users to select topic classes with desired attributes, and then refine the search into available subclasses. However, this method requires a laborious up-front investment in modeling the programming context, and is difficult for users to understand. By contrast, CodeFinder's key feature is a spreading activation process that retrieves similar and associated items that have terms in common; it can retrieve elements that are related to a query but do not match it exactly. The spreading-activation feature uses a network to link prominent attributes with the information items that contain them; the network can easily be built automatically. Com-parative studies showed that CodeFinder excelled in supporting the ill-defined needs of programmers who are unsure of how to begin to solve a problem. The system shows that formulating effective queries is at least as important as the retrieval algorithm used. The VOIR system generates dynamic links based on feedback provided by users during browsing (Golovchin-sky, 1997), using the display metaphor of a newspaper layout. VOIR is designed to navigate corpora that are too large to index manually. It performs query-expansion reformulation by adding content-bearing words to the revised query from document sentences selected by the user. The system then helps reduce the cog-nitive load on users by indicating which links are most likely to be contextually relevant. Results of a user study showed that queries based on this link selection method performed better than direct user-specified queries. An interesting general observation from this study was that the distinction between hypertext and information retrieval research seems to be blurring. ClickIR is another system that generates links dynamically for large corpora (Bodner and Chignell, 1998). ClickIR combines search and browse features in the same interface, to avoid the "spiky" Web navigation pattern of keyword search followed by local browsing. Similar to Voir, sentence-based relevance feedback is performed by using the sentence in which a link occurs as a query. The weighted average of the user's last few link selections is passed to the search engine; results are then dynamically combined into a hypertext document. Words best fitting this cumulative model of user interest are used as links in documents selected by the user. In a user study, ClickIR performed significantly better when users provided relevance feedback by indicating relevant documents. In a second study, ClickIR was compared to a standard search-and-browse interface, where search and browse functions are typically segregated. Results showed that ClickIR found significantly more correct results (i.e., higher recall) than the standard interface. The ScentTrails system for finding information on the Web (Olston and Chi, 2003) also combines search and browse in the same interface. Persistent terms reflecting the "partial information goal" are entered into a text field to provide context; the user may alter these terms at any time to refine the search context. Links related to these terms are then dynamically highlighted in viewed documents by enlarging the font, to indicate paths to desired content one or more hops away. The system is implemented as a proxy server between a standard H T M L browser and a Web server. In a user study, ScentTrails performed significantly better than either 72 standard search or browse alone. The method of reformulation sessions (Amitay et al., 2005) is meant to address the situation where users may know little or nothing about a topic or its available index terms. In a form of document expansion, the system links a series of query reformulations to the last set of results retrieved by the series. This association is indexed, based on the assumption that it will give the best answer if the current topic is revisited. Users accessed the specially designed interface at a Web site. In a user study, results showed that the method elicited more correct results in shorter time. Since the method is essentially a consensus approach that votes on the semantics by which documents are indexed, less popular but valid interpretations may be suppressed by more common meanings. Although these systems bear many similarities to our approach, there are some potential drawbacks to dynamic methods. First, a proprietary algorithm of some sort needs to be active at run-time to perform comparisons and meta-data calculations to build a model of the user's information needs. Second, there may be as many user models as there are users, and resources will need to be dedicated to store and manage these models. Third, computation at run-time may become onerous if many simultaneous comparisons are necessary in very large corpora. Fourth, since dynamic systems alter link structure ad hoc, they have little explicit notion of network topology, although graph theory has proven integral to analysis of static hypertext such as the Web, and clearly, link generation cannot be based on global topology if that topology is unknown. If the dynamic methods fail or are unavailable for any reason, then it is useful to have a network to fall back on; this has motivated our use of a static representation. 3 . 2 . 2 S i m i l a r i t y H y p e r t e x t Many studies have shown how to construct a static similarity hypertext automatically from a corpus of documents (e.g., Agosti et al., 1997; Green, 1999). A l l of these studies assume that building such a network provides users with useful means to browse through documents based primarily on semantic similarity. Automatically constructed hypertexts are most commonly constructed using statistical measures of word similarity between documents. Salton and Allan (1994) built a hypertext network from a corpus by com-paring vectors of weighted terms between documents. If the term vectors matched well enough, then the documents were linked. This method has proven popular for building hypertexts, although the method of determining keyword-based document similarity often varies between researchers who use such measures as inner-product, cosine, Dice, product-moment, covariance, overlap, spreading-activation, and Jaccard mea-sures (Jones and Furnas, 1987; van Rijsbergen, 1979). Other methods of computing inter-document simi-larity use synonymy and term expansion (Green, 1999). Our position is that any approach that produces a monotonic ranking of document similarities based on keyword matching will be sufficient for the purpose of our evaluation. Restricting the number of node and link types makes hypertext easier to build, maintain, and navigate. By contrast, some systems organize their items into semantic networks that use a variety of link types to represent different types of relations between items. More than 50 types of relations have been identified 73 that can be represented as links (Kopak, 2000). Although semantic networks are expressive, the use of multiple link types can make them difficult to build, navigate, and interpret, and their subtlety typically requires manual construction, a time-consuming and laborious process. Such complex networks "... are very difficult to build, to maintain and keep up to date. Their construction requires in depth application domain knowledge that only experts in the application domain can provide" (Crestani, 1997, p.467). This difficulty underscores the need for a minimum of number of link and node types when generating and maintaining large hypertext. -3.2.3 Information-Searching Behaviour Humans are good at wayfinding. Such local sign-following is based on nearby cues in the environment, and occurs quickly since these cues are usually present (Hunt and Waller, 1999). When browsing through hyper-text, users are likely to pick a next step that appears to bring them closer to their intended target, based on attributes visible in currently retrieved objects (Teevan et al., 2004). During the search, a document's nearest neighbours wil l share a significant number of keywords, but will have some distinguishing keywords closer to what is being sought. The likelihood of a successful search is improved if local attribute "signage" (e.g., the keywords of an adjacent node) is adequately complete and non-ambiguous. Such iterative searching approaches items of interest in a semantic gradient descent that quickly narrows in on relevant items. Ideally, the system wil l suggest next steps to the user, reducing the need for direct query formulation. A keyword search on the hypertext network will return search results most similar to the query; if their scores are high enough, users may be confident that items linked to them may lead to other good, related results (Berger et al., 1999). Automated suggestions are a useful feature given the difficulty of keyword search: refining queries can require significant user competence (Teevan et al., 2004). Keyword search also has semantic limits: users typically formulate queries with fewer than 3 keywords (Jansen et a l , 1998); the inherent ambiguity of short queries limits the precision of the results, and precision also falls as the size of a corpus grows. For these reasons, hypertext works well for difficult searches, as "navigability through the' semantic structure permits formulation of a query by means of the identification of a semantic path through the reference structure" (Agosti et al., 1997). Evidence shows that users react positively to link suggestions, which lead them to navigate in a more structured way, reducing disorientation and task execution time (Amitay et al., 2005; Golovchinsky, 1997; Olston and Chi, 2003). Although some studies (e.g., Green, 1999) seek to examine "pure" navigation with browsing only, such an approach does not simulate real-world conditions in which browsing is typically preceded by a keyword search. Neither browse nor search is sufficient by itself to fulfill complex information tasks (Marchionini, 1995; Olston and Chi, 2003; Teevan et al., 2004), but each has its strengths. Keyword search is fast and "good enough" in most cases, but ill-articulated information needs cause users to browse (Marchionini, 1995). Queries also serve to place users in the vicinity of good targets, and once a good source is found, browsing wi l l lead to other adjacent good targets. This technique starts from a good 'exemplar' source to find similar sources (Williams, 1984); it is frequently used by expert'librarians to quickly retrieve relevant information (Hawkins and Wagers, 1982). 74 Although traditional IR laboratory experiments typically evaluate algorithms using large collections of docu-ments, their limitations include a lack of real users, unrealistic queries, and unrealistic relevance judgments. Removing human subjects from the evaluation process raises questions of how relevant the model is on human tasks. Other methods use secondary data provided by the user logs provided by commercial IR sys-tems. For example, an extensive study of Excite user queries (Jansen et al., 1998), found that only 5% of the queries involved searches using the feature M O R E L I K E THIS. However, such studies are post-hoc; user studies are necessary for questions that cannot be answered with user logs. When research questions depend on the interaction of user and system, user studies are indispensable. "For hypertext to be useful to people, its designers must know what readers want to use it for. Hypertext being evaluated by people should be measured against the criteria of how well it helps people to complete their tasks" (Blustein et al., 1997). 3.2.4 Network Topology Ensuring navigability and efficient use of resources (computing time and storage) is also critically dependent on how the addition of links affects the network's topology. If the hypertext is over-connected, users can become disoriented when asked to choose from too many links (Golovchinsky, 1997; Salton and Allan, 1994). This problem has led to methods that monitor the addition of new links, in an attempt to control the hypertext's topology and keep its construction feasible while scaling up. Such methods compute quantitative network properties, as opposed to qualitative methods such as those that compute document similarity. For example, Botafogo et al. (1992) introduced the measures of compactness of a hypertext that measures the average number of links per node, and its "stratum" that measures the number of links traversed to get from one node to another. However, an analysis of this approach has shown it not particularly useful for guiding the automated construction of large hypertext (Smeaton, 1995). Adding a link based on its effect on the over-all topology will not work well if all links are not evaluated in the same way, that is, if links added later are penalized compared to earlier links whose eventual effects on topology were unknown when the links were added. The simplest method of controlling hypertext growth is to apply a threshold to the similarity function that computes link strength: only links above a certain weight will be connected. This works poorly: when the threshold is raised, the compactness of the network goes down, but the number of singleton nodes (i.e., nodes with no connections) increases. The resulting uneven link creation results in a very uneven link distribution, with some nodes hugely connected, and other nodes completely isolated (Golovchinsky, 1997; Wilkinson and Smeaton, 2000). By contrast, networks with small-world and scale-free topologies exhibit promising characteristics. Small-world networks, in which the number of connections per node follows a power-law distribution, seem to be ideal in offering optimal connectivity using minimal resources, and many instances of semantic networks (of which similarity hypertexts are one type) have been found to have small-world properties (Barabasi, 2002). Despite these advantages, small-world properties have not been exploited in information manage-ment systems (Perugini et al., 2004), although finding some way to encourage small-world properties in large automatically constructed hypertext could have significant benefits. 75 • nftJfiO&Q • *1irng browse. «rtc«I„nBlTcich* S E A R C H j | $—fd> WW. m i f w o r t . n*iseap* ec*r. id* s sa i r*, •silK'on, SGfft*«r", lec^iOv&ory, HJTVS: IfVO, tV, utl)?n*, unWH, »M!IG;,>i VfrfitU?*, i^lPf Oct, v1<J*0, WrSC-CI*, vuinp-iaDrilrV*, f^] BROWSE '•vary, ^ ' - J V * , w#r ^ m s o * * * -v it- ta ti.xov * S2 K * E i t f d It*fi,u t h » Coming Search Wars Wicrosoit antf Ooag-'* i t * evwng f*cft o & o i i&g - A aty prtze^gi'iWts *:ntefing Die rang The C A I T I K I I ? S a w r h W a r s •j NEctajoft and Google are, eyeaig each other bkc wary j r c e f i g h t a ? riYT. Ousift*s.s J Your Mdfttft F(M*rua<r 1, 2 0 0 4 oiteiuig. Uit n o g If M ( < KtMte** Move' . T o w a r d fTfeifh With Microsof t P A U . A i l . ^ ( C * ' f f Edging closer la J ducc l clash with Microsoft , Google is preparing to ailroduxe p ^ e r f u t search softw&re directly so PCs wary p # ^ M w * enlenng mo ting f(h c^mpAny, w e d each fltner Mte [f f j M i f v ' W i * * * * It ui Silicon VaKr? . V n u C m Etfefct it tn Stferon * * enairman U T M K I O S O U , stated ?»s-*fl rmnflofltorth* ftvvl o T t O " trf ow » « a t o » W w w r o n g ; h » said of F e r bi^h-tetli ailfqsjtfteya's wfes liit it big in the l9Wi, « a t * isn't nra company's earltai df t i s ion to torm* In* searr 1 msif(ct But hp tatted pomiedfy, CtlOllglt: w r* i>M ii Vote- hib.-n-,^ i l - x - ' h Rrstilt'; ai the Rcaimi Theftjta 105; Ooog** s*wr.ui->v s^ a t t ^ c i n ^ th* fon.-n • - Offset- m&mtt, wtoch displsy* i. W e b search a i a s « w o f *r«t hJii'fO- f'" |« ' C 3 | j j ,7 -,' calf^ancS on a tiie-titer nv^:, 'AiO r<ov.' run a:* & ?av-j plug-in (be b r o w s « $ * * m ^ message. So 3 t < n i w d u « ! h « w A * ttuM d to wnAayttz tii*%u$r> &n ir>i*fiiai ft 4 H t Ftahany k> Osi in on thg M q f l JMt T f e fcfccrosQitiu mwt potto? ts-fat G a l e ' s o o t ^ o N t f patents htiriiimi m ootenbai After severs! years o f caution, if not stilright fear, vent ire v;jin»>r.khit*)i«&. Mr Srhmio; tofvleftdeti A * K E o e « a y capitate! $ i r e « $ a n opening thar wallets to starvupc. VtfirasowS • Mttro; on is con-ciif.vtd crutt 4 t>& $1 t . m < Mtr^' Corsnaav S t a t ? U P a CbalLgKfg (o Goog le in m f-fl ittl mess*** , *»Kfy jw-e o & s * s « e d with 0 >»n i o t i r c « a* a t x i & n e s j modoi * r-i V i v m n i o , aii frucsnel s tarch engine company, is ititr educing a s e r v k e that a meant to lessen online information overload 6*1 r*aaytot MtcfOSofl v$ s$Meon V « t « y . R ou?»S 2. "ornmunt'OLons an tnei brath inyMe<h SUtfiup Brftfn i?>y Bay Aiea , comm*rct3fe2 t d m *m w o w s n , towtrung off t h * d o t - c u m d O M r u i t i h e * o m » a n y i o l d i r w r . a • . Show Selected Se lec t : f-wul arlickra thai d i s c u s s new fifletnet sasach teoiwiOlo$tes. Uuno Figure 3.1: The user interface, showing the (augmented) browse condition. 3.3 Exper imenta l Design Our experiment compares the reformulation of queries that users perform in keyword searching, to the query reformulation implicit in browsing between documents linked by similarity of content. The experiment simulates real-world tasks in a real-world interface. Subjects were tasked with finding news articles that fit stated parameters. The articles were drawn from two corpora of typical news article collections. The interface is unambiguous, simple, and familiar, used every day by anyone who uses a Web browser. The interface is augmented by a single button that retrieves articles similar to a selected article. Our goal is to show that a simple approach with a static similarity model can produce fast reasonable results that are significantly better than plain searching, without requiring proprietary systems at run-time. 3.3.1 User Interface To determine how well query reformulation helps people to complete their tasks, we require a user interface similar to one that users are likely to use on a daily basis. Compared to other studies that provided their own 76 novel, sometimes elaborate user interface (UI) (Amitay et al., 2005; Bodner and Chignell, 1998; Golovchin-sky, 1997; Henninger, 1994; Olston and Chi, 2003; Williams, 1984), our user interface was simplified as much as possible, and was given a familiar layout and familiar widgets in order to minimize the effect of UI design on results. The UI is separated into two main areas. The left area is a results panel that shows a ranked list of results when either searching or navigating. The right side of the UI is a display panel that shows the full contents of a document when its title is clicked in the ranked list at left. The user's search terms are highlighted in the document wherever they appear, to help them judge the document's relevance to the task. The document's search result entry in the results panel at left is also highlighted. A search field above the results panel is used to perform keyword searches. After the search button is clicked, search results are displayed in the results panel in a ranked list according to relevance. In the margin next to each result, a check box allows users to select a document if it matches the task; these selections are used later to evaluate the effectiveness of the interface. The strip along the bottom of the interface displays a description of the current user task, for the duration of the task. At the left end of the strip, a "Show Selected" button toggles the results panel to display all the articles that have been selected so far in a task; articles can be selected or de-selected at any time. The UI is shown in Figure 3.1. The two interfaces that we use in our experiment, dubbed Search and Browse, have identical layout. In both conditions, subjects were provided with standard direct-keyword-search facilities using the Google Desktop (GD) API (GoogleAPI, 2006). Google Desktop was tuned with the TweakGDS utility (TweakGDS, 2006) to ignore the file systems of the test machines, and instead index and search only within the two test corpora. Each user keyword search returned results from both corpora. As each user task was relevant only to one or other corpus, the items from the non-relevant corpus were filtered out before results were presented to the user. The Browse configuration also includes a navigation button next to each search result. This navigation button is the only obvious novelty in the interface. Its function is similar to the M O R E L I K E THIS button evaluated in the Excite Web search study (Jansen et al., 1998). When clicked next to an article's description, that article becomes the new 'focus' node: its contents are shown in the display panel, and its nearest neighbours from the similarity network are displayed in the results panel, listed in decreasing order of similarity. The navigation button is also analogous to the Similar Pages link that appears next to each search result in a standard Google web search, however the latter is based on link topology and judges similar pages as those that have the most hyperlinks in common; similarity of semantic content is ignored, and the approach is ineffective if applied to a corpus without predefined links (GoogleGuide, 2006). 3 . 3 . 2 C o r p o r a The experimental domain consisted of two news-article corpora, each containing over 2000 documents. The N Y T corpus is a random selection of daily articles from the New York Times, collected by the authors and 77 drawn from the years 2003-2005. The Reuters corpus is drawn from the widely used Reuters-21578 test collection (Reuters-21578, 2006). Many of the articles in Reuters-21578 are brief tables of numerical data; to guarantee discursive articles comparable with the N Y T corpus, the 2000 longest articles were culled from the test collection. We used relatively small corpora to ensure that the target ratio was high enough to give users a reasonable chance of completing the task in the two minutes allowed. 3.3.3 Similarity Network Prior to the experiment, articles were linked into two static similarity networks, one for each corpus. The networks were fixed throughout the user sessions, and constructed using simple algorithms2. The tf-idf classifier was used to extract a weighted list of potential keywords for each document. Keywords were stemmed to pool terms with the same root. Terms, with weights above a set threshold were selected as document keywords. After keywords were assigned to each document, the documents were linked by the keywords that they shared—in other words, to the extent that they were assumed to discuss similar topics. Thus documents are represented as nodes in a similarity hypertext, and the weight of a link between nodes is equal to the normalized similarity measure: similar ity(n,m) = normi^^weight^n^) + weight(mti))) for all keyword terms U shared by documents n and m. The topology of the similarity network depends crucially on the keyword threshold. A higher threshold value results in fewer keywords per node; a lower threshold allows more. Since the connectivity between two nodes is based on the keywords that they share, more keywords per node increases the probability that nodes will be connected. If the threshold is too high, then the number of singleton (i.e., unconnected) nodes increases. In effect, these documents would be interpreted as being unlike all other documents in the network. If the threshold is too low, then the network becomes over-connected, and requesting similar items with the navigation button wil l return too many results, increasing the difficulty for subjects of differentiating useful items. Networks with a small-world distribution have been shown to use resources more efficiently and be easier to navigate (Barabasi, 2002). Thus we specifically tuned our keyword threshold to approximate the power-law distribution of links per node typical of small-world networks, by incrementally raising the keyword threshold and reconnecting any singletons to their closest neighbour. This ensured that all nodes were connected into a single network with a reasonable level of sparseness. "The algorithms are described in detail in Appendix A. 1. 78 3.3 .4 Design The experiment was a 2-by-18 (interface type, task) factorial design. Task and interface were both within-subject variables. The corpus used was a within-subject control variable. A within-subject design was chosen for its increased power, to allow us to compare different types of tasks, and to let each subject use and comment on both the Search and Browse interfaces. 3 . 3 . 5 Subjects We conducted our experiment on 24 subjects recruited using an online experiment management system. Participants were compensated $10 for their participation. There were 18 males and 6 females, and 17 undergraduate to 7 graduate students. The average age was 22.4 with a range of 18-32. Subjects were all fluent in English and had on average 7.3 years of experience with WIMP (Windows Icons Mouse and Pointer) interfaces and keyword searching facilities. A l l used the Google search engine at least once a day but typically more often (2 also used Yahoo). 18 read the news at least every two days, and 6 once a week or less. 3 . 3 . 6 Apparatus The experiment was conducted on standard desktop machines running the Windows X P operating system and using a 17" L C D screen. The experiment was set up in sound-insulated one-person test rooms with no distractions. 3 . 3 . 7 Procedure During each hour-long session, subjects first answered a demographic questionnaire. They then completed a training session of two tasks like those in the experiment, in order to reduce any learning effect on the interface. Subjects were then presented with the experimental series of 18 search tasks, 9 tasks in each corpus. A l l tasks including the training sessions were limited to two minutes, and subjects were at liberty to rest between tasks if desired. The task time was limited to reduce the chance that users would "over-think" the task and wander too far into the corpus; we also wanted the tasks to reflect the time-constrained pressures of real-world work environments. A post-experimental questionnaire asked subjects to rate and comment on the navigation-button feature.3 Each subject performed half of the tasks in the search interface, and the other half in the browse interface. The interface for the tasks was counterbalanced so that for each task, half of the subjects used one interface while the other half used the other interface. To minimize the order effect, the order of the corpora was also counterbalanced, resulting in four configurations.4 3The pre- and post- questionnaires are listed in Appendix B.2. 4The experimental design is illustrated in Appendix B.3. 79 E X T E N T accurate + D I R E C T N E S S factual exhaustive fuzzy-Figure 3.2: Task classification based on Bhavnani (2001). The purpose of all tasks was to find and mark as many articles as possible in the allotted time that fit a specified task problem. The template of the task problems was "Find all articles that discuss....", followed by a brief topic description. Task problems were evenly divided between the N Y T and Reuters corpora. Tasks in the Search interface used a standard, basic keyword-search layout. Subjects were only able to enter search terms and select from the retrieved results. They were not allowed to use the similarity network. Further searching required subjects to reformulate their own queries in the Search text field. Tasks in the Browse interface used a slightly different interface. In addition to the standard keyword search, a green navigation button appeared next to each retrieved article in the ranked list. When an article's button was clicked, its nearest neighbours were retrieved from the similarity hypertext network and displayed in a list ranked by relevance. 3.3.8 Task Design Designing the task questions was not simple. We wanted to have "real world" tasks, yet keep them diverse and broad enough to cover most typical search tasks.5 Bhavnani et al. (2001) developed a taxonomy that divided real-world IR tasks into two dimensions. The first dimension concerns what the user requires from the search. This can vary between factual searches that require just a few sources of information, and in-depth or exhaustive searches that collect as much information as possible on a specific topic, usually from many sources. The second dimension concerns how much the user knows about the information being sought, which can vary between fuzzy and accurate knowledge. We designed our tasks along analogous dimensions (Figure 3.2): number of targets (extent) that maps to the factual / exhaustive dimension, and ease of search (directness) that maps to the fuzzy / accurate dimension. The extent of a task is its relative number of correct targets in the corpus. The task "Find the article that describes a missile treaty with the Soviets" has only one correct target, while the task "Find articles that talk 5The user tasks are listed in Appendix B.l . How the tasks were divided between dimensions of extent and directness is shown in Appendix B.4. 80 about financial predictions of the future" has many. There were 10 tasks with many correct answers (more than 13 targets), versus 8 tasks of small extent (13 or fewer targets). Directness describes how easy it would be to find the article with a direct keyword search. For example, the direct task "Find all articles that discuss the jailing of Judith Miller" requires little more than a search on the terms "Judith" and "Miller" to find all related articles. On the other hand, the indirect task "Find articles that talk about financial predictions of the future" is not easy to execute using keyword search alone. Seven tasks were direct tasks, while 11 tasks were "indirect" tasks. Our hypothesis was that the browse interface would show better performance for indirect tasks than for direct tasks. While Bhavnani's second dimension talks about the user's extent of knowledge about the search topic, it can map to our measure of directness. Although we cannot control for the user's knowledge of the topics covered in our experiment, we assume that if a user knows more about the topic, then they can more easily formulate appropriate queries. Tasks with low directness and low extent can also be considered similar to known-item search tasks, whose goal is to find a narrow set of items in a large collection that satisfies a well-understood information need. (Marchionini, 1995). 3.3.9 M e a s u r e s A l l user manipulations of the interface were time-stamped and recorded to a log file for later analysis. Basic indicators that were recorded include the search terms that were used, which articles were selected, which documents were retrieved from the similarity network, and which document's contents were viewed. Subject performance was scored in two ways. Consensus scoring scored a subject's selection higher if other subjects also selected it for the same task; the more subjects, the higher the score for that selection (up to a maximum score per selection of N=24 subjects). The intuition behind consensus scoring is that as a group, subjects are apt to find the best answers to each question, and that subjects who pick these popular answers should receive higher scores. By comparison, manual scoring was performed by the authors. We first pooled all user selections for each task, then checked if each selection answered the task description - if correct the selection received a score of 1, otherwise a 0. Marginal user selections were re-checked by both authors to produce a definitive score. A subject's precision on a task was then calculated as their number of correct selections out of their total number of selections. The intuition behind manual scoring is that good answers are not always popular answers, and that expert evaluation of answers provides an accurate and unambiguous basis for evaluation. For both types of scoring, a subject's overall score for a task was calculated as the sum of their individual selection scores. 81 Interface Mean Squares Search Browse F-value p-value Direct tasks 3.76 4.02 2.46 0.119 Indirect tasks 2.85 3.58 4.501 0.035 Low-extent tasks 1.84 2.53 16.93 0.001 High-extent tasks 5.92 6.19 0.21 0.603 A l l tasks 3.20 3.75 6.01 0.023 Table 3.1: ANOVA examining the effects of interface (Search vs. Browse) on correct answers for direct and extent dimensions. 3.4 Results and Analysis 6 Before examining our results in depth, we checked the gross indicators. There was no effect of corpus: the interaction effect between corpus and interface on manual-score results revealed no significant effect {F=2.09, p>0.05}. Also as expected, there was a main effect of task resulting from the inherent differences between the tasks. 3.4.1 Performance Comparison of Interfaces We ran a two-way A N O V A with repeated measures in order to examine the effect of interface type on user performance, with task and interface as the within-subject variables. The dependent variable reported in all the analyses was each subject's manual-score results on each task. The consensus-score results showed the same trends as manual scoring, and are thus are not discussed in further detail. The results show that across all tasks subjects performed better in the Browse interface (mean: 3.75) than in the Search interface (mean: 3.2) {F(l,22)=6.01, p<0.05}. The results also show an interaction effect between task and interface {F(l,22)=6.005, p<0.01} indicating that scores in each interface depend on the task given. 3.4.2 Task Analysis To further compare subject performance in both interfaces, and to investigate the interaction effect, we performed a task by task analysis on our measures of directness and extent: A summary of the results is presented in Table 3.1. Directness — We compared tasks according to their difficulty for keyword search. A 2-way A N O V A was run on the direct and indirect tasks in order to examine the effect of directness on task performance in both interfaces. Results indicate that indirect-task performance was better in the Browse interface {F(l,22)=4.50, p<0.05} while for the direct tasks no significant difference was found. Thus although the Browse interface 6The data table for the experiment is shown in Appendix B.4. 82 I 2 ' II 14 15 1S 3 4 5 e 8 g 10 12 13 10 17 Figure 3.3: Number of correct selections by subjects using the Browse and Search interfaces, for direct (a) and indirect (b) tasks. showed better performance on all tasks, it was more beneficial to the indirect tasks. Figure 3.3 shows a task-by-task comparison of both direct (a) and indirect (b) tasks. Visual inspection of the direct tasks (Figure 3.3 a) shows that performance was roughly the same in both interfaces except for Task 1, which was better in the Search interface {F(l,22)=10.51,p=0.004}, and Task 15 that was better using the Browse interface {F(l,22)=19.21,p<0.01}. The indirect-task analysis (Figure 3.3 b) shows that most of the indirect tasks had equal or better performance using the Browse interface. Task 12 is the notable exception, showing better performance in the Search interface {F(l,22) = 6.80, p=0.016}. Extent — We compared tasks according to the number of targets for each task. Performance in the high-extent tasks was not significantly better for either interface (mean 6.19 for the Browse interface, and mean 5.92 for the Search interface), while in the low-extent tasks the Browse interface showed better performance {F(l,22)=16.93, p<0.01}. These results suggest that the Browse interface was especially beneficial for low-extent tasks, i.e., tasks with few target answers. 3.4.3 Navigation Behaviour We compared the number of searches in the Search interface with the number of combined searches and navigations in the Browse interface, to estimate the information gain in each interface. Since every search or navigation retrieves a new list of results, we look at each as a basic information-retrieval operation. Results showed that subjects performed 1162 total searches in the Search interface for an average of 4.03 search requests per task. In the Browse interface subjects performed a total of 825 searches and a total of 479 navigations for an average of 4.72 combined retrieval operations per task. Subjects used the navigation button at least once in 84.1% of the Browse tasks. This fits with subjects' comments in the self-reported measures: 92% (22/24) of the subjects reported that they found the navigation option helpful. Of these, 72% (16/22) found it very helpful, while 27% (6/22) found it a somewhat useful. 83 Recall over time time (seconds) Figure 3.4: Recall measures of the Search and Browse interfaces, over the elapsed time of the tasks. Only 8% (2/24) of the subjects found the navigation option not useful. On a scale of 0 to 10, the subjects' average rating of the navigation button's effectiveness was 7.04. 66.8% (320/479) of navigations were through documents already selected as answers. This surprisingly high number indicates that users most often used navigation to retrieve results similar to a known strong correct result. For example one subject said, "I used the green button when I found a relevant article and wondered if there are any similar articles. The green button allowed [me] to search more quickly and efficiently." Open-ended answers in the post-questionnaire suggested another navigational behaviour. Some users indi-cated that they used the navigation button when they didn't get enough strong results in their initial search. Instead of refining their search, they used the navigation button to browse from the most likely looking result to "home in" on good targets. One subject wrote: "For some topics, only one or two of the initial results were related to the topic I was searching for. I'd use the browse feature to find more topics related to those articles". 3.4.4 Recal l over time To further compare the browse and the search interfaces we examined the distribution of answers along the 2-minute duration of each task. Recall in information retrieval is defined as the number of correct answers retrieved by a subject out of the total number of targets in the corpus. Figure 3.4 shows the average recall for all tasks over time, in both the Search and Browse interfaces. During the first 20 seconds recall was the same for both interfaces, since subjects using the Browse interface had to start with at least one keyword search before they could begin to exploit the browse option. After subjects began using the navigation button, recall increased more rapidly in the Browse interface. The longer that subjects used the Browse interface, the more correct answers they found compared to the search interface. It is possible that the point-and-click nature of the interface encouraged more interaction, as 84 suggested by (Bodner and Chignell, i998). Like us, their system led to significantly greater recall, but they also found that this came at the expense of significantly longer time taken to perform the task. By contrast we show greater recall within the same amount of time, i.e., faster exposure to more relevant material than using the Search interface. 3.5 Discussion Since users navigate between the nodes of a network on the basis of semantic similarity, our approach is essentially a type of query by reformulation. Our results suggest that link suggestions based on similarity provide significant browsing improvement, and work well even though we do not use expansion of queries, terms or documents, or use a sophisticated classifier. There is no question that dynamic approaches will produce superior results, but we demonstrate that a basic stripped-down approach can also show significant gains. Over-all, there was a clear advantage to searches using the Browse interface, over the Search interface. The navigation option in the Browse interface is most helpful for indirect tasks, and for tasks of low extent. There are two main reasons why the Browse interface may be better for indirect tasks. First, the Browse condition provides results of greater relevance. It does not actively guide users in their reformulation, or synthesize new contextual queries, but rather offers avenues for navigational exploration. Users perform query refinement indirectly by choosing links that bring them closer to relevant targets. In the sense that documents themselves stand as queries to other similar documents, query reformulation is implicit in the links that users follow to find results (Amitay et al., 2005; Golovchinsky, 1997). Whereas the search condition only shows users what they have asked for directly with their queries, in the browse condition users have the advantage of additional relevance information. A known good document is likely to be linked to other relevant documents. Also, subjects using the Search interface may feel that they have exhausted their search terms and stop seeking more information, while with the Browse interface it is easier for subjects to continue with their search. Second, the browse condition is less effortful. It gives users direct suggestions on possible avenues for exploration, befitting human wayfinding (Hunt and Waller, 1999). The similar documents that are presented to users as they navigate in the Browse condition suggest possible semantic alternatives that educate users as to what is available in the corpus, and what possible alternative search terms might be (Henninger, 1994; Williams, 1984), although the user's activity does not directly refine queries during navigation. That users can select from suggested items supports their decision-making while reducing their cognitive load (Bodner and Chignell, 1998). In the Search interface subjects must perform their own iterative query refinement. By contrast, in the Browse interface subjects are less burdened with query refinement and instead can follow their intuition to explore the information space along promising trails. Navigation exposes them faster to a greater amount of related information, which reduces the time required to find accurate results (Figure 3.4). Users take advantage of the implicit keywords inherent in the links of the similarity network, as the system makes "suggestions" for documents that seem reasonable based on the semantic content of the 85 corpus. Subjects can then examine the descriptions and content of the suggested documents to discover other relevant terms that can be used in further keyword searches. For tasks of low extent (i.e., with few targets) the factors of greater relevance and less effort are also a benefit. The Search interface requires careful query reformulation in "near-miss" situations, whereas the Browse interface is more likely to provide links that will bring the user closer to the target(s). Furthermore, the result set from navigation is more likely to suggest relevant possible query reformulation terms along the way, so that users can refine their own search queries and "jump" closer before resuming navigation. Once one of the targets is found, it is likely to be linked directly to other targets. 3.5.1 Exceptions of Interest To understand better the advantages and disadvantages of the navigation option, tasks which showed worse performance for the Browse interface offer some useful clues. Only two tasks showed significantly better performance for the Search interface in the task-by-task analysis. The description for Task 1 was: "find articles which mention the jailing of reporter Judith Miller". This is an excellent example of a direct task. Almost all other direct tasks showed no differences between Search and Browse interfaces, but here the search terms Judith and Miller are highly appropriate and specific, and any use of navigation merely delays and distracts subjects from the better tactic of reformulating queries around those two terms. Task 12 was described as "find articles which mention one company referring to another company". We judged it as an indirect task because it doesn't directly suggest any obvious search terms (unlike Task 1), but already in the design phase of the experiment we felt that it was different from the other indirect tasks. Indeed, overall the subjects' results in this task were hit-and-miss. If subjects were unable to think of appropriate search terms within the allotted time, then they were unable to fulfill the task. Given that the relationship between two companies could be described in any number of ways - by mergers, lawsuits, subcontracting, competition or cooperation, etc. - there was less likelihood that target articles would share keywords, and thus less likelihood that any two target documents would be linked. This would render the similarity-based navigation option less effective, as finding one target would not easily lead to more. Here again, trying to navigate to similar documents just wastes time and distracts the user from more appropriate keyword searching. Another possible reason that Task 12 was better in the Search interface is that unlike most tasks, Task 12 does not have an initial "anchor"—an obvious descriptive keyword which can be used as a starting point to find a superset of targets. For example, the description for Task 8 was: "Find articles that discuss Google's business dealings". This is an indirect task because the vagueness of "business dealings" makes it harder to generate appropriate keywords, but at least the task has an anchor. Querying on the term "Google" is an obvious course of action, and the search results will offer good results that can be explored with the navigation option. Without an anchor term, the process of refinement is difficult even to begin. 86 Task 15, "Find articles that talk about legal issues at the company Texaco," we mis-classified as a high-directness Search task prior to the experiment, perhaps since "Texaco" seemed to be a good anchor term. In fact, Task 15 did much better in the Browse interface, since although a keyword search on "Texaco" got immediate results, there were many articles in the corpus discussing all aspects of the company, and thus not all retrieved articles were relevant to legal issues. In the Search condition, this forced users to reformulate the query with all the legal terms that they could think of. In the Browse condition, once one or two good legal-related results were found, it was simple to navigate to other targets. 3 . 5 . 2 F u t u r e W o r k Hypertext browsing will fail when few articles share common terms within the scope of a task. This raises larger questions about the semantic peculiarities of the corpus being used. For example, a corpus that talks overwhelmingly of global trade, tariffs, quotas, and levels of production will yield many similar results for queries on any of these topics, while tasks that are semantically orthogonal to the corpus, e.g., queries about mathematics, will yield few, isolated results. Cases where hypertext browsing is less effective can inform better design guidelines for directing users in the strategic use of search and browse tools. Although it was not a major focus of this paper, it would be interesting to compare the network topologies of different automatically constructed hypertext algorithms, examining the effect that network topology could have on users' search behaviour and task performance in large networks, and how the hypertext-construction algorithms can be tuned to build networks that better support common search behaviours. 3.6 Conclusions The advantages of browsing compared to keyword search for hypertext are already well-known and demon-strated by many user studies based on the Web. We have shown that a static similarity hypertext, constructed from a heterogeneous document corpus using simple tools, provides superior query reformulation to that of human users in simulated real-world tasks. This indicates the viability of simple methods in the integration of existing corpora into navigable digital libraries. Systems that tackle the problem of large corpora by using dynamic link generation (Bodner and Chignell, 1998; Golovchinsky, 1997) may be more efficient in terms of resource usage or provide superior results by adjusting with greater subtlety to changing information contexts. On the other hand, they require proprietary software at run-time (Amitay et al., 2005; Bodner and Chignell, 1998; Golovchinsky, 1997; Henninger, 1994; Olston and Chi, 2003; Williams, 1984), and their use of custom interfaces may affect the evaluation of the underlying model to some degree. Our system was kept as simple as possible. We show no need to create a user model or perform complex calculations to provide good results. Since we generate a complete network, we can more easily perform global topological analyses to improve navigability. The network that is generated can be copied to other machines as a passive file system and used immediately with a standard Web browser. 87 Our user study confirmed that browsing in similarity hypertext is particularly effective for tasks that have few targets in the corpus (low extent), and are vaguely described (low directness). Results also showed that in the Browse condition, users were more active and were exposed to more information. We have also shown where browsing in a hypertext is actually detrimental: first, if users are asked to perform a task that does not match the semantics of the corpus, the result set will be less coherent and there will be few "trails" between targets. In this case users will waste time with browsing that would better be spent using keyword search. Second, in tasks where common search terms are highly appropriate and specific, the best tactic would be to use direct search only. This indicates where users may be guided to exhibit more efficient search behaviours. We have suggested similarity hypertext as a browsing component for document corpora in digital libraries. We conclude that using such hypertext can substantially enhance user experience and improve the overall quality of results. 88 Bibliography Agosti, M . , Crestani, F., and Melucci, M . (1997). On the use of information retrieval techniques for the automatic construction of hypertexts. Information Processing & Management, 33(2): 133-144. Amitay, E., Darlow, A. , Konopnicki, D., and Weiss, U . (2005). Queries as anchors: Selection by associ-ation. In HYPERTEXT '05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia, pages 193-201. A C M Press: New York, NY. Barabasi, A . - L . (2002). Linked: The New Science of Networks. Perseus Publishing: Cambridge, M A . Berger, F. C , van Bommel, P., and van der Weide, T. P. (1999). Ranking strategies for navigation based query formulation. Journal of Intelligent Information Systems, 12(1):5—25. Bhavnani, S. K., Drabenstott, K. , and Radev, D. (2001). Towards a Unified Framework of IR Tasks and Strategies. In Proceedings of the 2001 AS1ST Annual Meeting. Blustein, J. and Staveley, M . S. (2001). Methods of generating and evaluating hypertext. Annual Review of Information Science and Technology, 35:299-335. Blustein, J., Webber, R. E., and Tague-Sutcliffe, J. (1997). Methods for evaluating the quality of hypertext links. Information Processing & Management, 33(2):255-271. Bodner, R. C. and Chignell, M . H. (1998). ClickIR: Text Retrieval using a Dynamic Hypertext Interface. In Text REtrieval Conference, pages 506-515. Botafogo, R. A. , Rivlin, E., and Shneiderman, B. (1992). Structural analysis of hypertexts: Identifying hierarchies and useful metrics. ACM Transactions on Information Systems (TOIS), 10(2): 142-180. Carmel, E., Crawford, S., and Chen, H. (1992). Browsing in hypertext: A cognitive study. IEEE Transac-tions on Systems, Man, and Cybernetics, 22(5):865-884. Crestani, F. (1997). Application of spreading activation techniques in information retrieval. Artificial Intelligence Review, 11:453^482. Furner, J., Ellis, D., and Willett, P. (1999). Inter-linker consistency in the manual construction of hypertext documents. ACM Computing Surveys (CSUR), 31(4es):18. Golovchinsky, G. (1997). What the query told the link: The integration of hypertext and information retrieval. In HYPERTEXT '97: Proceedings of the eighth ACM conference on Hypertext, pages 67-74. A C M Press: New York, NY. GoogleAPI (2006). Google Desktop API. GoogleGuide (2006). Google Guide: Similar Pages, Green, S. J. (1999). Building hypertext links by computing semantic similarity. IEEE Transactions on Knowledge and Data Engineering, 11(5):713-730. Hawkins, D. T. and Wagers, R. (1982). Online bibliographic search strategy development. Online, 6(3): 12— 19. Henninger, S. (1994). Using iterative refinement to find reusable software. IEEE Software, 11 (5):48—59. Hunt, E. and Waller, D. (1999). Orientation and wayfinding: A review. Technical Report N00014-96-0380, Office of Naval Research, Arlington, VA. 89 Jansen, B. J., Spink, A. , Bateman, J., and Saracevic, T. (1998). What do they search for on the web and how are they searching: A study of a large sample of excite searches. In Proceedings of SIGIR 98: 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 5-17. Jones, W. P. and Furnas, G. W. (1987). Pictures of relevance: a geometric analysis of similarity measures. Journal of the American Society for Information Science, 38(6):420-442. Kopak, R. W. (2000). Functional link typing in hypertext. ACM Computing Surveys (CSUR), 31(4s):5pp. Marchionini, G. (1995). Information seeking in electronic environments. Cambridge University Press: New York, NY. Marchionini, G., Dwiggins, S., Katz, A. , and Lin, X . (1993). Information seeking in full-text end-user-oriented search systems: The roles of domain and search expertise. Library and Information Science Research, 15(l):35-69. Melucci, M . (1999). An evaluation of automatically constructed hypertexts for information retrieval. Infor-mation Retrieval, 1:91—114. evaluation; statistical methods; hypertext; hypermedia; automatic construc-tion. Olston, C. and Chi, E. H. (2003). ScentTrails: Integrating browsing and searching on the Web. ACM Transactions on Computer-Human Interaction (TOCHI), 10(3): 177-197. Perugini, S., Goncalves, M . A. , and Fox, E. A. (2004). Recommender systems research: A connection-centric survey. Journal of Intelligent Information Systems, 23(2): 107-143. Reuters-21578 (2006). Reuters 21578 test collection. Salton, G. and Allan, J. (1994). Automatic text decomposition and structuring. In Proceedings of the RIAO Conference: Intelligent Text and Image handling, volume 1, pages 6-20. Smeaton, A . F. (1995). Building hypertexts under the influence of topology metrics. In IWHD'95: Inter-national Workshop on Hypermedia Design, pages 105-106. Teevan, J., Alvarado, C , Ackerman, M . S., and Karger, D. R. (2004). The perfect search engine is not enough: A study of orienteering behavior in directed search. In CHI '04: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 415-422. A C M Press: New York, NY. TweakGDS (2006). TweakGDS: A Google Desktop Search Plug-in. van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths: London. Wilkinson, R. and Smeaton, A. F. (2000). Automatic link generation. ACM Computing Surveys, 31(4es). Williams, M . D. (1984). What makes RABBIT run? International Journal of Man-Machine Studies, 21(4):333-352. 90 Chapter 4 A Network Model for Context-Dependent Information Retrieval1 Spreading-activation networks have been used to represent semantic information, but are not typically used for capturing the occurrence patterns of events. This paper proposes a real-time, incrementally updated temporal index that captures the usage patterns of information objects, independent of how those objects may be otherwise organized or indexed. The resulting structure allows rapid retrieval for temporal queries. The encoding is inspired by properties of human memory as described by the cognitive sciences, with the assumption that human-like codings exhibit natural behaviour and are intuitive to program, since they provide an unambiguous anthropic mental model for information management. 4.1 Introduction The ability to discover what we usually do and when it's usually done can be a powerful way to understand and organize our activities. It can also be used prospectively to predict what we are likely to do in a particular context based on past activity, and to retrieve items that have previously proven useful in that context. Such abilities are useful in a number of ways. Personal information management tools are found to varying degrees in devices that people carry with them at all times. While useful to most (busy) persons, reliable memory prompts in a portable device would also be particularly helpful to the elderly, extending their independence and thereby lightening the growing burden of demographies with aging populations. Such reminders would be particularly useful if they could be provided automatically based on learned event patterns, rather than requiring users to write them into a schedule. Although fixed schedules are useful for reminding us of what must be done, their key disadvantage is that they are unable to reflect behaviours and preferences that change over time. It would be useful to ask of our information management systems such questions as, "when is activity x usually done?", or conversely, "what activities usually take place at time £?" It would also be useful to enable sequence-related questions such as "what usually happens before or after activity x or time £?" ' A version of this chapter has been accepted for publication. Huggett, M. (2007). Doctoral Consortium: A network model for context-dependent information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and Development in Information Retrieval. ACM Press: New York, NY. A version of this chapter will be submitted for publication. Huggett, M. A network model for context-dependent information retrieval. Information Processing and Management (Elsevier), special topic issue on adaptive information retrieval. 91 While there are many data-mining tools for finding temporal patterns in large databases and in sequences of measurements (Roddick and Spiliopoulou, 2002), the domain of information management systems provides no methods that we know of that build iterative, online temporal characterizations. We propose a method of context-dependent information retrieval called the cue-event-object (CEO) model. The model builds a dynamic summary of observed events incrementally and in real time, and modifies those patterns as user behaviour changes over time, to retrieve information objects when they are most likely to be needed. The representation speeds retrieval on temporal queries, and can trigger reliable reminders of upcoming events based on past behaviour. Although we focus primarily on the temporal dimension, we suggest also how the model can incorporate sensor data such as location or proximity. We propose three basic data structures for temporal spreading-activation networks. First, a temporal sub-sumption graph (TSG) defines a concept hierarchy of temporal units appropriate to a particular application domain. Second, a temporal node is created as a time-stamp for the occurrence of an event, based on the time units described in the TSG. Third, an event node represents an event, and is linked to temporal nodes that describe its pattern of occurrence. The CEO model uses human-like temporal encoding based on human-memory models of spread-of-activation (Anderson, 1983; Collins and Loftus, 1975) and forgetting (Anderson and Schooler, 1991). The method em-ploys two structures: an associative temporal network representing time-based patterns, and a hierarchical look-up table of component time units called a temporal subsumption graph (TSG). Temporal nodes can be used to reference events in any domain of temporal interest, such as meeting times, file usage, door openings, or any other observable activities of day-to-day life. Patterns are encoded as real-world events occur. Specific events are represented by event nodes. When an event first occurs, new temporal nodes are created and linked to the appropriate event node. The temporal nodes function as time stamps. Each time the event occurs again, the corresponding event node is fired, and activation passes along links to connected temporal nodes. Temporal nodes gain activation as they continue to be true of ongoing activities, and decay if they do not. Patterns are discovered in a process of aggregation that combines temporal nodes that are simultaneously true of events. Patterns are dissolved, or disaggregated, if the conjunction of temporal nodes that they represent no longer reflects the ongoing pattern of events. There are several advantages to this approach. First, it provides a current, queriable image of trends in which all events, whether singular or repeated, are represented. Second, it allows fast searching of activity trends by different time granularities, individually or by intersection. Third, the same structure can be used prospectively as basis for a recommender system, by activating temporal nodes to act as reminders as their time approaches. Reminders of this sort are reminiscent of properties of human memory, which remembers the gist of reg-ularities in the environment. We believe that such cognitive justification actually can make systems more comprehensible and easier to use, by emulating human event memory in a predictable and reliable way— without forgetting. Questions such as "when am I most likely to go skiing?" and "what do I usually do on Tuesdays?" are equally valid. 92 In enterprise environments, temporal indexing can track the usage of "tagged" resources and suggest opti-mizations through resource redistribution. It can be used as a batch process on databases to find patterns of temporal data and metadata (such as the time at which data is added to a corpus, used, or deleted). With ap-propriate TSG timescales, it could be used in science to run batch analyses on data with intervals both short (e.g., particle physics) and long (e.g., geology), building a descriptive temporal index that can be queried in real time. In the next section we look at related work; in Section 4.3 we introduce the structures involved; in Section 4.4 we discuss how data is dynamically organized; and in Section 4.4.4 we discuss the retrieval of queries and reminders. 4.2 Related Work Related work falls into two general areas. There is a great deal of research with respect to temporal encoding in data mining and machine learning, although such research is not directly relevant to information retrieval (IR) and personal information management (PIM). On the other hand, the nascent area of PIM research focuses-more on issues of semantic relevance and clustering, and has not yet developed the potential of temporal encodings. 4.2.1 Temporal Indexing Temporal indexing is related to data mining, which is primarily concerned with finding patterns and relations of interest in large bodies of data stored in databases. Data mining is also concerned with predicting the evolution of these patterns. Temporal data mining can be divided into five main tasks. Segmentation involves the clustering and classification of data; dependency analysis predicts the values of attributes based on other attributes; outlier analysis finds items that exhibit unusual properties; trend discovery predicts outcomes based on correlations of events; and generalization and characterization focus on compact descriptions of data (Yao, 2003). These tasks are applied to the following time-specific structures. T e m p o r a l a s s o c i a t i o n r u l e s typically find correlations between items that are included in a single transac-tion, such as "market basket" purchases (Agrawal et al., 1993). In the temporal domain, an example would be the observation that marshmallows and cocoa powder are purchased together in winter, but not in sum-mer; knowledge of this kind may be useful in devising marketing strategies. Evolution and maintenance refers to updating rules that change over time. Weak updates involve incremental modifications as more data becomes available; strong updates use an entire data set to replace an old knowledge base with a new one (Pechoucek et al., 1999). T i m e ser ies (o r sequences) are mined to determine after which events an interesting subsequence is ex-pected to occur. Activity monitoring analyzes sequences of events—typically in streams—to detect the occurrence of interesting events; alarm or alert systems can be triggered by outliers (Fawcett and Provost, 1999). Sequence prediction estimates the next value in a large set of potentially long sequences based on 93 past behaviour. To ensure low incremental computational complexity, sequence methods may employ an (exponential) "forgetting" process akin to that of human memory; a sliding window is often used (Putta-gunta and Kalpakis, 2002; Y i et al., 2000). Time series analysis often involves identifying curve shapes in data sequences (such as in stock fluctuations), which are used to find similar or unusual behaviours, in the same or other sequences (e.g., Y i et al., 2000). Temporal concept hierarchies are used to structure and simplify temporal pattern finding; they typically employ common relations such as Day—> Week — Q^uarter—> Year. Most mining methods that employ hierar-chies have difficulty generalizing to multiple hierarchies—for example, classifying episodes that occur on weekends in the summer months is difficult since the same object must be generalized according to multiple time units (Roddick and Spiliopoulou, 2002). One solution is to generate candidate-pattern tuples of mul-tiple time units, and then match them against events stored in the database to determine their support, as in (Li et al., 2001). "Interestingness," the problem of deciding which patterns will be of interest, is difficult to specify. The number of potential patterns in a knowledge base is enormous, and the patterns that are most interesting are not necessarily those that are most frequent, or unique, or relevant outside a narrow context (Roddick and Spiliopoulou, 2002). . While suggestive in terms of techniques and tools, temporal indexing has not been a focus of information retrieval research, which focuses primarily on semantic similarity, queries, and relevance. Temporal index-ing also takes a very different perspective from research in personal information management (PIM). Unlike temporal indexing, in PLM the user is the focal point, and in a sense information 'follows' the user as the user moves through time and space. 4.2.2 Contextual IR Information retrieval systems usually index information objects using internal attributes (or cues) that serve as descriptors of an object, although a few also use cues external to the system such as time and physical conditions. The model that we propose adds external-cue indexing to information retrieval, based on es-tablished spreading-activation internal-cue methods. In contrast to most data-mining approaches, our goal is to create a system that learns patterns incrementally and in real time, and maintains accuracy despite the "drift" of changing observed behaviours. 94 Internal Cues Semantic networks are a popular way to organize information and display its relationships. Nodes represent documents or other information objects, and often also represent attributes of the objects such as descriptive keywords. Links can represent many different types of relation such as is-a or has-component (Cohen and Kjeldsen, 1987), but complex networks with many link types often need to be built manually due to their semantic subtlety. By contrast, the task of building semantic networks from large corpora is too time-consuming to be done manually. The simplest automatically-constructed networks use just a single weighted link type that represents the degree of relatedness between objects (Belew, 1989; Jones, 1986). Relatedness can be calculated automatically using a similarity function to compare objects, and retrieval is effected through a traversal process such as spreading activation. The memory extender model (Jones, 1986) connects a layer of document nodes to a layer of attribute nodes. Given a keyword query, the corresponding keywords nodes are activated, and activation passes to connected documents. The document that receives the most combined activation from the set of keywords is the best response. Activation can spread back from nodes activated in the document layer to all connected terms, including those not part of the initial query set, creating a form of keyword expansion. The Associative Information Retrieval (AIR) model (Belew, 1989) also uses nodes to represent documents and keywords in different layers. Document nodes are connected to nodes representing descriptive keywords, but unlike the Memory Extender (Jones, 1986), related documents are also connected directly to each other, as are related keywords. AIR uses relevance feedback from its users to modify the strength of links, thereby increasing the accuracy of future retrievals. Such spreading activation models are often constrained to ensure that their propagation will terminate. One method divides activation at a node between all outgoing links so that each neighbour node receives only a fraction of the initial activation value. This ensures that the activation is dissipated at each propagation step, and propagation stops when the activation reaching a node falls below some threshold value. Another method is simply to set the number of propagation steps in advance; four steps seems to be a generally good value for balancing precision against recall (Cohen and Kjeldsen, 1987). External Cues Whereas internal cues are part of the semantic ontology within a system, external cues represent the world outside a system. The most useful and common external cues involve clock time, and sensor data such as lo-cation, temperature, velocity, luminance, etc. External cues can complement internal cues to retrieve objects under particular circumstances, such as at scheduled meetings, or on the way to particular destinations. Time Cues — As mentioned above, work in temporal indexing has explored temporal association rules used for data mining of item-purchase correlations in market baskets, time series (or sequences) used to identify recurring conditions that precede critical events such as a rise in the stock market, and temporal concept hierarchies that represent the nesting of temporal periods of different magnitude (Roddick and Spiliopoulou, 2002). Most of these are offline batch processes that perform data mining on a large data corpus. Systems for personal information management (PIM) often perform temporal indexing by treating transaction-time information such as date of object creation as an additional keyword attribute to be used as a search 95 cues events objects Figure 4.1: The CEO model showing the three layers of cues, events, and objects. The cue layer shows both basic cues (e.g., Monday) and compound cues that represent conjunctions of basic cues. term (Adar et al., 1999; Dumais et al., 2003), and represent events on a linear time-line (Fertig et al., 1996; Gemmell et al., 2002). Cyclic recurrence patterns are not yet an integral part of the PIM perspective. Sensor Cues — Contextual IR systems sense the environment, and then parse the information into standard terms that can be used to retrieve a rank-ordered list of pre-indexed documents (Rhodes and Maes, 2000). With the growing prevalence of mobile "wearable" computing, which includes personal digital assistants and cell phones, interest in presenting information in a context-dependent way has grown. Environmental attributes include sensors for acceleration, noise level, luminosity, humidity, etc. (Himberg et al., 2001) and devices with GPS and other positioning technologies are increasingly common. Active-badge technology has been available for some time (Want et al., 1992). Such wearable systems record the "history" of a user's wanderings and their interactions with instrumented objects such as phones and workstations, together with a time stamp. They can track where information objects have obtained, printed, to whom they were lent, where they were left last, etc; users can then search their histories based on time and the interactions (Lamming and Flynn, 1994). 4.3 The C E O Model Based on existing work in IR on semantic networks and spreading-activation retrieval, we propose a model for context-dependent IR. Our Cue-Event-Object (CEO) model describes an online real-time system that automatically builds an incremental representation of the typicality of recurring events. The CEO model can be used to add context dependence to any information system that is organized in terms of discrete information objects. An event is a particular type of occurrence, such as "going to the opera"; an object is a datum associated with the event, such as "concert program" or "menu"; and a cue is a time unit that is associated with the event, such as "Saturday". A single event type may be associated with multiple cues and objects. 96 4 . 3 . 1 C u e s , E v e n t s , a n d O b j e c t s Similar to prior IR systems based on semantic networks, the CEO model uses nodes to represent both information objects and attributes, and a single link type weighted to represent the strength of relation between two nodes (Figure 4.1). The nodes are divided into three layers. T h e C u e l a y e r contains all contextual attribute nodes, which represent fixed time units such as "0900h" or "Monday". The units are defined in the model in a temporal concept hierarchy that we call a Temporal Subsumption Graph (TSG). The TSG (see 4.3.2) defines the relative granularity of units (e.g., "Monday" is a component of "weekday"). The Cue layer also contains dynamically-created compound-cue nodes (dubbed T-nodes) that represent patterns of two or more attributes, such as "0900 A Monday". These T-nodes represent temporal patterns, and are created as conjunctions of the basic time units provided in a TSG. T-nodes are created when an event is first observed at a particular time, and are unique with respect to the pattern that they encode. Each T-node is linked to one or more events if the events occur at the time specified by the T-node. T-nodes have an activation level that reflects their support, i.e., the degree to which observed events continue to occur at that time; this allows fast queries to determine what times see the most activity. Below some minimum threshold of support, the T-node can be deleted to reclaim storage space. It is not always necessary to create T-nodes: event nodes may be directly connected to a time unit if events occur uniquely with respect to that unit. For example, an event that occurs without fail every day at 0900h only needs to be connected to the unit for 9am; in such case other time units would be redundant. T h e E v e n t l a y e r contains event nodes that act as placeholders: they are merely the "glue" between cues and objects. Events are analogous to the convergence zones described by some memory models to describe how the brain associates sets of stimuli dynamically into memories (Moll et al., 1994). Event nodes represent an event in the environment whose temporal patterns may be of interest. Event nodes have an activation value that reflects degree of usage. When an event occurs, the corresponding event node receives a "jolt" of energy. Thereafter its activation decays, such that the activation level can be used for fast queries of what events occur most recently and frequently, independent of other data structures. When connected to sensors in the cue layer, event node may represent the usage of an object or service, which can be observed when the sensor cues are stimulated, as for a door opening or a combination of meteorological readings. More significant events could likewise be captured, such as for borrowing a book from the library. T h e O b j e c t l a y e r contains nodes that represent documents, pictures, and other information objects that are associated with particular events. It can also contain action nodes that trigger an effect external to the system, such as dispensing medication at scheduled times or if biosensors are within a set range. Together, the layers of the CEO model encode activity patterns of when events occur. For example, the first time an object—such as a schedule—is used on Monday at 9am, a "0900 A Monday" compound cue is created, as well as an event node that represents this particular combination of cues and objects. The cue nodes are linked to the event node, which is linked to the relevant objects. If the schedule is used again at 9am on the following Monday, then the activation levels of all the links and nodes in the pattern grow stronger, else they decay and are "forgotten". 97 hour part of day 0600 0700 s. morning ; J 1200 1300 L afternoon 1800 1900 0000 0100 evening night day of week part of week V weekday weekend month March Apr i l f spring May June July >• summer August September October >- fall November December January February 1 winter Figure 4.2: An example TSG. As we move up the temporal hierarchy from hours to seasons, items in curly braces are aggregable, i.e., 0600-1200h is morning, and July is in the summer. Items in square brackets are not aggregable, e.g., morning is independent of the day of the week. 4.3.2 Temporal Subsumption G r a p h (TSG) TSGs are templates that define the time scales within which events can occur, and form a hierarchical lattice that shows how measurements of time are defined in terms of sets of smaller time increments. The TSG also indicates which units are aggregable, in that some temporal sub-units (e.g., Monday) may be descriptive of some super-units (e.g., weekday).' The contents of a TSG are subjective and arbitrary, and intended to describe events in a particular domain; only time units that may be useful to a particular application need to be included. In terms of a cognitively based model, the TSG serves a function akin to the brain's neocortical microcircuit, which maintains a virtual continuum of timescales of information processing, with time constants of activity ranging from a few milliseconds to years (Denham and Tarassenko, 2003). Figure 4.2 shows an example of a calendar-based TSG appropriate for capturing events in a personal infor-mation archive, which is based primarily on the day-to-day of human activity. As well as objective measures of time such as hour, day, and month, the example also includes increments for standard informal divisions: people can easily have different interpretations of "morning". Some common scales, such as seconds and minutes, may be considered unnecessary. As such TSGs are not necessarily expected to be objectively true or complete, but to fulfill a particular purpose. 4.4 T e m p o r a l Pa t te rns Finding temporal patterns and updating them to reflect changing circumstances involves three processes: adding new patterns (as T-nodes), combining existing patterns to reflect more general regularities, and delet-98 ing patterns that no longer match observed events. The C E O model captures the segment-wise periodicity of events that occur at particular times (Han et al., 1998), in the sense that not all of the segments in a time sequence will have cyclic behavior: activity x may occur regularly at 9am on Monday mornings, but could be randomly preceded or followed by other events. Popular encoding methods such as fast Fourier transformation (FFT) cannot be applied to mining segment-wise periodicity because FFT treats the time-series as an inseparable flow of values (Han et al., 1998). A cycle is formed if, throughout the whole time series being examined, there exist (with certain high probability) equally spaced similar observed values of some time-related attribute (Han et al., 1998). The representation of temporal patterns may be sparse to the degree that it only expresses the times at which events have occurred. . • 4.4.1 Encoding the Temporal Patterns of Events The process of encoding temporal occurrence patterns starts with a T S G and a catalogue of recognizable events. Every time an event occurs, the model records its occurrence in one of several ways: • If any event occurs at a time when no other events occur, a new T-node is generated for that time, and the event is connected to that T-node. • If an event occurs for the first time at a time when other events do occur, the new event is connected to the existing T-node representing that time. • If an event re-occurs at the time represented by a T-node to which it is connected, all the nodes and links corresponding to that specific time are stimulated. • If an event fails to occur at the time represented by a T-node to which it is connected, all the nodes and links corresponding to that specific time are decayed. Each newly generated T-node is temporally specific: all of the levels of the T S G are represented in the T-node's conjunction of units. For example, with reference to the TSG in Figure 4.2, one possible T-node timestamp pattern would be 0600AMondayAJuly. The T-node is linked with a default weight to the event node and to all of its component time units in the TSG: 0600, Monday, and July. These links are dynamic and their weight represents the support of the pattern for the event. Ultimately, we want to capture patterns at multiple levels: what happens at certain times of year, or day, etc., but without further information, initially we can only assume that this is a once-only event. A single T-node may be connected to an arbitrary number of event nodes; in effect, this indicates that the events occur simultaneously. If a new event occurs at the time encoded in an existing T-Node, the event's node is linked to that T-node with a default weight. As events continue to recur at a given time, the links between the event nodes and the T-nodes are strengthened asymptotically. Conversely, i f the timestamp in a T-Node becomes current but an event node 99 cues T-nodes event Figure 4.3: A temporal pattern before aggregation. When the T-node for Friday is added, it becomes clear that the set of T-nodes shows a clear pattern for weekdays. to which it is connected does not fire (i.e., the event is not detected at the specified time), then the link from the T-node to the event node decays. Pattern-learning occurs with a probe-and-confirm "ping". In the probe phase, the appropriate cue nodes as represented in the TSG fire when they match the current time. For example, at 9am on Monday the 0900h and Monday cue nodes will fire, and activation will flow along links to the connected T-nodes that correspond to these units. T-nodes that receive activation from all their component time cues are considered "current". As a result of this priming activation, T-nodes actively listen for activity in their connected event nodes. In the confirm phase, if an event occurs while a connected T-node is primed, then activation from the event node passes along the link to the T-node, and then on through links to temporal units in the TSG. The activation level of the event node, the T-node, the temporal units in the TSG, and the weights of all the traveled links involved increase asymptotically. If an event node does not fire before an active T-node ceases to be current, then the link from the event node to the T-node is decayed. If none of the event nodes connected to a T-node fire, then the T-node, its component time-unit nodes, and the links that connect them, are all decayed. Here and elsewhere, decay is subject to standard models of forgetting (e.g., Anderson and Schooler, 1991); we believe that the CEO model should decay relations at a rate slightly slower than the typical human rate, so that users are reminded of what they are not using as those skipped patterns are gradually replaced by more current patterns. Subsequent event occurrences are processed in the same manner until an event node gains more than one T-node. A popular event node can accumulate a large number of T-nodes over time. To represent higher-order patterns and to promote a more parsimonious representation, an aggregation test is performed as new T-nodes are added. If the test is successful then multiple T-nodes are combined into a single node that reflects trends at a higher temporal granularity. 4 . 4 . 2 T e m p o r a l A g g r e g a t i o n Patterns are simplified and kept up-to-date through a process of aggregation. The TSG indicates which units may be subsumed. For example, if all week-end days are present in a compound cue node ("0900 A Saturday 100 cues T-node event 0900, |Mon[. pfuel-[Wed} [Thu] Figure 4.4: A temporal pattern after aggregation. The set of T-nodes has been replaced with a single T-node representing 9am on weekdays. This pattern can dis-aggregate if all the cue nodes of the pattern do not continue to be equally stimulated within some tolerance. A Sunday"), they can be replaced in the pattern by the single "weekend" label in the TSG: "0900 A weekend". This pattern remains in effect as long as its links receive the same support. If the event ceases to take place on Sunday, then the link from "Sunday" to "0900 A weekend" decays relative to the other links in the pattern, and the compound node is dis-aggregated to "0900 A Saturday". Partial patterns are accommodated by using a logical statement. For example, an event that occurs consis-tently at 9am but only on Monday and Friday cannot be simplified to "0900 A weekday". Instead, a logical statement is used in the compound cue node: "{Mon I Fri} A 0900". The pattern can be further adjusted depending on the "drift" of observed behaviour. For aggregation to occur, the links from basic cues to the T-node must receive an equivalent amount of support. The support of links is determined using a combination of link activation values and event density. Activation is based on standard models of human memory (Anderson and Schooler, 1991; Hebb, 1949) in which an object rises to a maximal value (usually 1.0 on a normalized scale) when stimulated, and then decays asymptotically to zero over time unless stimulated again. In the CEO model, activation decays for each time that a pattern is not stimulated as expected. For example, if a pattern exists for "Monday A 0900", then if that event does not occur at Monday at 9am, then the activation value of the pattern and its links is decayed by one step; it is decayed further for each subsequent step that the pattern is not stimulated. The normalized activation value a can be determined from the number of steps t since an event was last observed, with -a = 2-t/h where h is the half-life: the number of steps until the activation drops to half its current value. Activation by itself is not enough to determine support, since recent activity may not be a reliable indicator of persistent trends. Various statistical learning methods may be substituted here, but for illustrative purposes we adopt a simple sliding-window approach. Event density is used to gauge the 'reliability' of a pattern, and takes two forms. Total event density p is simply expressed as the proportion of times the pattern has been stimulated, relative to the number of times that it could have been stimulated. Recent event density u is the proportion of times that a pattern has been stimulated within a sliding window of the n most recent steps. 0900 A Weekday 101 <~ t node 6 • 5 4 3 2 1 a P CO M o n • • • • • 0 . 9 0 . 8 0 . 6 Tue • • • 0 . 9 0 . 5 0 . 6 W e d • • • 0 . 6 0 . 5 0 . 0 Thu • • • 0 . 9 0 . 5 1 . 0 Fr i • • • 0 . 6 0 . 5 0 . 0 Figure 4.5: Determining support for aggregation for a given event. Each day-line represents links to a separate T-node. For a given T-node, a indicates the recency of the event, p its frequency since first occurrence, and u its recent frequency (here in a sliding window of size 3). A l l three values must be within a specified e for aggregation to occur. For pattern aggregation to occur, the component links must have tolerably similar values for activation and event densities. Figure 4.5 shows an example of determining support for the aggregation of a particular' that happens daily at a particular time (for example, buying coffee at 9am). The dots in the diagram show on what days the event has occurred, over the 6 weeks since the behaviour began. The sliding window has size 3. The different days have similar values for a, p, and u, but only Wed and Fri share all three, and are thus reasonable candidates to create the new pattern, "{Wed I Fri} A 0900". 4.4.3 Temporal Disaggregation T-nodes will "break apart" if the pattern that they represent is not fully and persistently maintained. Such fragmentation is a necessary part of adjusting to changing activity patterns. Disaggregation is triggered when one of the pattern's links from cue to T-Node differs from the others above some e, and follows a two-step process. First, the T-node is first decomposed into its individual component sub-patterns. Second, the sub-patterns are re-aggregated if their support is sufficiently similar. The sub-pattern with the weakened links is thereby separated from the persistent parts of the pattern, which continue to be aggregated if they see the same amount of use. For example, if the pattern weekday A 0900 ceases to be true on Fridays, then the link from the T-node to the cue node for Friday will become weaker than to the other weekday nodes. Following the two-step disaggregation process, Friday is represented in a separate T-node, while a new partially aggregated T-node combines the remaining days that are still supported by a compound pattern (Figure 4.6). The singleton pattern is thereby able to represent a different pattern of support for the same event, and may be re-aggregated with the larger T-node if their respective support levels once again coincide. In the case where a pattern is ignored completely, all of its links will decay equally, and therefore the pattern is maintained whole even as its currency fades. 102 cues T-nodes event Figure 4.6: A temporal pattern after aggregation. When a component of the pattern weakens with respect to the other components, it is "broken out" of the pattern. 4 . 4 . 4 R e t r i e v a l Retrieval is performed by spreading activation. To ask "what is usually done at time T?" activation is introduced at all temporal cue nodes that are components of T. For instance, asking "what is usually done on Monday at 9am?" activates the Monday and 9am nodes. The activation flows through any connected compound cue nodes, to connected event nodes, and finally to connected object nodes, which are retrieved. The spread of activation is structurally constrained, that is, a spreading activation query only propagates to the limits of the pattern into which it was originally introduced, from cues toward objects. In contrast to the explicit user-posed queries typical of database systems, implicit retrieval retrieves events as the times at which they typically occur become current, or to listen for an anticipated action. Two kinds of activation are used in the encoding of temporal events: - priming activation originates from the cues and primes events as they become current, in anticipation of use. Priming activation can incorporate a lead-in time that serves to give the user some warning in advance of the event. - event-generated activation originates from a primed event's objects if they are used, confirming that an event has taken place as anticipated. Only event-generated activation has a lasting effect on link and node strengths; otherwise the patterns of events will be corrupted. For the other types of activation, the original state of the nodes involved is stored so that it may be re-instated after priming, queries, or reminders are concluded. 4.5 Experiment: Real-World Trip Data To answer our research questions, two approaches can be taken. The traditional data-driven approach is to test the constraints and properties of the algorithms with corpora of real or synthesized data. A second user-driven approach is used to test the system on real users; the users may be given tasks to perform in a laboratory setting, or they may use the test system in a longitudinal setting as part of their daily lives. The CEO model is appropriate for all of these testing methods. 103 Data-driven testing was performed as proof-of-concept. As part of an industrial internship, the CEO model was implemented at the research laboratory of a major automobile manufacturer2. The experiment tests the model's ability to retrieve appropriate route information documents based on time and location-sensor data. The test corpus is comprised of a stream of in-car activity data generated by a single driver over the space of 6 months. The hypothesis is that if there was any regularity to the driver's activities, they would be encoded by the CEO model, and the relative strengths of the patterns would reflect the user's preferences and degree of consistency. 4 .5 .1 M o t i v a t i o n a n d G o a l s Modern product developers recognize the value that information technology can bring to a user's experi-ence of their products. New cars increasingly include information systems that provide drivers with such information as the state of the car's mechanical and control systems, navigation choices and traffic data. As a next step, the development of automated contextually appropriate services that improve the in-car expe-rience is of particular interest. Such services include route reminders, way-point stop reminders, navigation information, and music selection. One goal of in-car information management is to make driving safer by inferring driver preferences and environmental conditions so that control systems can present contextually appropriate menu options with minimal user intervention. Longer-term research interests include dynamically altering the driving experi-ence in response to changing conditions, to make it more pleasurable and less stressful. According to this research agenda, cars develop their own user models: driver skill is modeled on whether drivers have had problems in the past, and under what conditions. On-board music selection is automated to time of day and the driver's route, based on collaborative filtering models and refined by the driver's own choices. Contex-tual reminders (e.g., for an unplanned stop along the way) are provided based on the route taken, coupled with needs that can be detected by sensors or from passengers' other information devices. Similarly, if a particular destination can be predicted based on time, route or persons in the car, the on-board system will retrieve appropriate traffic data to warn the driver against traffic slowdowns and suggest alternate routes before getting under way. Based on this vision, our research question is: Using real driving data, can the CEO model reliably predict a trip's destination based on departure point, time of day, and day of the week. 4 . 5 . 2 S e t - U p Driving data was captured using a GPS-instrumented vehicle, representing the driving patterns of a single driver over a 6-month period. The car was driven throughout an area of south-east Japan where the driver lived and worked. The time-stamped GPS logs were introduced to the CEO model in simulated real time, i.e., incrementally with individual trips in correct order as they had actually occurred. 2The company has requested anonymity to protect its research agenda, although it has released the data and results for publica-tion. 104 The CEO model was implemented for the experiment in a simplified form omitting months and seasons, but providing cue nodes for day of the week and clock time recorded in hours and minutes; as none of these units are aggregable, we leave an evaluation of the effects of aggregation to future experiments.3 Departure and arrival termini were represented as sensor cue nodes, and were added to the model as they were encountered at run-time in the GPS logs. If closely spaced, individual trip termini were combined to represent one destination; all trip termini that overlapped with 100-metre radii were treated as a single terminus. The intuition is that multiple trips to the same site, though not parking in exactly the same spot, should be treated correctly as the same destination. On departure from a terminus, the CEO model spreads activation from the sensor cue node representing that location, and also from the temporal cue nodes for the current clock-time hour and minute. The most highly activated retrieved terminus is the predicted destination. Other destinations above a set threshold are treated as possible alternative destinations. A run-time dynamic map of the data is shown in Figure 4.7. The map starts by depicting a single trip, then zooms out to include other trips as they occur in the logs. Latitude and longitude values on the map indicate the scale of the image. Routes are represented as arrows between termini; lighter arrows represent more recent trips, and fade to black the longer that they are not traveled. Route predictions from the current starting point are shown as light-coloured circles. 3The algorithms of the experiment are described in Appendix A.2. 105 Figure 4.7: A run-time map of route predictions. Cumulative data are displayed in the top-right corner. Other numbers on the map represent lattitude and longitude readings inferred from the GPS data. Lighter-coloured arrows represent more recent trips. Three light-coloured circles show the predictions for the current time. 4.5.3 Results Our test driver performed 133 trips over the course of the experiment, resulting in 16 terminus locations connected by 30 routes. The system created 45 T-nodes to connect the cue nodes to the event/route nodes. Results show that, despite the noise of everyday driving patterns, the CEO model quickly made correct predictions, rising rapidly after 20 trips, and averaging over 80% accuracy after 80 trips (Figure 4.8). Similar increases were seen in accuracy in terms of the number of individual routes taken (Figure 4.9) and the number of individual termini (Figure 4.10).4 4 A table of experimental data with results is shown in Appendix C. 106 0 10 20 30 40 50 60 70 80 90 100 110 190 130 140 number of trips Figure 4.8: Increase of predictive accuracy as a function of number of trips recorded. Human behaviour is conspicuously non-random—most people generate a limited number of termini over time, and are constrained by the various rhythms of work, recreation, relationships, meals, and daylight. As expected then, we found that as the number of trips increases, the number of routes taken and of termini visited rises quickly at first (i.e., every route and trip is novel), but then more and more routes and termini match previously observed values (Figure 4.11). 4.6 Extending the Model to Personal Information Management The CEO model can be extended to more traditional forms of personal information management, i.e., the creation, organization, and retrieval of documents, by including sensor cues that attend to file creation, access, and manipulation. In conjunction with semantic search, the addition of temporal trend information can specify a narrower context that can increase the accuracy of object retrieval. 4 .6 .1 S e a r c h e s a n d Q u e r i e s The system can also be expanded to accept explicit queries with regards to any of the cues. The explicit queries propagated from cues to objects answer questions similar to the implicit queries of the automatic retrieval model (Section 4.4.4) , but (in the temporal case) in the form of "what usually happens at this time?". By propagating queries from objects towards the cues, users can also ask such questions as, "when is this object usually used?" Such queries do not alter the link weight values of the CEO model, however 107 Figure 4.9: Increase of predictive accuracy as a function of number of unique routes travelled. 1 0 2 4 6 8 10 12 14 16 number al termini Figure 4.10: Increase of predictive accuracy as a function of number of distinct termini. 70 SO 40 30 number of routes 33 * 10 number of tarmini ! 3 10 20 30 40 60 60 70 SO 90 100 11Q 120 130 140 number of trips Figure 4.11: Increase of routes and termini as a function of number of trips. As human behaviour mostly repeats past patterns, the curves are expected to grow flatter over time. with a reflexive extension to the model, user queries could also be explicitly represented, and become events in the CEO model. Specifically with respect to the two types of query — what-happens-when — TSG units are fired at the queried time, and activation flows along links to con-nected T-nodes and event nodes. The amount of activation that reaches the nodes is determined by the link weights. Activation rises in the nodes, and the most highly activated event nodes above a threshold are retrieved as the events most likely to occur at the queried time. when-does-it-happen — The event node of the queried event is fired, and activation flows to the connected T-nodes and TSG units. The most highly activated T-nodes and units above a threshold are retrieved as the times at which the queried event is most likely to occur. 4.6.2 Reminders The ultimate purpose of the CEO model is to make use of its accurate predictions, to recommend information objects to users that are appropriate to their current context. It may not be enough to fetch an object at the exact time that it has been shown to be useful; for reminders to be effective, they need to lead the target time so that their use can be easily incorporated into the user's intentions. For instance, reminding a user of a meeting at the time that the meeting is scheduled to start does not give them time to prepare and arrive on time. To give users enough warning of an impending event, events can be "pre-fetched". The amount of lead-time—minutes, hours, or days—would depend on such factors as the user's proximity to the event. A meet-ing in the same building would require a warning of perhaps just a few minutes; in another city would require 109 considerably more time. This lead-in time could be calculated as a proportion of the event's periodicity, for example 10% of the cycle in advance of the event. For instance, reminders for an event that happens daily at a given time would have a lead-in time horizon of perhaps an hour, a reminder for an event that happens once per year would have an initial lead-in time of perhaps a month. Such reminders are provided by adding the lead-in time interval to the current time, applying activation to the cues at the resulting time. Activation begins quietly at first as a reminder, and gets stronger as the time approaches as an alarm, such as warnings to pay bills as the due-date approaches. The difference between reminder and alarm is therefore a difference of degree. 4.7 Conclusion We have described a simple, incremental, real-time method of modeling user information behaviour, in the cue-event-object (CEO) model. Cues are sensitive to temporal and physical stimuli, and trigger retrieval of appropriate information objects under appropriate circumstances. To summarize event patterns, the model uses a Temporal Subsumption Graph (TSG), a hierarchy of time units (such as defined by the calendar) to aggregate frequent, consistent patterns into higher-order time units. TSGs are user-definable such that the temporal subsumption hierarchy can be made useful to a particular domain or application. The model captures ongoing cyclic trends in behaviour, as well as highly specific temporal occurrences, such as the different destinations visited on each annual holiday. Results of a user-log experiment suggest that the CEO model is a viable approach to building accurate activity pattern profiles in real time. Compared to typical existing temporal indexing methods, our approach — • does not predict sequences per se, but does predict the most likely events to occur at a given time. • operates online in real time • uses human-like "forgetting" of unsupported patterns to keep the knowledge base current. • maintains the temporal knowledge base with incremental (weak) updates as events are observed • allows patterns to evolve through fragmentation and re-assembly • detects outliers in terms of highly specific patterns (i.e., singleton nodes with low support) • uses temporal concept hierarchies to define the time units relevant to a corpus • uses networks to efficiently generalize patterns to multiple time units. • leaves up to the user the question of what is most interesting in their pattern of activities. Future work Given that the corpus was generated by one driver in a single driving environment, the results are difficult to generalize. Larger corpora are necessary, which may possibly be available from other industrial or govern-mental sources, such as the recently released Microsoft Multiperson Location Survey (Krumm and Horvitz, 2006). It would also be particularly useful to examine how context can assist users in proactive semantic search in typical personal information management situations. 110 Bibliography Adar, E., Karger, D. R., and Stein, L . A . (1999). Haystack: Per-user information environments. In Pro-ceedings of the eighth international Conference on Information and Knowledge Management (CIKM99), pages 413-422. A C M Press: New York, NY. Agrawal, R., Imielinski, T., and Swami, A . N . (1993). Mining association rules between sets of items in large databases. In Buneman, P. and Jajodia, S., editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207-216. Anderson, J. R. (1983). A spreading activation theory of memory. Journal of Verbal Learning and Verbal Behavior, 22(3):261-295. Anderson, J. R. and Schooler, L . J. (1991). Reflections of the environment in memory. Psychological Science, 2(6):396-408. Belew, R. K. (1989). Adaptive Information Retrieval: Using a connectionist representation to retrieve and learn about documents. In Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval, pages 11-20. A C M Press: New York, NY. Cohen, P. R. and Kjeldsen, R. (1987). Information retrieval by constrained spreading activation in semantic networks. Information Processing & Management, 23(4):255-268. Collins, A . M . and Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psycholog-ical Review, 82(6):407-428. Denham, M . and Tarassenko, L . (2003). Sensory processing. Technical Report of the Foresight Cognitive Systems Project (Research Review), Office of Science and Technology, Department of Trade and Industry, London, U K . Dumais, S. T., Cutrell, E., Cadiz, J. J., Jancke, G., Sarin, R., and Robbins, D. C. (2003). Stuff I've Seen: a system for personal information retrieval and re-use. In SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 72-79. A C M Press: New York, NY. Fawcett, T. and Provost, F. (1999). Activity monitoring: noticing interesting changes in behavior. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD-99), pages 53-62. A C M Press: New York, NY. Fertig, S., Freeman, E., and Gelernter, D. (1996). Lifestreams: An alternative to the desktop metaphor. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI '96), pages 410-414. A C M Press: New York, NY. Gemmell, J., Bell, G., Lueder, R., Drucker, S., and Wong, C. (2002). MyLifeBits: Fulfilling the Memex Vision. In Proceedings of ACM Multimedia '02, pages 235-238. A C M Press: New York, NY. Han, J., Gong, W., and Yin, Y. (1998). Mining segment-wise periodic patterns in time-related databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD), pages 214-218. A A A I Press: Menlo Park, CA. Hebb, D. O. (1949). The Organization of Behavior. John Wiley: New York. Himberg, J., Korpiaho, K., Mannila, H. , Tikanmaki, J., and Toivonen, H. T. T. (2001). Time series segmen-tation for context recognition in mobile devices. In Proceedings of the 2001 IEEE International Conference on Data Mining, pages 203-210. I l l Jones, W. P. (1986). The Memory Extender Personal Filing System. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 298-305. A C M Press: New York, NY. Krumm, J. and Horvitz, E. (2006). Predestination: Inferring destinations from partial trajectories. In Ubicomp 2006: Ubiquitous Computing, 8th International Conference, volume 4206 of Lecture Notes in Computer Science, pages 243-260. Springer. Lamming, M . and Flynn, M . (1994). "Forget-Me-Not"-Intimate Computing in Support of Human Memory. In Proceedings of FRIEND21 '94 International Symposium on Next Generation Human Interfaces, pages 1-9. Rank Xerox Research Center: Cambridge, UK. L i , Y , Wang, X . S., and Jajodia, S. (2001). Discovering temporal patterns in multiple granularities. In Rod-dick, J. and Hornsby, K., editors, Temporal, Spatial, and Spatio-Temporal Data Mining: First International Workshop, TSDM 2000, volume 2007 of Lecture Notes in Computer Science, pages 5-19. Springer-Verlag. Mol l , M . , Miikkulainen, R., and Abbey, J. (1994). The capacity of convergence-zone episodic memory. In Proceedings of the 12th National Conference on Artificial Intelligence, AAAI-94, pages 68-73. MIT Press: Cambridge, M A . Pechoucek, M . , Stepankova, O., and Miksovsky, P. (1999). Maintenance of discovered knowledge. In Zytkow, J. and Rauch, J., editors, Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery, volume 1704 of Lecture Notes In Computer Science, pages 476-483. Springer-Verlag: London, U K . Puttagunta, V. and Kalpakis, K. (2002). Adaptive methods for activity monitoring of streaming data. In Proceedings of the 2002 International Conferences on Machine Learning and Applications (ICMLA'02), pages 197-203. Rhodes, B. J. and Maes, P. (2000). Just-in-time retrieval agents. IBM Systems Journal, 39(3-4):685-704. Roddick, J. F. and Spiliopoulou, M . (2002). A survey of temporal knowledge discovery paradigms and methods. IEEE Transactions on Knowledge and Data Engineering, 14(4):750-767. Want, R., Hopper, A. , Falcao, V., and Gibbons, J. (1992). The active badge location system. ACM Trans-actions on Information Systems (TOIS), 10(1):91—102. Yao, X . (2003). Research issues in spatio-temporal data mining. Technical Report (white paper) submit-ted to the University Consortium for Geographic Information Science (UCGIS) workshop on Geospatial Visualization and Knowledge Discovery, Lansdowne, Virginia, Nov. 18-20, 2003, Athens, GA. Y i , B.-K. , Sidiropoulos, N . D., Johnson, T., Jagadish, H. V , Faloutsos, C , and Biliris, A . (2000). Online data mining for co-evolving time sequences. In Proceedings of the 16th International Conference on Data Engineering, pages 13-22. IEEE Computer Society: Washington, DC. 112 Chapter 5 Conclusion This thesis describes the first steps towards a general study of biomimetic information management. The ulti-mate goal of biomimetic information management is to provide the user with an artificial memory prosthesis by mimicking human memory processes. We began by examining mental models of semantic and contextual encoding. Since we wanted our approach to be tractable and generalizable, we focused on the functional aspects of memory, treating concepts as discrete symbols connected by an associative network of simple relations as with some prominent memory models (e.g., Anderson, 1983), and recent neuroimaging research showing that collections of cells within the brain seem to act indeed as the equivalent of symbol-systems (e.g., Huyck, 2004). We found that these models are very similar to models developed independently in computational information retrieval, both in terms of large, multidimensional term spaces (e.g., Salton, 1988) and linked-node network models (e.g., Belew, 1986; Jones, 1986). Based on our comparison of cognitive and information-retrieval models, we developed a framework of prin-ciples (P-MAK) that describes their key similarities and differences. P - M A K describes general properties of both semantic and contextual information. Some of the framework's principles, and the sparsity of semantic information spaces, led us to choose networks for our data representation. Networks seem ideal for our purposes, due to their dominance in cognitive memory models. In addition, networks are also easily built and edited, encode sparse data efficiently, can be easily depicted due to their graphical nature. In particular, networks can be analyzed topologically in terms of clusters, hubs (i.e., highly connected "exemplar" nodes), and link distributions. We built two network-based systems based on our biomimetic principles, one for semantic and one for con-textual encodings. In experiments, we showed these systems to have useful applications. As we pursued a network approach, we found two things: that GOFIR ("good old-fashioned IR," aka mainstream IR prac-tice) has avoided human-centered methods in general, and network approaches in particular. The former is only recently starting to be remedied (e.g., Ingwersen and Jarvelin, 2005). There are no obvious reasons for the latter: as we found in developing our systems, common IR algorithms are also applicable to network approaches, which suggests some equivalence. However, IR has shown little or no interest in using the topology of semantic networks to improve IR systems (Perugini et al., 2004). To our knowledge, we are the first to use small-world properties for proactive tuning of semantic networks. In this concluding chapter, we discuss the outcomes of the two systems, some possibilities for future work, and a short disquisition on the undeveloped promise of biomimetics. 113 5.1 Outcomes: The Semantic Network Based on models of human memory and P-MAK's epistemic principles, our semantic network uses nodes to represent information objects (i.e., documents) as discrete entities. Similarity valuations between documents can be expensive to compute at run-time, especially if they involve global methods that require scanning the contents of an entire corpus (Sparck Jones, 1972). Therefore, our semantic network connects pairs of similar documents with a link weighted to reflect their degree of similarity. Relations between objects, and within local neighbourhoods of related objects, are then cheap to retrieve at run-time. We purposely used only this one type of link, for two reasons. First, a single link type reduces the complexity of the system, compared with more typically complex semantic networks that can have dozens of link types depending on the application (Cohen and Kjeldsen, 1987). Second, networks with fewer link types are easier for users to interpret, which speeds navigation between nodes (Foltz and Kintsch, 1988). In the course of researching, building, and testing our semantic net (Chapter 3), we found prior work in IR on hypertext, defined as "a software system allowing extensive cross-referencing between related sections of text and associated graphic material" (Canadian Oxford Dictionary, 1998). The most obvious example of hypertext is the World Wide Web (aka "the Web"). However, the Web is a poor metaphor for our purposes. First, although there is some cross-over, it has grown into a separate subject from IR, and is given its own trade journals and conferences (i.e., W W W and SIGWEB, versus SIGIR). Second, contrary to the methods that we test in this thesis, the Web is a hand-built network, and its hyperlinks represent many different types of relation apart from similarity: definition, citation, example, next, previous, home, etc. Hand construction is enormously time-consuming and varies greatly between individual editors (Furner et al., 1999); thus for large corpora it is better to build networks by machine. Despite the success of the Web as a self-organizing system, there is still a need for systems that can be used to index and compare large corpora of pre-existing documents, such as in digital libraries and company intranets. Before the Web became popular, many systems were designed to link documents stored on a single machine. Early work on automatically-constructed hypertext (ACH) recognized the need to link large numbers of doc-uments stored in closed systems; key findings are concentrated in a special issue of Information Processing and Management.1 However, we found that the A C H was a product of its time, a historical cul-de-sac whose interest in indexing large corpora has been overshadowed by work that followed in dynamic hyper-text, i.e. hypertext where links are spontaneously generated as needed, rather than pre-computed and stored. Whereas work on A C H has disappeared, research in dynamic hypertext continues to this day. This shift reflects the increase of computing speed in commodity machines over the last decade: links no longer need to be laboriously pre-computed and stored as run-time generation becomes feasible. However, we do not believe that this change invalidates the use of ACHs: over the same decade memory cost has plummeted, so storage of large numbers of pre-computed links is also feasible. While one advantage of dynamic hypertext is that spontaneous link generation can adapt to changing needs and priorities during search, the comeback is that the processes that manage an A C H can likewise be programmed to adapt to 'Vol. 33,No.2, 1997 114 changing circumstances, and to update links as necessary. One clear advantage of an A C H is that if link generation and updating is unavailable, you still have a navigable network that can be used with a generic Web browser. A last/best-guess at a useful link is better than no links at all. The most important advantage of an A C H over dynamic hypertext is that its topology can be analyzed. By contrast, it is difficult to analyze dynamic hypertext if its structure is volatile and incomplete. In this thesis, we explored a process based on small-world networks, so named because the distance between any two of its nodes is a short path on average (approximately logarithmic in the number of nodes). Small-world networks have been observed in many natural phenomena, including the structure of the brain and human language, as well as in resource-sensitive human-made systems such as power grids, airplane routes, the Internet, and the Web(Barabasi, 2002). Small-world networks have been shown to be good for navigation while using an optimal number of links (Kleinberg, 2000). Since network-based approaches for information management have few principled methods for tuning network topology (cf. Botafogo et al., 1992), we developed an algorithm that tunes the link distribution of our semantic network to a small world distribution (see Appendix A. 1.4). This semantic network was used in our user study (Chapter 3). Compared to the Web, there are few studies that examine the usability of ACHs (e.g., Blustein and Staveley, 2001). Nonetheless, there are many relevant user studies that examine Web-based searching and browsing behaviour (see e.g., Jansen and Spink, 2006). Based on these, we examined the utility of similarity-based ACHs for query reformulation, i.e. to help users clarify their needs and narrow in on information of interest (in what we have termed in P-MAK's navigation principle as semantic gradient descent). The results of our study showed that users who used the automatically constructed similarity network showed significant improvement over keyword-based search in their ability to explore more widely and retrieve more correct information. Thus there are two contributions from our semantic network: a new non-dynamic approach to "query refor-mulation," and a new method of optimizing the link topologies of semantic networks. 5.2 Outcomes: The Context Network Based on models of human memory and P -MAK's situational principles, our context network uses nodes to represent cues, and the events related to these cues, to reference information objects. The goal of the context network is to retrieve the objects most likely to be useful when certain environmental stimuli are activated. Compared to our semantic network, which imposes no pre-existing taxonomy on the information it inducts, the context network needs to be given a basic "view of the world", an interpretation of what time is and how to encode place. Our context network uses this information to learn the regularities of user information behaviour. Once patterns are learned, users can ask questions such as "what do I usually do/use in this time/place?" and "when do I usually perform this action?" Users can also be reminded of information objects that they previously found useful in particular circumstances. The context network thus functions as an incremental dynamic user model, and can also be used as a recommender system by using active cues to retrieve event-related objects. 115 To capture behaviour patterns, we looked at how various types of cognitive schema (Brewer and Treyens, 1981; Mandler, 1984) support memories of contextually-relevant information. Temporal and place schemas represent the time and place of remembered events, and are relatively simple to implement with timestamp and coordinate values; scene schemas encode the objects that are typically present in particular environ-ments, but are more to implement as they require recognition of individual objects.2 We also looked at context within descriptions of episodic memory for personally-experienced events (Tulving, 1972), and its use of encoding specificity to retrieve whole memories based on a handful of relevant features. We found a suggestive network model of encoding specificity in the convergence-zone model of Moll et al. (1994), which associates a set of cues in a binding layer to reference memory of a particular event. Our prior success in connecting the cognitive and information-retrieval models led us to expect the same in context models. Contrary to our expectations, we could not find clear parallels between work on context in cognitive science and in information retrieval. As yet there is very little work on context in IR. This is beginning to change with the new study of Personal Information Management (PIM) that seeks to help users find personal information based on temporal and episodic attributes; PIM systems focus on helping people find information objects (often photographs) by displaying objects on a timeline, or by retrieval based on automatic or user-specified tags. There is also a considerable body of work in database research on temporal and geographic databases, however the focus is on prediction (e.g., of stock values) and not on helping users to retrieve information in context. To our surprise, we could find no systems specifically designed to answer the "what do I usually do/use in this time/place?" question, i.e. concerning the typicality of human information behaviour.3 Thus we based our cue-event-object (CEO) model on cognitive episodic memory models and P -MAK's situational principles. Cue nodes of useful temporal and place schemas (such as the Gregorian calendar and the GPS coordinate system) are pre-programmed into the system. When events occur, an event node is created that acts as an intermediary binding layer. The event node connects all of the relevant cues to all information objects that are associated with the time and place described by the cues. Cues that regularly co-occur are compounded before connection to the binding layer. Although the CEO context network uses the same kind of weighted link as our semantic network, the CEO model is dynamic and learns patterns incrementally in a process analogous to Hebbian learning (Hebb, 1949): links and nodes grow stronger if they are used, and decay if they are not. By contrast, links and nodes in the current version of our semantic network are static, since the similarity relation between document nodes is based on keywords, and we do not change the document keywords after they are assigned. We tested the CEO model in a user study using real-world driving logs. The premise was that certain information-based services, such as retrieving particular music in certain situations (e.g., soothing music in traffic jams), or suggesting appropriate rest stops along the way, would improve the experience of driving. The inputs were a timestamp and a starting location, and the output (the retrieved "information object") was 2 We have suggested that such recognition is possible with the judicious use of sensors, but this is beyond the scope of the current thesis; for now we restrict our encodings to time and space. 3This impression was strengthened in conversation at SIGIR 2007 with my doctoral consortium mentors; for instance, unlike the CEO system the contextual system of Krumm and Horvitz (2006) is not intended to be generalized beyond driving tasks, and is not designed to answer user queries. 116 a prediction of the route taken. With this simple dynamic model we found that accuracy of prediction rose rapidly, and continued to rise as long as new data were introduced to the system. Thus the contribution of the CEO context network is a novel dynamic model for typifying information usage, that can also be used as a recommender system. 5.3 Future Work Ultimately the intention is that information systems built according to biomimetic principles should scaf-fold a user's memory, by displaying what users do remember, as well as providing them with reminders of timely information that has faded from memory. P - M A K , and its implications, are broad and require further exploration. As for the models that we have derived from P - M A K , the task in general is to find bound-ary conditions and attempt to break the models by inducting large numbers of different object types, and massively frequent occurrence patterns. The questions are then, do the models degrade gracefully or break suddenly? How scalable and efficient are these solutions? We partition our explorations into information networks, the nature of context, the inference of abstractions, and expanding the cognitive basis to improve user interaction. 5 .3 .1 S e m a n t i c N e t w o r k B u i l d i n g Investigations of semantic network structure will examine how it is built and maintained. Aspects that affect network structure include the choice of classifier to extract meaningful keywords, the dynamics of node and link strengths, and strategies for controlling spreading activation. Some of these issues also have implications for network topology. Classifiers — The effect of alternate classifiers, and of different types of classifier, can be investigated in terms of their effect on network topology and utility. In particular, the investigation of single-document key-word extraction, i.e., classifiers that do not require a global scan of a corpus (see e.g., Matsuo and Ishizuka, 2004), would be particularly useful for large-and-growing corpus situations, since global classifiers need to be re-run regularly to update changing keyword frequencies. Other approaches, such as the new incremental Generalized Hebbian Algorithm for L S A & SVD (Gorrell, 2006) may improve semantic accuracy while reducing the time required for network construction. It is also possible that there are best-use classifiers for particular situations. In this case, the choice of classifier can be based on which sensors are active in the context network. Node and Link Dynamics — Much more could be done to explore the nature of node and link dynamics. One possibility would be to give "seniority" to items that are most consistently used over time, by employing a mass- or gravity-based model. The more an object is used over time, the slower it would decay, much like persistent and well-known concepts in human memory (Anderson and Schooler, 1991). This implies perhaps using multiple activation values with different decay rates to track short- and long-term trends; decay rates could also change adaptively within the range that they cover. 117 Other possible research in network building includes — • modeling changes in keyword distributions over time as documents are added to the corpus • redefining objects in part by other non-semantically-related nodes with which they are nonetheless consistently used • introducing dynamic links into the (currently static) semantic network, by emphasizing different key-words based on situational changes detected by the context network • increasing the number of link types incrementally while gauging effects on utility and efficiency Topology — The small-world technique that we employ for tuning the topology of the semantic network (described in Appendix A. 1.4) is rudimentary. Such tuning would likely be improved by better understand-ing how different test corpora produce networks with different link distributions. Indeed, topology might be used to detect "weak" corpora with little variability, or perhaps "pathological" classifiers that function poorly. Specific work includes examination of why some corpora do not converge on a good link-pruning threshold, and the tuning of evaluation metrics such as the network clustering coefficient and diameter. General questions of interest involve the relation between semantics and network topology, and the effect on topology of large proportions of similar (or even duplicate) documents in a corpus. 5 . 3 . 2 C o n t e x t The most immediate work to be done with the context network involves acquiring larger and varied GPS test corpora (e.g., Krumm and Horvitz, 2006). Such large data sets are necessary for properly testing the aggregation process of the CEO model, and for developing data structures for fast aggregation across a large set of nodes. In the absence of large user-log corpora, the dynamics of the context network can be studied by generating a synthetic series of events that follow specified patterns, with various degrees of noise added to determine at what point patterns become "noticeable". We have implemented a code base for event-series generation, and will use it as a first test for aggregation processes. In the domain of personal information management, longitudinal studies would be useful in three domains: medical monitoring systems in the home, usage of library resources, and Web browsing. As with the GPS domain, such longitudinal studies can be performed on user logs, but finding appropriate corpora may be a challenge. In expanding the scope of the context network beyond personal information management, we see appli-cations in collaborative filtering (see e.g., Huang et al., 2004) by combining usage profiles from several users into an aggregate profile. Profiles could be clustered and combined based on different types of user behaviour, under the premise that users are likely to be interested in the activities of other users with similar interests. Other possible research in this area includes— • expanding prediction by recommending good targets that have not yet been visited, perhaps by using aggregates of attributes in user-visited nodes to find other nodes 118 • exploring the effect of sensor selection by experimenting with noisy sensors, which can develop de-tection methods for defective sensors, such as when a sensor fails to fire in a particularly persistent pattern • modeling regujar events that occur in an arbitrary period (e.g., "every 17 minutes") that does not correspond easily to the system's existing temporal cue nodes • testing in dynamic information-technology contexts, such as mobile, distributed, and ubiquitous com-puting 5.3.3 Abstraction Automated summaries of an information space would be of particular use to users, giving them an immediate impression of what information they can hope to retrieve, or even if the available information might be appropriate to their needs before they begin. Summaries are generated by a process of abstraction that distills a body of information down to its salients. Finding usage patterns in our context network is essentially a form of abstraction, and a simple one at that, given the constrained process that we employ. However, semantic abstraction is a much thornier issue. One possibility may be to create representative meta-nodes that mimic simple generalizations found in functional-level memory theories. For instance, if a subset of nodes shows little variability, a new meta-node can be created to represent the set. By contrast, i f the set shows more than a minimum amount of variability, then an existing node that best exemplifies family resemblance can be used as a representative. Respectively, these methods mimic prototype theory (Posner and Keele, 1970) and exemplar theory (Brooks, 1978)—both are feature-based theories of knowledge structure. Meta-nodes, while connected to the ordinary nodes in the network, would exist in a meta-layer that may summarize the dominant topics in the- network; each meta-node offers an entry point to a particular topic. In the same manner that human language is formed of basic symbols that are aggregated into meta-symbols (often called chunks or tokens) that can also be aggregated, these meta-nodes could be summarized and represented in even higher-order meta-meta-layers. Ideally, this process would create global maps to dominant semantic hierarchies. 5.3.4 User Interaction User interaction can be used to develop better, more accurate models of human information behaviour, and to display information more intuitively. Cognitive modeling — One intriguing possibility is that information behaviour over time may reveal a user's cognitive weaknesses. Hypothetically, an information system could be made to recognize deficits of different types and adapt its behaviour to scaffold them appropriately. Alternatively, the system could be used to train and strengthen these weaknesses, hovering at the limits of ability without pushing users into frustration (Doidge, 2001; Klein et al., 2002). This approach would require mapping various cognitive pro-cesses onto information-retrieval processes based on similar function, and would start by modeling common human information-behaviour deficits that would best be served by remediation. 119 Other possible research in this area includes — • examining the evolutionary constraints on human information behaviour in cross-cultural studies (see e.g., Anderson and Schooler, 1991; Brown, 2001; Sharps et al., 2002) • developing models that describe the regularities of human behaviour (e.g., Sanderson and Dumais, 2007) Bookmarking — Given the various contexts under which information can be retrieved, users may wish to save the state of the network as a "bookmark" to which they can return later. This could be accomplished by saving a compact profile that contains only the dominant node and link values in the user's network that exceed a set threshold. The profile can be revisited by setting all network elements below the threshold value and then assigning the stored values to the appropriate nodes. Visualization — In our user study, we specifically avoided using a novel interface and relied instead on well-known Web metaphors, to be more certain of testing the underlying model rather than any effect of the interface. Future work demands expansion into information visualization, given that (1) networks are inherently graphical, and (2) the field of personal information management is concerned at least as much with the depiction of data as with its underlying model. 5.4 Biomimetic Information Retrieval We believe that if humans organize concepts internally in memory as the equivalent of discrete symbol-systems, and also organize information externally in discrete objects using discrete symbol systems, then this coincidence is worthy of closer examination. This thesis has introduced the notion of biomimetic information retrieval—information retrieval based upon cognitive function—as an area of potential interest. I am generally motivated by the large and increasingly difficult problems with which humanity has beset itself, particularly in addressing the outcomes of past indulgences. I believe that the duty of every scientist, indeed of every person, is to contribute to rational solutions that reduce the impact of human foibles and improve the quality of life. The obvious question then is to ask what one can do about it. As an information scientist, the clearest way to contribute is to help others to grapple with the mass of data that modern methods allow them to accumulate, and organize it into coherent models. Ultimately the models that are created need to be comprehensible to both scientists and policy-makers. Our work has been motivated by the question of how to organize information in a natural way. A first step to answering this question should address how the human mind organizes information. To ease users' struggles with technology, we should start with cognitive models and develop systems around such models, as opposed to the well-established method of starting with a technological solution and there-after finding ways to make it digestible for unhappy users. Until now, this approach to development has been the legacy of the personal computer: the failures of poor design are a common topic of discussion. I believe that machine-based information access will remain a frustrating exercise until machines begin to respond in ways that are better matched to the fundamental information-processing properties of human cognition. 120 lib & info procuring cataloguing organizing instructing Y o u an here adjunct CS core cogs socs Figure 5.1: Biomimetic information retrieval in context, represented as a grey area that incorporates per-sonal information managment (PIM) and information behaviour (IB), and lies at the crossroads of library & information science, cognitive science, sociology, and computer science. There is little work that applies the lessons of functional cognition to the design of information management systems. This seems odd given the close agreement between IR models and semantic memory models discussed in this thesis. A reasonable first reaction is that perhaps there's nothing there to find, and yet it seems far more likely that such work lies in a blind spot in science, a place where few people venture since few enter the divide between the component disciplines of computer science, cognitive science, and library and information science. Thus we find ourselves situated at the junction of several fields of inquiry, as shown in Figure 5.1. This junction is most fittingly described as a combination of personal information management and information behaviour, with a strong cognitive component. Our approach draws on, and provides tools for, work in context representation, semantic browsing and navigation, recommender systems, and memory prosthesis. In a project of such wide scope, there is always more potentially relevant literature to be found, and much remains to be proven. The abundance of possibilities only encourages further study, and I believe that this area is filled with promise. 121 Bibliography Anderson, J. R. (1983). A spreading activation theory of memory. Journal of Verbal Learning and Verbal Behavior, 22(3):261-295. Anderson, J. R. and Schooler, L . J. (1991). Reflections of the environment in memory. Psychological Science, 2(6):396-408. Barabasi, A . - L . (2002). Linked: The New Science of Networks. Perseus Publishing: Cambridge, M A . Belew, R. K. (1986). Adaptive Information Retrieval: Machine Learning in Associative Networks. PhD thesis. Blustein, J. and Staveley, M . S. (2001). Methods of generating and evaluating hypertext. Annual Review of Information Science and Technology, 35:299-335. Botafogo, R. A. , Rivlin, E., and Shneiderman, B. (1992). Structural analysis of hypertexts: Identifying hierarchies and useful metrics. ACM Transactions on Information Systems (TOIS), 10(2): 142-180. Brewer, W. F. and Treyens, J. C. (1981). Role of schemata in memory for places. Cognitive Psychology, 13:207-230. Brooks, L . R. (1978). Nonanalytic concept formation and memory for instances. In Rosch, E. and Lloyd, B. , editors, Cognition and Categorization, pages 170-211. Lawrence Erlbaum Associates: Hillsdale, NJ. Brown, W. M . (2001). Natural selection of mammalian brain components. TRENDS in Ecology & Evolu-tion, 16(9):471-473. Canadian Oxford Dictionary (1998). The Canadian Oxford Dictionary. Oxford University Press: Toronto, Canada. Cohen, P. R. and Kjeldsen, R. (1987). Information retrieval by constrained spreading activation in semantic networks. Information Processing & Management, 23(4):255-268. Doidge, N . (2001). Building a better brain. Saturday Night Magazine (May 1). St. Joseph Media: Toronto, Canada. Foltz, P. W. and Kintsch, W. (1988). An Empirical Study of Retrieval by Reformulation on HELGON. Technical Report 88-9, University of Colorado, Boulder, CO. Furner, J., Ellis, D., and Willett, P. (1999). Inter-linker consistency in the manual construction of hypertext documents. ACM Computing Surveys (CSUR), 31(4es):18. Gorrell, G. (2006). Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing. In EACL 2006, list Conference of the European Chapter of the Association for Computational Linguistics. Hebb, D. O. (1949). The Organization of Behavior. John Wiley: New York. Huang, Z., Chen, H. , and Zeng, D. (2004). Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering. ACM Transactions on Information Systems (TOIS), 22(1):116-142. Huyck, C. R. (2004). Overlapping cell assemblies from correlators. Neurocomputing, 56:435-439. Ingwersen, P. and Jarvelin, K. (2005). The Turn: Integration of Information Seeking and Retrieval in Context. Springer: Dordrecht, The Netherlands. 122 Jansen, B. J. and Spink, A . (2006). How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Information Processing & Management, 42(l):248-263. Jones, W. P. (1986). The Memory Extender Personal Filing System. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 298-305. A C M Press: New York, NY. Klein, J., Moon, Y , and Picard, R. W. (2002). This computer responds to user frustration. Interacting with Computers, 14(2): 119-140. Kleinberg, J. M . (2000). Navigation in a small world. Nature, 406:845. Krumm, J. and Horvitz, E. (2006). Predestination: Inferring destinations from partial trajectories. In Ubicomp 2006: Ubiquitous Computing, 8th International Conference, volume 4206 of Lecture Notes in Computer Science, pages 243-260. Springer. Mandler, J. M . (1984). Stories, Scripts, and Scenes: Aspects of Schema Theory. Lawrence Erlbaum Associates: Hillsdale, NJ. Matsuo, Y. and Ishizuka, M . (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal of Artificial Intelligence Tools, 13(1): 157—169. Moll , M . , Miikkulainen, R., and Abbey, J. (1994). The capacity of convergence-zone episodic memory. In Proceedings of the 12th National Conference on Artificial Intelligence, AAAI-94, pages 68-73. MIT Press: Cambridge, M A . Perugini, S., Goncalves, M . A. , and Fox, E. A . (2004). Recommender systems research: A connection-centric survey. Journal of Intelligent Information Systems, 23(2): 107-143. Posner, M . I. and Keele, S. W. (1970). Retention of abstract ideas. Journal of Experimental Psychology, 83:304-308. Salton, G. (1988). Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley: Reading, M A . Sanderson, M . and Dumais, S. (2007). Examining repetition in user search behavior. In Proceedings of the European Conference on Information Retrieval (ECIR). Sharps, M . J., Villegas, A . B. , Nunes, M . A. , and Barber, T. L. (2002). Memory for animal tracks: A possible cognitive artifact of human evolution. Journal of Psychology, 136(5):469-492. Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1): 11-21. Tulving, E. (1972). Episodic and Semantic Memory. In Tulving, E. and Roberts, M . , editors, Organization of Memory, pages 381-403. Academic Press: New York. 123 Parti Appendices Appendix A System Design This thesis describes two approaches to organizing information objects: by semantics and by usage. These approaches operate independently, and both use a network representation as the organizing principle. The semantic network is used in the user experiment of Chapter 3, and comprises a set of documents represented as nodes, which are connected with static links whose weights represent the degree of similarity between nodes. The purpose of the semantic network is to organize information in a way that makes it easy to navigate and retrieve. The context network is used in the GPS activity-pattern experiment of Chapter 4, and is comprised of a set of temporal and sensor cues connected to objects to indicate when those objects are used and under what circumstances. The purpose of the context network is to provide an updated summary of activity patterns that can be used to predict future behaviour. The links in the context network are dynamic. A . l The Semantic Network The algorithms that are used to build and access the semantic network are based on the epistemic principles of Chapter 2. In the description that follows, we use several specialized words: • Term merely means "word", and is most commonly used when speaking of searching or indexing processes. • A keyword is a word (or term) that is descriptive of a document. Keywords are typically extracted from the text of the document using statistical techniques such as tf-idf, described below as part of Algorithm 1. • A corpus is a collection of information objects. Such objects are typically printed documents such as books, journals, articles, diaries, etc. We build a network from a corpus of documents in three steps, based on semantic features. First, we extract keywords from each document using the well-known tf-idf algorithm (Salton and McGil l , 1983). Second, we link these documents with a weight proportional to their similarity, i.e., by the number of keywords that they share. Third, we introduce a principled method for pruning links in the network to approximate a small-world topology. The motivation and justification for this approach is discussed in detail in Chapters 2 and 3. Following the description of the network-building algorithms, we show how keyword-searching and browse-searching are performed, and how an individual document is added to an existing network. 125 A . l . l Pre-Processing Before building the semantic network, some pre-processing was necessary to prepare the different raw document formats for the same algorithmic code base. We assembled our own New York Times corpus by extracting documents from the daily headlines emailed by the N Y T to list-serve subscribers (New York Times, 2006). Documents were culled from the years 2003-2005. As the emails were in html format and contained extraneous material such as formatting and image tags, we wrote an html parser to remove all tags and leave only ascii body text. The parser also extracted meta-information such as title, author, and date of publication, which were included as file headers in the document text files that we stored in a file system. The Reuters 21578 corpus was downloaded online (Reuters-21578, 2006) as a large collection of Zip-compressed plain-text files. Many documents were not discursive (i.e., written in grammatical sentences) but were rather short numerical tables of agricultural and industrial production rates that would be inappropriate for semantic keyword extraction. To guarantee discursive documents, we set a minimum length threshold at 500 words to filter out all shorter documents, which successfully retained a useful set of documents. We filtered both corpora for special characters to ensure that different writings of the same word, e.g., glacee and glacee, would be treated as identical. A l l special characters (such as accented vowels) were replaced with their ascii-only equivalents, for example by substituting all e, e, and e, with the simple vowel e. This ensured that words would be correctly counted by the indexing algorithm despite variations in accenting. C o m p u t a t i o n a l C o n s i d e r a t i o n s With small corpora, the documents (and their accompanying meta-data, such as word counts) may be held in R A M , but it is better to assume corpora of arbitrary size. We stored our corpora on disk and loaded documents into R A M as needed. We also wrote a paging mechanism for our indexes, although in practice we had enough R A M on our experimental machines (at least 512MB) so that all indexes could be held in memory without paging. We assigned each document in our system a unique integer identifier (ID), which was used to name the document's file on disk, and was also used as the document's reference in our indexes. IDs were assigned sequentially as documents were inducted into our system. For storing a large number of documents on disk, we partitioned the corpus into a tree hierarchy that allows access to any document in logarithmic time by using its ID. The documents were stored at the leaves of the tree, with all other levels containing only folders. Documents were stored using the standard txt suffix to indicate ascii text; a document's keyword-and neighbour-list files were stored alongside the document in the file system with the same ID and a different suffix. The number of documents or folders held at each level of the hierarchy (i.e., the branching factor) determines the base of the logarithm. For example, with a branching factor of 10, a tree of 100,000 documents would 126 have a depth of 5 levels, since Zogio'100,000= 5. The absolute path to a specific document can be pre-calculated in constant time, based on its ID, and thereafter accessed directly. Since physical media access shows greater latency than calculation, calculating the entire path versus searching for identifiers on disk reduces access time. A . 1.2 Semantic Indexing The goal of semantic indexing is to tag documents according to their dominant characteristics, so that they may be easily retrieved if those characteristics appear in a query. The standard approach to indexing docu-ments is keyword extraction, which tags documents with dominant words found in the document's contents. Tf-idf (Salton and McGil l , 1983) is perhaps the simplest and best-known keyword-extraction algorithm. We use it in our system as a simple baseline—it is a crude implement compared to more recent algorithms (such as BM25: Robertson et al., 2000). The assumption is that if our approach works well with tf-idf, then it can only improve in speed or accuracy with more sophisticated algorithms. Keyword extraction with tf-idf is necessarily performed in two steps. In the first step, a word count is made over the entire document corpus of size ||JD||, simultaneously recording in table DF the number of documents that contain a particular word, and in table TF the number of times each term appears in a document (Algorithm 1, lines 4-7). Common words such as and, her, whereas, etc. that are not seman-tically interesting (called stopwords) are excluded from the count. We used the same list of stopwords as employed in Saltan's SMART system (Salton, 1968), which uses fewer than 600 stopwords. As there are few stopwords relative to the size of the English lexicon, the comparison of each term to the stopword list invokes negligible cost, particularly if the stop-list is stored as a hash table or search tree. Lines 6 and 7 will be executed \\D\\ x t times, where t is the average number of words per document. Since bi-nary search trees are used to store the counts, searching or incrementing TF(d, t) takes C>(log | | T F | | ) , and incrementing DF takes 0(\og \ \DF\\). Incrementing DF happens once per unique term per document, but the number of unique terms in a document is proportional to document length t, thus incurring a total cost bounded by 0(Dt log | | D F | | ) . Therefore (omitting bars for clarity) the total cost for the first step is 0( Dt [21ogTF + log-DF]). However, the value t is essentially a constant: individual documents are not likely to exceed a standard size, and as the corpus grows arbitrarily large, the rate at which previously unseen terms are added to TF and DF slows down, as the likelihood increases that any term has been seen before. We estimate this cost at roughly 0(\og \\DF\\). Thus, the complexity of step one is dominated by the size of the corpus to give running time OiD log | \DF\\), although with large values for DF and constant t and if D includes large numbers of large documents. 127 A l g o r i t h m 1: Keyword extraction for all documents in a corpus I n p u t : D — a corpus of documents 7Y — the maximum number of keywords per document (set to 20) O u t p u t : IDX — a table of document IDs for each keyword extracted from D KW — a table of keywords per document d E D DF -— a table of the number of d E D containing a term t TF — a table of term frequencies per document d E D 1 b e g i n 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 e n d DF <— 0 TF 0 f o r d E D d o f o r each term t in d that is not a stopword d o i f TF(d, t) = 0 t h e n increment DF(t); increment TF(d, t) 1/ c o u n t w o r d s / / f i r s t i n s t a n c e KW < — 0 f o r d E D d o f o r each term t in TF(d) d o ojt4^TF(d,t)xlog[15§^} i f u t 4 > minWeight(KW) o r \\KW\\ < N t h e n L sort [u>tid -> t] into KW / / . k e y w o r d w e i g h t s / / t f - i d f KW <— ALPHABETIZE(KW) f o r kw E KW d o add [kw -> ID(d)\ to IDX writeToKeywordFile(ID(d), kw, LJ^W) / / i n s e r t i n t o i n d e x In the second step, the term counts are used to calculate the weights for all terms in each document, and the strongest terms in each document are then stored as that document's keywords (lines 9-17). "Strongest" may be defined in at least two ways: either by selecting all terms with weights above a set threshold, or by selecting the 7Y terms with the highest weights. The thresholding approach is problematic: the choice of threshold is difficult, potentially allowing an excessive number of keywords for some "distinctive" docu-ments, but few or no keywords for other "weaker" documents. Therefore, we take the approach of picking a fixed number of keywords per document, and have found in our experiments that setting 20 keywords per document seems to provide documents with adequate keyword descriptions, judged by comparing keywords to the content of a sample of documents. The best keywords tend to have weights above 0.9 on a normalized scale; in extreme cases we discarded terms within the 20-keyword set whose weights were weaker than some trivial threshold, which we set at 0.1. 128 Thus the second step of the keyword-extraction algorithm proceeds by using the tf-idf formula to calculate a weight ujttd f ° r e a c h term t in each document d, by multiplying the frequency of the term in the document by the inverse document frequency (idf) of the term across the entire corpus (Line 11). The intuition of tf-idf is that if a term appears across all the documents in a corpus, then it is useless as a keyword since it will not help differentiate between documents: in that extreme case, D = DF(t), and thus log(l) = 0, so the term will receive a weight of 0, which makes it an unlikely candidate for a keyword. By contrast, a term that is used in just a few documents maximizes the log expression and is much more likely to be favoured as a keyword. In Line 12, a term is selected as a keyword if its weight is greater than the weakest member of the keyword list, or if fewer than N — 20 keywords have been selected so far. Once all terms in the document have been processed, the resulting keyword list is resorted alphabetically to speed the linking stage that follows keyword extraction. Then the document's identifier and all its keywords are added to the inverted index IDX, which maps each keyword extracted from the corpus onto a list of all documents that contain that keyword. The keywords are also written to a file; the file is stored alongside the document in the file space and is specified by the document ID. In terms of run-time complexity, we use the same simplifying logic as used in step one to discount the influence of TF and DF for very large corpora D. The KW table of keywords per document is also of small constant size (limited to 20 keywords). Thus, the running time of step two tends asymptotically to 0(D log | |Z)F||) for arbitrarily large corpora D, and the asymptotic running time for both steps of keyword extraction is log \\DF\\) + 0(D log \\DF\\) = 0(D log | | J D F | | ) . The main problem with tf-idf is that it is a global process; that is, all terms in all documents must be scanned before keywords can be chosen for each document. This requirement is a considerable burden for corpora that change or grow over time, as the choice of keywords is dependent on the corpus-wide distribution of terms. For example, a corpus of 1,000 documents may be written entirely on the topic of fish. If the term fish appears liberally throughout the corpus, then it would not be chosen as a keyword for any document as it does not help to distinguish between documents. However, if 999,000 other documents are added to the corpus on a topic unrelated to fish, then the term fish becomes a good way to differentiate the 0.1% of the corpus comprising the fish documents. Such significant changes to a corpus require that document keywords be recalculated by re-running tf-idf on the entire corpus. The cost of counting term frequencies in documents already in the corpus can be ameliorated by storing the frequencies of each unique term in a database, incrementing them as new documents are added, and re-using them whenever tf-idf needs to be recalculated. As the corpus becomes larger and more comprehensive, additions of new documents become smaller in proportion to the size of the corpus as a whole. The problem then becomes coping with semantic drift, as the distribution of terms in the corpus changes gradually—deciding on how frequently to recalculate keywords across the corpus, and under what circumstances, is a non-trivial question worthy of its own research program. Global methods such as tf-idf are best avoided for large corpora, as they represent a significant cost. More recent algorithms (e.g., Matsuo and Ishizuka, 2004) claim performance equivalent to tf-idf in extracting keywords from single documents, which would speed keyword extraction and avoid costly recalculations in 129 large corpora. We intend to explore such methods as future work. A.1.3 Semantic Linking Once keywords have been extracted from each document, we use them to build the semantic network. Documents can be represented as vertices (or nodes) linked by bi-directional edges (or links) that represent semantic similarity. The weight of a link indicates the degree of similarity of the nodes that it connects. Linking proceeds by finding all nodes that share any keywords with the current node, and then computing similarity between the current node and each of the neighbours based on the keywords that they share. Our linking policy is to generate all but the most vanishingly weak links, since a dynamic pruning function is used later to optimize the network's topology. The linking algorithm is described in Algorithm 2. The algorithm iterates through corpus D retrieving the stored keywords for each document d (in 0(1) time, Line 3). The algorithm then iterates through the keywords, retrieving for each keyword a list of IDs of all other documents that use that keyword. Retrieving the list of IDs takes time C(log k) for k keywords in the corpus, since the index IDX is organized as a binary search tree. The algorithm then retrieves the keywords for each of these neighbour documents n, and compares the keywords of documents d and n to produce a similarity value that is used as a link weight. There are two mechanisms by which d and n are not compared more than once (Line 7). First, since links are bi-directional, only one comparison is necessary, so comparisons are only made if n ' sTD number is smaller than d's (and not the reverse). Second, since document d and its neighbour n may share more than one keyword, Line 7 also checks that no prior link exists between the two nodes, i.e., that a comparison has not already been performed based on a previous matching keyword. If the keywords of documents d and n have not been previously compared, then the algorithm walks through the two (now alphabetically sorted) keyword lists in parallel, adding the weights of any keywords that match (lines 12-20). Once all keywords have been compared, the resulting similarity score is normalized. If the normalized similarity score exceeds a trivial threshold e (set to 0.01), then the neighbour ID and the score are written to a neighbour file alongside document d in the file hierarchy. The number of neighbours, i.e., the node's degree value, is also stored in the header of the file to speed its retrieval for the link-pruning algorithm described in the next Section. Once similarity scores have been generated and stored between all possible pairs of nodes in the network, they are not recalculated. The size of the keyword lists KW again affects the complexity of the algorithm. Since the number of keywords per node is limited to 20, all operations that involve iterating through keyword lists (the loops at lines 4 and 12) are essentially constant-time. The key issue then is the average number of neighbours per node, which affects the loop beginning at Line 6. In a fully connected network of M nodes, each node would have jV — 1 neighbours, giving a run-time complexity of 0(J\f2). Given that tf-idf discourages keywords from being shared by more than a fraction of the documents comprising a network, this must be considered a loose estimate. Also, since our pruning algorithm described in the next section attempts to tune the network to a small-world link distribution, and since the number of links in a small-world network tends 130 to be logarithmic in the number of nodes (Barabasi, 2002), we expect the actual bound of Algorithm 2 will be closer to 0(N log Nj. Estimating a tighter connectivity bound wil l be the subject of future work. Algori thm 2: Linking all similar documents in a corpus Input: IDX — a table of document IDs for each keyword extracted from D KW — a table of keywords per document d E D e — a minimum threshold for creating a link (set to 0.01) Output: a file for each document d containing links between d,n E D,d / n l begin 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 for idd = 1 t o \ \ D \ \do KWd <— getKeywords(idd) for kw E KWd do neiblDs ^— IDX{kw) for idn E neiblDs do if idn < idd ana" getLink(idd, idn) = 0 then KWn <— getKeywords(idn) sirn, <— 0 ptfd *— 0 ptrn <— 0 whileptr d < \\KWd\\ md.ptrn < \\KWn\\ kwd <— KWd,plrd kwn <— KWn.ptrn if kwd < kwn then | ptrd <— ptrd + 1 else if kwci > kwn then | ptrn <— ptrn + 1 else |^ sim <— sirn + uj(kwd) + u>(kwn) sim <— riorrnalize(sim) if sim > e then |_ writeToNeighbourFile(idd, idn, sim) do 24 end A.1.4 Small-World L i nk Pruning We introduce a principled approach to automatic link pruning using the metrics of small-world networks. We discuss how that topology is approximated by adjusting the threshold for forming links between nodes, how the resulting topology is evaluated for small-world properties, and the limitations of our link-pruning method. After the semantic network has been as completely connected as possible using Algorithms 1 and 131 0 20 40 60 80 100 120 140 160 0 10 20 30 40 50 60 70 80 Figure A . l : As the link threshold 9 is increased, the distribution of node degree moves from polynomial (0 = 5, 10) to exponential (9 = 15) to power-law (9 = 22). The number of nodes of low degree increases dramatically as the number with high degree diminishes. 2, we raise and focus the link-acceptance threshold 9 to prune away weaker links, in an attempt to adjust the number of links per node (termed the degree or arity of the node) to a small-world distribution. Link Thresholding Link thresholding is set manually in most experimental settings. The manual approach can be arbitrary, and depends on the semantic nature of the corpus: the relative richness of the attributes in the information space, whether attributes are clustered or evenly distributed, etc. Instead, we propose adjusting the link-acceptance threshold 9 dynamically based on target small-world metrics, in particular the property that the distribution of node degree (i.e., the number of links per node) can be described by a power law with a target exponent of roughly —2.0. Our semantic network is built as described above by extracting document keywords with tf-idf, and linking them based on the keywords that they share. The threshold 9 is used to determine whether links are strong enough to be added to the model. A lower value of 9 raises the connectivity of the network. As can be seen in Figure A . l , setting 9 = 5 creates a high average connectivity per node of approximately 500 links. The effect of raising the threshold 9 is to reduce the number of links in the network, which reduces the number of nodes with high degree and increases the number that are sparsely connected. The curve in Figure A. 1 is thereby moved to the left, as shown. The test corpus in this case is the Reuters corpus used in our user experiment (Reuters-21578, 2006). As 9 is raised, the curve moves from quadratic through exponential 132 towards a power curve, with 9 = 22 producing the power curve y = 2025.9a; 1 - 7 u b 7 . Algorithm 3 starts with 9 set to the network's average link value, and the adjustment of 9 is doubled or halved at each iteration in a binary search that narrows in on the best possible value for 9. The best 9 is the one that produces a power-law distribution with a target exponent value; various sources (e.g., Adamic et al., 2001; Kleinberg, 2000) have shown that the exponent value for the power-law formula y — x~a typically approximates to a = 2.0 in small-world networks, where y represents the count for each degree value x. In our algorithm, the target exponent value is stored in the real variable target. The for-loop (Line 7) builds a histogram of the network's node-degrees distribution, by using the filtering function getDegree()that ignores all links with weights below threshold 9. getDegreeQ uses the node's ID to retrieve the node's neighbour-file in constant time; the degree value of the node is stored in the header of the file, and is thus also accessible in constant time. The function addToHistogramQ adds the retrieved degree value to the histogram, which is implemented as a binary search tree mapping a degree value onto the number of nodes with that degree, i.e., the keys of the tree are node degrees, and the values are the counts. In the worst (and unlikely) case, few nodes will have the same degree, and thus the histogram is assembled in time bounded by 0(| |A/] | log \\tf\ \). When the histogram is complete, the function calculateCurve uses the standard least-squares method to fit a curve to the points, in running time proportional to the size of the histogram tree, where H signifies HISTOGRAM, 0(\\H\\). As many nodes in the network are expected to have the same degree (see Figure A . l ) , \\H\\ is expected to be strictly smaller than ||A/"||. After the curve is calculated, its exponent is extracted for comparison with the target exponent value. There are two termination tests for the algorithm. The first is found in Line 12, and represents a success-ful result where the difference between the calculated exponent exp and the specified target is less than the desired degree of accuracy e, which we arbitrarily set to 0.3. The second termination test at Line 16 prevents the algorithm from looping indefinitely if it cannot converge upon a good result. The function getPastExponentQ compares the current curve exponent exp to the exponents generated in past iterations stored in a binary search tree, in time 0(log i) for i iterations. If the test fails, then the algorithm has already approximated the currently observed link distribution in a previous iteration. In our preliminary experiments without the termination test, this failure occurred in two ways: either the same distribution was repeated in successive iterations, or the algorithm oscillated between two distributions on either side of the target (but not within e). Better understanding this non-optimal behaviour is the subject of future research. If the ter-mination test succeeds, then the curve exponent is previously unobserved, and is stored for comparison with further iterations with the function storeExponentQ, which also operates in time 0(log i) for i iterations. Lines 16 to 21 control the adjustment of the link-filtering threshold 9 in a binary search for the best value. If exp is larger than the target, then the network is too sparse and 9 needs to be lowered; the current value of 9 becomes the upper bound of the search, the lower bound stays the same, and 9 itself is halved. If exp is smaller than the target, then the network is too dense and 9 needs to be raised; the current value of 9 becomes the lower bound of the search, the upper bound stays the same, and 9 itself is doubled. In this manner, the algorithm quickly narrows in on the value of 9 that sets the distribution of link connections in 133 the network most closely to the desired state: approximating a small-world network. A l g o r i t h m 3: Link thresholding I n p u t : J\f — a network of nodes representing a corpus of documents Umin — the minimum link weight in the network Umax — the maximum link weight in the network target — the desired power-exponent in the link distribution (set to 2.0) e — the desired degree of accuracy (set to 0.3) O u t p u t : the links of network M tuned to a small-world topology l b e g i n 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 lower < — Umin upper < Umax 9 < — [lower + upper)/2 exp <— 0 w h i l e true d o f o r idd = 1 to |pV|| d o degree <— getDegree(idd, 9) HISTOGRAM <— addToHistogram(degree) eqn *— calculateCurve(HISTOGRAM) exp <— get Exponent (eqn) i f abs(target — exp) < e t h e n |_ E N D i f getPastExponent(eqn) = 0 t h e n storeExponent(eqn) i f (target — exp) < e t h e n upper < — 9 9 <— (9 +lower)/2 i f (target — exp) > e t h e n lower <— 9 9 <— (9 + upper)/2 / / s u c c e s s ! / / t o o s p a r s e / / t o o d e n s e else |_ E N D / / c a n ' t d o a n y b e t t e r 24 e n d The total running time for Algorithm 3 is comprised of i iterations of the outer loop times the sum of building the histogram in time 0(Af), calculating the curve in time C(||fz"||), and storing and retrieving the curve exponent each in time C(log i); thus 0(i x [ \ \M\ \ + + 21og i]). The number of iterations i is not affected by the number of nodes but rather by the choice of e: a smaller e requires more iterations to achieve the desired accuracy (or fail). Since e is a constant, we expect any iterations in terms of i to likewise be bounded by some constant. Since H is strictly smaller thanA^, and i is constant, we simplify the expression for total running time to 0(| |AA||) . 134 We anticipated that some nodes in the network might become completely unconnected as the link-filtering threshold 9 was adjusted. We devised a simple heuristic to catch this possibility. During the run of the link-thresholding Algorithm 3 a tree of node IDs was maintained for all such singleton nodes that were not connected to any other nodes; nodes can become singletons if none of their links are stronger than 9. After the algorithm terminated, any singleton nodes in the tree were re-connected to their strongest neighbour, even if the weight of that link was below 9. In practice, the Reuters corpus required reconnection of only 2 nodes after thresholding, and the N Y T corpus did not fragment. As a result, we did not focus further on network fragmentation during the experiment (Chapter 3). Evaluating the Network There are three tests that can be performed to test for "small-worldiness". The first is the power-law degree distribution; as we have seen above, there are pruning methods that can satisfy this requirement in the type of semantic networks that we are investigating. Once a network has the correct distribution, two additional properties are necessary to confirm a small-world network graph: a high clustering coefficient, and a low diameter. We have implemented the two algorithms described below; initial experiments have shown that the algorithms work well when compared to the fully accurate—but painfully slow—brute-force alternatives. Fine-tuning the evaluation of link pruning is the subject of future work. Clustering coefficient — The clustering coefficient was introduced by Watts and Strogatz (1998) as a measure of social networks. It calculates the proportion of a node's neighbours that are connected to each other. With 5(G) the number of triangles in G, i.e., the number of complete subgraphs of three nodes, and with T(G) the number of triples in G, i.e., the number of nodes in G that are the center node in a path of length 2, the clustering coefficient C(G) of a graph G is defined as -C(G) = ^ This coefficient is costly to calculate for large graphs, as it requires a complete traversal of the network, examining each node's neighbours as it goes. Schank and Wagner (2005) show that for input size of n nodes with degree d, this naive approach has running time 0(ndmax), and the best asymptotic running time using fast multiplication on the connection matrix of m links is 0(m1A1). By contrast, Schank and Wagner propose a much faster approximate clustering function that runs in time 0(1) by uniformly sampling node connectivities as stored in adjacency arrays, although setting a higher probability on a smaller error bound e can significantly increase the running time. Since our approach generates adjacency arrays when the nodes of the corpus are linked (Algorithm 2), we can use this formula for the clustering coefficient. Diameter — The diameter of a graph is defined as the maximum distance between any two points (Harary, 1969). The naive brute-force method of calculating the diameter of a graph is to calculate the average of the breadth-first network traversals starting from each node. Palmer et al. (2002) point out that such traversals incur a significant cost of 0(n2\og n) in a graph of n nodes. Instead, they propose an Approximate Neighbour Function (ANF) that shows good error bounds. The method uses a clever approximate counting technique that assigns a string of bits (or a bitmap) to each node, with a single bit set to 1 in each. Roughly 135 half of the bitmaps have a 1 in their first bit position, a quarter of them in their second position, etc., so that each bitmap has bit i set with probability l / 2 l + 1 . The network is "traversed" by performing the union of bitmaps between adjacent nodes. Estimating the number of nodes visited so far is based on the intuition that if we expect 25% of the nodes to be assigned to Bit 1 and we haven't seen any such nodes yet, then we have probably seen fewer than 4 nodes. Thus the estimated number of nodes visited at any point in the traversal is proportional to 2b, where b is the least bit number in the bitmaps that has not been set. The algorithm iteratively performs the union of all adjacent node bitmaps until all bitmaps in the network show the same value, indicating that all bitmaps have been merged and all nodes have been counted. Thus the number of iterations equals the diameter d of the network. The algorithm therefore has running time 0(d(n + m)) for n nodes and m edges, and is expected to be fast since the diameter d is typically small. The Limits of Tuning At present, this small-world tuning algorithm is simply an initial approach to a principled automatic link pruning method on associative similarity networks, and displays some shortcomings. Following our user experiment (Chapter 3), we generated some other networks with a variety of different documents, and found that some of the'networks failed to converge successfully on the desired degree of accuracy. Instead, these networks oscillated between exponent values or repeated the same exponent value, as 9 was alternately raised and lowered. In this case, the results are as good as they will get, but are sometimes not good enough, as too many nodes in the network remain over-connected. Some of these networks fragment (i.e., decompose into unconnected subgraphs) as the threshold is raised, long before a power-law is achieved. As a completely connected network is necessary to calculate a mean-ingful diameter value, and as a well-connected network is useful for allowing browsing search between nodes, fragmentation is to be avoided. If the fragmentation is minor, a re-linking scheme can be employed. This scheme proceeds by finding the strongest links that fell below the link-weight cut-off threshold between pairs of isolated subgraphs, and re-connecting those links. This heuristic has been implemented and is the subject of further investigation. Based on preliminary experimental observations, premature convergence and fragmentation seem to depend on the semantics represented by the network. If the semantic attributes that contribute to its construction are not sufficiently "rich" (as an extreme example, a corpus made up entirely of the same document), then an insufficient diversity of strong keywords will lower the variability of the network's link weights. In such circumstance raising the link threshold could remove a significant proportion of links, causing the network to break into fragments, and thwarting the search for a small-world topology. A.1.5 Semantic Retrieval Retrieval of documents from the semantic network can take two forms: a Browse search that enables users to move from node to node through the network, and a Keyword search that allows users to specify keywords of interest and returns all nodes that match these keywords, regardless of whether the nodes are connected. 136 B r o w s e S e a r c h — Once the network's links are in place, retrieval of a current node's neighbourhood is straightforward. The neighbour-nodes and the weight of the links to each neighbour are stored in a meta-data file alongside the document file, referenced by the ID of the current node. Since files are stored in a binary search tree based on ID number, a fully specified path name to the meta-file can be generated, and the file accessed in constant time. Thus the run-time cost is determined by the average number of neighbours that a node is likely to have, 0(n) for n neighbour nodes, which in the worst case is bounded by the number of nodes in the network. In a network tuned to a small-world topology, the link distribution is sparse; in the networks that we generated, each node had on average 10 neighbours. K e y w o r d S e a r c h — Keyword searching uses a simple algorithm to retrieve all nodes that are associated with a user's keyword query. As implemented, a set of document IDs is retrieved for each term in the query, and the result is the intersection of all of these ID sets. The running time to retrieve all IDs related to a spec-ified keyword is O(k) for k keywords indexed by the system, and merging two ordered sets takes 0(m + n) for lists of length m and n. Weak keywords could be distributed among some proportion of the nodes of a network, and therefore the length of m and n could be bounded by the size of the network, cC?(|pV||), for some constant c e l . Determining the maximum proportion of nodes of a network that can contain a given keyword is the subject of future work. The algorithm for keyword search is as follows — A l g o r i t h m 4: Retrieving documents with a keyword search I n p u t : Q — a query list of keywords IDX — a table listing document IDs for each keyword extracted from D O u t p u t : IDs — a table listing the IDs of documents corresponding to the query, ranked by their similarity to the query 1 b e g i n 2 IDs < — 0 3 f o r each term tQ in Q d o 5 e n d A.1.6 A d d i n g Individual Documents Although we don't add additional, individual documents to the network once the network is built, it would be simple to do if required by a particular application. Adding an additional new node to a pre-existing network uses the O(D) Algorithm 1 to extract keywords and update the index, and the 0{M2) Algorithm 2 to link the new node into the existing network structure. The only modification to each algorithm is the removal of the outermost loops so that the algorithms apply to adding a single document, rather than inducting an entire corpus. The worst-case running time to add an individual document is therefore 0(1) + 0{M) = 0(M) to link the new node to the M nodes already in the network. However, since we are tuning the network to a small-world distribution with Algorithm 3, we expect the average number of links per node to be 4 137 algorithmic in the number of nodes. Therefore, a more likely bound is O(log N) to connect the new node to log J\f neighbours. Note, however, that adding individual documents to pre-existing networks will produce accurate keywords only for current and future new documents, while the keywords of the pre-existing nodes are likely to grow increasingly inaccurate as the distribution of term frequencies in the corpus (stored in table DF) changes gradually with each new document insertion. A . 2 The Context Network The algorithms that are used to build and access the context network are based on the cue-event-object (CEO) model as discussed in Chapter 2 with regards to situational principles, and in the GPS activity-pattern experiment of Chapter 4, where the algorithms are used to predict future activity based on a dynamic summary of past actions. While both the semantic network and the context network are used to organize information objects, there are some significant differences. Since the similarities between documents do not change, the links of the semantic network are static, whereas since modeling activity patterns in real-time involves continual change, the links of the context network are dynamic. Also, unlike the semantic network, the context network does not connect information objects (such as documents) directly, but only through the events that they share. Here we describe how predictions are made, how context representations are encoded, and how the representations are retrieved. The motivation and justification for this approach is discussed in detail in Chapters 2 and 4. A.2.1 Data Types In general, all nodes in the context network (cue nodes, timestamp nodes, route objects, etc.) are indexed and retrieved by a unique ID number, just as with documents in the semantic network. A l l nodes in the system are also weighted to indicate recency of last usage. More complex approaches to weighting could include consistency of use over time, but research into the regularity of events in the environment (and human memory's adaptation to such regularity) has shown that recency of an event is a very good predictor of its probability of happening again (Anderson and Schooler, 1991). Therefore we use recency as a simplified shorthand for consistency of occurrence. Unlike the semantic network, the context network is programmed to be held entirely in R A M , as in practice the model representation never exceeded memory capacity (a minimum of 512MB) on the machines we used to develop the system, and thus we had no incentive to add code for memory paging. For larger data sets, paging of the sort that we used for the semantic network will have to be implemented, and acquiring large data sets is part of our future work. Since a surfeit of memory was available on our test machines, we designed the following data types to store references to their neighbour nodes, rather than using tables to store link objects, as with the semantic network. The input data were parsed directly from raw GPS logs at run-time. 138 The cues, events, and objects described in the CEO model are represented in simplified form in the imple-mented context network, using the following node types: C u e N o d e s — Cue nodes represent time units, and are fixed in advance, before any data are seen. The time units chosen are those required by the test domain and the scope of the system implementation—in this case, minutes, hours, and days, but not seconds or dates. In general, the choice of time units determines the granularity of the timestamps that can be constructed. As implemented, there are 35 cue nodes in the context network, which represent the 7 days of the week, the 24 hours in a day, and the 4 quarter-hours in an hour. Each cue node contains a description, e.g., "Monday", a type, e.g., "day", and a real-valued weight that indicates node's utility. T i m e s t a m p N o d e s — Timestamps nodes (T-nodes) index routes by when they have been driven. They are referenced by cue nodes that represent the start time of a trip, and point to the routes previously driven at that time. Since more than one route might be taken at any given day and time (say, in different weeks), the T-node contains a tree of route nodes, sorted by the weight of the link connecting the T-node to each route node. The strongest link indicates the route most consistently followed at that time. The link weights are not assumed to be unique. Each T-node contains a real-valued field representing each of its links to its day, hour, and minute nodes. Each T-node also contains a description that is the conjunction of its component time units, e.g., "Monday 9h 30m", and a weight that represents the node's utility. T e r m i n u s N o d e s that are created ad hoc at run-time to represent the endpoints (or termini) of observed trips. If two terminus points within 100m of each other, then they are combined into a single terminus with an averaged coordinate value. A terminus node may be used in multiple routes, as an origin or destination. Each terminus node contains a tree structure that stores each of its constituent location points, and an average representative coordinate. R o u t e N o d e s — Route nodes are the objects of the system. Since the same route can be travelled at different times, routes may be referenced by more than one T-node. Each route node contains two terminus nodes representing the origin and destination of the route. The relationships between these components is shown in Figure A.2. Regardless of type, each node in the network has its own weight value; the nodes that are most consistently used will have the highest weight values. With relation to the CEO model, the route/terminus nodes are the objects of the system. However, the CEO model's event nodes are not needed in the experimental system. Event nodes in the CEO model are used to allow multiple sets of temporal cues to be connected to multiple objects of different types. In this experiment there is only one object type: the route, and the T-nodes are considered to be independent of each other; thus the T-nodes are the sole intermediaries between cue nodes and object nodes. Route nodes in the experiment may be considered a conflation of the event and object nodes of the CEO model. A.2.2 Predict ing Destinations The basic algorithm of the experiment is shown in Algorithm 5. The test data consist of a set of GPS data, with each log file recording an individual trip driven by the same person over a six-month period. The 139 T I M E C U E S T I M E S T A M P . R O U T E S m 3-stage BST <o(d,t) co(h, t ) I D ( t ) o ( t ) o>(m,t) I D ( n ) | (Q ( rQ j t 0 t d I D ( r 2 ) | o ( r 2 ) | t 0 t d I D ( r n ) | c o ( r n ) to t d Figure A.2: The relationship between context-network components. Day (d), hour (h), and minute (m) cues are "linked" to timestamps through a 3-stage binary search tree. Each timestamp contains the weights of those links, a unique ID identifier, its own activation weight, and a binary search tree (BST). The tree links the timestamp to one or more routes, and is sorted by'the weights (u) of those links. Each route contains a unique ID identifier, its own activation weight, and a reference to an origin terminus (tD) and a destination terminus (td). Cues are chosen prior to run-time; timestamps, routes, and termini are created as GPS log files are read. algorithm reads each of the time-ordered GPS log files in order of occurrence, attempting to predict the driver's likely destination based on past activity and updating the accuracy score whenever a prediction is made. At the start of each iteration, the time and place of the driver's starting point (or origin), and the location of the destination, are parsed from the next GPS file. Since the origin and destination are at the top and bottom of each GPS file, they can be extracted in constant time. The destination is temporarily ignored while the function predictDestinationQ, described below as Algo-rithm 5, performs contextual retrieval by querying the network model for likely destinations; the function returns a list of predicted destinations. Each predicted destination is then compared to the actual trip desti-nation, and if they match (i.e., in practical terms, their coordinates are within 100m) then the accuracy score (the ratio of correct to total guesses) is increased. If the prediction list is empty, then no prediction is made, and the score remains the same. The network model is updated in two ways to improve predictions in later iterations. First, the function decay Intervening EventsQ decays all events that were expected but did not occur since the last observed event, by running the clock forward from the timestamp lastEventTime until the current time timestart and applying a decay formula to all retrieved T-nodes, and to all links and nodes connected to these T-nodes. Decay is performed by applying the hyperbolic tangent to the weight to of the node or link with ui' = tanh(co), although any function would work equally well if its output is less than its input and monotonic, to ensure that a set of elements will maintain their relative weight rankings when they are decayed. Rolling the clock forward to retrieve T-nodes is not a solution that generalizes well, but is acceptable for this specific GPS data set, since the time between events is typically no more than a couple of days, i.e., 2 days x 24 hours x 4 quarter-hours = 198 searches between lastEventTime and timestart, a trivial number. Each search takes 0(7og(number of T-nodes)) in the worst case, but since the number of time units is fixed prior 140 to run-time, and since the number of T-nodes is bounded by the number of possible combinations of time units, i.e., 7 days x 24 hours x 4 quarter-hours = 672 in the case of this experiment, the cost of a search is actually constant. The second way that the network model is updated is to encode the trip's origin and true destination in a route object. Since termini on a map are arbitrary and potentially numerous, terminus nodes and route nodes are created at run-time. The latitude and longitude (lat-long) values of the start location are compared to the lat-long values of termini already in the model with the function getTerminusQ: if none of the model's termini lie within 100m of the start location, then the new lat-long reading is encoded as a new terminus with the function createNewTerminus(). If a nearby match is found, the proximate lat-long values are pooled into a single terminus node, and the node is given a lat-long value that averages the proximate readings. Terminus nodes in the system are accessed through a two-stage binary- search tree. The first stage sorts the latitudes. The latitude of the candidate location is compared to the next-highest and next-lowest value in the tree, and if either is within 100m, then the search uses that nearby value to retrieves a tree of longitudes. If no latitudes in the first tree are near enough, the search terminates. In the second stage, if the longitude of the candidate location is within 100m of the next-highest or next-lowest value in the tree, then the terminus for the closest value is retrieved, and the candidate lat-long is added to the set of lat-longs for that terminus. The single "official" coordinate point of the terminus is the average of the lat-longs of the set. If the search terminates without retrieving an existing terminus, then the current location is inserted into the lat-long tree as a new terminus. The search for a matching terminus is therefore 0(log | | termim||), which in the worst (and unexpected) case 0(log ||trips||) if no trip ever returns to a previously visited terminus. We consider this extremely unlikely. As shown in the results of Chapter 4, most drivers are expected to return regularly to e.g., home and workplace, and the rate at which new termini are added to the model is expected to fall off over time. Thus over the long term, the search for a matching terminus is expected to be constant. Once terminus nodes are in hand, the cost of creating a route node is constant. Like termini, route nodes are also stored in a two-stage binary search tree. Routes are stored by the IDs of the origin and destination termini. The cost of inserting and retrieving routes is therefore bounded by the number of termini, which as we've seen is in 0(\og \\trips\\). The occurrence of an event (i.e., driving a trip) is added to the model with the function encodeEventOccurrenceQ, described in Algorithm 6, of cost C7(||tHps|| log | |£rips| |). Thus the worst-case cost of prediction is the combined costs of — 0(1) 0{\og\\trips\\) 0(1) 0([og\\trips\\) 0{\\trips\\log\\trips\\) extracting trip-start and -end data from the logs retrieving a prediction decay of intervening events creating the terminus (and route) objects encoding the route occurrence 141 A l g o r i t h m 5: Predicting destinations from a series of time-stamped routes I n p u t : GPS — a time-ordered series of GPS logs O u t p u t : accuracy — a score in the range 0 to 1 model — a probabilistic network model of event occurrences l b e g i n 26 27 e n d numGuesses < — 0 numCorrect <— 0 lastEventTime < — 0 f o r i = 1 t o \ \ G P S \ \ d o {timestart,placestart} <— extractStart(GPSi) placedest <— extractDestination{GPSi) routeList <— predictDestination(timestart,placestart) f o r % = 1 to | |row£eLis£|| d o placedest <— routeListi numGuesses <— numGuesses + 1 i f placedest = placedest t h e n |^ numCorrect +— numCorrect + 1 accuracy = numCorrect/numGuesses model <-— decayInterveningEvents(lastEventTime, timestart) lastEventTime <— timestart origin <— getTerminus(placestart) i f origin = 0 t h e n |^ origin <— createTerminus(placestart) destination <— getTerminus(placedest) i f destination = 0 t h e n |^ destination <— createTerminus(placedest) route <— getRouteObject(origin, destination) i f route = 0 t h e n |^ route <— createRouteObject(origin, destination) model <— encodeEventOccurrence(model, route, timestart) II A l g . 7 / / d e c a y / / g e t t e r m i n i / / g e t r o u t e / / A l g . 6 — or 2 C9(log | |irips| |) + 0 ( | | iHps | | log | |irips| |) = C?(||irips|| log ||trips||). In practice this has proven to be negligible. 142 A . 2 . 3 C o n t e x t u a l E n c o d i n g Encoding in the network model takes place in the function encodeEventOccurrenceQ in Algorithm 5. Algorithm 6 describes the process of incrementally encoding activity patterns. The input to the algorithm is the clock time of the start of the trip, and the route that was taken. The output per se is the updated network model. First, cue nodes for day, hour, and minute are retrieved from the model. The function getCueQ is straight-forward in the case of day and hour: the days and hours map one-to-one onto day and hour nodes. The function getGeneralizedCue() is used to pool the 60 minutes in an hour into 4 blocks of 15 minutes each (much as termini are pooled if close enough together). Pooling the minutes allows a more intuitive descrip-tion of time, such as "around half-past the hour". The functions getCuei) and getGeneralizedCuei) both execute in constant time. A l g o r i t h m 6: encodeEventOccurrence() — Encoding occurrence patterns I n p u t : model — a probabilistic network model of event occurrences route — a tuple comprised of start and destination coordinates time — a tuple comprised of day, hour, and minute values O u t p u t : an updated probabilistic model of event occurrences 1 b e g i n / / r e t r i e v e a p p r o p r i a t e c u e n o d e s f r o m t h e m o d e l 2 day <— getCueitimeday) 3 hour <— getCue(timehour) 4 minute <— getGeneralizedCue(timeminute) II p r o c e s s t h e t i m e s t a m p t <— getTimestamp(day, hour, minute) 6 i f i = 0then 7 t *— createTimestamp(day, hour, minute) 8 UnkToCue(t, day) 9 HnkToCue(t, hour) 10 UnkToCueit, minute) 11 | UnkToRoute(t, route) 12 13 14 15 16 17 18 19 20 21 e n d else jolt(day) jolt(hour) joltiminute) jolt(route) j olt(getCueLink(t, day)) jolt(getCueLink(t, hour)) jolt(getCueLink(t, minute)) jolt(getRouteLink(t, route)) II j o l t a l l n o d e s / / j o l t a l l l i n k s 143 Once cue nodes are retrieved for day, hour, and minute, they are used to retrieve a timestamp node (T-node) with the function getTimestampQ. There is only one T-node for any unique combination of time cues. T-nodes are retrieved from a three-stage binary search tree, with a stage each for day, hour, and minute. For each stage, the unique integer ID of each temporal cue node is used as the key to retrieve the tree for the next stage, and the last tree in the series retrieves the t. Thus t retrieval takes time 0(£o<?(number of T-nodes)), where the number of T-nodes is bounded by the number of possible combinations of time units, i.e., 672 as seen above, and thus the retrieval runs in constant time. If no T-node fits the given arguments, then a new one is created with the function createTimestampQ and inserted into the three-stage tree with the values of the new day-hour-minute combination. The new links between the cues and the T-node are given an arbitrary default weight set to 0.9 on a normalized scale, and the weight values are stored in the reserved fields in the T-node. The route is inserted into the T-node's route tree, with key the default weight of 0.9. If the final stage is successful and an existing T-node is retrieved, then all the links and nodes related to the current trip are jolted, i.e., their weight value is increased to a maximal level of 0.9 on a normalized scale4. Creating and storing a T-node costs the same as retrieving one, i.e., constant time, and linking already-retrieved nodes by storing the link's weight value in a reserved field also takes constant time. The function getCueLinkQ is implemented to retrieve the link connecting a T-node to a cue node, and runs in constant time since with a T-node in hand, the link values to the day, hour, and minute nodes are directly available in reserved fields in the T-node. The function getRouteLink() is more complex since a T-node may reference multiple routes; these are stored as a tree in the T-node with key the weight of the link and with value the route itself. Thus the route cannot be retrieved in logarithmic time, and requires an exhaustive search, comparing the route ID to each of the IDs in the route tree. Thus retrieving a route takes time 0(r) for r routes indexed by the T-node. When the route is found in the tree, its corresponding key is jolted, which entails removing the weight-route mapping from the tree and re-inserting it at the new value, with cost 0(log r). Thus both retrieving and jolting a route takes 0(r log r). In practice, T-nodes did not reference more that three or four routes, and so the route IDs were stored in the T-node in a simple list. Thus with a T-node already retrieved, the cost of accessing the nodes connected to it is essentially constant. However, in the worst case where routes are never repeated but all driven at the same time, the bound becomes 0 ( | | trips || log ||trips||). We consider this scenario highly unlikely. The upper-bound cost for encoding an occurrence pattern is thus the cost of retrieving or creating a terminus node for the trip origin, plus the cost of retrieving or creating a timestamp node, 0(log ||trips] |) + 0(1) + 0 ( | |trips11 log ||trips||) = 0(11trips | | log ||trips||), which is more than adequate for the purposes of our experiment. 4 Activation could just as easily be set to the limit of 1.0, although at development time we wanted to conserve some activation "headroom" for experimenting with different jolting strategies. 144 A.2.4 C o n t e x t u a l R e t r i e v a l Contextual retrieval is used in the function predictDestinationQ in Algorithm 5, and is described below in Algorithm 7 with respect to retrieving a prediction of trip destination. As with contextual encoding, the algorithm begins by retrieving cue nodes that correspond to the time of trip origin. These retrievals all occur in constant time. The retrieved nodes are used to retrieve the timestamp node (T-node) that they share with the function getTimestampQ, which runs in constant time since the number of T-nodes is finite. Algorithm 7: predictDestination() — Retrieving a prediction Input: startTime — a tuple of the day, hour, and minute of the trip start time startPlace — a tuple of the lat-long coordinates of the trip's origin 6 — a minimum threshold for predicting a route (set to 0.4) Output: predictionList — a ranked list of predicted destinations 1 begin 2 | day <— getCue(startTime(iay) 3 | hour <— getCue(startTimehour) 4 | minute <— getGeneralizedCue(startTimeminnte) t <— getTimestamp(day, hour, minute) i f i = 0then | return 0 else predictionList <— 0 routeList <— getRouteList(t) for i = 1 to \ \routeList\ \ do {weight, route} <— routeListi 6 7 8 9 10 11 12 13 14 15 16 end if startPlace = routeorigin and weight > 9 then predictionList return predictionList predictionList U routedestination If no timestamp is found, then a null result is returned. Otherwise the function getRouteListQ returns a list of all routes referenced by the T-node to all its connected routes, in time 0(r) for r routes indexed by the T-node, i f the leaves of the T-node's route tree are maintained as a list. The number of routes is bounded in the unlikely worst case by the number of trips, i f every trip starts from the same place and at the same time, but ends at a different destination. While this would place the number of routes at an upper bound of 0 ( l o g | | t r ips | | ) , in practice we found that T-nodes were typically connected to a maximum of three or four routes, and thus route retrieval in the experiment is essentially a constant-time function. This result fits with the intuition that most drivers who leave consistently from specific trip origins at specific times are likely to be headed to very few appropriate destinations. The destination terminus node of the most 145 strongly connected route is returned in a tuple along with the weight of the link, which is used to indicate the confidence of the prediction. The algorithm then iterates through the route list of size C ( l o g ||f,rips||), and accumulates the destinations of all routes that originate from the startPlace and whose link weights exceed a threshold 9. In practice, the value of 9 was arbitrarily set to 0.4, as this seemed to provide the best ratio of correct to false-positive predictions. Thus the cost for contextual retrieval of a prediction is the sum of costs of retrieving a T-node, plus iteration through the route list: 0 ( l o g ||irips||) + C ( l o g ||£rips||) = C ( l o g ||£Hps||). A . 2 . 5 T e m p o r a l A g g r e g a t i o n The goal of aggregation is to generalize the context network where possible, to create summaries that de-scribe regular patterns of behaviour. The intuition for aggregation is explained in Chapters 2 and 4. Here we describe the mechanism of aggregation in more detail using the language of set theory. Aggregation follows the process described in the fol lowing algorithm: Algorithm 8: Aggregation of timestamps Input: t — a newly created or newly updated timestamp node Result: an updated model 1 begin 2 EVENTS <— getAUConnectedEvents(t) 3 Te <— getAUPeerTimestamps{EVENTS) 4 Tg <— perfectAggregation(Te) s if 7^  = 0 then The algorithm begins by retrieving all event nodes connected to a given new or updated timestamp node t. The algorithm then retrieves all of fs peer T-nodes, i.e., all the timestamps that are connected to the same event nodes as t. With this set Te of peer T-nodes, the algorithm attempts perfect aggregation, i.e., the substitution in node t of a complete set of time units with a single time unit of bigger granularity. If perfect aggregation fails, then the algorithm attempts partial aggregation, i.e., the substitution in node t of an incomplete set of time units with a well-formed formula that indicates a concurrence of specific units referencing one or more particular events. The cost of the algorithm is highly dependent on how the data types are implemented. If no special adap-tations are made, then in the worst case, i f most T-nodes in the system are connected to most event nodes, then retrieval of Te wi l l take time 0(\\EVENTS\\) x 0 ( | | T | | ) , which since the number of T-nodes in the system is finite and limited to the number of possible combinations of time units, is 0(\\EVENTS\\). The number of events is potentially very large, although evidence from the driving experiment suggests (with regards to the rate of introduction of new termini and routes to the model) that the number of possible event 6 7 end 146 types is limited, and perhaps virtually constant over the long-term. Still, processing can be accelerated if T-nodes are organized by the distributions of aggregable and non-aggregable units that they share, so that candidates for aggregation may be more easily found. This is a topic for future work. Once a set of aggregable T-nodes has been identified, the cost of aggregation is relatively low. One node in the aggregable set is kept, and the aggregation substitution is applied to its time units. The other aggregable T-nodes and their links are deleted. Since T-node links are stored in tables, links can be deleted en masse along with the T-nodes in constant time, and thus the cost of an individual aggregation is constant. The conditions for perfect and partial aggregation are as follows. Fundamental Data Types and Relations How the time units are related (e.g., hour as part-of day) are specified in a temporal subsumption graph (TSG). The T S G is coded as a simple hierarchy of sets. There are two types of relations that needs to be indicated between sets in the TSG: those that are aggregable (e.g., the days Monday to Friday can be replaced by the larger unit weekday) and those that are not (minutes are part of hours, but say nothing about which hour). In what follows, we describe how aggregation functions with respect to the structure of the T-Nodes represent a conjunction of atomic time units; as such they are timestamps that can be used to index events by when they occur. T-Nodes that reference the same event(s) may be aggregated if they show the same level of support, i.e., if their activation weights are the same within some tolerance. For aggregation to be possible, the following elements and relations are necessary. Given : U: the superset containing all time units u, i.e., the Temporal Subsumption Graph (TSG). ^ y - — - \ £ / i+7^— T S G . The TSG Hierarchy T: the set of all T-nodes (timestamps); each T-node is a set containing a conjunction of time units u. E: the set of all events. ' e = an event G E. U is comprised of non-intersecting levels Ui such that Figure A.3: The aggregation mapping 0 Let n be the number of levels in T S G hierarchy U. If Ui is aggregable (e.g., 0600h =>• morning): Each aggregable level Ui is the union of aggregation sets A: Ui — {Au | u e Ui+i). 147 A n aggregation subset A of level Ui holds time units that map onto a single element u of Ui+i : Fix i < n. F ix u 6 Ui+\. Let 0 : U—>U be a many-to-one mapping with the property that i f u E Ui for i < n then @(u) E C/j+i (and i f u E Un then 0(u) = u). A n aggregation subset Au C <7j is the set of elements 0 _ 1 ( w ) for some u G <7j+i. I f t/j is not a g g r e g a b l e (e.g., M o n d a y =/=> J u l y ) : Each non-aggregable level Ui w i l l hold a single subset such that Ui = A and for Vn G A there is no discrete mapping 0 , i.e., Q(u) = 0. Rather set A can map to any and all units of C/i+i. T - n o d e s t ha t M a p to E v e n t e Let Te be a subset of T that maps to a single event e'mE : Let A : T —* I? be a many-to-one mapping. Fix e £ E. Te = A~\e). The elements of Te are the T-nodes Te\... Tek where k is the size of Te. Let TEA be the subset of all T-nodes in Te that contain units of aggregation subset A : T e A = { T e i G T e | T e i n A ^ 0 } . Let TpiLTER be the set of all time units u in TEA that are also members of aggregation set A : TFILTER = Let TREM be the set of all T-nodes in Te\ with elements of A removed : TREM = j^ei — A | T e j G Te^j . The set of all T-nodes in Te that d o no t contain units of A: {TeA)c = [Tei | Tei G Te, TeiDA = 0}. 148 P e r f e c t A g g r e g a t i o n In the ideal case, if all the units of A U € U% appear in otherwise identical members of TE, then we can either replace those units with a single unit u G Ui+i, or remove them. if 0 A — TFILTER if all time units in A are also in TEA (and A maps onto a u ) and © U TREM = f l TREM the non-A units in T E A are the same be that 3u G A \ v! — Q(u) identifies the 0-element of set A then TE' = [{{u'} U \ J T R E M } U (TEA)C] replace all TEA in TE with a single node. F o r E x a m p l e Let e be an event in set E. Let A = {a,b} C [7j. Let T e = {{a, 1}, {6,1}, {g, 0}}, where the glyphs a, 6, q, 1,0 represent arbitrary time units such that, within the T S G hierarchy, {a, b, q] S Ui, {0,1} G Uj, for i ^ j . Thus — T E A = {{a,l},{b,l}}. TFILTER = {a, b}. TREM = {{1}, {!}}• (T e A ) c = {{<?, 0}}. Condition Q is satisfied since TFILTER = A . Condition © is satisfied since \JTREM = C\TREM = {I}-If Ui is a g g r e g a b l e — : Then Q_1(x) = A X . Thus u' = x. So we construct the new TE' = | {{x} U {1}} U {{q, 0}} } = {{x, 1}, {q, 0}} where the sets {a, 1} and {b, 1} have been aggregated into the set {x, 1}. If Ui is no t a g g r e g a b l e — Then V u G A , 0 ( u ) = 0. Thus it' = 0. So we construct the new TE' = { {{0} U {1}} U {{q, 0}} } = {{1}, {q, 0}} where the sets {a, 1} and {6,1} have been aggregated into the set {1}. 149 Partial Aggregation In the more common case where there is at least some agreement between the members of Te, a partial pattern can be abstracted into a hierarchical assertion of conjunctions and disjunctions. This assumes that all constituent sub-patterns are unique to a single event e, otherwise they will be "broken out" and shared with other events. In partial aggregation, the sets of Te are clustered recursively in order of most-to-least-common time unit; in the case where no one unit is most numerous, then clustering continues by decreasing time-unit granularity. For Example Let e be an event in set E. Let T e = {{a, 1, a}, {3,1, b}, {a, 0, c } , {a, 1, q}}, where the glyphs a, 3 represent arbitrary time units such that, within the TSG hierarchy, {a, b, c, q} G Ui, {0,1} G Uj, and {a, 3} G Uk, for i < j < k. Here a and 1 are equally numerous, so a takes precedence due to its bigger granularity size. T e = {{a, 1, a}, {3,1, b}, {a, 0, c}, {a, 1, q}} = {{a,(l,a) | (l,q) | (0, c ) } , {/3,1, o}} = { {a , ( l , a |g ) | ( 0 , c ) } , { / 3 , l , 6 } } . By contrast, if one T-node is changed such that glyph 1 is most numerous— T e = {{a, 1, a } , {3,1, 6}, { a , 1, c}, {a, 1, <?}} = {!. ( « . a ) I ( « . 9) I (a, c) I (/?, 6)} = {1, (a, a I <z I c) I (/?,&)}. In our GPS experiment, we chose time units (day, hour, minute) based on the nature of the GPS data set and the desired granularity of results. Since none of the units that we chose were aggregable, we were not able to test aggregation as part of our experiment. Nonetheless, code for aggregation and the temporal subsumption graph (TSG) is implemented, and further testing will be part of our future work, subject to acquisition of a large and diverse GPS trip corpus. For aggregation testing to be meaningful, we will need access to some large and detailed GPS data sets. In particular, comparisons between different driving situations, such as for commuters, couriers, taxis, and family use, would be of particular interest. 150 Bibliography Adamic, L . A. , Lukose, R. M . , Puniyani, A . R., and Huberman, B . A . (2001). Search in power-law networks. Physical Review E, 64(046153):8. Anderson, J. R. and Schooler, L . J. (1991). Reflections of the environment in memory. Psychological Science, 2(6):396-408. Barabasi, A . - L . (2002). Linked: The New Science of Networks. Perseus Publishing: Cambridge, M A . Harary, F. (1969). Graph Theory. Addison-Wesley: Reading, M A . Kleinberg, J. M . (2000). Navigation in a small world. Nature, 406:845. Matsuo, Y. and Ishizuka, M . (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal of Artificial Intelligence Tools, 13(1): 157-169. New York Times (2006). The New York Times online edition, Palmer, C. R., Gibbons, P. B. , and Faloutsos, C. (2002). ANF: A fast and scalable tool for data mining in massive graphs. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 81 -90. Reuters-21578 (2006). Reuters 21578 test collection. Robertson, S. E., Walker, S., and Beaulieu, M . (2000). Experimentation as a way of life: Okapi at TREC. Information Processing and Management, 36(1):95-108. Salton, G. (1968). Automatic Information Organization and Retrieval. McGraw Hi l l Text: New York, NY. Salton, G. and McGi l l , M . (1983). An Introduction to Modern Information Retrieval. McGraw-Hill: New York, NY. Schank, T. and Wagner, D. (2005). Approximating clustering coefficient and transitivity. Journal of Graph Algorithms and Applications, 9(2):265-27'5. Watts, D. j ; and Strogatz, S. H. (1998). Collective dynamics of 'small world' networks. Nature, 393(6684):440-442. 151 Appendix B Data for User Study B . l User Tasks B . 1 . 1 N Y T T a s k D e s c r i p t i o n s T r a i n i n g T a s k s Find articles that discuss teaching the theory of evolution. Find articles that discuss new internet search technologies. E x p e r i m e n t a l T a s k s Tasks are listed by task ID number as used in the result tables — 1. Find articles that discuss reporters (e.g. Judith Miller) going to jail. 2. Find articles that discuss the business dealings of 3. Find articles that discuss a new visualization tool for internet search queries. 4. Find articles that discuss Google's business dealings (not its technology). 5. Find articles about problems in a professional sport league. 11. Find articles which discuss how to shape your virtual internet identity. 12. Find articles in which one company refers to another company. 13. Find articles that criticize U.S. foreign policy in the Iraqi war. 14. Find articles that discuss digitizing books for online reference. B.1 .2 R e u t e r s T a s k D e s c r i p t i o n s T r a i n i n g T a s k s Find articles that talk about coffee quotas. Find articles which talk about export problems of European countries. E x p e r i m e n t a l T a s k s Tasks are listed by task ID number as used in the result tables — 6. Find articles that talk mainly about people going on strike. 7. Find articles that mention the U.S selling weapons to another country. 8. Find articles that talk about Canadian banks giving loans to Brazil. 9. Find articles that talk about new products of I B M . 10. Find articles that talk about financial predictions of the future. 15. Find articles that talk about legal issues at the company Texaco. 16. Find articles about weather causing a problem with Argentina's agriculture. 17. Find articles that talk about Canada's airline industry. 18. Find articles which discuss the situation of workers in the American economy. B.2 User Questionnaires B.2.1 P r e - Q u e s t i o n n a i r e P e r s o n a l I n fo Age — Users provided their age, typed into a text field. Gender — Users chose one of "Female" or "Male" with radio buttons. User Name — Users provided a unique anonymous identifier, typed into a text field. C o m p u t e r U s a g e "Choose your level of expertise in computer usage" — Users moved a slider along the following scale from 0 to 10: "0 - rarely use a computer" "10 - expert". "What search engine do you use most often?" — Users typed their answer into a text field. "How often do you use a search engine?" — Users chose one of the following with radio buttons: "Many times each day" "A couple of times a day" "Once a day" "Once every couple of days" "Less" 153 Experience "Choose the level of education you finished or are in now:" — Users chose one of the following with radio buttons: "High School" "Undergraduate" "Graduate" "Year" — Users typed the year-of-program that they had completed, into a text field. "How often do you read news? (paper or internet)" — Users chose one of the following with radio buttons: "every day" "about every second day" "about once a week" "less" B.2.2 Post-Questionnaire "In what circumstances did you use the green button?" — Users typed their answer into a text box. "Which system helped you better to perform the tasks?" — Users chose one of the following with radio buttons: "SEARCH (without the green button)" "BROWSE (with the green button)" "Which system did you prefer?" — Users chose one of the following with radio buttons: "SEARCH (without the green button)" "BROWSE (with the green button)" "In your opinion, in which system did you spend more time overall?" — Users chose one of the following with radio buttons: "SEARCH (without the green button)" "BROWSE (with the green button)" "Did you find the green button useful?" — Users typed their answer into a text box. "Please rate how useful was the green button feature in your opinion" — Users moved a slider along the following scale from 0 to 10: "0 - not useful" "10 - very useful" "Did the green button option give you results that were hard to find otherwise?" — Users typed their answer into a text box. Users were also asked to provide any comments that they might have that were not covered by the questions, which were typed into a text box. 154 B.3 Order of Tasks Order of task completion NYT Reuters NYT Reuters userid 1 2 3 4 5 1 1 12 13 ' 4 6 7 a 9 to 15 16 | 17 18 user 01 user 02 user 03 3 user 04 user 05 user 06 user 07 user 08 user 09 user 10 user 11 user 12 user 13 user 14 user 15 2^ user 16 user 17 user 18 user 19 user 20 user 21 3 , 4 user 22 user 23 user 24 Table B.l: Order of task completion. There were four main experimental groups, each of which completed the four blocks of questions in different order. In all groups, task blocks 1 and 2 were performed in the Search interface, and task blocks 3 and 4 were performed in the Browse interface. The Search interface is virtually identical to standard Google searching in a browser, and provides a baseline by which the Browse interface can be evaluated. 1 5 5 B.4 Tab le o f Resu l t s Scores per task per user NYT Reuters NYT Reuters usena I 2 3 4 5 11 2 13 14 6 7 8 9 10 •5 16 •7 13 user 01 8 1 7 2 0 2 4 6 2 2 0 5 2 8 8 ! 3 user 02 3 1 6 5 1 3 3 4 2 1 n 4 2 11 9 3 7 user 03 5 0 4 3 2 3 3 5 2 1 1 1 1 5 5 6 0 4 user 04 5 1 3 4 1 4 3 2 2 2 i 1 1 2 4 3 2 1 user 05 10 0 5 2 2 7 5 7 3 0 1 6 13 1 0 5 user 06 6 0 3 2 2 4 3 2 2 0 i 0 1 5 6 4 1 3 user 07 1 1 2 4 4 4 2 3 2 1 1 2 1 4 6 10 4 3 user 08 6 0 1 1 3 3 1 4 2 2 i 2 1 3 2 7 2 4 user 09 3 0 1 1 1 0 1 3 2 0 ! 0 3 4 3 1 4 user 10 5 1 6 1 0 3 3 4 2 0 1 2 • 4 8 2 2 4 user 11 i ! 1 2 0 2 2 4 2 2 1 2 3 I 2 ' 1 2 user_12 3 • ' 0 0 0 6 4 ? 1 1 0 1 0 0 3 6 2 user ' 3 4 : 3 2 1 5 4 3 2 2 1 5 3 8 9 7 4 5 user 14 5 1 3 3 4 8 21 2 2 2 6 4 8 12 7 0 4 user 15 5 0 6 4 1 4 9 2 2 3 1 3 4 4 10 8 1 3 user 16 11 1 2 4 2 6 6 2 3 2 1 3 4 3 17 12 0 4 user 17 6 0 3 4 0 2 7 2 2 3 4 4 6 13 6 0 3 u s e r j 8 14 1 2 1 1 3 4 ? 3 3 1 5 1 7 15 3 6 1 user 19 20 0 7 5 b 5 7 3 2 2 1 3 4 8 18 14 9 4 user 20 7 ; 1 1 0 0 4 2 2 1 1 3 3 3 11 11 8 3 user 21 10 t 2 1 2 1 3 2 2 0 1 0 4 5 16 12 5 4 user 22 14 0 5 0 2 2 7 3 2 2 1 5 4 9 16 3 9 3 user 23 17 1 9 1 1 2 20 2 2 4 1 3 4 4 19 1 0 5 user 24 10 0 3 3 0 2 2 2 2 1 0 2 4 2 3 1 2 1 task ID | median score 5 1 3 2 1 3 3.5 3 2 2 1 2 1 4 8 6 1 3.5 threshold number of targets 23 2 43 9 19 10 61 7 3 4 4 20 5 25 24 18 13 8 13 1: high 0: low extent 1 1 1 1 1 1 1 1 1: direct 0: indirect directness 1 1 1 1 1 1 1 Table B.2: Data for the user study. Tasks in the first and third columns (task IDs 1 to 10) were taken from the N Y T corpus, and tasks in the second and fourth columns (task IDs 11 to 18) were taken from the Reuters corpus. Task IDs are listed across the top row. Median scores per task are given below the table, and table cells above the median value are shaded. The extent of a task is based on super- or sub-median number of correct targets for that task available in the corpus. The directness of a task is based on the ease with which the task could be completed by simple keyword search. Tasks of high extent and directness are indicated with ones; zeros are omitted for clarity. 156 Appendix C Data for Context Experiment trip route ID origin ID dest ID routes termini guesses correct score 1 r01 too t01 1 2 2 r02 t01 t02 2 3 3 r03 t02 too 3 3 4 r04 t03 t04 4 5 5 r05 t04 t01 5 5 6 r06 t01 too 6 5 7 r07 101 t05 7 6 8 r08 tos t06 8 7 1 0 0.0000 9 r09 t06 107 9 8 1 0 0.0000 10 no t07 101 10 8 2 0 0.0000 11 r06 t01 too 10 8 2 0 0.0000 12 r01 too 101 10 8 2 0 0.0000 13 r06 t01 too 10 8 2 0 0.0000 14 r01 too t01 10 8 2 0 0.0000 15 r06 t01 too 10 8 2 0 0.0000 16 r01 too t01 10 8 2 0 0.0000 17 m t01 t08 11 9 2 0 0.0000 18 r12 108 too 12 9 2 0 0.0000 19 r13 too t04 13 9 2 0 0.0000 20 r14 109 t01 14 10 3 1 0.3333 21 r06 t01 too 14 10 3 1 0.3333 22 r07 101 t05 14 10 3 1 0.3333 23 M5 105 t10 15 11 3 1 0.3333 24 r16 t10 101 16 11 5 2 0.4000 25 r06 t01 too 16 11 6 3 0.5000 26 r01 too t01 16 11 7 4 0.5714 27 rf)6 t01 too 16 11 8 5 0.6250 28 r01 too 101 16 11 9 6 0.6667 29 r11 101 t08 16 11 10 7 0.7000 30 r12 t08 too 16 11 11 8 0.7273 31 r01 too t01 16 11 12 8 0.6667 32 r07 W1 tos 16 11 12 8 0.6667 33 r17 t05 too 17 11 12 8 0.6667 34 r13 too t04 17 11 12 8 0.6667 35 r16 HO t01 17 11 12 8 0.6667 36 r07 t01 105 17 11 13 9 0.6923 37 r18 W5 101 18 11 14 10 0.7143 38 r06 t01 too 18 11 15 11 0.7333 39 r01 too t01 18 11 16 12 0.7500 40 r11 101 108 18 11 16 12 0.7500 41 r12 t08 100 18 11 16 12 0.7500 42 r01 too t01 18 11 17 13 0.7647 43 r06 t01 too 18 11 17 13 0.7647 44 r01 too t01 18 11 18 14 0.7778 45 r11 t01 t08 18 11 19 14 0.7368 46 M2 t08 too 18 11 19 14 0.7368 47 r01 too t01 18 11 20 14 0.7000 48 r06 t01 too 18 11 21 15 0.7143 49 r07 t01 105 18 11 22 16 0.7273 50 r15 t05 no 18 11 23 16 0.6957 51 r16 no 101 18 11 23 16 0.6957 52 r11 W1 t08 18 11 23 16 0.6957 53 r19 108 111 19 12 23 16 0.6957 54 r20 tn t01 20 12 23 16 0.6957 55 r06 101 too 20 12 24 17 0.7083 56 r01 too 101 20 12 25 18 0.7200 57 r11 t01 tos 20 12 26 18 0.6923 58 r12 t08 too 20 12 27 19 . 0.7037 59 r13 too 104 20 12 27 19 0.7037 60 r05 t04 t01 20 12 27 19 0.7037 61 r06 t01 too 20 12 28 19 0.6786 62 r21 too t12 21 13 29 19 0.6552 63 r06 t01 too 21 13 30 20 0.6667 64 r01 too 101 21 13 31 21 0.6774 65 r06 t01 too 21 13 32 22 0.6875 66 r01 too t01 21 13 33 23 0.6970 67 r11 t01 t08 21 13 34 24 0.7059 trip route ID origin ID dest ID routes termini guesses correct score 68 r12 t08 too 21 13 35 26 0.7429 69 r22 too 111 22 13 35 26 0.7429 70 r20 t11 t01 22 13 36 27 0.7500 71 r23 t01 t13 23 14 37 27 0.7297 72 r24 t13 too 24 14 38 28 0.7368 73 r01 too t01 24 14 39 29 0.7436 74 r06 t01 too 24 14 40 30 0.7500 75 r07 t01 105 24 14 41 30 0.7317 76 r08 t05 t06 24 14 42 30 0.7143 77 r25 toe t01 25 14 43 31 0.7209 78 r06 t01 too 25 14 44 32 0.7273 79 r01 too t01 25 14 45 33 0.7333 80 r06 t01 too 25 14 46 34 0.7391 81 r01 too t01 25 14 47 36 0.7660 82 M1 t01 t08 25 14 48 37 0.7708 83 r12 t08 too 25 14 49 39 0.7959 84 r01 too t01 25 14 50 40 0.8000 85 r06 101 too 25 14 51 41 0.8039 86 r11 101 t08 25 14 52 42 0.8077 87 r12 t08 100 25 14 53 42 0.7925 88 r01 too t01 25 14 54 43 0.7963 89 r06 101 too 25 14 55 44 0.8000 90 r01 too t01 25 14 55 44 0.8000 91 m t01 t08 25 14 56 44 0.7857 92 r12 tos too 25 14 56 44 0.7857 93 r13 too t04 25 14 56 44 0.7857 94 r05 104 101 25 14 57 46 0.8070 95 r23 t01 t13 25 14 58 46 0.7931 96 r24 H3 too 25 14 59 48 0.8136 97 r01 too 101 25 14 60 49 0.8167 98 r26 t01 no 26 14 60 49 0.8167 99 r27 no too 27 14 61 50 0.8197 100 r11 t01 108 27 14 62 50 0.8065 101 r12 t08 too 27 14 63 50 0.7937 102 r28 too too 28 14 64 50 0.7813 103 r28 100 too 28 14 65 51 0.7846 104 r01 too t01 28 14 66 53 0.8030 105 r07 t01 t05 28 14 67 53 0.7910 106 r17 t05 too 28 14 67 53 0.7910 107 r01 too 101 28 14 68 54 0.7941 108 r23 t01 113 28 14 68 54 0.7941 109 r24 t13 too 28 14 69 55 0.7971 110 r01 too 101 28 14 70 56 0.8000 111 r26 t01 no 28 14 70 56 0.8000 112 r27 no too 28 14 71 57 0.8028 113 r26 t01 no 28 14 71 57 0.8028 114 r27 no too 28 14 72 58 0.8056 115 r13 too t04 28 14 72 58 0.8056 116 r05 t04 t01 28 14 72 58 0.8056 117 r23 101 H3 28 14 72 58 0.8056 118 r24 t13 too 28 14 73 59 0.8082 119 r01 100 101 28 14 74 60 0.8108 120 r11 t01 t08 28 14 75 61 0.8133 121 r12 t08 100 28 14 76 62 0.8158 122 r01 too t01 28 14 77 63 0.8182 123 r26 t01 110 28 14 78 63 0.8077 124 r27 no too 28 14 79 65 0.8228 125 r01 too 101 28 14 80 66 0.8250 126 r23 t01 113 28 14 81 66 0.8148 127 r24 H3 too 28 14 82 68 0.8293 128 r06 101 too 28 14 83 70 0.8434 129 r01 too t01 28 14 84 71 0.8452 130 r06 t01 too 28 14 85 73 0.8588 131 r28 too too 28 14 86 73 0.8488 132 r29 t01 t14 29 15 87 73 0.8391 133 r30 t15 101 30 16 88 73 0.8295 Table C l : Data for the context experiment. 157 The data was gathered for the destination-prediction experiment of Chapter 4: Context-Dependent Informa-tion Retrieval. Data was added to the model in real time as trips were read from individual GPS logs. Each trip is logged with unique identifiers for each route, which is made up of two terminus locations also logged with unique identifiers origin ID for the starting point, and dest ID for the destination. The table also shows the number of routes and termini at each step, as well as the number of predictions (guesses), the number correct, and the resulting accuracy score. 158 Appendix D Ethics Approval for User Study 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items