Chapter 4 Frequent users 4.1 Introduction This chapter summarises some preliminary results from interviewing and from analysis of the transaction logs of 21 people who were frequent users of the Okapi systems illustrated in Appendix A. The Okapi logs are described in Appendix R. Twenty-one users (15 network and six library users) were identified as having used the system frequently. These users were rhose with at least five sessions on at least four different days. The library users (those using Okapi with a library card number via the terminal available in the library, who were only able to access the City University Library database) are referred ro as LIB-A, LIB-B etc and the registered network users are NET-A, N E T - B and so on. 4.2 Obtaining the data There is one user log for each online session. The logs are chronological records of what has happened during the session. They include the queries, query modifications, references that the user has seen, whether a reference was judged relevant by the user, the terms added by the system to expand on the query should the user want this, some riming information and some summaries of command usage during the session. The following types of information were gathered: 1. Information about the patterns and context of user searches. 2. Some background information about the users. 3. Information about the terms used by individual users. 39 40 CHAPTER 4. FREQUENT USERS 4. Information about the variety and frequency of terms taken over all the users. Information about the patterns and context of the user sessions and user behaviour was obtained by looking at frequent users' logs in depth. 4.2.1 Background information about users A limited amount of background information was available about those users who registered to use the system over the network. This consisted of their name, department, status ( undergraduate, graduate, staff) and email address. For those who use the system only from the library this information was not available as there was no registration of users. However, interviews were conducted with the 15 frequent network users to check the log analysis and to elicit information about their attitude to the system. The data collection instrument is shown in Appendix C.2. The interviews were about 20 minutes long. The questions came under four sections: 1. context 2. Okapi in general 3. search behaviour 4. conclusion. T h e first section consisted of questions about users' field of work, what stage they were at in their work and whether they regularly exchanged information with colleagues. T h e second section related to users' perception of the Okapi system: whether they found it easy to use. which databases they used and which they found most useful, and whether they felt they had had to adapt or change their search strategy over time. The third section consisted of questions about the way users performed their searches. They were asked how they went on to choose records from the brief records lists, how they decided whether a reference was relevant, whether they felt they had to look through a lot of references to find useful ones, whether they noticed any points indicating different levels of relevance (for example at the point where the system displays "The rest of the books may not match your search very well" (Figure A.8)), and whether they had used the " M O R E " (query expansion) command and if so whether they found it useful. T h e final questions were as to whether they had performed any online searches on other systems or any manual literature searches, and whether there were any particular features that they liked or disliked about the Okapi system. 4.3. ANALYSING 4.2.2 THE DATA 11 I n d i v i d u a l s ' use of language Queries are parsed and the remaining words are then stemmed (for example "translating" and "translate" become "translat"). It is these stemmed terms which are then looked up. Programs were written to produce listings of the (stemmed) terms used by any particular user along with the total number of times the terms were used and the weights of the terms (the weights of the terms depend on how frequently they occur in the database used at the time, the more frequent a term occurs the less the weight is and vice versa— see equation 2.1). For example, the user XET-D used the term "robot7' six times but "parallel'' only once. 4.2.3 Search l a n g u a g e in general Further programs were written to analyse the usage of terms over all users. These programs list all the terms used by all the frequent users with total and per user frequency for each term, together with the weight the system has assigned to the term on each usage. For example, the term "vision" was used 47 times (over the network) by eight different users. Amongst these searches r he term took on eight different weights. 4.3 Analysing t h e data The !ogs of each frequent user were analysed and notes made on • the general subject areas that were seen to be queried by the user • some overall impressions about the sessions • more specific details and examples. In addition to these, information was gathered about the number of logs, their size and the time period covered. A r ranscription of the notes made on one user (NET-D) is given in Appendix D. Generally, the interviews confirmed the major points about users and their search behaviour which had been gathered from examination of their logs. It was difficult to get more information on the more specific points as i; would have meant reminding users of searches they might have performed several months ago. All users had been informed when registering that they are using an experimental system which aims to provide a real service while learning about real users. To this end. records of their searches would be kept and that they might be asked to answer questions about their use of the system. Despite this warning it was found that some users were slightly 42 CHAPTER 4. FREQUENT USERS disturbed when shown an example log. Hence, some discretion was exercised during the interviews. The frequency listings for the terms used by the frequent users are consistent with the general impressions about the user subject areas mentioned above. It may be useful to write further programs to produce term cooccurrence statistics or diagrams, but this has not been done at the time of writing. It seems that some general terms — "computer" for example — are used frequently by a considerable proportion of the network users, whereas other terms may be more specific to particular groups of users. Currently, the listings do not provide information about the distribution of frequencies of the terms between the users. It would be desirable to try to cluster users with respect to their term usage. 4,4 Results Having analysed the logs for the 21 frequent users and interviewed 15 of them, some general issues and some more specific impressions can be identified. These points are summarised here. Some of the points relate to machine learning about individuals and groups of users, while others may simply point to possible system improvements in Okapi. 4.4.1 G e n e r a l points about frequent users It would appear that on average users do not use the system for more than three subject areas and usually only one or two. When it is possible to identify queries that a user performs frequently, this information could be used after a database update has taken place. For example, the queries could be kept in an optional list for subsequent retrieval and re-running; this could be done automatically and the result mailed to the user (the system might then have to have some contextual information or some information about which records had been previously chosen relevant by the user), or the user could be reminded about these frequent queries upon logging on. 4.4.2 S u g g e s t i o n s for s y s t e m m o d i f i c a t i o n s A considerable number of records are looked at in brief (see Tables 3.10 and 3.11). It might be worth analysing the effect of the number of records found on the proportion of brief records looked at in comparison with the total found. 4A. RESULTS 43 When the sessions are long users sometimes lose their place and repeat some or all of the search. Perhaps some kind of optional summary information could be made available. This could include details of how many iterations of query expansion have been done and a listing of the results of the last few searches (in the present system the latter information is only given when the user chooses the "New search" or "Edit search" options). Particularly in the case of INSPEC the truncated titles displayed in brief one-line records are often misleading, and may lead to relevant records not being displayed in full. Users may expect some of their search terms to appear in the part of the title which is displayed. Several of the network users interviewed favoured showing two-line brief records. If an abbreviation (e.g. "AI") is used in a query and the initial letter of the words in a subsequent or previous query match exactly then these words could be shown to the user to verify what they mean v/ith the abbreviation. It would appear that especially in INSPEC the relevance level points ("The rest of the references may match your search less well", "The rest of the references may not match your search very well") often do not appear until several screens have been displayed — in some cases about 25 screens down. Within this list, it is likely (especially if the query consists of only one or two general terms) that there are long runs of records which are all of the same weight. Hence, the user sees nothing but what seems like an alphabetical (by author) list of references for the first several screens. This results in them sometimes logging out due to frustration before doing much, or missing some references which they may pick up in future searches, perhaps having less confidence in the relevance judgements of the system or simply having to spend a lot of time looking through seemingly unordered references. With the frequent users analysed, it was found that almost all of them had at least one such occasion amongst their searches. In 5.4.2 a possible way to produce a more fine-grained sequencing of the retrieved records is outlined, using knowledge gained by the system about the search language of individual users.