2.4 - The extraction and analysis of Webzine logfile data

2.4.1 - Extraction of relevant data
2.4.2 - Limiting assumptions and conditions
2.4.3 - Analysis and interpretation of the data
2.4.4 - Hypotheses
Table 3: Table of selected Hip webserver logfile statistics
Figure 12: Graph of average total page hits per day on Hip's webserver
Figure 13: Graph of the number of sessions per day on Hip's webserver
Figure 14: Graph showing average session length: duration and number of pages
Figure 15: Graph of hits on the Webzine cover page and the first page of the Hip tour
Figure 16: Graph showing hits on the various Hip tour pages
Figure 17: Graph comparing hits on the Webzine cover and the two most popular columns

Web server logfile data provides publishing researchers with a unique opportunity to gain insights into the reading process. Web servers can "serve up" (that is "make available to browsers") an unlimited number of pages of information, linked together in a variety of labyrinthine ways. Each page has its own address, or URL (for Uniform Resource Locator), which uniquely distinguishes it from all other pages on the World Wide Web. With the present state of the technology, a Web server automatically records the page name of each page requested by every reader accessing that server. Each transaction is time-stamped and identified with the network address of the requesting browser, and all transactions are kept in chronological order. Appendix C provides a sample extract from one of the Hip Web server's logfiles.

In this raw state, the logfile data provides only a rough sense of the volume of traffic passing through a given server. Simple scripts that perform basic logfile analysis are readily available from various sources on the Internet. (18) However all of the existing scripts I have examined simply treat the logfile transactions as a collection of discrete page requests. They make no attempt to connect these discrete page requests into a "session," to produce a chronological record of each individual reader's interaction with the entirety of the server's published material.

Typically these existing scripts will summarize logfile transactions by grouping them:

by time of day
by page name
by the requester's network address

providing totals and percentage breakdowns for each category. While such analyses are of interest to a Web server administrator, they do not address the issues most of interest to publishing researchers, particularly those interested in the processes at work when readers interact with hypertext material. Investigating such issues would involve asking questions like the following: (19)

When readers are provided with alternate paths through pages of interconnected material, which paths are chosen most frequently?
How are these choices affected by such things as: overall page design and layout; physical location of links upon the page; the textual context of links?
How long do readers spend upon each page during a session?
How many pages do readers read during a session?
Do readers typically read sequentially presented material in sequence, or do they skip around?

One of my goals with this internship was to begin to explore these kinds of issues, using the Hip Webzine as a "real world" example of an Internet publication. At Hip I was fortunate to have access to archives of logfile data covering most of the time period in which the Hip Webzine was published. As mentioned at the start of Part 2 of this paper, the first issue of the Webzine appeared in September of 1994. As of this writing the most recent issue was Issue 9, which appeared during the third week of August, 1995.

Logfile records at either end of this period were not available for examination: records older than November of 1994 had not been retained, and a new version of the Web server software, installed at Hip in early June of 1995, resulted in invalid logfile data from that date. Apart from these two undocumented periods, there were only minor episodes of data loss, the result of Web server failures of one sort or another. As a result the data on-hand represented a unique record of a webzine's history, showing both its growth and its decline.

Since Hip's Web server serviced all of Hip's Web clients, the logfiles would contain data about interactions between two of these site's pages, as well as activity within any single site (e.g. the Webzine) itself.

It is important to realize that, despite the richness of the data record, it is far from being a complete picture of reader (or publisher) activity. The various limitations inherent to the data, as well as some unavoidable assumptions that were made, are discussed at greater length below.

2.4.1 - Extraction of relevant data

The volume of data generated by the Web server was enormous. During the month of May, 1995, for example, approximately 4 megabytes of logfile data were being generated daily.

Because of this, and in an attempt to simplify the subsequent analysis, I decided to break down the data extraction process into discrete stages, writing a sequence of three Perl scripts to examine the logfile data, rather than a single, more complex script. The full text of the three scripts is included as Appendix B.

2.4.1.1 - The "sessions.pl" script

The first script in the sequence (referred to as the "sessions.pl" script), goes through a sequence of logfiles, filtering out records which were of no interest, and reducing the information into a compact form that was session (or page-sequence) based, rather than "hit" (or page-reference) based. This reduced information was stored in an intermediate file referred to as the ".ssn" - for "session" - file.

Each record in the ".ssn" file records a single session: one reader's interaction with the Web server's pages of linked material. Each ".ssn" record includes:

the network address of the reader
the time that the first page was requested
the duration of the session
the complete sequence of pages visited during the session in chronological order
the amount of time spent on each page in the sequence

An extract from a typical ".ssn" file is provided in Appendix D of this paper.

The "sessions.pl" script also produced a second intermediate file, referred to as the ".pgs" - for "pages" - file. Each record in the ".pgs" file contains information related to the usage of a specific page amongst all those made available on the Web server. Each ".pgs" record includes:

the page name
the number of times that page was requested during the period of time covered by the logfiles
the total time spent on that page in all sessions recorded during the period
the average time spent on that page in all sessions recorded during the period

An extract from a typical ".pgs" file is provided in Appendix D of this paper.

2.4.1.2 - The "transit.pl" script

The second script in the sequence (referred to as the "transit.pl" script), goes through all sessions recorded in a ".ssn" file, extracting information on the different inter-page transitions which occurred within each session. To define terms, an inter-page transition is simply a transition from one page (the "from page") to another page (the "to page") within a linked hypertext publication. Each reader's session would therefor consist of a series of such inter-page transitions, and would reflect the links chosen by each reader from amongst the many defined by a given publication.

As output from this script, two more intermediate files are produced: the ".trn" - for "transitions" - file, and the ".trs" - for "transitions sorted" - file. Both files contain exactly the same data, sorted in two different ways. In the ".trn" file the inter-page transitions are sorted by the "from page," or originating page, of a transition. In the ".trs" file the inter-page transitions are sorted by frequency of transition, from the most frequently used transition to the least frequently used.

Each record in the ".trn" file or the ".trs" file contains information on a specific inter-page transition:

the name of the "from page"
the name of the "to page"
the number of times that this inter-page transition occurred during the period of time covered by the logfiles
the total time spent on the "from page" before the "to page" was referenced, totaled over all uses of the transition during the period
the shortest recorded time spent on the "from page"
the longest recorded time spent on the "from page"
the average time spent on the "from page," averaged over all uses of the transition

An extract from a typical ".trs" file is provided in Appendix D of this paper.

2.4.1.3 - The "hipstats.pl" script

The third script in the sequence goes through all of the intermediate files to extract and report on data which is specific to sessions involving pages of the Hip Webzine. Note again that the Hip Web server maintains pages for a variety of Hip's clients, in addition to the Webzine's pages. Since the Webzine's pages represent only a small portion of the server's total, only a subset of the logfile data pertained directly to the Webzine.

Since the two earlier scripts in the sequence make no assumptions about which pages are of interest, the intermediate files will contain a complete record of all Web server activity. The purpose of the third script is to take a closer look at a specific segment of that data. This structure was intended to make it easy for Hip to "plug in" an alternate script as the third stage, to take a closer look at other segments of the server's data: on behalf of a paying client for example.

The nine reports produced by the "hipstats.pl" script are divided into three categories:

Page-based reports: these four reports provide detailed information on specific pages of the Hip Webzine, including the "hit" rate for each page of each Webzine column, the completion ratio for two-page articles, and simple bar charts showing the relative popularity of each page.
Session-based reports: these two reports provide bar charts showing the distribution of session time amongst all visits to the Webzine.
Transition-based reports: these three reports provide a list of the "top ten" inter-page transitions. They focus on ways in which Webzine readers enter and leave the Webzine (as far as can be determined), and point out the most popular paths within the Webzine itself.

2.4.2 - Limiting assumptions and conditions

Real-world settings rarely provide ideal research conditions, and this case study was no exception. In writing the suite of Perl scripts described above, and in interpreting the results, I was forced to make certain assumptions about the data, and to address a number of limitations inherent in the technology. I will list my assumptions and limitations here.

In the "sessions.pl" script a session is defined as a chronological sequence of Web page references coming from a single IP number (or network address). This definition assumes certain things about the reader which may not necessarily be true.
While many networking environments provide each workstation with its own network address, for security reasons some other environments hide all of their machines behind a single "firewall" machine. This firewall acts as a gateway for all network traffic into and out of that environment, with the side effect that network traffic originating from any of the machines behind the firewall appear to be coming from the same network address. The Web server has no way of distinguishing between individual machines behind the same firewall, with the result that simultaneous visits to the Webzine from two separate machines behind the same firewall will be intermingled, and treated by the "sessions.pl" script as one session. It is my (unsupported) feeling, though, that such intermingling of sessions would be rare, although it is clearly an issue which should be examined more closely.
There is an implicit assumption that the Web server's logfile is an accurate record of all pages from that server which have been viewed by each reader. This is not necessarily so. What the logfile actually records is all pages requested by an external source. For a specific reader interacting with a specific Web server, there are a number of situations where the server's logfile might not record the pages actually viewed by that reader.
As noted earlier in this paper, "page caching" can result in gaps in the Web server's logfile. Page caching can be either local to the reader's browser software; it can also occur remotely, on a proxy Web server. Page caching can cause a reader's request for a specific page to be satisfied from a cache of recently (or frequently) referenced pages. Such page references will satisfy the reader's request (displaying the requested page) without the server's intervention, resulting in a gap in the session record.
There is an unstated assumption that page references come about as the result of readers clicking on hypertext links. Based on this assumption, it is tempting to draw certain conclusions about the way in which readers interact with hypertext.
Again, however, it is important to note that the logfile only records the server's activity: extrapolating a reader's actions from this record is risky. On many Web servers there is nothing to prevent readers from "hunting" for pages by requesting specific directories by name. Such activity will be logged, but will not provide any insights into the reading of hypertexts.
Certain boundary problems arise in the "sessions.pl" script as it attempts to determine when a session begins and ends:
- Web server logfiles are switched automatically every day in Hip's case, or every week in other settings. This means that any sessions in progress at the time of the switch will be split, appearing as two separate sessions.
- since the logfile only records references to the local server's pages, there is no way of knowing where readers come from, or where they go to when they finish with that server's pages. From the point of view of understanding readers' behavior in the greater hypertext that is the World Wide Web, our "sessions" are more properly thought of as session fragments.
- sessions were deemed to have ended when there was no page reference recorded from a particular reader within a period of five minutes. Under normal circumstances a cut-off point of five minutes would be a strong indication that the reader has gone elsewhere. An examination of the Hip Webzine data seems to bear this out: as Figure 14 shows, the average session time was fairly constant at between four and five minutes, with between four to six pages being referenced during that time. (20) It would be useful to examine this cut-off point more closely though, to see how much other numbers change when values greater than five minutes are used.
Since many World Wide Web surfers use dynamically assigned IP numbers when browsing, we cannot assume that a specific IP number indicates a specific reader continually over multiple sessions. Note however that this does not affect the assumptions about single sessions, since a reader's IP number will remain constant throughout any one session.
However in any longer-term study, it would be important to be able to uniquely identify individual readers, so that their interaction with a given hypertext publication could be studied over an extended period of time.
A proper and extensive exploration of readership within a hypertext publication would need to consider the effects of changes in the publication's layout and structure. It is highly likely that altering such things as:
- the length of pages
- the size and position of graphics on the page
- the number, location, and wording of links
- the ways in which the pages of a publication are inter-connected
would have an effect upon the ways in which readers interact with the publication. Some evidence of this belief has been cited earlier in this paper, during the discussion of the restructuring of the Webzine "tour."

It is therefore not enough to simply know the names of the pages being referenced, and the sequence of those references (in other words the information routinely recorded in the logfile); we must also be aware of the page contents at the time the references occurred. Because of the intimate relationship between publication form, content, and reader interaction, in order to properly interpret Web server logfile data it is imperative to record all editorial changes to the publication, in parallel with the Web server's log of page requests.

2.4.3 - Analysis and interpretation of the data

Given the limitations and cautions described above, it is difficult (and risky) to draw any firm conclusions about hypertext readership behavior from the Hip Web server's logfile data. Despite the risks, however, it is my feeling that the suite of Perl scripts described above do explore new and important areas of reader behavior, even if that exploration is only a prototype for more extensive and more closely controlled research done at a later date. I would be happy to make copies of my scripts available to any interested parties, and would be very interested to pursue this area of research later, should the opportunity arise.

That having been said, the results of the analysis I did undertake, based on the data extracted as described above, are as follows. Table 3 contains a selection of the statistics extracted from archives of the Hip Web server's logfile over the period from November, 1994, to the end of May, 1995. Each column in that table contains the values for a specific measurement, or metric, chosen from the much wider assortment of all metrics gathered by the scripts. In choosing these metrics over others, I was looking for ways to investigate behavior specific to the Hip Webzine (from amongst all server activity). I was also looking for general trends, and patterns of behavior, which might suggest areas for further research.

In most cases the figures represent average daily values for the metric over the stated period. To illustrate with reference to Table 3: during the second week of February, 1995, Hip's Web server recorded an average of 2825 sessions per day. Those sessions averaged 4.56 minutes each, and an average of 4.59 pages were visited in each session. (21) Each day that week an average of 12039 requests were recorded, totaling over all pages on the server. Most of the remaining columns record the average number of daily requests for specific pages: the Webzine cover page (under any of its aliases); the first page of the Hip tour; pages 1 and 2 of two of the Webzine's most popular columns; and for comparison, the Focus catalog cover page. The remaining two columns give the completion ratio for the two Webzine columns.

The first two graphs (Figures 12 and 13) show the steady growth of traffic on Hip's Web server. Both the number of page hits, and the number of sessions, show a similar growth pattern over the period. When considered against the background of overall Web traffic growth however, the trends on Hip's server are hardly surprising.

The graph in Figure 14 is more interesting, showing the average length of sessions for readers visiting Hip's server. While there is much anecdotal evidence of Web surfers' brief attention span, this graph provides the first numeric evidence of this behavior that I have seen. Session lengths declined slightly, but remained fairly constant throughout the entire period. One possible reason for the gradual decline in session lengths is the overall growth in traffic on the Hip server: as the demand for pages grows, the server would be less responsive to individual page requests, motivating people to browse elsewhere on the Web. Examining changes in the average intervals between page requests within a session might provide additional data: a less responsive server should result in longer intervals between pages, in addition to the possible effects on total session length.

It would valuable to try varying those parameters in the "sessions.pl" script which define a session. For example the criterion used for deciding when a session ends was "more than five minutes between page requests." If this value were increased to ten minutes or more, with no significant effect on the graph in Figure 14, then we could be more assured of the accuracy of the session length data.

Figures 15, 16, and 17 are intended to show the distribution of activity within the pages of the Hip Webzine. Only most frequently referenced pages are noted: the Webzine cover page (under any of its aliases); the first pages of the "Speaking of Sex" and "Dear Dick" columns; and a selection of the "departmental tour" pages.

The cover pages were referenced much more frequently than any other Webzine page, for the simple reason that the Webzine's structure directed readers back to the cover from every subsidiary page (Figure 4). This is good Web page design at work: always provide readers with a way "back" to some well-known reference point, to help them get their bearings if they become disoriented.

The other feature worth noting with these graphs is the rise and fall of the Webzine's readership. Unfortunately the period covered by the logfiles ends in May, but this same trend can be seen in Figure 10, showing Webzine mailing list activity continuing to decline through July.

2.4.4 - Hypotheses

Based on the data and the analysis above, what hypotheses can be put forward about readership patterns within the Hip Webzine?

Readers came to the Webzine in one of two ways: either by explicitly typing the Webzine's URL into their browser, or by clicking on an existing link from some other Web site. Readers may have read about the Webzine in another media outlet (and the Webzine did get reviewed on one or two occasions), or in typical Web browsing fashion, they may have clicked on a link to see where it would lead. In either case curiosity was likely the motivating force.

Once at the Webzine's cover, a typical new reader would take the Webzine tour, and get a sense of the publication's style. Those who were interested enough would register their EMail addresses with the Hip mailing list, and then proceed back to the cover page to examine the current issue's table of contents. Not surprisingly, the "Sex & Scandal" section attracted the most interest from readers; such material is usually successful in drawing in the curious. Readers would then examine an article or two - rarely to completion, even though the articles were just two pages long - before "surfing" off to some other Web site. The entire visit would have lasted approximately five minutes, and only four or five pages of the Webzine's total of thirty to forty pages would have been viewed. Based on this behaviour, Webzine visitors could more accurately be termed "browsers" than "readers."

One can assume that return visits to the Webzine were rarely spontaneous, since interested readers were likely to have added their names to the mailing list, and would await a message from Jane announcing the appearance of a new issue. There is some evidence in the logfile that messages from Jane were followed by a slight peak in Webzine visits. Although logfile data does not allow us to distinguish between return visits and first time visits, I would expect to find some differences in readership behaviour between first-time and return visits.

Return visitors would do less "exploring," since they would already have a sense of the Webzine's style and structure. This behaviour would be related to the publication schedule: weekly issues would result in readers who were more familiar with the publication; monthly (or less frequent) issues would result in return visits that more closely resembled first-time ones. Return visits should show fewer visits to the Webzine tour pages, and less visits to columns already found to be of little interest. With a regular publication schedule I would expect to find readers' return visits shorter in terms of the number of pages visited, and more "linear" in the order that those pages were read.

Footnote 18:
For example, the package known as "wwstat" is widely used by Web server administrators. It is available at: http://www.ics.uci.edu/WebSoft/wwwstat/ (back)

Footnote 19:
Appendix E contains a report written for Dick Hardt, providing a detailed rationale for the gathering of various metrics by Web servers. (back)

Footnote 20:
Note, though, that there is a degree of circularity in this reasoning, since the session length numbers are derived from sessions extracted using the five-minute cutoff value. (back)

Footnote 21:
Note again that, as defined here, sessions need not consist of Webzine pages alone. In fact, because of the cross-linking of material on Hip's Web server, Webzine readers frequently visited other, non-Webzine material on Hip's server, and vice versa. (back)