Appendix E - A Rationale for WWW Server Metrics Gathering

Prepared for Dick Hardt, GITCO, Hip Communications Inc.
by Michael Hayward, M. Pub. candidate, Simon Fraser University

In the following writeup, I have attempted to organize my thoughts regarding the gathering of certain WWW server statistics. The following report is based on an examination of the logfiles of the Hip Webzine, collected over the period between October 1994, and May 1995. Many of the conclusions are derived from an examination of the results of a suite of Perl scripts which I wrote to extract session information and generate reports from these logfiles.

This report suggests a number of metrics to be gathered by the WWW server. These suggestions are "blue sky" in the sense that no attempt has been made to determine the difficulty of gathering each metric - I have left that to the programming experts. For each metric a brief rationale is given, explaining where the metric would be useful to Hip or its clients.

Assumptions
Recommendations
Suggested metrics to be gathered by Hip's WWW server

Assumptions

The rationales below are based on the assumption that

Hip will want to monitor the overall efficiency of their WWW server for internal purposes, in order to determine such things as: server reliability, variations in load, distribution of load amongst clients. This will help Hip with such things as: setting levels for client charging; the identification of problem areas, and the timing of hardware (and software) upgrades.
Since Hip will be providing a "site" to a variety of clients, these clients will want regular feedback on the activity on their site. This is good client relations. A high level of service can help set Hip above its competitors: it can make or break a small startup company in what is now a highly competitive arena.
Both Hip and their clients will regularly make changes to their site configurations (from changes in overall site structure, to changes in individual page layout), and they will want to be able to monitor the effects of these changes on site activity

Recommendations

As a general recommendation, it is wise to gather as much information as possible - possibly even beyond the metrics suggested here. Numbers are (generally) small in terms of storage space requirements, and storage is cheap. It is easier to gather more data than you initially think you need, than it is to try and retrieve such information retroactively (obviously some restraint is needed!) You never know what a client might ask for in the way of information reports on their site.
Hip may want to provide a number of "value added" services to their clients in the form of a choice among several packages of statistical reports.
As Hip gets more expert in the area of Customer Support, they can expand their "site setup" services, offering expert site design: the WWW site is a selling mechanism for most clients. If Hip can provide basic advice on how to improve sales, clients will be impressed.
It is important to plan a site as a whole: catch the customer's interest and hold them, lead them through a prepared sales pitch, pace the pitch, make the "sell" at the right point, make it easy for them to commit. Statistics can be used to back up this advice.
It is equally important to compare the actual reported statistics with the expected ones, (which would be based upon the planned pitch and the customer's anticipated path through the material). If the actual behavior is not as expected: ask "Why?", and adjust the site structure to try and achieve the desired flow.
Hip should consider flagging "significant" changes in any of the metrics being monitored, either for the client's use, or for Hip's internal use (for example, flagging any pages whose hit rate changes by more than 10% from one report to the next). If there is a significant change, again ask "Why?"
The WWW server should automatically log (and clients should receive a report of) any changes to the site configurations: new pages added, pages deleted or moved, new versions of pages installed etc. This is very important when trying to determine the cause of significant changes in metrics.
If extensive software development is required in putting together a statistics gathering package for Hip's WWW server, this software itself might prove to be a viable product for Hip to market to other Internet Service Providers.
If the Webzine continues (and if it can be made to flourish), its readership may be a viable market which Hip could "sell" to advertisers. To this end, Hip will need to be able to back up any claims it makes about the extent and demographics of this readership, by providing statistics to potential advertisers or ad placement agencies.
If the Webzine continues, the "corporate" Hip may want to distance itself from the "fun" Hip, by renaming the Webzine to something that isn't so closely linked to the company as a whole.

Suggested metrics to be gathered by Hip's WWW server

Page-based metrics

Hit count for each page
To determine the most active pages, and the least active pages on a site. What material is being visited; what material is being overlooked?
Time spent on each page
Are people reading the material, or are they simply "passing through?" If certain pages are getting high numbers, is it because of high data volume (JPGs, MPGs etc)? Is the page copy possibly confusing to readers? Or are they genuinely interested?
Link used to enter each page
How did people get here? If there are several routes, which one gets the most traffic? Since Hip and the client have planned the site as a whole (see recommendation (3) above), this metric and the one below help them to determine if the client's customers are behaving as they want them to behave. If not, find out why. Site redesign might be appropriate.
Link used to leave each page
What on the page caught the reader's eye: the link's text itself? The placement of the link on the page/screen?

Session-based metrics

Length of each session (in pages, in time)
How long are the customers being held before their attention flags? Average behavior helps to determine where to place the pitch during site design. The longer potential customers can be kept "on site", the greater the chances of making the sale.
Start page for each session
Which pages get read first (position important material "earlier" in the session, when reader attention is higher). Which pages rarely get read (consider simplifying the site by eliminating low-use pages).
Ending page for each session
Why did the reader leave from here? Is it simply the end of the normal attention span (based on average session data), or did they leave earlier (in the session) than average?

Link-based metrics

Usage counts for each link (a link consists of a "from" page and "to" page)
Which links are most used? Which ones are least used? When designing a site the goals should include: minimizing the number of links "off site" (to hold the readers longer); simplifying the overall site structure; providing "roadsigns" back to a site's important pages (the home page, a product-ordering page).

User-based metrics

Automatically gathered user information
What WWW browser is being used? Can an EMail address be determined? Some metric on bandwidth or throughput rate to the user? (the latter can be used to monitor Hip's link through its service provider)
Standard demographic information
If Hip considers offering some kind of "individual user" access to a client's site at some point (i.e. username and password to access a site), then they should consider trying to gather some standard demographic information for each user: age, interests, income etc. Most commercial clients already make extensive use of this kind of information when planning marketing efforts for other media (print, radio, TV).
Frequency of visits and length of visits for each identifiable user
How often does a user revisit the site? How long do they stay? When combined with demographic information, a WWW server can offer highly accurate profiles of certain demographic groups to potential clients.
Pages visited by repeat visitors
Repeat visitors are different than first time visitors. Do they come back in order to buy more product? Can their path through to another sale be simplified?
Geographical (or less usefully, subdomain-based) location of each user
Where the readers coming from? If the client's intended readership is geographically localized (VanEcho), does their customer readership profile match? For clients attempting to market Internet-wide: where are they successful? Where are they weak? They might consider supplementing their WWW marketing efforts with other media (such as print) in weak markets.