About the plot

This appeared in the June 1996 issue of Consumer Reports. The article reviews and rates 69 beers. Each beer has several characteristics recorded, including measures of how "malty" and "bitter" it is. The blobs on the plot represent six categories of beer (light, regular, non-alcoholic, craft lagers, imported lagers, and craft ales). Each blob is actually an envelope that was drawn around the individual beers in the corresponding category.

This is an example of a classification problem, in which we want to predict a categorical variable (in this case type of beer) using other information (in this case maltyness and bitterness). It's a pretty easy problem, at least for some of the classes, as you can see there is good seperation between the classes (for example, the craft ales, lagers, and regular beers are seperated). Using other information (another variable in the article is the percent alcohol, which is a pretty good predictor for the nonalcoholic beers), we can get even better classification.

This could also be an example of a clustering problem, if we didn't know the classes of beers in advance. That is, if we just had a lot of beers, and there weren't any classes, we might notice in the plot that there seem to be three "clumps" or clusters.

This is data mining in the sense that we want a flexible way to predict beer type. One characteristic of data mining problems not exhibited in this example is a large volume of data - there are only 69 beers, and around a dozen variables recorded on each beer. In the course we will look at larger problems as well, and explore how algorithms scale to these larger problems.

Last modified on March 7, 2001.