In a blog post (now since deleted), I used Tableau to create a visualization of total taxes collected versus population for municipalities in British Columbia (BC) and Alberta (AB). This plot has been recreated below in R. Note that both axes are logged in order to make the graph more readable:
#Replication of Tableau visualization
#Assume library(tidyverse) has been run and data loaded into a tibble called "dat"
ggplot(dat, mapping=aes(x=log10(Population), y=log10(Taxation), color=JurisdictionCode)) +
geom_point(shape = 1, size=3) +
scale_shape(solid=FALSE) +
ylab("Total Tax (log)") +
xlab("Population (log)") +
geom_smooth(method=lm, se=FALSE)
The best-fit lines suggests that, on average, AB municipalities have higher rates of per capita taxes than BC municipalities. This is certainly an interesting result. But, as is always the case, a two-dimensional visualization is a good place to generate hypotheses. But to test hypotheses, we need to use statistical tools.
In the interest of repicability, I include the full R code here. Readers with no interest in R can skip the code and focus on the results. The data is available by request.
I start by loading the necessary libraries and performing some basic
filtering and recoding of the data. Specifically, I censor all
municipalities with population less than 1,000 and zero taxes (missing
data). Then, I log both population and taxes to recreate the relatively
clean linear relationship shown in the visualization. Finally, I
transform the BC/AB field (JurisdictionCode) into a factor
variable. This simplifies subsequent analysis. Finally, I use the
summary function to provide descriptive statistics for all
the variables in the dataframe.
#Preliminaries (libraries and data preparation)
library(tidyverse)
library(QuantPsyc)
library(broom)
dat <- dat %>% filter(Taxation > 0 & Population > 1000)
dat <- dat %>% mutate(JurisdictionCode = as_factor(JurisdictionCode),
PerCapitaTax = Taxation/Population,
logTaxation = log(Taxation),
logPopulation = log(Population),
logLandArea = log(LandArea))
summary(dat)
## OrgID OrgName OrgNameFull Latitude
## Min. : 1.0 Length:281 Length:281 Min. :48.37
## 1st Qu.: 95.0 Class :character Class :character 1st Qu.:49.79
## Median :273.0 Mode :character Mode :character Median :51.67
## Mean :262.4 Mean :51.97
## 3rd Qu.:409.0 3rd Qu.:53.83
## Max. :536.0 Max. :59.09
## Longitude OrgTypeID JurisdictionCode YearID
## Min. :-128.6 Min. :1.000 BC:106 Min. :2017
## 1st Qu.:-119.8 1st Qu.:1.000 AB:175 1st Qu.:2017
## Median :-115.1 Median :3.000 Median :2017
## Mean :-116.8 Mean :2.246 Mean :2017
## 3rd Qu.:-113.3 3rd Qu.:3.000 3rd Qu.:2017
## Max. :-110.0 Max. :3.000 Max. :2017
## Taxation Population LandArea PerCapitaTax
## Min. :1.742e+05 Min. : 1001 Min. : 1 Min. : 121.2
## 1st Qu.:2.897e+06 1st Qu.: 2427 1st Qu.: 19 1st Qu.: 925.1
## Median :8.531e+06 Median : 5040 Median : 915 Median : 1200.7
## Mean :3.191e+07 Mean : 20881 Mean : 34112 Mean : 1835.3
## 3rd Qu.:2.129e+07 3rd Qu.: 11349 3rd Qu.: 3716 3rd Qu.: 1773.8
## Max. :1.702e+09 Max. :1306784 Max. :8458900 Max. :15347.5
## logTaxation logPopulation logLandArea
## Min. :12.07 Min. : 6.909 Min. : 0.000
## 1st Qu.:14.88 1st Qu.: 7.794 1st Qu.: 2.944
## Median :15.96 Median : 8.525 Median : 6.819
## Mean :15.95 Mean : 8.713 Mean : 5.920
## 3rd Qu.:16.87 3rd Qu.: 9.337 3rd Qu.: 8.220
## Max. :21.26 Max. :14.083 Max. :15.951
One nice thing about Tableau is that it fits a line to the data in a scatterplot and provides much of information from the underlying linear model. One bad thing about Tableau it is does not provide all of the information from the underlying linear model. Specifically, it does not provide regression diagnostics. Fitting a line to some data points is one thing; satisfying the assumptions of the underlying technique is another.
Although I logged the axes in the Tableau visualization, Tableau creates its regression using the unlogged data. In addition, by default Tableau forces the intercept through the point (0,0). Although this makes for a more attractive graph, the practice of forcing the intercept is controversial (especially in this case, in which the intercept is well outside the sample—there are no functioning local governments with zero population).
Running separate zero-intercept regressions for BC and AB, we get roughly the same parameter estimates as Tableau. I am not exactly sure why they are not indentical, but it is hard to know what is going on under the hood in a tool like Tableau.
#Simple models for BC and AB
mod.simple.BC <- lm(Taxation ~ 0 + Population, data=dat[dat$JurisdictionCode=='BC',])
summary(mod.simple.BC)
##
## Call:
## lm(formula = Taxation ~ 0 + Population, data = dat[dat$JurisdictionCode ==
## "BC", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -22670446 -1220205 -109140 1909577 40362346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Population 1102.71 25.43 43.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8162000 on 105 degrees of freedom
## Multiple R-squared: 0.9471, Adjusted R-squared: 0.9466
## F-statistic: 1881 on 1 and 105 DF, p-value: < 2.2e-16
mod.simple.AB <- lm(Taxation ~ 0 + Population, data=dat[dat$JurisdictionCode=='AB',])
summary(mod.simple.AB)
##
## Call:
## lm(formula = Taxation ~ 0 + Population, data = dat[dat$JurisdictionCode ==
## "AB", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -123373294 -1851340 -302186 6333270 576674592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Population 1396.77 28.34 49.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47510000 on 174 degrees of freedom
## Multiple R-squared: 0.9332, Adjusted R-squared: 0.9328
## F-statistic: 2429 on 1 and 174 DF, p-value: < 2.2e-16
The problem with these two parallel models is that, although they look good when plotted and have very high values for explained variance (\(R^2\)), the regression diagnostics suggest serious problems. Linear regression is finicky and its assumptions must be satisfied for its results to be legitimate. For example, a core requirement of regression is that residuals (difference between observed and predicted values) be “random” rather than attributable to a specific cause. In practical terms, this means the residuals should be normally distributed around a mean of zero. The plots of the residuals—which are not provided by Tableau’s simple line-fitting functionality—suggest some serious problems. The histogram of residuals for BC makes it clear that the errors are not random and the boxplot shows asymmetry in the form of many very large positive residuals. Thus, as is often the case, the complexities of real-world data compel us to go beyond the simple line-fitting functionality of Tableau or Excel and construct a more complex regression model.
hist(resid(mod.simple.BC))
boxplot(resid(mod.simple.BC), horizontal = TRUE)
Recall that the clean linear relationships in the visualization were the result of logging both axes. Indeed, if we unlog the axes in the visualization, we get a much less compelling representation of the relationship between taxation and population:
#Unlogged version of the Tableau visualization
ggplot(dat, mapping=aes(x=Population, y=Taxation, color=JurisdictionCode)) +
geom_point(shape = 1, size=3) +
scale_shape(solid=FALSE) +
ylab("Total Tax") +
xlab("Population") +
geom_smooth(method=lm, se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Thus, an obvious first step to improving the regression model is to log taxes and population. Here I use the logged version of the variables created in the preliminaries:
mod.simple <- lm(logTaxation ~ logPopulation + JurisdictionCode, data=dat)
summary(mod.simple)
##
## Call:
## lm(formula = logTaxation ~ logPopulation + JurisdictionCode,
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5625 -0.4285 -0.1456 0.2998 2.2733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.97344 0.28516 24.454 < 2e-16 ***
## logPopulation 1.00675 0.03161 31.852 < 2e-16 ***
## JurisdictionCodeAB 0.33713 0.07936 4.248 2.95e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6437 on 278 degrees of freedom
## Multiple R-squared: 0.7859, Adjusted R-squared: 0.7844
## F-statistic: 510.2 on 2 and 278 DF, p-value: < 2.2e-16
Before we interpret this model, it is good practice to examine the residuals for normality. Below I show a histogram, boxplot, and a q-q plot. Although none of these suggests perfect normality, this is a huge improvement over the non-logged models. I therefore conclude that the model is good enough for high-level root cause analysis and move on to interpretation.
hist(resid(mod.simple))
boxplot(resid(mod.simple), horizontal = TRUE)
qqnorm(resid(mod.simple))
qqline(resid(mod.simple))
The downside of logging one or more variables in regression model is that we lose the clean linear interpretation of the coefficient for each explanatory variable as a slope. Reading the Tableau trendlines, we could say that each additional person in BC results in an average increase in taxation of $1,102 and each additional person in AB results in an average increase in taxation of $1,397. We know from the distribution of the residuals that these estimates are unreliable, but at least they are easy to understand,
More reliable conclusions are provided by the transformed model: the natural logarithm of taxes increases by 0.93171 (on average) when the natural logarithm of population increases by 1.0. Of course, no one likes to think in terms of logarithms so coefficients in these types of regressions are typically thought of as elasticities: A one percent increase in the explanatory variable leads to a \({coefficient}\%\) increase in the response variable. This is an approximation, but it generally close enough. Thus, a 1% increase in population leads to, on average, a 0.93% increase in taxation.
A more practical interpretation of the transformed model simply focuses on the significance of the coefficients:
logPopulation 0.93171 0.04698 19.833 < 2e-16 ***
JurisdictionCodeAB 0.34639 0.12809 2.704 0.00725 **
First, we note that the coefficient of logPopulation is
highly significant. This comes as no surprise: the whole take-away from
the Tableau visualization is that population almost completely
determines taxes. Second, the coefficient for the dummy variable
JurisdictionCodeAB is positive and significant. What this
suggests is that, controlling for population, Albertan communities
collect significantly more tax than British Columbian communities. This
is an interesting result given is thorny political implications. But
before we ascribe differences in taxation to political jurisdiction, we
need to ask whether there are other sources of this difference.
When we build explanatory models, we invariably run into practical limitations. First, we can only test hypotheses that we can imagine. Thus, we have to know something about the underlying mechanisms of the phenomenon we are studying. Second, even if we can imagine a promising hypothesis, we cannot test it unless we have the right data. And data is hard to come by. So, in our search for explanatory models, we are always bound by practical limitations on domain knowledge and data.
In this case, I doubt that Alberta communities are systematically more expensive to run than British Colombian communities. But I am vaguely aware that Alberta tends to be more resource intensive than BC and my understanding of resource-based communities is that they tend to be small, remote, and perhaps sprawling. So here is a hypothesis: small, remote, sprawling communities are more expensive to operate. There are several potential reasons for this: First, small communities fail to realize the same economies of scale as larger communities. Thus, there is a potential second-order effect of population. We already know that taxes increase with population, but economies of scale mean that this increase slows down as population increases. More nonlinearity—ugh. Second, remoteness means everything is far away and, as a consequence, expensive. And remoteness in Canada typically implies northerness, which creates additional sources of cost (heating, snow removal, need for alternative recreation, and so on). Finally, the costs of sprawl are obvious: more roads to maintain, larger distances to travel, and so on.
The provincial data sets contain data on population (already seen) and (at least recently) physical size. However, they do not contain any direct measures of remoteness (e.g., distance from major centers). However, it is a fairly straightforward to use the Google Maps API to grab the latitude and longitude of each community and use latitude as a proxy for remoteness (again, this relationship between remoteness and latitude is not universal, but likely works okay in Canada).
We can thus create a kitchen sink model that includes all
our variables—well almost. I do not include longitude because (a) I have
no a priori reason to think longitude influences taxation and
(b) it is clearly colinear with JurisdictionCode, which I
want to keep in the model to test its significance. The kitchen sink and
its diagnostic plots follow:
#Kitchen sink model
mod.ks <- lm(logTaxation ~ Latitude + logPopulation + logLandArea + JurisdictionCode, data=dat)
summary(mod.ks)
##
## Call:
## lm(formula = logTaxation ~ Latitude + logPopulation + logLandArea +
## JurisdictionCode, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76462 -0.22660 -0.02237 0.21362 1.48200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.233287 0.649584 4.977 1.14e-06 ***
## Latitude 0.064107 0.011895 5.389 1.52e-07 ***
## logPopulation 0.907507 0.021135 42.938 < 2e-16 ***
## logLandArea 0.179133 0.009703 18.462 < 2e-16 ***
## JurisdictionCodeAB 0.679228 0.064503 10.530 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4013 on 276 degrees of freedom
## Multiple R-squared: 0.9174, Adjusted R-squared: 0.9162
## F-statistic: 766.1 on 4 and 276 DF, p-value: < 2.2e-16
hist(resid(mod.ks))
boxplot(resid(mod.ks), horizontal = TRUE)
qqnorm(resid(mod.ks))
qqline(resid(mod.ks))
This gives a somewhat better result. The \(R^2\) of this final model is a respectable 0.92. More critically, its fit plot looks convincing: the model does a pretty good job predicting overall taxation as a function of population, latitude, and physical extent.
plot(fitted(mod.ks), dat$logTaxation, xlab="predicted value", ylab="observed value")
abline(0, 1)
I like to create a tornado diagram of the standardized coefficients. This gives a handy graphical guide to what is driving the explanatory model. In this case, we can see (not surprisingly) that population is the dominant explanatory variable:
# now need an explict 0,1 dummy variable (instead of a factor) for a standardized regression in R
dat <- dat %>% mutate(JurisdictionCode_AB.T = ifelse(JurisdictionCode == "AB", 1, 0))
mod.ks <- lm(logTaxation ~ Latitude + logPopulation + logLandArea + JurisdictionCode_AB.T, data=dat)
mod.std <- lm.beta(mod.ks)
tidy(mod.std) %>% ggplot() +
geom_col(mapping=aes(x=reorder(names, abs(x)), y=x), fill="blue") +
xlab("") +
ylab("Standardized Coefficient") +
coord_flip()
Given this reasonably- accurate predictive model of taxation, we can invite controversy and abuse by comparing predicted taxation to observed taxation for each municipality. Conceptually, municipalities that fall below the 45\(^\circ\) fit line are taxing at a level lower than the model predicts whereas those above the fit line are taxing more than we might expect given their population, latitude, and physical size.
obs_max <- max(dat$logTaxation)
obs_min = min(dat$logTaxation)
mod.ks %>% augment(data = dat) %>% ggplot(mapping = aes(x=.fitted, y=logTaxation, color=JurisdictionCode)) +
geom_point() +
ggtitle("Observed vs. Predicted Taxation") +
xlab("Predicted") +
ylab("Observed") +
xlim(obs_min, obs_max) +
ylim(obs_min, obs_max) +
geom_abline(slope=1, intercept=0, color="blue")
I have not included municipality names on the fit plot because hundreds of overlapping labels makes the chart unreadable. An alternative is simply to list communities in order of the relative size of their deviation from the model. In the column chart below, I plot \({residual}/{fitted} * 100\) for each municipality. Thus, a community that taxed $100K more than a predicted total taxation of $1M would show on the plot as 10%. I do not show all ~300 communities below but just a sample of 50. If you want the complete list, you can change the sample size in the R code or look at the follow-on Tableau visualizations here.
#Column (sampled)
mod.ks %>% augment(data = dat) %>% sample_n(30, replace=FALSE) %>% mutate(predict_per = .resid/.fitted * 100) %>%
ggplot(mapping = aes(x=reorder(OrgNameFull, predict_per), y=predict_per, fill=JurisdictionCode)) +
geom_col() +
xlab("") +
ylab("Relative Deviation from Model") +
coord_flip()
The point here is not to call out communities that are relatively high or low on this measure. Indeed, the percentage variation from the model is generally small. More than half the communities are within \(\pm\) 2% of their predicted value. (Caveat: recall that the residuals are in log(dollars), so actual dollar deviations from expectations are higher.)
We can predict municipal taxation in Alberta and British Columbia fairly well using just three variables: population, latitude, and physical extent. We could likely do even better with other variables, such a road network size, existence of other sources of funding (such as special grants, money-generating utilities, and so on). However, I have simply used data on hand through provincial financial reporting sites and public sources, such as Google Maps.
We do see some deviation from expectations in taxation from community to community. However, deviation from the model appears to decrease as population increases. This heteroskedasticity is a methodological nuisance, but its practical implication is more reassuring: large communities have predictable property taxation.