On the Monkey Cage blog, Baptiste Coulmont (a.k.a. @coulmont) recently uploaded a post entitled “You can vote twice ! The many political appeals of proxy votes in France“, coauthored with Joël Gombin (a.k.a. @joelgombin), and myself. The study was initially written in French as mentioned in a previous post. Baptiste posted additional information on his blog (http://coulmont.com/blog/…) and I also wanted to post some lines of code, to mention a model that was not used in that study (more complex to analyze, but more realistic, and with the same conclusions). The econometric study is based on aggregated voted, with a possible ecological misinterpretation.

Regression Model: Possible Explanatory Variables

The first idea was to model proxies using a binomial regression, per pooling station where denote the number of proxy vote, per station , and denotes the number of voters. Proportion can be a function of possible explanatory variables (on Baptiste’s blog there are additional information about the datasets, obtained from insee.fr and opendata.paris.fr)

> bt1=read.table("paris2007-pres-t1.csv",header=TRUE,sep=";") > bt2=read.table("paris2007-pres-t2.csv",header=TRUE,sep=";") > bv=read.table("paris-bv-insee-07.csv",header=TRUE,sep=";") > bv$BV=bv$BVCOM > baset1=merge(bt1,bv,by="BV") > baset2=merge(bt2,bv,by="BV") > baset1$LOGEMENT=baset1$PROPRIO+baset1$LOCNONHLM+baset1$LOCHLM+baset1$GRATUIT > baset2$LOGEMENT=baset2$PROPRIO+baset2$LOCNONHLM+baset2$LOCHLM+baset2$GRATUIT

For instance, assume that is a function of the proportion of owner of the place people live in, denoted in the neighborhood of the pooling station,

> variable="PROPRIO" > reference="LOGEMENT" > baset1$taux=baset1[,variable]/baset1[,reference] > baset2$taux=baset2[,variable]/baset2[,reference]

We can consider a logistic regression

or a logistic regression with splines, if we do not want to assume a linear model

With cubic splines, the code is

> b=hist(baset1$taux,plot=FALSE) > library(splines) > regt1=glm(PROCURATIONS/INSCRITS~bs(taux,6),family=binomial,weights=INSCRITS,data=baset1) > regt2=glm(PROCURATIONS/INSCRITS~bs(taux,6),family=binomial,weights=INSCRITS,data=baset2) > u=seq(min(baset1$taux)+.015,max(baset1$taux)-.015,by=.001) > ND=data.frame(taux=u) > ug=seq(0,max(baset1$taux)+.05,by=.001) > pt1=predict(regt1,newdata=ND,se=TRUE,type="response") > pt2=predict(regt2,newdata=ND,se=TRUE,type="response") > library(RColorBrewer) > CL=brewer.pal(6, "RdBu") > plot(ug,ug*1,col="white",xlab=nom,ylab="Taux de procuration", + ylim=c(0,.1)) > for(i in 1:(length(b$breaks)-1)){ + polygon(b$breaks[i+c(0,0,1,1)],c(0,b$counts[i],b$counts[i],0) + /max(b$counts)*.05,col="light yellow",border=NA)} > polygon(c(u,rev(u)),c(pt1$fit+2*pt1$se.fit,rev(pt1$fit-2*pt1$se.fit)), + border=NA,density=30,col=CL[4])

while a standard logistic regression would be

> lines(u,pt1$fit,col=CL[6],lwd=2) > polygon(c(u,rev(u)),c(pt2$fit+2*pt2$se.fit,rev(pt2$fit-2*pt2$se.fit)), + border=NA,density=30,col=CL[3]) > lines(u,pt2$fit,col=CL[1],lwd=2) > regt1l=glm(PROCURATIONS/INSCRITS~taux,family=binomial,weights=INSCRITS,data=baset1) > regt2l=glm(PROCURATIONS/INSCRITS~taux,family=binomial,weights=INSCRITS,data=baset2) > ND=data.frame(taux=ug) > pt1l=predict(regt1l,newdata=ND,se=TRUE,type="response") > pt2l=predict(regt2l,newdata=ND,se=TRUE,type="response") > lines(ug,pt1l$fit,col=CL[5],lty=2) > lines(ug,pt2l$fit,col=CL[2],lty=2) > legend(0,.1,c("Second Tour","Premier Tour"),col=CL[c(1,6)], + lwd=2,lty=1,border=NA)

Here it is (the confidence region is for the spline regression) with on blue the first round of the Presidential election, and in red, the second round (in France, it’s a two-round system)

(the legend of the y axis is not correct). We can consider as explanatory variable the rate of H.L.M., low-cost housing or council housing,

If I like the graph, unfortunately, the interpretation of coefficient might be complicated

> summary(regt1l) Call: glm(formula = PROCURATIONS/INSCRITS ~ taux, family = binomial, data = baset1, weights = INSCRITS) Deviance Residuals: Min 1Q Median 3Q Max -12.9549 -1.5722 0.0319 1.6292 13.1303 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.70811 0.01516 -244.6 <2e-16 *** taux 1.49666 0.04012 37.3 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 12507 on 836 degrees of freedom Residual deviance: 11065 on 835 degrees of freedom AIC: 15699 Number of Fisher Scoring iterations: 4 > summary(regt2l) Call: glm(formula = PROCURATIONS/INSCRITS ~ taux, family = binomial, data = baset2, weights = INSCRITS) Deviance Residuals: Min 1Q Median 3Q Max -15.4872 -1.7817 -0.1615 1.6035 12.5596 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.24272 0.01230 -263.61 <2e-16 *** taux 1.45816 0.03266 44.65 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 9424.7 on 836 degrees of freedom Residual deviance: 7362.3 on 835 degrees of freedom AIC: 12531 Number of Fisher Scoring iterations: 4

So we did consider a standard linear regression model, for the proxy rate, per station,

(again, either a model with splines, or a standard linear model). The code is

> regt1=lm(PROCURATIONS/INSCRITS~bs(taux,6),weights=INSCRITS,data=baset1) > regt2=lm(PROCURATIONS/INSCRITS~bs(taux,6),weights=INSCRITS,data=baset2) > u=seq(min(baset1$taux)+.015,max(baset1$taux)-.015,by=.001) > ND=data.frame(taux=u) > ug=seq(0,max(baset1$taux)+.05,by=.001) > pt1=predict(regt1,newdata=ND,se=TRUE,type="response") > pt2=predict(regt2,newdata=ND,se=TRUE,type="response") > library(RColorBrewer) > CL=brewer.pal(6, "RdBu") > plot(ug,ug*1,col="white",xlab=nom,ylab="Taux de procuration", + ylim=c(0,.1)) > for(i in 1:(length(b$breaks)-1)){ + polygon(b$breaks[i+c(0,0,1,1)],c(0,b$counts[i],b$counts[i],0) + /max(b$counts)*.05,col="light yellow",border=NA)} > polygon(c(u,rev(u)),c(pt1$fit+2*pt1$se.fit,rev(pt1$fit-2*pt1$se.fit)), + border=NA,density=30,col=CL[4]) > lines(u,pt1$fit,col=CL[6],lwd=2) > polygon(c(u,rev(u)),c(pt2$fit+2*pt2$se.fit,rev(pt2$fit-2*pt2$se.fit)), + border=NA,density=30,col=CL[3]) > lines(u,pt2$fit,col=CL[1],lwd=2) > regt1l=lm(PROCURATIONS/INSCRITS~taux,weights=INSCRITS,data=baset1) > regt2l=lm(PROCURATIONS/INSCRITS~taux,weights=INSCRITS,data=baset2) > ND=data.frame(taux=ug) > pt1l=predict(regt1l,newdata=ND,se=TRUE,type="response") > pt2l=predict(regt2l,newdata=ND,se=TRUE,type="response") > lines(ug,pt1l$fit,col=CL[5],lty=2) > lines(ug,pt2l$fit,col=CL[2],lty=2) > legend(0,.1,c("Second Tour","Premier Tour"),col=CL[c(1,6)], + lwd=2,lty=1,border=NA)

Here is again the evolution as a function of the rate of owner of their homes,

The graph is rather close to the one before, and here, the interpretation of the summary table is more conventional,

> summary(regt1l) Call: lm(formula = PROCURATIONS/INSCRITS ~ taux, data = baset1, weights = INSCRITS) Weighted Residuals: Min 1Q Median 3Q Max -1.9994 -0.2926 0.0011 0.3173 3.2072 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.021268 0.001739 12.23 <2e-16 *** taux 0.054371 0.004812 11.30 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.646 on 835 degrees of freedom Multiple R-squared: 0.1326, Adjusted R-squared: 0.1316 F-statistic: 127.7 on 1 and 835 DF, p-value: < 2.2e-16 > summary(regt2l) Call: lm(formula = PROCURATIONS/INSCRITS ~ taux, data = baset2, weights = INSCRITS) Weighted Residuals: Min 1Q Median 3Q Max -2.9029 -0.4148 -0.0338 0.4029 3.4907 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.033909 0.001866 18.17 <2e-16 *** taux 0.079749 0.005165 15.44 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.6934 on 835 degrees of freedom Multiple R-squared: 0.2221, Adjusted R-squared: 0.2212 F-statistic: 238.4 on 1 and 835 DF, p-value: < 2.2e-16

We have used those codes to produce the graphs mentioned in the post. But before mentioning the residuals of the multiple model we considered, I wanted to share some awesome code that produce maps (I can say that those codes are awesome since Baptiste wrote most of them).

Visualization of Residuals on a Map of Paris

To plot the neighborhood of the pooling stations, one more time the post on Baptiste’s blog, explains how the shapefile was obtained from cartelec.net

> library(maptools) > library(rgdal) > library(classInt) > paris=readShapeSpatial("paris-cartelec.shp")

To visualize the proxy rate (the average of round one and round two), here is the code

> elec=data.frame() > elec=cbind(bt1$BV,(bt1$PROCURATIONS+bt2$PROCURATIONS),(bt1$EXPRIMES+bt2$EXPRIMES)) > colnames(elec)=c("BV","PROCURATIONS","EXPRIMES") > elec=as.data.frame(elec) > elec$BV=bt1$BV

To get nice colors, function of the rates, we use

> m=match(paris$BUREAU,elec$BV) > plotvar=100*elec$PROCURATIONS/elec$EXPRIMES > nclr=7 > plotclr=brewer.pal(nclr,"RdYlBu")[nclr:1] >(plotvar[m], nclr, style="fisher",dataPrecision=1) > colcode=findColours(class, plotclr)

and finally

> par(mar=c(1,1,1,1)) > plot(paris,col=colcode,border=colcode) > legend(656274.9, 6867308,legend=names(attr(colcode,"table")), + fill=attr(colcode, "palette"), cex=1, bty="n", + title="Frequence procurations (%)")

If we consider a model with three explanatory variable, to explain the proxy rate,

> regt1=lm(PROCURATIONS/INSCRITS~I(POP65P/POP)+ + I(PROPRIO/LOGEMENT)+I(CS3/POP1564),weights=INSCRITS,data=baset1)

we can plot the residuals using

> m=match(paris$BUREAU,elec$BV) > plotvar=100*residuals(regt1) > nclr=7 > plotclr=brewer.pal(nclr,"RdYlBu")[nclr:1] >(plotvar[m], nclr, style="fisher",dataPrecision=1) > colcode=findColours(class, plotclr) > par(mar=c(1,1,1,1)) > plot(paris,col=colcode,border=colcode) > legend(656274.9, 6867308,legend=names(attr(colcode,"table")), + fill=attr(colcode, "palette"), cex=1, bty="n",title="Residus")

It might not be a pure random spatial noise… But we could not get better with our small set of covariates.