A small group of Quantified Selfers tested themselves daily on arithmetic and engaged in a month of meditation. I analyze their scores with a multilevel model with per-subject grouping, and find the expect result: a small decrease in arithmetic errors which is not statistically-significant, with practice & time-of-day effects (but not day-of-week or weekend effects). This suggests a longer experiment by twice as many experimenters in order to detect this effect.

Quantified Self hobbyist Peter Lewis in April 2013 began a self-experiment with 11 other hobbyists (with the results posted on Seth Robert’s blog): they used a cellphone app “Math Workout” to do timed simple arithmetic problems (eg. “3.7 + 7.3, 93 + 18, 14 * 7, and 12² + √9”) in an ABA experimental design, where B was practicing mindfulness meditation. This was motivated in part by the observation that, while there is a ton of scientific research on meditation, the studies are often of doubtful quality (he cited a remarkable 2007 collection of meditation meta-analyses).

Lewis thought the results were positive inasmuch as his timings & error rate fell during the meditation period and the falls seemed visually clear when graphed, but some of the participants dropped out and others had graphs that were not so positive.

The interpretation of the data from a statistical point of view was not terribly clear. Lewis & Roberts did not do the analysis. Reading the post, I thought that it was a perfect dataset to try out multilevel models on, as this sort of psychology setup (multiple groups of data per subject, and multiple subjects) is a standard use-case.

The TestDuration variable jumps out as bizarre: it goes from 143 seconds to 2185.7 seconds? A closer look shows that the last two entries are 368.42, and then 2185.66. This is a bizarre outlier, so I deleted it. The errors also jump from 4 to 25, and a closer look shows a similar jump from 16 to the maximum value 25, which will distort plots later on, so I chose to make 16 the ceiling. We also don’t want the dummy rows which indicate a meditation session, they’re redundant once the Meditated variable has been created. Now that the data is cleaned up and coded, we can look at it. For example, all error scores colored by subject:

The original data was provided by Lewis in an Excel spreadsheet , but formatted in a way I couldn’t make use of. I extracted the data, turned it into a ‘long’ row-by-row format, added IDs for each subject, transformed the dates into R format, and added a variable for each score whether it was on a meditation day. The result was a CSV :

Descriptive We have two response variables, number of errors & how long it took to finish the set of 50 arithmetic problems: library(ggplot2); qplot(Date, jitter(ErrorCount, factor=1.5), color=Subject, data=mdtt, ylab="Number of errors out of 50 problems") qplot(Date, TestDuration, color=Subject, data=mdtt, ylab="Time taken to answer 50 problems (lower=better)") We can clearly see a practice effect in both test duration & number of errors, where subjects are gradually getting better over time. We will need to remember to control for that. Splitting by subject instead and looking at small multiples for any ABA patterns (we delete 3 of the attrited subjects to free up vertical space): mdtt2 <- mdtt[ as.integer (mdtt $ Subject) < 9 ,] qplot (Date, ErrorCount, color= Meditated, facets= Subject ~ ., data= mdtt2) qplot (Date, TestDuration, color= Meditated, facets= Subject ~ ., data= mdtt2) Errors in each phase of ABA , stratified by subject Test time required in each phase, by subject Overall, these are looking equivocal. There might be something going on, there might not be - ErrorCount looks better than TestDuration , but in both cases, the clear practice effect makes comparison hard.

Goals Roberts comments: Another way to improve this experiment would be to do statistical tests that generate p values; this would give a better indication of the strength of the evidence. Because this experiment didn’t reach steady states, the best tests are complicated (e.g., comparison of slopes of fitted lines). With steady-state data, these tests are simple (e.g., comparison of means). If you are sophisticated at statistics, you could look for a time-of-day effect (are tests later in the day faster?), a day-of-week effect, and so on. If these effects exist, their removal would make the experiment more sensitive. In my brain-function experiments, I use a small number of problems so that I can adjust for problem difficulty. That isn’t possible here. p-values are standard in psychology, of course, despite their conceptual & practical problems which help contribute to systematic problems in science, so we’ll calculate one. But we know in advance what the result will be: greater than 0.05, or non-statistically-significant. A p-value is mostly a function of how much data you have and how large the effect size is - we know a priori that this is a small experiment, and we also know that the effect of meditation on arithmetic scores cannot be huge because otherwise we would have heard of the groundbreaking research & society would make kids meditate before standardized exams & the subjects didn’t do much meditation & they were inexperienced etc. A small experiment on a small effect is guaranteed to not turn in a statistically-significant result, even when there genuinely is an effect, because the experiment doesn’t have much statistical power. In fact, a decrease in errors large enough to reach a statistical-significance of p<0.05 would be so bizarre & unexpected that it would be grounds for throwing out the data entirely, as someone must be lying or have screwed something up or something have gone terribly wrong somewhere. So the best we can hope for is to show that there was a benefit at all, rather than a penalty - it’s not clear, visually, that we’ll see better scores once we take into account a practice effect of time. Having shown this, then we can ask additional questions like, “how large an experiment would we need to reach statistical-significance, anyway?”

Multilevel models So, we’ll set up a few models in lme4 and compare them. The first model is the simplest: each subject has a different level of base accuracy (intercept), but are influenced the same way by meditation (meditation is a fixed effect): library (lme4) mlm1 <- lmer (ErrorCount ~ Meditated + ( 1 | Subject), data= mdtt); mlm1 ... AIC BIC logLik deviance REMLdev 3793 3812 -1893 3784 3785 Random effects : Groups Name Variance Std.Dev. Subject (Intercept) 1.37 1.17 Residual 5.77 2.40 Number of obs : 818 , groups : Subject, 12 Fixed effects : Estimate Std. Error t value (Intercept) 2.892 0.368 7.87 MeditatedTRUE -0.213 0.179 -1.19 Correlation of Fixed Effects : (Intr) MedittdTRUE -0.139 (This is with a normal distribution family; using a Poisson distribution with family=poisson would give a more accurate model but I am not yet comfortable using or interpreting Poissons.) We get a fixed effect of -0.2 or 1/5 of an error less on meditation days. Not a big effect. The next step is to imagine that besides having differing natural levels of accuracy, meditation may affect some subjects more or less: mlm2 <- lmer (ErrorCount ~ Meditated + ( 1 + Meditated | Subject), data= mdtt); mlm2 ... AIC BIC logLik deviance REMLdev 3794 3822 -1891 3780 3782 Random effects : Groups Name Variance Std.Dev. Corr Subject (Intercept) 1.6378 1.280 MeditatedTRUE 0.0913 0.302 -1.000 Residual 5.7472 2.397 Number of obs : 818 , groups : Subject, 12 Fixed effects : Estimate Std. Error t value (Intercept) 2.881 0.398 7.23 MeditatedTRUE -0.144 0.196 -0.74 Correlation of Fixed Effects : (Intr) MedittdTRUE -0.568 In this view of things, there’s a weaker effect of meditation: 1/7th of an error less (rather than 1/5th). But of course, what about the practice effect? The date should matter! Let’s re-run those two models with the date: mlm1 .2 <- lmer (ErrorCount ~ Meditated + Date + ( 1 | Subject), data= mdtt); mlm1 .2 ... AIC BIC logLik deviance REMLdev 3791 3814 -1890 3772 3781 Random effects : Groups Name Variance Std.Dev. Subject (Intercept) 1.58 1.26 Residual 5.69 2.39 Number of obs : 818 , groups : Subject, 12 Fixed effects : Estimate Std. Error t value (Intercept) 698.5757 203.2720 3.44 MeditatedTRUE -0.1978 0.1777 -1.11 Date -0.0440 0.0129 -3.42 ... R > mlm2 .2 <- lmer (ErrorCount ~ Meditated + Date + ( 1 + Meditated | Subject), data= mdtt); mlm2 .2 ... AIC BIC logLik deviance REMLdev 3791 3824 -1889 3769 3777 Random effects : Groups Name Variance Std.Dev. Corr Subject (Intercept) 1.867 1.37 MeditatedTRUE 0.109 0.33 -1.000 Residual 5.662 2.38 Number of obs : 818 , groups : Subject, 12 Fixed effects : Estimate Std. Error t value (Intercept) 706.2475 203.3645 3.47 MeditatedTRUE -0.0928 0.1987 -0.47 Date -0.0445 0.0129 -3.46 ... R > # what do the per-subject meditation estimates look like? R > # this is similar to what we would see if we did separate regressions on each subject: R > ranef (mlm2 .2 ) $ Subject (Intercept) MeditatedTRUE 00 -0.50140 0.11975 01 -1.32272 0.31590 02 -0.12651 0.03021 03 2.99856 -0.71614 04 0.96203 -0.22976 05 1.33934 -0.31987 06 -0.09078 0.02168 07 0.04431 -0.01058 08 -0.69278 0.16546 09 -0.64673 0.15446 10 -0.12256 0.02927 11 -1.84076 0.43963 R > # 4/11 have negative meditation estimates Both agree that there is definitely a practice effect, it’s probably statistically-significant with such large t-values, and the effect size is cumulatively large - at -0.045 errors per day, we’d expect our error rate to decrease by 1.4 errors a month - which is a lot when we remember that the total average errors per session was 3.1! Time of day So, that’s the practice effect verified. What else? Roberts suggested “time-of-day effect (are tests later in the day faster?)” and “day-of-week effect”. First we’ll convert the times to hours and regress on that: rm (x) for (i in 1 : nrow (mdtt)) { y <- strsplit (mdtt $ TimeOfDay[i], ":" )[[ 1 ]]; x[i] <- as.integer (y[[ 1 ]]) * 60 + as.integer (y[[ 2 ]]); } mdtt $ Hours <- (x / 60 ) mlm3 <- lmer (ErrorCount ~ Meditated + Date + Hours + ( 1 | Subject), data= mdtt); mlm3 ... AIC BIC logLik deviance REMLdev 3789 3817 -1888 3762 3777 Random effects : Groups Name Variance Std.Dev. Subject (Intercept) 1.52 1.23 Residual 5.63 2.37 Number of obs : 818 , groups : Subject, 12 Fixed effects : Estimate Std. Error t value (Intercept) 690.4037 202.0927 3.42 MeditatedTRUE -0.1875 0.1767 -1.06 Date -0.0434 0.0128 -3.40 Hours -0.0526 0.0162 -3.25 To my surprise, there really is a time-of-day effect. (I expected that there wouldn’t be enough data to see one, or that it would be a complex squiggle corresponding to circadian rhythms.) And it’s a simple linear one, too, which jumps out when one plots errors against hour: qplot(Hours, ErrorCount, color=Meditated, data=mdtt) The lesson here is that doing arithmetic at 7AM is not a good idea. (Would that American high schools could realize this…) But moving on: I really don’t think there’s enough data to get a day-of-week effect as Roberts suggests, so I’ll also do a simpler split on weekends. mdtt $ Day <- weekdays.Date (mdtt $ Date) mdtt $ WeekEnd <- mdtt $ Day == "Sunday" | mdtt $ Day == "Saturday" mlm4 <- lmer (ErrorCount ~ Meditated + Date + Hours + Day + ( 1 | Subject), data= mdtt); mlm4 ... AIC BIC logLik deviance REMLdev 3801 3858 -1889 3757 3777 Random effects : Groups Name Variance Std.Dev. Subject (Intercept) 1.47 1.21 Residual 5.64 2.37 Number of obs : 818 , groups : Subject, 12 Fixed effects : Estimate Std. Error t value (Intercept) 715.0345 204.1072 3.50 MeditatedTRUE -0.1610 0.1803 -0.89 Date -0.0450 0.0129 -3.48 Hours -0.0518 0.0163 -3.17 DayMonday -0.4196 0.3138 -1.34 DaySaturday -0.3912 0.2959 -1.32 DaySunday -0.1379 0.3032 -0.45 DayThursday -0.4688 0.2894 -1.62 DayTuesday -0.3116 0.3164 -0.99 DayWednesday -0.1052 0.3267 -0.32 ... mlm5 <- lmer (ErrorCount ~ Meditated + Date + Hours + WeekEnd + ( 1 | Subject), data= mdtt); mlm5 ... AIC BIC logLik deviance REMLdev 3792 3825 -1889 3762 3778 Random effects : Groups Name Variance Std.Dev. Subject (Intercept) 1.52 1.23 Residual 5.63 2.37 Number of obs : 818 , groups : Subject, 12 Fixed effects : Estimate Std. Error t value (Intercept) 690.57787 202.44963 3.41 MeditatedTRUE -0.18743 0.17684 -1.06 Date -0.04345 0.01280 -3.39 Hours -0.05263 0.01623 -3.24 WeekEndTRUE -0.00412 0.18231 -0.02 ... Neither one is impressive. The model using weekends is completely useless, while the model using days looks like it’s overfitting the data. Which of the 5 models is better, overall and keeping in mind that simplicity is good (we don’t want to overfit the data by throwing in a huge number of random variables which might “explain” variation in error-scores)? anova (mlm1, mlm2, mlm1 .2 , mlm2 .2 , mlm3, mlm4, mlm5) ... Df AIC BIC logLik Chisq Chi Df Pr ( > Chisq) mlm1 4 3792 3810 -1892 mlm1 .2 5 3782 3806 -1886 11.44 1 0.00072 mlm2 6 3792 3821 -1890 0.00 1 1.00000 mlm3 6 3774 3802 -1881 18.75 0 < 2e-16 mlm2 .2 7 3774 3807 -1880 1.68 1 0.19554 mlm5 7 3776 3809 -1881 0.00 0 1.00000 mlm4 12 3781 3838 -1879 4.26 5 0.51227 As we might have guessed, model 3 ( Meditated + Date + Hours + (1|Subject) ) wins because the practice effect & time-of-day effect matter but apparently not day-of-week in the small dataset we have. (In a larger dataset, then the fancier models might do better.) Significance So, in mlm3 the meditation coefficient was -0.1875. How reliable is this meditation coefficient? We’d like a p-value, and the t-values aren’t clear: -1.106? library (languageR) mcmc <- pvals.fnc (mlm3) mcmc $ fixed Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr ( > | t | ) (Intercept) 690.4037 679.5600 268.8057 1069.9631 0.0018 0.0007 MeditatedTRUE -0.1875 -0.1837 -0.5400 0.1511 0.2948 0.2889 Date -0.0434 -0.0427 -0.0679 -0.0172 0.0018 0.0007 Hours -0.0526 -0.0529 -0.0839 -0.0205 0.0012 0.0012 The practice & time-of-day effects are pretty statistically-significant, but we get a p-value of 0.29 for meditation at -0.15 (-0.54-0.15). To put this in perspective: the dataset just isn’t a whole lot of data and the data is heavily loaded on non-meditation data - not such an issue for the date or time-of-day effects which get spread around and estimated evenly, but bad for the meditation data. Another check is by bootstrapping our model from the data and seeing how the meditation estimate varies - does it vary like it did above? library (boot) meditationEstimate <- function (dt, indices) { d <- dt[indices,] # allows boot to select subsample mlm3 <- lmer (ErrorCount ~ Meditated + Date + Hours + ( 1 | Subject), data= d) return ( fixef (mlm3)[ 2 ]) } bs <- boot ( data= mdtt, statistic= meditationEstimate, R= 200000 , parallel= "multicore" , ncpus= 4 ); bs ... Bootstrap Statistics : original bias std. error t1 * -0.1656 -0.002162 0.1741 boot.ci (bs) ... Intervals : Level Percentile 95 % ( - 0.5108 , 0.1735 ) Calculations and Intervals on Original Scale Very similar: -0.16 (-0.51-0.17).