To gain some statistical & web development experience and to improve my readers’ experiences, I have been running a series of CSS A/B tests since June 2012. As expected, most do not show any meaningful difference.

But unfortunately, simply lowering the timeout will have minimal returns as Analytics also reports that 82% of reader spend 0-10 seconds on pages. So we are stuck with a severe loss.

This isn’t even an efficient dichotomization: we could improve the fractional bit to 1 bit if we could somehow dichotomize at 50% of readers:

According to my Analytics, the mean reading time (time on page) is 1:47 and the maximum bracket, hit by 1% of viewers, is 1801 seconds, and the range 1-1801 takes <10.8 bits to encode ( log2(1801) → 10.81 ), hence each page view could be represented by <10.8 bits (less since reading time is so highly skewed). But if we dichotomize, then we learn simply that ~14% of readers will read for 40 seconds, hence each reader carries not 6 bits, nor 1 bit (if 50% read that long) but closer to 2/3 of a bit:

https://support.google.com/websiteoptimizer/bin/answer.py?hl=en-AU&answer=74345 “Time on page as a conversion goal” - every page converts, by using a timeout (mine is 40 seconds). Problem: dichotomizing a continuous variable into a single binary variable destroys a massive amount of information. This is well-known in the statistical and psychological literature (eg. MacCallum et al 2002 ) but I’ll illustrate further with some information-theoretical observations.

Oy vey! When I discovered Google had deleted my results, I decided to simply switch to 900px. Running a new test would not provide any better answers.

Ironically, I was warned at the beginning about both of these possible behaviors by a paper I read on large-scale corporate A/B testing: http://www.exp-platform.com/Documents/puzzlingOutcomesInControlledExperiments.pdf and http://www.exp-platform.com/Documents/controlledExperimentDMKD.pdf and http://www.exp-platform.com/Documents/2013%20controlledExperimentsAtScale.pdf It covered at length how many apparent trends simply evaporated, but it also covered later a peculiar phenomenon where A/B tests did not converge even after being run on ungodly amounts of data because the standard deviations kept changing (the user composition kept shifting and rendering previous data more uncertain). And it’s a general phenomenon that even for large correlations, the trend will bounce around a lot before it stabilizes ( Schönbrodt & Perugini 2013 ).

The second distressing thing was that Google’s estimated chance of a particular intervention beating the default (which I believe is a Bonferroni-corrected p-value), did not increase! Even as each version received 20,000 hits, the chance stubbornly bounced around the 70-90% range for 900px and 1300px. This remained true all the way to the bitter end. At the end, each version had racked up 93,000 hits and still was in the 80% decile. Wow.

The results were initially very promising: ‘conversion’ was defined as staying on a page for 40 seconds (I reasoned that this meant someone was actually reading the page), and had a base of around 70% of readers converting. With a few hundred hits, 900px converted at 10-20% more than the default! I was ecstatic. So when it began falling, I was only a little bothered (one had to expect some regression to the mean since the results were too good to be true). But as the hits increased into the low thousands, the effect kept shrinking all the way down to 0.4% improved conversion. At some points, 1300px actually exceeded 900px.

It ran from mid-June to 2012-08-01. Unfortunately, I cannot be more specific: on 1 August, Google deleted Website Optimizer and told everyone to use ‘Experiments’ in Google Analytics - and deleted all my information. The graph over time, the exact numbers - all gone. So this is from memory.

CSS-3 property: set how wide the page will be in pixels if unlimited screen real estate is available. I noticed some people complained that pages were ‘too wide’ and this made it hard to read, which apparently is a real thing since lines are supposed to fit in eye saccades. So I tossed in 800px, 900px, 1300px, and 1400px to the first A/B test.

Thus, banner ads on gwern.net appear to be harmful and AdSense has been removed. If these results generalize to other blogs and personal websites, an important implication is that many websites may be harmed by their use of banner ad advertising without realizing it.

Correcting for a flaw in the randomization, the final results yield a surprisingly large estimate of -14% traffic loss if all traffic were exposed to ads (95% credible interval: -13-16%) and an expected traffic loss of -9.7%, exceeding the decision threshold for disabling ads and rendering further experimentation profitless.

Design: A decision analysis of revenue vs readers yields an maximum acceptable total traffic loss of ~3%. Power analysis of historical gwern.net traffic data demonstrates that the high autocorrelation yields low statistical power with standard tests & regressions but acceptable power with ARIMA models. I design a long-term Bayesian ARIMA(4,0,1) time-series model in which an A/B-test running January-October 2017 in randomized paired 2-day blocks of ads/no-ads uses client-local JS to determine whether to load & display ads, with total traffic data collected in Google Analytics & ad exposure data in Google AdSense.

One source of complexity & JavaScript use on gwern.net is the use of Google AdSense advertising to insert banner ads. In considering design & usability improvements, removing the banner ads comes up every time as a possibility, as readers do not like ads, but such removal comes at a revenue loss and it’s unclear whether the benefit outweighs the cost, suggesting I run an A/B experiment. However, ads might be expected to have broader effects on traffic than individual page reading times/bounce rates, affecting total site traffic instead through long-term effects on or spillover mechanisms between readers (eg social media behavior), rendering the usual A/B testing method of per-page-load/session randomization incorrect; instead it would be better to analyze total traffic as a time-series experiment.

So I simply removed it. It was a bit of an experiment, and <8.9k searches does not seem worth it.

Even with the most optimistic possible assumptions (perfect conversion, no negative effect), it takes 279,449 page-views to get decent power. This is ridiculous from a cost-benefit perspective, and worse given that my priors are against it due to the extra JS & CSS it entails.

This might seem like a time to A/B test the presence/absence of the CSE div. (I can’t simply hide it using CSS like usual because it will still affect page loads.) Except consider the power issues: if that 1 CSE search converts, then to be profitable, it needs to damage the 227 other page-views conversion rate by <1/227. Or to put it the other way, the current conversion rate is ~17% of page-views and CSE search represents 0.44% of page-views, so if the CSE makes that one page-view 100% guaranteed to convert and otherwise converts normally, then over 1000 page-views, we have 0.17⋅995+1.0⋅5=174 vs 0.17⋅995+0.17⋅5=170, or 17.4% vs 17.0%.

To put these 8855 searches in perspective, in that same exact time period, there were 891,790 unique users with 2,010,829 page views. So only 0.44% of page-views involve a use of the CSE, or a ratio of 1:227 Is it net-beneficial to make 227 page-views incur the JS run & loading for the sake of 1 CSE search?

There had been 8974 searches since I installed it 785 days previously or ~11.4 searches per day; at least 119 were searches for “e”, which I assume were user mistakes where they didn’t intend to search and probably annoyed them. (The next most popular searches are “Graeber”/26, “chunking”/22, and “nootropics”/10, with CSE refusing to provide any further queries due to low volume. This suggests a long tail of search queries - but also that they’re not very important since it’s easy to find the DNB FAQ & my nootropics page, and it can hardly be useful if the top search is an error.)

The problem is that the CSE search input takes up space in the sidebar, and is more JS to run on each page load and loads at least one other JS file as well. So on 2015-07-17, I took a look to evaluate whether it was worth keeping.

Google provides HTML & JS for integrating a CSE somewhere, so creating & installing it was straightforward, and it went live 2013-05-24.

A CSE is a Google search query but one specialized in various ways - somewhat like offering a user a form field which redirects to a Google search query like QUERY site:gwern.net/docs/ , but more powerful since you can specify thousands of URLs to blacklist and whitelist and have limited patterns. I have two: one is specialized for searching for anime/manga news sites and makes writing Wikipedia articles much easier (since you can search for a particular anime title and the results will be mostly news and reviews which you can use in a WP article, rather than images, songs, memes, Amazon and commercial sites, blogs, etc); and the second is specialized to search gwern.net , my Reddit, LessWrong, PredictionBook, Good Reads and some other sites, to make it easier to find something I may’ve written. The second I created to put in the sidebar and serve as a website search function. (I threw in the other sites because why not?)

A strange set of results. meta2 performs the best on new visitors, and worst on old visitors; while meta6 is the exact opposite. Because there are more new visitors than old visitors, meta2 is the best on average. Except I hate how meta2 looks and much prefer meta6 . The confidence intervals are wide, though - it’s not clear that meta6 is definitely worse than meta2 .

On 2015-02-05, the top variant ( meta5 ) outperformed the bottom one ( meta1 , corresponding to my expectation that the taller variants would be worse than the compactest ones), so the worst was deleted. On 2015-02-08, the new top variant ( meta6 ) now outperformed ( meta4 ), so I deleted it. On 2015-03-22, it outperformed none . On 2015-05-25, the difference was not statistically-significant but I decided to delete meta3 anyway. On 2015-07-02, I deleted meta2 similarly; given the ever smaller differences between variants, it may be time to kill the experiment.

(This also means that people browsing without Javascript enabled should still continue to see a readable version of the site.)

I define inline in the HTML template each of the 6 variants, as div s ID ‘metadata1..metadata6’. In the default.css , I set them to display: none so the user does not 6 different metadatas taking up 2 screens of space. Then, each A/B variant passed to ABalytics toggles back on one version using display: block . I also include a 7th variant, where none of the 6 should be visible, which is effectively the control condition which roughly matches the status quo of showing the metadata in the sidebar. (“Roughly”, since in the none condition, there won’t be metadata anywhere in the displayed page; but since the previous experiment indicated that removing elements from the sidebar didn’t make any noticeable difference, I decided to simplify the HTML source code by removing the original metadata div entirely to avoid any collisions or issues with the CSS/HTML I’ve defined.)

As an HTML rather than CSS change, the implementation as an A/B test is more complex.

There are several different ways and levels of density, so I created 6 variants with increasing amounts of density.

The page metadata is the odd man out, and I’ve noticed that a lot of people seem to not notice the page metadata hiding in the sidebar (eg there will be comments wondering when a page was created, when that’s listed clearly right there in the page’s sidebar). What if I moved the page metadata to underneath the big title? I’d have to change the formatting, since I can’t afford to spend 10+ vertical lines of space the way it must be formatted in the sidebar, but the metadata could fit in 2-5 lines if I combine the logical pairs (so instead of 4 lines for “created: / 2013-05-07 / modified: / 2015-01-09”, just one line “created: 2013-05-07; modified: 2015-01-09”).

Looking at the sidebar some more, it occurred to me that the sidebar was serving 3 different purposes all mixed together:

Junk. The extra sidebar elements may be a tiny bit harmful and the ruler helpful, but it would take much more data than 71k datapoints to show that.

I killed the test in late January. (I had gotten an idea I wanted to test, see next section: if the sidebar is too cluttered with site navigation, donation and metadata, why not move the metadata into the body?)

So, I’d like to try out removing the horizontal ruler as dividers, and hiding the search-engine and donations. Then in another A/B test I can try out different tweaks (maybe resort the sections or change the word-breaking.)

Looking at my current pages, one of the visual aspects that bother me is the sidebar: it contains links to top-level pages, page-specific metadata, a search interface, and donation widgets (all separated by whitespace and horizontal rulers). It comes off as a little disorganized and messy.

There’s definitely temporal heterogeneity, given the statistical-significance of the time-period dummies, so that is good to know. But the estimated effects for each indentation variant is derisorily small (despite having spent n=159634), suggesting readers don’t care at all. Since I have no opinion on the matter, I suppose I’ll go with the highest point-estimate, 2em.

A simple analysis of the totals would indicate that 0.1em is the best setting - which is odd since it was the worst-performing and first variant to be deleted, so how could it be the best? The graph of traffic suggests that, like before, the final totals are confounded by time-varying changes in conversion rates plus dropping variants; that is, 0.1em probably only looks good because after it was dropped, a bunch of Hacker News traffic hit and happened to convert at lower rates, making the surviving variants look bad. One might hope that all of that effect would be captured by the Old covariate as HN traffic gets recorded as new visitors, but that would be too much to hope for. So instead, I add a dummy variable for each of the 3 separate time-periods which will absorb some of this heterogeneity and make clearer the effect of the indentation choices.

The conversion data, with new vs returning visitor, segmented by period, and ordered by when a variant was deleted:

On 2014-07-27, since the 95% CIs for the best and worst indent variants no longer overlapped, I deleted the worst variant ( 0.1 ). On 2014-08-23, the 2.0em and 0.0em variants no longer overlapped, and I deleted the latter.

Since we’re back to testing CSS, we can use the old ABalytics approach without having to do JS coding:

In retrospect years later, after learning more about typography and revamping gwern.net CSS a number of times, I think Anonymous was actually talking about text justification : HTML/ gwern.net is by default “flush left, ragged right”, with large whitespace gaps left where words of different lengths get moved to the next line but not broken/hyphenated or stretched to fill the line. Some people do not like text justification, describing ragged right as easier to read, but most typographers endorse it, it was historically the norm for professionally-set print, still carries connotations of class, and I think the appearance fits in with my overall site esthetic. I eventually enabled text justification on gwern.net in February 2019 (although I was irritated by the discovery that the standard CSS method of doing so does not work in the Chrome browser due to a long-standing failure to implement hyphenation support).

I liked this, but I suppose for lots of small paragraphs, it lends a ragged appearance to the page. So might as well test a few variants of text-indent to see what works best: 0em, 0.1, 0.5, 1.0, 1.5, and 2.0.

Looking at a random page, my best guess is that he’s bothered by the indentation at the start of successive paragraphs: in a sequence of paragraphs, the first paragraph is not indented (because it can’t be visually confused) but the successive paragraphs are indented by 1.5em in order to make reading easier. The CSS is:

I wasn’t sure what he meant, since the text is left-aligned, and I can’t ask for clarification (anonymous means anonymous).

Could you format your pages so that the texts are all aligned at the left? It looks unprofessional when the lines of text break at different areas. Could you make the site like a LaTeX article? The formatting is the only thing preventing you from looking really professional.

I was surprised that the gray variants could perform so wildly different, from slightly better than the control to horribly worse, considering that they didn’t strike me as looking that different when I was previewing them locally. I also didn’t expect blues to last as long as it did, and thought I would be deleting it as soon as dark . This makes me wonder: are there color themes only subtly different from the ones I tried which might work unpredictably well? Since BLR by default offers only a few themes, I think BLR should try out as many color themes as possible to locate good ones they’ve missed.

An unlikely +0.5% to reading rates isn’t enough for me to want to add a dependency another JS library, so I will be removing BLR. I’m not surprised by this result, since most tests don’t show an improvement, BLR coloring test is pretty unusual for a website, and users wouldn’t have any understanding of what it is or ability to opt out of it; using BLR by default doesn’t work, but the browser extension might be useful since the user expects the coloring & can choose their preferred color scheme.

The results are not impressive: only 2 gray variants out of the 8 variants have a positive estimate, and neither is statistically-significant; the best variant was gray1 (“#222222” & “#FBFBFB”), at an estimated increase from 19.52% to 20.04% conversion rate. More surprising, the nesting turns out to not matter at all, and in fact the worst variant was gray. (The best-fitting multilevel model ignore the variants entirely, although it did not fit better than the regular logistic model incorporating all of the time periods, Old , and variants.)

As usual, a logistic regression on the various BLR themes with new vs returning visitors ( Old ) as a covariate. Because of the heterogeneity in traffic (and because I bothered breaking out the data by time period this time for the table), I also include each block as a factor. Finally, because I expected the 6 gray variants to perform similarly, I try out a multilevel model nesting the grays together.

The BLR people say that there may be cross-browser differences, so I thought about throwing in browser as a covariate too (an unordered factor of Chrome & Firefox, and maybe I’ll bin everything else as an ‘other’ browser); it seems I may have to use the GA API to extract conversion rates split by variant, visitor status, and browser. This turned out to be enough work that I decided to not bother.

I also received a number of complaints while running the BLR test (principally due to the dark and blues variants, but also apparently triggered by some of the less popular gray variants; the number of complaints dropped off considerably by halfway through):

The conversion data, with new vs returning visitor, segmented by period, and ordered by when a variant was deleted:

Due to caching, the deletions didn’t necessarily drop data collection instantly to zero. Traffic was also heterogeneous: Hacker News traffic is much less likely to spend much time on page than the usual traffic.

On 31 March, with total n having reached 15652 visits, I deleted the worst-performing variant: gray4 , which at 19.21% was substantially underperforming the best-performing variant’s 22.38%, and wasting traffic. On 6 April, two Hacker News submissions having doubled visits to 36533, I deleted the next-worst variant, gray5 (14.66% vs control of 16.25%; p=0.038). On 9 April, the almost as inferior gray6 (15.67% vs 16.26%) was deleted. On 17 April, dark (16.00% vs 16.94%) was deleted. On 30 April, I deleted gray2 (17.56% vs 18.07%). 11 May, blues was gone (18.11% vs 18.53%), and on 31 May, I deleted gray3 (18.04% vs 18.24%).

(Why “bl3”? I don’t know JS, so it took some time; things I learned along the line included always leaving whitespace around a < operator, and that the “none” argument passed into beeline.setOptions causes a problem which some browsers will ignore and continue recording A/B data after but most browsers will not; this broke the original test. Then I discovered that BLR by default broke all the MathML/MathJax, causing nasty-looking errors over pages with math expressions; this broke the second test, and I had to get a fixed version.)

The usual implementation using ABalytics doesn’t work because it uses a innerHTML call to substitute the various fragments, and while HTML & CSS get interpreted fine, JavaScript does not; the offered solutions were sufficiently baroque I wound up implementing a custom subset of ABalytics hardwired for BLR inside the Analytics script:

Since I’m particularly interested in these results, and I think many other people will find the results interesting, I will run this test extra-long: a minimum of 2 months. I’m only interested in the best variant, not estimating each variant exactly (what do I care if the ugly dark is 15% rather than 14%? I just want to know it’s worse than the control) so conceptually I want something like a sequential analysis or adaptive clinical trial or multi-armed bandit where bad variants get dropped over time; unfortunately, I haven’t studied them yet (and MABs would be hard to implement on a static site), so I’ll just ad hoc drop the worst variant every week or two. (Maybe next experiment I’ll do a formal adaptive trial.)

I asked if there were a JavaScript version I could use in an A/B test; the initial JS implementation was not fast enough, but by 2014-03-10 it was good enough. BLR has several themes, including “gray”; I decided to test the variants no BLR, “dark”, “blues”, & expanded the gray selection to include grays #222222 / #333333 / #444444 / #555555 / #666666 / #777777 ( gray - 6 ; they vary in how blatant the highlighting is) for a total of 9 equally-randomized variants.

BeeLine Reader (BLR) is an interesting new browser plugin which launched around October 2013; I learned of it from the Hacker News discussion . The idea is that part of the difficulty in reading text is that when one finishes a line and saccades left to the continuation of the next line, the uncertainty of where it is adds a bit of stress, so one can make reading easier by adding some sort of guide to the next line; in this case, each matching pair of half-lines is colored differently, so if you are on a red half-line, when you saccade left, you look for a line also colored red, then you switch to blue in the middle of that line, and so on. A colorful variant on boustrophedon writing. I found the default BLR coloring garish & distracting, but I couldn’t see any reason that a subtle gray variant would not help: the idea seems plausible. And very long text pages (like mine) are where BLR should shine most.

BLR is a JS library for highlighting textual paragraphs with pairs of half-lines to make reading easier. I run a randomized experiment on several differently-colored versions to see if default site-wide usage of BLR will improve time-on-page for gwern.net readers, indicating easier reading of the long-form textual content. Most versions perform worse than the control of no-highlighting; the best version performs slightly better but the improvement is not statistically-significant.

So, as I expected, putting the ToC on the right performed worse; the larger ToC widths don’t seem to be better but it’s unclear what’s going on there. A visual inspection of the Width data ( library(ggplot2); qplot(Width,Rate,color=Alignment,data=rates) ) suggests that 20% width was the best variant, so might as well go with that.

I decided to end this test early on 2014-03-10 because I wanted to move onto the BeeLine Reader test, so it’s underpowered & the results aren’t as clear as usual:

After the page title, the next thing a reader will generally see on my pages in the table of contents. It’s been tweaked over the years (particularly by suggestions from Hacker News) but still has some untested aspects, particularly the first two parts of div#TOC :

Uppercase and ‘none’ beat ‘capitalize’ in both page titles & section headers (interaction does not survive). So I toss in a CSS declaration to uppercase section headers as well as the status quo of the title.

So of the 11 main effects, 9 two-ways, & 2 three-ways, there were confirmed in the reduced models: 7 mains, 3 two-ways (22%), & 0 three-ways (0%). And of the 2 interactions, only the black/white interaction was important (and even there, if I had regressed instead cbind(Successes, Failures) ~ Black + White , black & white would still have positive coefficients, they just would not be statistically-significant, and so I would likely have made the same choice as I did with the interaction data available).

At this point it seems worth asking whether running multifactorials has been worthwhile. The analysis is a bit more difficult, and the more factors there are, the harder to interpret. I’m also not too keen on encoding the combinatorial explosion into a big JS array for ABalytics. In my tests so far, have there been many interactions? A quick tally of the glm() / step() results:

The two size tweaks turn out to be unambiguously negative compared to the status quo (with an almost negligible interaction term probably reflecting reader preference for consistency in sizes of letters and numbers - as one gets smaller, the other does better if it’s smaller too). The Table of Contents backgrounds also survive (thanks to the new vs old visitor type covariate adding power): there were 3 background types, e / f / r [gb], and f / r turn out to have negative coefficients, implying that e is best - but e is also the status quo, so no change is recommended.

Finally, because I am tired of just 2 factors, I throw in a third factor to make it really multifactorial. I picked the number-sizing from the existing list of suggestions.

And while the blockquote background coloring is a good idea, per the previous test, what about the other place on gwern.net where I use a light background shading: the Table of Contents? Perhaps it would be better with the same background shading as the blockquotes, or no shading?

It was pointed out to me that in my previous font-size test, the clear linear trend may have implied that larger fonts than 100% were bad, but that I was making an unjustified leap in implicitly assuming that 100% was best: if bigger is worse, then mightn’t the optimal font size be something smaller than 100%, like 95%?

The top-performing variant is the status quo (no Readability-style quote, zebra-striped blocks). So we keep it.

This is another 2x2 design since we can use the Readability quotes or not, and the zebra-striping or not.

Another bit of formatting I’ve been meaning to test for a while is seeing how well Readability ’s pull-quotes next to blockquotes perform, and to check whether my zebra-striping of nested blockquotes is helpful or harmful.

100% and squares, however, were the original CSS settings, so this means I will make no changes to the existing CSS based on these results.

Immediately the negative effect of increasing the font size jumps out, but it’s easier to understand the list icon estimates: square performs the best in the 100% (the original default) font size condition but it performs poorly in the other font sizes, which is why it seems to do only medium-well compared to the others. Given how much better 100% performs than the others, I’m inclined to ignore their results and keep the squares.

The results are a little confusing in factorial form: it seems pretty clear that Size is bad and that 100% performs best, but what’s going on with the list icon type? Do we have too little data or is it interacting with the font size somehow? I find it a lot clearer when plotted:

I halted the A/B test on 27 October because I was noticing clear damage as compared to my default CSS. The results were:

I make heavy use of unordered lists in articles; for no particular reason, the symbol denoting the start of each entry in a list is the little black square, rather than the more common little circle. I’ve come to find the little squares a little chunky and ugly, so I want to test that. And I just realized that I never tested font size (just type of font), even though increasing font size one of the most common CSS tweaks around. I don’t have any reason to expect an interaction between these two bits of designs, unlike the previous A/B test, but I like the idea of getting more out of my data, so I am doing another factorial design, this time not 2x2 but 3x5. The options:

Reader Lucas asks in the comment sections whether, since we would expect new visitors to the website to be less likely to read a page in full than a returning visitor (who knows what they’re in for & probably wants more), whether including such a variable (which is something Google Analytics does track) might improve the analysis. It’s easy to ask GA for “New vs Returning Visitor” so I did:

So, this suggests a change to the CSS: we switch the default background color from #FCFCFC to white , while leaving the default color its current black .

I am a little curious about this one, so I scheduled a full month and half: 10 September - 20 October. Due to far more traffic than anticipated from submissions to Hacker News, I cut it short by 10 days to avoid wasting traffic on a test which was done (a total n of 231,599 was more than enough). The results:

The hyperlinks, on the other hand, make use of a off-black color: #303C3C , partially motivated by Ian Storm Taylor’s advice to “Never Use Black” . I wonder - should all the text be off-black too? And which combination is best? White/black? Off-white/black? Off-white/off-black? White/off-black? Let’s try all 4 combinations here.

But seriously, it is nice to see that ABalytics does not seem to be broken & favoring either option and any results driven by placement in the array of options.

Ah, but can we reject the null hypothesis that "“==”"? In a rare victory for null-hypothesis-significance-testing, we do not commit a Type I error:

While amusingly the first pair of 1k hits resulted in a dramatic 18% vs 14% result, this quickly disappeared into a much more normal-looking set of data:

Since any difference due to the testing framework should be noticeable, this will be a shorter experiment, from 15 August to 29 August.

It’s easy to switch from the lineheight test to the null test; just rename the variables for Google Analytics, and empty the payloads:

One of the suggestions in the A/B testing papers was to run a “null” A/B test (or “A/A test”) where the payload is empty but the A/B testing framework is still measuring conversions etc. By definition, the null hypothesis of “no difference” should be true and at an alpha of 0.05, only 5% of the time would the null tests yield a p<0.05 (which is very different from the usual situation). The interest here is that it’s possible that something is going wrong in one’s A/B setup or in general, and so if one gets a “statistically-significant” result, it may be worthwhile investigating this anomaly.

I changed the 150% to 130% for the heck of it, even though the difference between 130 and 150 was trivially small:

Just from looking at the miserably small difference between the most extreme percentages (15.26−14.92=0.34%), we can predict that nothing here was statistically-significant:

Most web design guides seem to suggest a safe default of 120%, rather than my current 150%. If we try to test each decile plus one on the outside, that’d give us 110, 120, 130, 140, 150, 160 or 6 options, which combined with the expected small effect, would require an unreasonable sample size (and I have nothing in the pipeline I expect might catch fire like the Google analysis and deliver an excess >50k visits). So I’ll try just 120/130/140/150, and schedule a similar block of time as fonts (ending the experiment on 2013-08-16, with presumably >70k datapoints).

I have seen complaints that lines on gwern.net are “too closely spaced” or “run together” or “cramped”, referring to the line height (the CSS property line-height ). I set the CSS to line-height: 150%; to deal with this objection, but this was a simple hack based on rough eyeballing of it, and it was done before I changed the max-width and font-family settings after the previous testing. So it’s worth testing some variants.

With essentially no meaningful differences between conversion rates, this suggests that however fonts matter, they don’t matter for reading duration. So I feel free to pick the font that appeals to me visually, which is Baskerville.

Since there’s only small differences between individual fonts, I wondered if there might be a difference between the two sans-serifs and the two serifs. If we lump the 4 fonts into those 2 categories and look at the small difference in mean conversion rate:

Picking the most extreme difference, between Trebuchet and Georgia, the difference is close to the usual definition of statistical-significance:

The sample size for each font is 20k higher than I projected due to the enormous popularity of an analysis of the lifetimes of Google services I finished during the test. Regardless, it’s clear that the results - with double the total sample size of the NYT experiment, focused on fewer fonts - are disappointing and there seems to be very little difference between fonts.

I had not used Baskerville but Georgia since Georgia seemed similar and was convenient, but we’ll fix that now. Besides Baskerville & Georgia, we’ll omit Comic Sans (of course), but we can try Trebuchet for a total of 4 fonts (falling back to Georgia):

15000 visitors in each group seems reasonable; at ~16k visitors a week, that suggests a few weeks of testing. Of course I’m testing 4 fonts (see below), but that still fits in the ~2 months I’ve allotted for this test.

Would this font work its magic on gwern.net too? Let’s see. The sample size is quite manageable, as over a month I will easily have 60k visits, and they tested 6 fonts, expanding their necessary sample. What sample size do I actually need? Their professor estimates the effect size of Baskerville at 1.5%; I would like my A/B test to have very high statistical power (0.9) and reach more stringent statistical-significance (p<0.01) so I can go around and in good conscience tell people to use Baskerville. I already know the average “conversion rate” is ~13%, so I get this power calculation:

The New York Times ran an informal online experiment with a large number of readers (n=60750) and found that the Baskerville font led to more readers agreeing with a short text passage - this seems plausible enough given their very large sample size and Wikipedia’s note that “The refined feeling of the typeface makes it an excellent choice to convey dignity and tradition.”

But I want to move on to the next test and by the same logic it is highly unlikely that the difference between them is large or much in 1300px’s favor (the kind of mistake I care about: switching between 2 equivalent choices doesn’t matter, missing out on an improvement does matter - maximizing β, not minimizing α).

1100px is close to my original A/B test indicating 1000px was the leading candidate, so that gives me additional confidence, as does the observation that 1300px and 1200px are the other leading candidates. (Curiously, the site conversion average before was 13.88%; perhaps my underlying traffic changed slightly around the time of the test? This would demonstrate why alternatives need to be tested simultaneously.) A quick and dirty R test of 1100px vs 1300px ( prop.test(c(2632,2581),c(18164,18071)) ) indicates the difference isn’t statistically-significant (at p=0.58), and we might want more data; worse, there is no clear linear relation between conversion and width (the plot is erratic, and a linear fit a dismal p=0.89):

In March 2013, I decided to give A/B testing another whack. Google Analytics Experiment did not seem to have improved and the commercial services continued to charge unacceptable prices, so I gave the Google Analytics custom variable integration approach another trying using ABalytics . The usual puzzling, debugging, and frustration of combining so many disparate technologies (HTML and CSS and JS and Google Analytics) aside, it seemed to work on my test page. The current downside seems to be that the ABalytics approach may be fragile, and the UI in GA is awful (you have to do the statistics yourself).

A/B testing variants one at a time is fine as far as it goes, but it has several drawbacks that have become apparent:

fixed trials, compared to sequential or adaptive trial approaches, waste data/page-views. Looking back, it’s clear that many of these trials didn’t need to run so long. they are costly to set up, both because of the details of a static site doing A/B tests but also because it requires me to define each change, code it up, collect, and analyze the results all by hand. they are not amenable to testing complicated models or relationships, since factorial designs suffer combinatorial explosion. they will test only the interventions the experimenter thinks of, which may be a tiny handful of possibilities out of a wide space of possible interventions (this is related to the cost: I won’t test anything that isn’t interesting, controversial, or potentially valuable, because it’s far too much of a hassle to implement/collect/analyze)

The topic of sequential trials leads naturally to multi-armed bandits (MAB), which can be seen as a generalization of regular experimenting which naturally reallocate samples across branches as the posterior probabilities change in a way which minimizes how many page-views go to bad variants. It’s hard to see how to implement MABs as a static site, so this would probably motivate a shift to a dynamic site, at least to the extent that the server will tweak the served static content based on the current MAB.

MABs work for the current use case of specifying a small number of variants (eg <20) and finding the best one. Depending on implementation detail, they could also make it easy to run factorial trials checking for interactions among those variants, resolving another objection.

They’re still expensive to set up since one still has to come up with concrete variants to pit against each other, but if it’s now a dynamic server, it can at least handle the analysis automatically.

MABs themselves are a special case of reinforcement learning (RL), which is a family of approaches to exploring complicated systems to maximize a reward at (hopefully) minimum data cost. Optimizing a website fits naturally into a RL mold: all the possible CSS and HTML variants are a very complicated system, which we are trying to explore as cheaply as possible while maximizing the reward of visitors spending more time reading each page.

To solve the expressivity problem, one could try to equip the RLer with a lot of power over the CSS: parse it into an AST, so instead of specifying by hand ‘100%’ vs ‘105%’ in a CSS declaration like div#sidebar-news a { font-size: 105%; } , the RLer sees a node in the AST like (font-size [Real ~ dnorm(100,20)]) and tries out numbers around 100% to see what yields higher conversion rates. Of course, this yields an enormous number of possibilities and my website traffic is not equally enormous. Informative priors on each node would help if one was using a Bayesian MAB to do the optimization, but a Bayesian model might be too weak to detect many effects. (You can’t easily put in interactions between every node of the AST, after all.)

In a challenging problem like this, deep neural networks come to mind, yielding a deep reinforcement learner (Q-learning) - such a system made a splash in 2013-2015 in learning to play dozens of Atari games (DQN). The deep network handles interpretation of the input, and the RLer handles policy and optimization.

So the loop would go something like this:

a web browser requests a page the server asks the RL for CSS to include the RL generates a best guess at optimal CSS , taking the CSS AST skeleton and returning the defaults, with some fields/parameters randomized for exploration purposes (possibly selected by Bayesian optimization to maximize information gain) the CSS is transcluded into the HTML page, and sent to the web browser JS analytics in the HTML page report back how long the user spent on that page and details like their country, web browser, etc, which predict time on page (explaining variance, making it easier to see effects) this time-on-page constitutes the reward which is fed into the RL and updates return to waiting for a request

Learning can be sped up by data augmentation or local training: the developer can browse pages locally and based on whether they look horrible or not, insert pseudo-data. (If one variant looks bad, it can be immediately heavily penalized by adding, say, 100 page-views of that variant with low rewards.) Once previews have stabilized on not-too-terrible-looking, it can be run on live users; the developer’s preferences may introduce some bias compared to the general Internet population, but the developer won’t be too different and this will kill off many of the worst variants. As well, historical information can be inserted as pseudo-data: if the current CSS file has 17% conversion over 1 million page views, one can simulate 1m page views to that CSS variant’s considerable credit.

Parsing CSS into an AST seems difficult, and it is still limited in that it will only ever tweak existing CSS fields.

How to offer more power and expressivity to the RLer without giving it so much freedom that it will hang itself with gibberish CSS before ever finding working CSS, never mind improvements?

A powerful AI tool which could generate CSS on its own are the recurrent neural networks: NNs which generate some output which gets fed back in until a long sequence has been emitted. (They usually also have some special support for storing ‘memories’ over multiple recursive applications, using LSTM.) RNNs are famous for mimicking text and other sequential material; in one demo, Karpathy’s “The Unreasonable Effectiveness of Recurrent Neural Networks”, he trained a RNN on a Wikipedia dump in XML format and a LaTeX math book (both replicating the syntax quite well) and more relevantly, 474MB of C source code & headers where the RNN does a credible job of emitting pseudo-C code which looks convincing and is even mostly syntactically-correct in balancing parentheses & brackets, which more familiar Markov-chain approaches would have trouble managing. (Of course, the pseudo-C doesn’t do anything but that RNN was never asked to make it do something, either.) In another RNN paper, the authors trained it on Python source code and it was able to ‘execute’ very simple Python programs and predict the output; this is perhaps not too surprising given the earlier “Neural Turing Machines” and solving the Traveling Salesman Problem (“Pointer Networks”). So RNNs are powerful and have already shown promise in learning how to write simple programs.

This suggests the use of an RNN inside an RLer for generating CSS files. Train the RNN on a few hundred megabytes of CSS files (there are millions online, no shortage there), which teaches the RNN about the full range of possible CSS expressions, then plug it into step 3 of the above website optimization algorithm and begin training it to emit useful CSS. For additional learning, the output can be judged using an oracle (a CSS validator like the W3C CSS Validation Service/ w3c-markup-validator package, or possibly CSSTidy), and the error or reward based on how many validation errors there are. The pretraining provides extremely strong priors about what CSS should look like so syntactically valid CSS will be mostly used without the constraint of operating on a rigid AST, the RL begins optimizing particular steps, and providing the original CSS with a high reward prevents it from straying too far from a known good design.

Can we go further? Perhaps. In the Atari RL paper, the NN was specifically a convolutional neural network (CNN), used almost universally in image classification tastes; the CNN was in charge of understanding the pixel output so it could be manipulated by the RL. The RNN would have considerable understanding of CSS on a textual level, but it wouldn’t be easily able to understand how one CSS declaration changes the appearance of the webpage. A CNN, on the other hand, can look at a page+CSS as rendered by a web browser, and ‘see’ what it looks like; possibly it could learn that ‘messy’ layouts are bad, that fonts shouldn’t be made ‘too big’, that blocks shouldn’t overlap, etc. The RNN generates CSS, the CSS is rendered in a web browser, the rendering is looked at by a CNN… and then what? I’m not sure how to make use of a generative approach here. Something to think about.

Recurrent Q-learning:

Lin & Mitchell 1992 “Memory approaches to reinforcement learning in non-Markovian domains”

Meeden, McGraw & Blank 1993 “Emergent control and planning in an autonomous vehicle”

Schmidhuber 1991b “Reinforcement learning in Markovian and non-Markovian environments”

http://nikhilbuduma.com/2015/01/11/a-deep-dive-into-recurrent-neural-networks/

Training a neural net to generate CSS It would be nifty if I could set up a NN to generate and optimize the CSS on gwern.net so I don’t have to learn CSS & devise tests myself; as a first step towards this, I wanted to see how well a recurrent neural network (RNN) could generate CSS after being trained on CSS. (If it can’t do a good job mimicking the ‘average’ syntax/appearance of CSS based on a large CSS corpus, then it’s unlikely it can learn more useful things like generating usable CSS given a particular HTML file, or the ultimate goal - learn to generate optimal CSS given HTML files and user reactions.) char-rnn Fortunately, Karpathy has already written an easy-to-use tool char-rnn which has already been shown to work well on XML/LaTeX/C. (I was particularly amused by the LaTeX/math textbook, which yielded a compiling and even good-looking document after Karpathy fixed some errors in it; if the RNN had been trained against compile errors/warnings as well, perhaps it would not have needed any fixing at all…?) char-rnn relies on the Torch NN framework & NVIDIA’s CUDA GPU framework (Ubuntu installation guide/download). Torch is fairly easy to install (cheat sheet): cd ~/src/ curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-deps | bash git clone https://github.com/torch/distro.git ./torch --recursive cd ./torch ; ./install.sh export PATH=$HOME /src/torch/install/bin: $PATH ## fire up the REPL to check: th Then char-rnn is likewise easy to get running and try out a simple example: luarocks install nngraph luarocks install optim # luarocks install cutorch && luarocks install cunn ## 'cutorch' & 'cunn' need working CUDA git clone 'https://github.com/karpathy/char-rnn.git' cd ./char-rnn/ th train.lua -data_dir data/tinyshakespeare/ -gpuid 0 -rnn_size 512 -num_layers 2 -dropout 0.5 # package cunn not found! # package cutorch not found! # If cutorch and cunn are installed, your CUDA toolkit may be improperly configured. # Check your CUDA toolkit installation, rebuild cutorch and cunn, and try again. # Falling back on CPU mode # loading data files... # cutting off end of data so that the batches/sequences divide evenly # reshaping tensor... # data load done. Number of data batches in train: 423, val: 23, test: 0 # vocab size: 65 # creating an lstm with 2 layers # number of parameters in the model: 3320385 # cloning rnn # cloning criterion # 1/21150 (epoch 0.002), train_loss = 4.19087871, grad/param norm = 2.1744e-01, time/batch = 4.98s # 2/21150 (epoch 0.005), train_loss = 4.99026574, grad/param norm = 1.8453e+00, time/batch = 3.13s # 3/21150 (epoch 0.007), train_loss = 4.29807770, grad/param norm = 5.6664e-01, time/batch = 4.30s # 4/21150 (epoch 0.009), train_loss = 3.78911860, grad/param norm = 3.1319e-01, time/batch = 3.87s # ... Unfortunately, even on my i7 CPU, training is quite slow: ~3s a batch on the Tiny Shakespeare example. The important parameter is train_loss here ; after some experimenting, I found that >3=output is total garbage, 1-2=lousy, and with <1=good, with <0.8=very good. With Tiny Shakespeare, the loss drops quickly at first, getting <4 within seconds and into the 2s within 20 minutes, but then the 1s take a long time to surpass, and <1 even longer (hours of waiting). GPU vs CPU This is a toy dataset and suggests that for a real dataset I’d be waiting weeks or months. GPU acceleration is critical. I spent several days trying to get Nvidia’s CUDA to work, even signing up as a developer & using the unreleased version 7.5 preview of CUDA, but it seems that when they say Ubuntu 14.04 and not 15.04 (the latter is what I have installed), they are quite serious: everything I tried yielded bloodcurdling ATA hard drive errors (!) upon boot followed by a hard freeze the instant X began to run. This made me unhappy since my old laptop began dying in late July 2015 and I had purchased my Acer Aspire V17 Nitro Black Edition VN7-791G-792A laptop with the express goal of using its NVIDIA GeForce GTX 960M for deep learning. But at the moment I am out of ideas for how to get CUDA working aside from either reinstalling to downgrade to Ubuntu 14.04 or simply waiting for version 8 of CUDA which will hopefully support the latest Ubuntu. (Debian is not an option because on Debian Stretch, I could not even get the GPU driver to work, much less CUDA.) 1 Frustrated, I finally gave up and went the easy way: Torch provides an Amazon OS image preconfigured with Torch, CUDA, and other relevant libraries for deep learning. EC2 The Torch AMI can be immediately launched if you have an AWS account. (I assume you have signed up, have a valid credit card, IP permission accesses set to allow you to connect to your VM at all, and a SSH public key set up so you can log in.) The two GPU instances seem to have the same number and kind of GPUs (1 Nvidia ) and differ mostly in RAM & CPUs, neither of which are the bottleneck here, so I picked the smaller/cheaper “g2.2xlarge” type. (“Cheaper” here is relative; “g2.2xlarge” still costs $0.65/hr and when I looked at spot that day, ~$0.21.) Once started, you can SSH using your registered public key like any other EC2 instance. The default username for this image is “ubuntu”, so: ssh -i /home/gwern/.ssh/EST.pem ubuntu@ec2-54-164-237-156.compute-1.amazonaws.com Once in, we set up the $PATH to find the Torch installation like before (I’m not sure why Torch’s image doesn’t already have this done) and grab a copy of char-rnn to run Tiny Shakespeare: export PATH=$HOME /torch/install/bin: $PATH git clone 'https://github.com/karpathy/char-rnn' # etc Per-batch, this yields a 20x speedup on Tiny Shakespeare compared to my laptop’s CPU, running each batch in ~0.2s. Now we can begin working on what we care about. CSS First, to generate a decent sized CSS corpus; between all the HTML documentation installed by Ubuntu and my own WWW crawls, I have something like 1GB of CSS hanging around my drive. Let’s grab 20MB of it (enough to not take forever to train on, but not so little as to be trivial): cd ~/src/char-rnn/ mkdir ./data/css/ find / -type f -name "*.css" -exec cat {} \; | head --bytes=20MB >> ./data/css/input.txt ## https://www.dropbox.com/s/mvqo8vg5gr9wp21/rnn-css-20mb.txt.xz wc --chars ./data/css/input.txt # 19,999,924 ./data/input.txt scp -i ~/.ssh/EST.pem -C data/css/input.txt ubuntu@ec2-54-164-237-156.compute-1.amazonaws.com:/home/ubuntu/char-rnn/data/css/ With 19.999M characters, our RNN can afford only <20M parameters; how big can I go with -rnn_size and -num_layers ? (Which as they sound like, specify the size of each layer and how many layers.) The full set of char-rnn training options: -data_dir data directory. Should contain the file input.txt with input data [data/tinyshakespeare] -rnn_size size of LSTM internal state [128] -num_layers number of layers in the LSTM [2] -model LSTM, GRU or RNN [LSTM] -learning_rate learning rate [0.002] -learning_rate_decay learning rate decay [0.97] -learning_rate_decay_after in number of epochs, when to start decaying the learning rate [10] -decay_rate decay rate for RMSprop [0.95] -dropout dropout for regularization, used after each RNN hidden layer. 0 = no dropout [0] -seq_length number of timesteps to unroll for [50] -batch_size number of sequences to train on in parallel [50] -max_epochs number of full passes through the training data [50] -grad_clip clip gradients at this value [5] -train_frac fraction of data that goes into train set [0.95] -val_frac fraction of data that goes into validation set [0.05] -init_from initialize network parameters from checkpoint at this path [] -seed torch manual random number generator seed [123] -print_every how many steps/minibatches between printing out the loss [1] -eval_val_every every how many iterations should we evaluate on validation data? [1000] -checkpoint_dir output directory where checkpoints get written [cv] -savefile filename to autosave the checkpoint to. Will be inside checkpoint_dir/ [lstm] -gpuid which GPU to use. -1 = use CPU [0] -opencl use OpenCL (instead of CUDA) [ 0 ] Large RNN Some playing around suggests that the upper limit is 950 neurons and 3 layers, yielding a total of 18,652,422 parameters. (I originally went with 4 layers, but with that many layers, RNNs seem to train very slowly.) Some other settings to give an idea of how parameter count increases: 512/4: 8,012,032

950/3: 18,652,422

1000/3: 20,634,122

1024/3: 21,620,858

1024/4: 30,703,872

1024/5: 39,100,672

1024/6: 47,497,472

1800/4: 93,081,856

2048/4: 120,127,744

2048/5: 153,698,560

2048/6: 187,269,376 If we really wanted to stress the EC2 image’s hardware, we could go as large as this: th train.lua -data_dir data/css/ -rnn_size 1306 -num_layers 4 -dropout 0.5 -eval_val_every 1 This turns out to not be a good idea since it will take forever to train - eg after ~70m of training, still at train-loss of 3.7! I suspect some of the hyperparameters may be important - the level of dropout doesn’t seem to matter much but more than 3 layers seems to be unnecessary and slow if there are a lot of neurons to store state (perhaps because RNNs are said to ‘unroll’ computations over each character/time-step instead of being forced to do all their computation in a single deep network with >4 layers?) - but with the EC2 clock ticking and my own impatience, there’s no time to try a few dozen random sets of hyperparameters to see which achieves best validation scores. Undeterred, I decided to upload all the CSS (using the sort-key trick to reduce the archive size): find / -type f -name "*.css" | rev | sort | rev | tar c --to-stdout --no-recursion --files-from - | xz -9 --stdout > ~/src/char-rnn/data/css/all.tar.xz cd ~/src/char-rnn/ && scp -C data/css/all.tar.xz ubuntu@ec2-54-164-237-156.compute-1.amazonaws.com:/home/ubuntu/char-rnn/data/css/ unxz all.tar.xz ## non-ASCII input seems to cause problems, so delete anything not ASCII: ## https://disqus.com/home/discussion/karpathyblog/the_unreasonable_effectiveness_of_recurrent_neural_networks_66/#comment-2042588381 ## https://github.com/karpathy/char-rnn/issues/51 tar xfJ data/css/all.tar.xz --to-stdout | iconv -c -tascii > data/css/input.txt wc --char all.css # 1,126,949,128 all.css Unsurprisingly, this did not solve the problem, and with 1GB of data, even 1 pass over the data (1 epoch) would take weeks, likely. Additional problems included -val_frac ’s default 50 and -eval_val_every ’s default 1000: 0.05 of 1GB is 50MB, which means every time char-rnn checked on the validation set, it took ages; and since it only wrote a checkpoint out every 1000 iterations, hours would pass in between checkpoints. 1MB or 0.001 is a more feasible validation data size; and checking every 100 iterations strikes a reasonable balance between being able to run the latest & greatest and spending as much GPU time on training as possible. Small RNN So I backed off to the 20MB sample and a smaller 3-layer RNN, training it overnight, and was startled to see what happened: th train.lua -print_every 5 -data_dir data/css/ -savefile css -eval_val_every 10000 -val_frac 0.001 -rnn_size 1700 -num_layers 3 -dropout 0.8 # ... # 20000/21408250 (epoch 0.047), train_loss = 0.84076253, grad/param norm = 3.4693e-03, time/batch = 3.59s # loss is exploding, aborting. Specifically, the loss on the validation set had exploded to 333.2351 (!). When I looked at samples from the check-pointed copy, it performed both well and poorly. th sample.lua cv/lm_css_epoch0.05_333.2351.t7 yielded: ... .nest-comments .mb_comments p , .pomenefister .views-content .add-post , .pleuse-product-nav .dropdown-pane .date .intro , .product-nav .shortlist.tooltip_pane .bread-slide , .module-category .vote-content td { font-size : 9 px ; text-align : center ; } .teotspaces-left h2 , .footer .social-icons .mepubititab-left ul { margin : 0 ; } .mlissbox-box .post-content { margin-top : 10 px ; } .secondary .tooltip-arrow .scegee-category h6 , .no-list .comment-footer .module-title , .featured-left-column .lightbox-description h0 , .popup-tomimate .social-rover .comment-arrow { list-style : none ; } .tooltip-post .secondary .module-menu-houder .thumbnail { float : left ; } .module-content .lightbox-list { display : inline-block ; float : none ; width : 50 % ; position : relative ; background-position : 0 1 px ; width : 30 px ; margin-top : 5 px ; text-align : center ; margin-right : 10 px ; } .logo-events .show-luset .box-content li { width : 30 px ; -webkit-box-shadow : inset 0 2 px 4 px rgba( 0 , 0 , 0 , 0.3 ) ; } .search-results .section-featured .content-top-section-left { margin-top : -6 px ; } .secondary .module-mones-grey .down-mod-page { margin-bottom : 20 px ; } .comment-top-content .meviat-alignright { width : 350 px ; } .module-contributor h2 .comment-hold-homicate.sptbed_postnames { display : inline-block ; padding-top : -24 px ; } .secondary .module-content .logo div { float : left ; } .medium-block-grid-1 > li :hover { border-bottom : none #5c466d ; } .row-image-container .medium-topic-browser { clear : right ; } .rooc-box-sprite .modal-links .list-group li { display : inline-block ; float : left ; padding-top : 8 px ; } .vead-video-list { display : block ; margin-right : 13 px ; } #no-touch .main-tabs-new-content .widget-top-content { color : #58128c ; display : block ; padding : 8 px 0 ; border-color : transparent ; border-bottom : 1 px solid #fff ; padding : 5 px 12 px ; text-align : center ; } Aside from the Unicode junk at the beginning, the output actually looks tremendously like CSS! The brackets are matched, the selectors look like selectors, and the fields are properly typed (pixels go into pixel fields, colors go into color fields, etc). If I validate the non-junk CSS part, the validator remarkably yields only 1 error, at line 52/ .module-contributor h2.comment-hold-homicate.sptbed_postnames where it notes that “Value Error : padding-top -24px negative values are not allowed : -24px”. Considering it didn’t even finish 1 epoch, the mimicking is almost uncanny: it nails the various aspects like RGB color notation (both hex & rgba() ), matching brackets, plausible-sounding identifiers (eg .scegee-category ), etc. If I were shown this without any corresponding HTML, I would not easily be able to tell it’s all gibberish. Chastened by the exploding-error problem and the mostly waste of ~26 hours of processing (7:30PM - 9:30PM / $15.6), I tried a smaller yet RNN (500/2), run from 5PM-11AM (so total bill for all instances, including various playing around, restarting, generating samples, downloading to laptop etc: $25.58). Data URI problem One flaw in the RNN I stumbled across but was unable to reproduce was that it seemed to have a problem with data URIs. A data URI is a special kind of URL which is its own content, letting one write small files inline and avoiding needing a separate file; for example, this following CSS fragment would yield a PNG image without the user’s browser making additional network requests or the developer needing to create a tiny file just for an icon or something: url('data :image /png;base64 , iVBORw0KGgoAA AANSUhEUgAAABAAAAAQAQMAAAAlPW0iAAAABlBMVEUAAAD ///+l2Z/dAAAAM0l EQVR4nGP4/5/h/1 + G/58ZDrAz3D/McH8yw83NDDeNGe4Ug9C9zwz3gVLMDA/A6 P9/AFGGFyjOXZtQAAAAAElFTkSuQmCC') So it’s a standard prefix like data:image/png;base64, followed by an indefinitely long string of ASCII gibberish, which is a textual base-64 encoding of the underlying binary data. The RNN sometimes starts a data URI and generates the prefix but then gets stuck continually producing hundreds or thousands of characters of ASCII gibberish without ever closing the data URI with a quote & parentheses and getting back to writing regular CSS. What’s going on there? Since PNG/JPG are compressed image formats, the binary encoding will be near-random and the base-64 encoding likewise near-random. The RNN can easily generate another character once it has started the base-64, but how does it know when to stop? (“I know how to spell banana, I just don’t know when to stop! BA NA NA NA…”) Possibly it has run into the limits of its ‘memory’ and once it has started emitting base-64 and has reached a plausible length of at least a few score characters (few images can be encoded in less), it’s now too far away from the original CSS, and all it can see is base-64; so of course the maximal probability is an additional base-64 character… This might be fixable by either giving the RNN more neurons in the hope that with more memory it can break out of the base-64 trap, training more (perhaps data URIs are too rare for it to have adequately learned it with the few epochs thus far), backpropagating error further in time/the sequence by increasing the size of the RNN in terms of unrolling (such as increasing -seq_length from 50); I thought improving the sampling strategy with beam search rather than greedy character-by-character generation would help but it turns out beam search doesn’t fix it and can perform worse, getting trapped in an even deeper local minima of repeating the character “A” endlessly. Or of course one could delete data URIs and other undesirable features from the corpus, in which case those problems will never come up; still, I would prefer the RNN to handle issues on its own and have as little domain knowledge engineered in as possible. I wonder if the data URI issue might be what killed the large RNN at the end? (My other hypothesis is that the sort-key trick accidentally led to a multi-megabyte set of repetitions of the same common CSS file, which caused the large RNN to overfit, and then once the training reached a new section of normal CSS, the large RNN began making extremely confident predictions of more repetition, which were wrong and would lead to very large losses, possibly triggering the exploding-error killer.) Progress This RNN progressed steadily over time, although by the end the performance on the held-out validation dataset seem to have been stagnating when I plot the validation tests: performance <- dget ( textConnection ( "structure(list(Epoch = c(0.13, 0.26, 0.4, 0.53, 0.66, 0.79, 0.92, 1.06, 1.19, 1.32, 1.45, 1.58, 1.71, 1.85, 1.98, 2.11, 2.24, 2.37, 2.51, 2.64, 2.77, 2.9, 3.03, 3.17, 3.3, 3.43, 3.56, 3.69, 3.82, 3.96, 4.09, 4.22, 4.35, 4.48, 4.62, 4.75, 4.88, 5.01, 5.14, 5.28, 5.41, 5.54, 5.67, 5.8, 5.94, 6.07, 6.2, 6.33, 6.46, 6.59, 6.73, 6.86, 6.99, 7.12, 7.25, 7.39, 7.52, 7.65, 7.78, 7.91, 8.05, 8.18, 8.31, 8.44, 8.57, 8.7, 8.84, 8.97, 9.1, 9.23, 9.36, 9.5, 9.63, 9.76, 9.89, 10.02, 10.16, 10.29, 10.42, 10.55, 10.68, 10.82, 10.95, 11.08, 11.21, 11.34, 11.47, 11.61, 11.74, 11.87, 12, 12.13, 12.27, 12.4, 12.53, 12.66, 12.79, 12.93, 13.06, 13.19, 13.32, 13.45, 13.58, 13.72, 13.85, 13.98, 14.11, 14.24, 14.38, 14.51, 14.64, 14.77, 14.9, 15.04, 15.17, 15.3, 15.43, 15.56, 15.7, 15.83, 15.96, 16.09, 16.22, 16.35, 16.49, 16.62, 16.75, 16.88, 17.01, 17.15, 17.28, 17.41, 17.54, 17.67, 17.81, 17.94, 18.07, 18.2, 18.33, 18.46, 18.6, 18.73, 18.86, 18.99, 19.12, 19.26, 19.39, 19.52, 19.65, 19.78, 19.92, 20.05, 20.18, 20.31, 20.44, 20.58, 20.71, 20.84, 20.97, 21.1, 21.23, 21.37, 21.5, 21.63, 21.76, 21.89, 22.03, 22.16, 22.29, 22.42, 22.55, 22.69, 22.82, 22.95, 23.08, 23.21, 23.34, 23.48, 23.61, 23.74, 23.87, 24, 24.14, 24.27, 24.4, 24.53, 24.66, 24.8, 24.93, 25.06, 25.19, 25.32, 25.46, 25.59, 25.72), Validation.loss = c(1.4991, 1.339, 1.3006, 1.2896, 1.2843, 1.1884, 1.1825, 1.0279, 1.1091, 1.1157, 1.181, 1.1525, 1.1382, 1.0993, 0.9931, 1.0369, 1.0429, 1.071, 1.08, 1.1059, 1.0121, 1.0614, 0.9521, 1.0002, 1.0275, 1.0542, 1.0593, 1.0494, 0.9714, 0.9274, 0.9498, 0.9679, 0.9974, 1.0536, 1.0292, 1.028, 0.9872, 0.8833, 0.9679, 0.962, 0.9937, 1.0054, 1.0173, 0.9486, 0.9015, 0.8815, 0.932, 0.9781, 0.992, 1.0052, 0.981, 0.9269, 0.8523, 0.9251, 0.9228, 0.9838, 0.9807, 1.0066, 0.8873, 0.9604, 0.9155, 0.9242, 0.9259, 0.9656, 0.9892, 0.9715, 0.9742, 0.8606, 0.8482, 0.8879, 0.929, 0.9663, 0.9866, 0.9035, 0.9491, 0.8154, 0.8611, 0.9068, 0.9575, 0.9601, 0.9805, 0.9005, 0.8452, 0.8314, 0.8582, 0.892, 0.9186, 0.9551, 0.9508, 0.9074, 0.7957, 0.8634, 0.8884, 0.8953, 0.9163, 0.9307, 0.8527, 0.8522, 0.812, 0.858, 0.897, 0.9328, 0.9398, 0.9504, 0.8664, 0.821, 0.8441, 0.8832, 0.8891, 0.9422, 0.953, 0.8326, 0.871, 0.8024, 0.8369, 0.8541, 0.895, 0.8892, 0.9275, 0.8378, 0.8172, 0.8078, 0.8353, 0.8602, 0.8863, 0.9176, 0.9335, 0.8561, 0.7952, 0.8423, 0.8833, 0.9052, 0.9202, 0.9354, 0.8477, 0.8271, 0.8187, 0.8714, 0.8714, 0.9089, 0.903, 0.9225, 0.8583, 0.7903, 0.8016, 0.8432, 0.877, 0.8825, 0.9323, 0.8243, 0.8233, 0.7981, 0.8249, 0.826, 0.9109, 0.8875, 0.9265, 0.8239, 0.8026, 0.7934, 0.851, 0.8856, 0.9033, 0.9317, 0.8576, 0.8335, 0.7829, 0.8172, 0.8658, 0.8976, 0.8756, 0.9262, 0.8184, 0.792, 0.7826, 0.8244, 0.861, 0.9144, 0.9244, 0.9106, 0.8327, 0.766, 0.7988, 0.8378, 0.8606, 0.8831, 0.9032, 0.8113, 0.8138, 0.7747, 0.8027, 0.8197, 0.8684, 0.874, 0.912)), .Names = c('Epoch', 'Validation.loss'), class = 'data.frame', row.names = c(NA, -195L ))" )) library (ggplot2) qplot (Epoch, Validation.loss, data= performance) + stat_smooth () Loss of the CSS char- RNN during training As the loss diminished to ~0.8-0.9, the sampled CSS output became even more realistic. At one point I was impressed to see that the RNN had learned to switch between “minified” and unminified CSS formatting. For example, above the output is unminified, but the RNN at 0.88 sometimes writes minified (following has been line-broken from a single line): $ th sample.lua cv/lm_css_epoch6.07_0.8815.t7 -primetext 'div#sidebar { margin: 0px; }' -length 2000 div #sidebar { margin : 0 px ; } #flashTopgip ul li h3 { clear : both ; padding : 0 ; height : 25 px ; background : url( /images/exporibox.png ) no-repeat 0 0 ; } .col_description { text-align : left !important ; display : block ; height : 44 px ; top : -3 px ; left : 68 % ; width : 150 px ; } .front .content h3{ display : inline-block ; width : 100 % ; position : fixed ; position : absolute ; left : 0 ; } .date-repeat #right { list-style : none ; } .rtl #block-agned-header { padding : 10 px ; line-height : 14 px ; width : auto ; } #block-column-right { background : #63c ; } .block-document_body #content , .activism-content-box .content , .rtl .archive-wide .button.input-rawignad { float : left ; } .rtl .panel-pane .social-view table .lim h1 , .page-news h3 .pane-title { * zoom : 2 !important ; } .rtl .arow-right , #right-fields-img { display : none ; }div .error { background-color : #ededed ; }div .page-term span .recimsregoor_contemt #aconds , .input-admin-widget-fill div .inner .form-submit { display : block ; margin-right : .1 em ; } #edit-activism-field-actionpoint { color : #8c0000 ; background : url( /sites/all/themes/zrw/sprites/hadd.png ) no-repeat 3 px 0 px ; calse: 0 } .login-form p{ margin : 4 px 25 px ; } .rtl .note-ssTitle { margin : 0 0 3 px 0 }ul .secondary .page , #node-region { background : url( /sites/all/themes/rpg_theme/images/btn/form_subscription_not-page.png ) no-repeat 12 px 0 #016 !important ; } #network-footer :active {} #rtl #newsletter-recitients-work_latest .center a{ background-position : 5 px -154 px ; } #product-item { margin-bottom : 10 px ; } .node-type-article .home .field-popup-widget-form { padding : 20 px 10 px 10 px 4 px ; text-align : right ; } .rtl .view-filters , .rtl #comments-albumang_sprite { float : left ; } .node-type-nodes .field-actionpoints-view-filters { padding : 19 px 28 px 8 px 0 ; } .rtl #multimedia-latest .field-body , .view-content div .field-view-layout {ulline-color: white ; } .view-experts .views-field-title { padding : 4 px ; text-align : center ; } .node-description .views-exposed-form { overflow : visible ; } #content .views-view-grid tr .format { padding-bottom : 10 px ; background : #030000 ; } .view-forword-source .views-exposed-form #edit-submit { margin-right : 0 ; } This initially does not look impressive, but if we run it through a unminifier: div #sidebar { margin : 0 px ; } #flashTopgip ul li h3 { clear : both ; padding : 0 ; height : 25 px ; background : url( /images/exporibox.png ) no-repeat 0 0 ; } .col_description { text-align : left !important ; display : block ; height : 44 px ; top : -3 px ; left : 68 % ; width : 150 px ; } .front .content h3 { display : inline-block ; width : 100 % ; position : fixed ; position : absolute ; left : 0 ; } .date-repeat #right { list-style : none ; } .rtl #block-agned-header { padding : 10 px ; line-height : 14 px ; width : auto ; } #block-column-right { background : #63c ; } .block-document_body #content , .activism-content-box .content , .rtl .archive-wide .button.input-rawignad { float : left ; } .rtl .panel-pane .social-view table .lim h1 , .page-news h3 .pane-title { * zoom : 2 !important ; } .rtl .arow-right , #right-fields-img { display : none ; } div .error { background-color : #ededed ; } div .page-term span .recimsregoor_contemt #aconds , .input-admin-widget-fill div .inner .form-submit { display : block ; margin-right : .1 em ; } #edit-activism-field-actionpoint { color : #8c0000 ; background : url( /sites/all/themes/zrw/sprites/hadd.png ) no-repeat 3 px 0 px ; calse: 0 } .login-form p { margin : 4 px 25 px ; } .rtl .note-ssTitle { margin : 0 0 3 px 0 } ul .secondary .page , #node-region { background : url( /sites/all/themes/rpg_theme/images/btn/form_subscription_not-page.png ) no-repeat 12 px 0 #016 !important ; } #network-footer :active {} #rtl #newsletter-recitients-work_latest .center a { background-position : 5 px -154 px ; } #product-item { margin-bottom : 10 px ; } .node-type-article .home .field-popup-widget-form { padding : 20 px 10 px 10 px 4 px ; text-align : right ; } .rtl .view-filters , .rtl #comments-albumang_sprite { float : left ; } .node-type-nodes .field-actionpoints-view-filters { padding : 19 px 28 px 8 px 0 ; } .rtl #multimedia-latest .field-body , .view-content div .field-view-layout { ulline-color: white ; } .view-experts .views-field-title { padding : 4 px ; text-align : center ; } .node-description .views-exposed-form { overflow : visible ; } #content .views-view-grid tr .format { padding-bottom : 10 px ; background : #030000 ; } .view-forword-source .views-exposed-form #edit-submit { margin-right : 0 ; } Now it’s readable and we can see the RNN has done an excellent job of still writing CSS while in minified-mode, and around this level of loss, I noticed the RNN had learned to write valid-looking URLs - fragments like background : url(/sites/all/themes/rpg_theme/images/btn/form_subscription_not-page.png) look exactly like what a human CSS programmer would write. (Unfortunately, this sample has 4 validation errors: 1 from an imbalanced bracket; 1 one parse error on *zoom: 2 !important due to the asterisk which is an old IE hack & arguably the RNN isn’t wrong; and 2 properties which don’t exist. Also in the RNN’s favor, I should note that lots of CSS in the wild will not have 0 validation errors.) At 0.88, I also noticed the RNN was now making a valiant attempt to write comments. Bad comments, but still: /* ubuntu@ip-172-31-30-222:~/char-rnn$ th sample.lua cv/lm_css_epoch6.07_0.8815.t7 -primetext 'div#sidebar { margin: 100px; }' -length 2000 -seed 1 using CUDA on GPU 0... creating an lstm... seeding with div#sidebar { margin: 100px; } -------------------------- */ div #sidebar { margin : 100 px ; } viv .yeah-company :first-child , .news-row0 .colsetIcob img , .content .content-number { background-position : 0 -340 px ; text-decoration : repeat-x ; } #content .rcper { display : none ; display : block ; } #coftelNotif .topUy { background : url( '/assets/css/epwide-datetherator.png' ) ; } #leftCol span .scord img { background : url( /img/text/about_links.png ) no-repeat 0 -1050 px ; } div .subkit_snav_created , ul .up_tains li .active { width : 64 % !important ; } .hdr_outer { text-align : center ; } active , img { top : auto ; margin-right : 20 px ; margin : 0 !important ; text-align : center ; -webkit-box-shadow : #205575 1 px 0 0 rgba( 0 , 0 , 0 , 0.6 ) 1 px 0 px px ; box-shadow : 0 0 5 px rgba( 0 , 0 , 0 , .5 ) ; } #ywip_section p .tab_promo , #search_container #slideshow .page_inner #triabel_left { background : url( drop, sanc-email' } simple{ box-sizing: border-box; } span.naveptivionNav} a.nav, pre, html { */ background-color: #8ccedc; background: #22a82c; float: left; color: #451515; border: 1px solid #701020; color: #0000ab; font-family: Arial, sans-serif; text-align: center; margin-bottom: 50px; line-height: 16px; height: 49px; padding: 15px 0 0 0; font-size: 15px; font-weight: bold; background-color: #cbd2eb; } a.widespacer2, #jomList, #frq { margin: 0 0 0 0; padding: 10px -4px; background-color: #FFCFCF; border: 1px solid #CBD7DD; padding: 0 0 4px 12px; min-height: 178px; } .eventmenu-item, .navtonbar .article ul, .creditOd_Dectls { border-top: 1px #CCC gradsed 1px solid; font-size: 0.75em; } h2, div.horingnav img { font-size: 5px; } body { margin: 0 0 5px 20px; } .n-cmenuamopicated, .teasicOd-view td { border-top: 4px solid #606c98; } /* Rpp-fills*/ .ads{padding: 0 10px;}.statearch-header div.title img{display:table-call(} fieldset legend span, blockquote.inner ul {padding:0;}} ... /* Ableft Title */ /* ======================================================== helper column parting if nofis calendar image Andy "Heading Georgia" */ .right_content { position: relative; width: 560px; height: 94px; } Ultimately, the best RNN achieved a loss of 0.7660 before I decided to shut it down because it wasn’t making much further progress. Samples It stalwartly continued to try to write comments, approximating slightly English (even though there is not that much English text in those 20MB, only 8.5k lines with /* in them - it’s CSS, not text). Examples of comments extracted from a large sample of 0.766’s output ( fgrep '/*' best.txt ): * //* COpToMNINW BDFER /* .snc .footer li a.diprActy a:hover, #sciam table {/*height: 164px;*//*/* } body.node-type-xplay-info #newsletter,body.node-type-update #header{min-width:128px;height:153px;float:left;}#main-content .newsletternav,#ntype-audio .block-title{background:url(/sites/www.amnesty.org/modules/civicrm/print-widget.clu)) /*gray details */ /* Grid >> 1px 0 : k0004_0 */ /* corner */ /* ST LETTOTE/ CORCRE TICEm langs 7 us1 Q+S. Sap q i blask */ /*/*/ /* Side /**/ /* Loading Text version Links white to 10ths */ /*-modaty pse */ /**/ div #sb-adrom { display : none !important ; } /* /* `Grid >> Global /* `Grid >> 16 Columns /* `Grid >> 16 Columns /* `Suffix Extra Space >> 16 Columns /* `Prefix Extra Space >> 12 Columns /* `Prefix Extra Space >> 12 Columns /* `Clear Floated Elements /* `Prefix Extra Space >> 12 Columns /* `Push Space >> 16 Columns /* `Suffix Extra Space >> 16 Columns /* `Suffix Extra Space >> 16 Columns /* `Suffix Extra Space >> 16 Columns /* `Prefix Extra Space >> 16 Columns /* `Suffix Extra Space >> 16 Columns /* IE7 inline-block hack */ /* T* */ Not too great, but still more than I expected Still, the (unminified) CSS looks good: div #sidebar { margin : 100 px ; } .ep_summary_box_body { float : left ; width : 550 px ; } .dark_search span { margin-right : 5 px ; } h1 .highlight_column { text-align : right ; display : block ; font-size : 18 px ; } h3 { font-weight : bold ; font-size : 12 px ; } col .teas h2 { clear : both ; width : 100 % ; z-index : 190 ; action: !important ; } #full_content .fancybox.no-float { background-image : url( '/static/onion/img/description.png' ) ; max-width : 33 px ; height : 40 px ; margin-top : 20 px ; color : #3D5042 ; font-size : 0.75 em ; padding-left : 25 px !important ; } .filter-container iframe{ width : 990 px ; } #funcy-oneTom { margin : 0 ; padding : 10 px 1 % ; line-height : 30 px ; } #utb_documentAlert { color : #222 ; } #utb_column02 a .button :focus { display : block ; font-family : Arial , Helvetica , sans-serif ; } #utb_column02 ul .blogs-listing aundoc1 ul :before , #utb_column01 a :active , h1 { font-weight : bold ; font-family : line-heetprind , AnimarzPromo , Atial ; line-height : 1.4 ; font-size : 1 9 px ; } #utb_column03 ul .fourder { width : 500 px ; padding : 4 px 10 px ; } The RNN also seems to have a thing for Amnesty International, regularly spitting out Amnesty URLs like url(/sites/www.amnesty.org/modules/civicrm/i/mast2adCbang.png) (not actually valid URLs). Once that was done, I generated samples from all the checkpoints: for NN in cv/*.t7; do th sample.lua $NN -primetext 'div#sidebar { margin: 0px; }' -length 2000 > $NN.txt; done ## https://www.dropbox.com/s/xgstn9na3efxb43/smallrnn-samples.tar.xz ## if we want to watch the CSS evolve as the loss decreased: for SAMPLE in `ls cv/lm_css*.txt | sort --field-separator="_" --key=4 --numeric-sort --reverse`; do echo $SAMPLE: && tail -5 $SAMPLE | head -5; done Evaluation In under a day of GPU training on 20MB of CSS, a medium-sized RNN (~30M parameters) learned to produce high quality CSS, which passes visual inspection and on some batches yields few CSS syntactic errors. This strikes me as fairly impressive: I did not train a very large RNN, did not train it for very long, did not train it on very much, did no optimization of the many hyper-parameters, and it is doing unsupervised learning in the sense that it doesn’t know how well emitted CSS validates or renders in web browsers - yet the results still look good. I would say this is a positive first step. Lessons learned: GPUs > CPUs

char-rnn , while rough-edged, is excellent for quick prototyping

NNs are slow: major computation is required for the best results meaningful exploration of NN sizes or other hyperparameters will be challenging when a single run can cost days

computing large datasets or NNs on Amazon EC2 will entail substantial financial costs; it’s adequate for short runs but bills around $25 for two days of playing around are not a long-term solution

pretraining an RNN on CSS may be useful for a CSS reinforcement learner