Maybe you know that the Github public archive can be analyzed with Google BigQuery. That’s 3Tb of data! This helped people run analysis on languages usage or framework popularity.

I wanted to produce similar results with projects using Docker. But what’s a project using Docker? For this article, I will consider a Docker project, a project that has a least one Dockerfile file.

So, why don’t we start by counting the number of Dockerfile files?

Count the Docker projects

The following query lists all files on a master branch that contain Dockerfile in their name. That’s not perfect but should be good enough for statistics.

SELECT count ( * ) FROM [bigquery - public - data :github_repos.files] WHERE path CONTAINS 'Dockerfile' AND ref = 'refs/heads/master'

The number is: 281,212 . That’s quite a lot! About 1% of all 2.9M Github projects.

Compare all the things!

We could compare it to the number of:

pom.xml : 840,328

: 840,328 travis.yml : 2,187,893

: 2,187,893 Makefile : 18,103,807

: 18,103,807 README.md : 10,774,468

A LOT of project seem to have a LOT of Makefiles! For example, shengshanbai/mini2440_buildroot contains 8628 Makefiles.

SELECT repo_name, count ( * ) count FROM ( SELECT repo_name FROM [bigquery - public - data :github_repos.files] WHERE path CONTAINS 'Makefile' AND ref = 'refs/heads/master' ) GROUP BY repo_name ORDER BY count DESC LIMIT 1

Which base images are most used?

Now, it would be interesting to know which base images are used most often. There’s two ways of doing that. We can count the occurrences of each distinct FROM clause. Or we could group by image name, without the tag.

Let’s try both!

To make the next queries easier to write, I’ve created a view called dockerfiles with this query. This view contains the id of every Dockerfile we found:

SELECT id FROM [bigquery - public - data :github_repos.files] WHERE path CONTAINS 'Dockerfile' AND ref = 'refs/heads/master'

And another view called dockerfiles_content that contains the actual content of all those Dockerfiles:

SELECT contents.content as dockerfile FROM [code - story - blog:github.dockerfiles] as dockerfiles JOIN [bigquery - public - data :github_repos.contents] as contents ON dockerfiles.id = contents.id

Now let’s extract the images names and counts:

SELECT RTRIM(LTRIM(SUBSTR(line, 5 ))) as image, count ( * ) as count FROM ( SELECT SPLIT(dockerfile, '

' ) as line FROM [code - story - blog:github.dockerfiles_content] HAVING LEFT (line, 5 ) = 'FROM ' ) GROUP BY image ORDER BY count DESC LIMIT 10

position image count 1 ubuntu:14.04 26746 2 debian:jessie 14031 3 scratch 11891 4 busybox 8064 5 ubuntu:trusty 7516 6 ubuntu 5597 7 golang:1.4 3395 8 ubuntu:latest 3383 9 debian:wheezy 2897 10 ubuntu:12.04 2545

Old Ubuntus are used a lot! The first alpine image is in 30th position. That needs to be improved! (See Alpine Based Docker Images Make a Difference in Real World Apps)

Now, what if we ignore the tags:

Let’s define an images view with all image names:

SELECT RTRIM(LTRIM(SUBSTR(line, 5 ))) as line_group.base_image FROM ( SELECT SPLIT(dockerfile, '

' ) as line FROM [code - story - blog:github.dockerfiles_content] HAVING LEFT (line, 5 ) = 'FROM ' )

And count the occurrences of each image name:

SELECT image, count ( * ) as count FROM ( SELECT FIRST (SPLIT(line_group.base_image, ':' )) as image FROM [code - story - blog:github.images] ) GROUP BY image ORDER BY count DESC LIMIT 15

position image count 1 ubuntu 54453 2 debian 23079 3 golang 13127 4 scratch 11894 5 busybox 11059 6 centos 10087 7 fedora 7340 8 node 6764 9 python 6239 10 java 6102 11 alpine 4945 12 nginx 3074 13 redis 2970 14 php 2611 15 brimstone/ubuntu 2536

Ubuntu still leads and it’s good to see that people also use a lot the base images created especially for each language: Go, Node, Python and Java base images make it into the top 10.

OK, that’s it for counting. One can get nice information by counting but can we do better? For example, could we compare Dockerfiles with the best practices?

One of those best practices says we should use COPY over ADD .

ADD or COPY?

I tried to count the number of ADD and COPY commands. Clearly, the number of COPY should be higher since ADD should be used for only very specific cases. But since ADD also does what COPY does, people tend to overuse ADD .

That’s easy to check!

We already have a view called dockerfiles_content that contains all the lines of all the Dockerfiles. Let’s extract the lines that start with ADD or COPY and count the occurrences of both:

SELECT command, count ( * ) as count FROM ( SELECT first (split(line, ' ' )) as command FROM ( SELECT SPLIT(dockerfile, '

' ) as line FROM [code - story - blog:github.dockerfiles_content] HAVING LEFT (line, 5 ) = 'COPY ' || LEFT (line, 4 ) = 'ADD ' ) ) GROUP BY command

And here’s the result:

command count ADD 101042 COPY 65066

Clearly, too many ADD s! We could analyze the file names passed to ADD and check that they are valid urls or tarball files. That would give us the percentage of valid usages. I let that exercise to my beloved readers!

COPY or ADD after USER?

Sebastian, a colleague of mine, asked me to search for a well known bug pattern. Because of a bug / bad decision during implementation of COPY and ADD , those directives don’t follow USER directives. So files are not added with ownership of the USER , resulting in people having to RUN chown afterwards, which balloons the image size.

Le’s see what we can do…

SELECT count ( * ) FROM ( SELECT INSTR(dockerfile, '

COPY ' ) as copy , INSTR(dockerfile, '

ADD ' ) as add , INSTR(dockerfile, '

USER ' ) as user FROM [code - story - blog:github.dockerfiles_content] HAVING ( user != 0 ) AND ( copy != 0 OR add != 0 ) AND ( user < copy OR user < add ) )

That’s 4,254 out of 281,212 Dockerfiles (1.5%). Out of those 4,254 , 854 seem to run a chmod afterwards.

That’s it for today! I hope this will motivate you to either run similar analysis or to improve your Dockerfiles!